Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-14330

Upgrade to Tika 1.17

    XMLWordPrintable

Details

    • Task
    • Resolution: Fixed
    • Major
    • 10.1-rc-1
    • 9.4
    • Dependency Upgrades
    • None
    • Unknown
    • N/A

    Description

      See https://github.com/apache/tika/blob/1.17/CHANGES.txt

      Release 1.17 - December 8, 2017
      
        ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN
           ON Java 7.  The next versions will require Java 8***
      
        * Fix thread-safety in ChmExtractor (TIKA-2519).
      
        * Upgrade cxf to 3.0.16 (TIKA-2516).
      
        * Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213).
      
        * Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512).
      
        * Cache TikaConfig in EmbeddedDocumentUtil for better performance
          in documents with large number of attachments (TIKA-2511).
      
        * Extract media files from ooxml (TIKA-2510).
      
        * Standardize the way the Image and Video captioning 
          dockers and extraction work (TIKA-2400, GitHub-208)
      
        * Upgrade to xmpcore 5.1.3 (TIKA-2034).
      
        * Upgrade to metadata-extractor 2.10.1 (TIKA-2486).
      
        * Upgrade to OpenNLP 1.8.3 (TIKA-2502).
      
        * Upgrade to Jackson 2.9.2 (TIKA-2501).
      
        * Catch potential NPE in getting InputStream for attachments
          in PST file (TIKA-2488).
      
        * Upgrade to PDFBox 2.0.8 (TIKA-2489).
      
        * Allow configuration of markLimit in EncodingDetectors
          via tika-config.xml (TIKA-2485).
      
        * RFC822Parser now selects the best alternative for
          multipart/alternative body components.  This aligns with the
          behavior of the OutlookParser (TIKA-2478).  Users can select
          legacy behavior via the "extractAllAlternatives" parameter
          in the RFC822 parser definition in tika-config.xml.
      
        * Narrow mime detection for ms-owner files and add detection
          for .nls files (TIKA-2469).
      
        * Fix bug in CharsetDetector that led to different detected charsets
          depending on whether user setText with a byte[] or an InputStream
          via Sean Story (TIKA-2475).
      
        * Remove JAXB for easier use with Java 9 via Robert Munteanu (TIKA-2466).
      
        * Upgrade to POI 3.17 (TIKA-2429).
      
        * Enabling extraction of standard references from text (TIKA-2449).
      
        * Load external custom mimetypes XML from system property 
          tika.custom-mimetypes (TIKA-2460). 
      
        * Extract number of tiffs in a multi-page tiff (TIKA-2451).
      
        * Fix detection of emails extracted from mbox (TIKA-2456).
        
        * Add OverrideDetector and allow PSTParser to specify body content type
          as text or html -- to avoid incorrect auto-detection of
          rfc/mbox, etc. (TIKA-2454)
      
        * AutoDetectParser throws ZeroByteFileException for zero-byte files after
          detection on the file extension (TIKA-2450).
      
        * Extract phonetic runs in docx with experimental SAX parser (TIKA-2448).
      
        * Extract phonetic runs from xls and allow users to turn off extraction
          of phonetic runs in both xls and xlsx (TIKA-2440).
      
        * OOXML locale should be set by POI's LocaleUtil not Locale.getDefault().
          Fix unit tests to be robust against different locales in OOXML
          and ExcelParser (TIKA-2438).
      
        * Upgrade to PDFBox 2.0.7 (TIKA-2431).
      
        * Tika now has support for automatic image captioning, that
          combines Computer Vision and Natural Language Processing to
          automatically generate a readable caption for an image 
          (TIKA-2262, TIKA-2355, TIKA-2402, Gh-198, Gh-196, Gh-189).
      
        * Add TestCorruptedFiles to allow devs to test parsers against
          corrupted input files (TIKA-2430).
      
        * Correct Mimetype definition for Windows batch files (CMD and BAT)
          which are the same (TIKA-2445)
      
        * PSDParser memory use improvements (TIKA-2447)
      
        * Add underline extraction from Word documents (doc/docx) via Stuart Hendren
          as well as strikethrough extraction in docx (TIKA-2347, GitHub-173)
      
        * Corrected Tesseract OCR rotation.py script and made it a configurable
      option via Peter Weiss (TIKA-2385) 
      
      Release 1.16 - 7/7/2017
      
        * Exclude jj2000 from edu.ucar grip to avoid potential
          license conflicts with ASL 2.0
      
        * Add Age recognition using Ensemble model for Linear regression
          and Apache OpenNLP Maximum Entropy. Tika can now detect age from
          text (TIKA-1988).
      
        * Add Tika Deep Learning support for the VGG16 model for
          Very Deep Convolutional Networks for Large-Scale Image Recognition.
          Now Tika supports both Inception v3/v4 and VGG16 based image 
          recognition (TIKA-2298).
      
        * Extract macros from PPT (TIKA-2089).
      
        * Extract absolute path for last saved location when available
          in .xlsx and .xlsb (TIKA-2335).
      
        * Rename SentimentParser to SentimentAnalysisParser to
          prevent conflict with dependency (TIKA-2368).
      
        * tika-app now extracts inline images in PDFs by
          default, and it includes a warning to users that this is not the
          default behavior elsewhere in Tika (TIKA-2374).
      
        * Allow configurability of warnings for problems during
          parser initialization (TIKA-2389).
      
        * Upgrade to Jackcess 2.1.8 (TIKA-2380).
      
        * Upgrade to POI 3.17-beta1 (TIKA-2336).
      
        * Remove non-ASL-2.0-compatible org.json (TIKA-1804).
      
        * Allow extraction of <script> elements in HTML as embedded "MACRO".
          Users must turn this on via TikaConfig (TIKA-2391).
      
        * Allow users to turn off extraction of headers and footers
          from .doc, .docx, .xls, .xlsx, .xlsb (TIKA-2362)
      
        * Extract text from charts in .docx, .pptx, .xlsx and .xlsb
          (TIKA-2254).
      
        * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb
          (TIKA-1945).
      
        * Fix bug in tika-server that led to an attempt to close the
          input stream twice (TIKA-2384).
      
        * Enable base32 encoding of digests and enable BouncyCastle implementations
          of digest algorithms (TIKA-2386).
      
        * Canonical Mimetype of WAVE audio changed to match RFC 2361 defined
          version, audio/vnd.wave, older audio/x-wav remains as an alias
      
        * Upgrade "provided" xerial to 3.19.3 (TIKA-2412).
      
        * Upgrade Gson to 2.8.1 (TIKA-2414).
      
        * Upgrade mime4j to 0.8.1 (TIKA-2413).
      
        * Mime magic improvements for GraphViz (TIKA-2422), HTML files which
          claim to be XML but aren't quite valid XML (TIKA-2419) and QuickTime
          / MP4 (TIKA-2418)
      
      Release 1.15 - 05/23/2017
      
        * Tika now has a module for Deep Learning powered by the 
          DL4J toolkit. The initial included model is for InceptionV3
          and so using this module, natively in Java, Tika can use 
          Deep learning for metadata/text extraction from Images using
          the power of the Inception model (Github-165).
      
        * A new parser for sentiment analysis using a categorical 
          (multi-class, anry, sad, neutral, like, love) and binary
          (positive/negative) was added leveraging the USC data 
          science work (TIKA-2016).
      
        * Tika now has the ability to automatically detect objects in videos,
          using OpenCV and Tensorflow (TIKA-2322).
      
        * Change default behavior to parse embedded documents even if the user
          forgets to specify a Parser.class in the ParseContext (TIKA-2096).
          Users who wish to parse only the container document should set
          an EmptyParser as the Parser.class in the ParseContext.
      
        * Change default behavior of Office Parsers to _not_ extract
          Macros.  User needs to setExtractMacros to "true" (TIKA-2302).
      
        * Added tika-eval module (TIKA-1332).
      
        * Unified logging across Tika: SLF4J as logging API, Apache Log4j as
          implementation with JCL and JUL bridges in standalone tools like
          tika-app, tika-batch and tika-server (TIKA-2245).
      
        * Add parser for XLSB files (TIKA-1195).
      
        * Add parsers for EMF/WMF files (TIKA-2246/TIKA-2247).
      
        * Add parsers for WordPerfect and QuattroPro (.qpw) files.
          Contributed by Pascal Essiembre (TIKA-1946 and TIKA-2228).
      
        * Add experimental SAX parser for .pptx files. To select this parser,
          set useSAXPptxExtractor(true) on OfficeParserConfig (TIKA-2210).
      
        * Add experimental SAX parser for .docx files. To select this parser,
          set useSAXDocxExtractor(true) on OfficeParserConfig (TIKA-1321, TIKA-2191).
      
        * Add mime detection and parser for Word 2006ML format (TIKA-2179).
      
        * Bug fix for WordPerfect via Pascal Essiembre (TIKA-2352).
      
        * Added "text-main" equivalent option to tika-server via
          /tika/main (TIKA-2343).
      
        * Enabled configuration of the EncodingDetector used by
          parsers that extend AbstractEncodingDetectorParser (TIKA-2273).
      
        * Prevent easily preventable OOMs for both detection and parsing
          of some compression formats (TIKA-2330).
      
        * Extract images and thumbnails from ODT via Sam Bayer (TIKA-2295).
      
        * Fix potential NPE in FeedParser via Julien Nioche (TIKA-2269).
      
        * Official mime types for BMP, EMF and WMF have been registered with
          IANA, so switch to these (image/bmp image/emf image/wmf) (TIKA-2250)
      
        * Be more parsimonious with BufferedInputStreams via Josh Hight
          (TIKA-2244).
      
        * Enable handling of hyphenated language codes in TesseractOCRParser
          via Graham Russell (TIKA-2231).
      
        * Improve style tags in ODT (TIKA-2242).
      
        * Add container detection for embedded MSEquation files (TIKA-2238).
      
        * Add parsing of JBIG2 and extraction of JBIG2 from PDFs when
          required dependencies are added to class path by user.
          Contributed by Pascal Essiembre (TIKA-2232).
      
        * Mime magic for the OneNote family (.one / .onetoc / .onepkg), no parser
          (TIKA-2224).
      
        * Add configurability of "preserve-interword-spacing" to
          TesseractOCRParser (TIKA-2190).
      
        * Upgrade to PDFBox 2.0.6 and JempBox 1.8.13 (TIKA-2209/TIKA-2236/TIKA-2361).
      
        * Refactor MockParser to consolidate service loading
          and mime types into tika-core/src/test (TIKA-2195).
      
        * Enabled extraction of embedded objects from headers, footers,
          footnotes, endnotes and comments in legacy .docx parser (TIKA-2192).
      
        * Allow extraction of PDActions (including Javascript) from
          PDFs (TIKA-2090).  This is turned off by default.  Users
          must setExtractActions(true) on the PDFParserConfig.
      
        * Change default behavior in experimental .docx parser to ignore
          deleted text to align with .doc (TIKA-2187).
      
        * Upgrade to POI 3.16 (TIKA-2116, TIKA-2181, TIKA-2329).
      
        * Allow configuration of timeout for ForkParser (TIKA-2170).
      
        * Add extraction of .jpx inline images from PDFs when required
          dependencies are added by user to class path (TIKA-2175).
      
        * Add .jpx, .jp2, .ppm to formats handled by Tesseract (TIKA-2174).
      
        * Upgrade SQLite "provided" dependency to 3.16.1 (TIKA-2334).
      
        * Update Apache CXF version to 3.0.12 (TIKA-2292).
      
        * Add Lingo24 Language Detector (TIKA-2297).
      
        * Further mime magic for WebVTT (TIKA-1772)
      
        * Extend support for increased PSM options up to 13 for modern 
          versions of Tesseract (TIKA-2357).
      
        * Prevent potential resource leak by closing TrueTypeFont
      via Cameron Rollheiser (TIKA-2370).
      

      Attachments

        Issue Links

          Activity

            People

              tmortagne Thomas Mortagne
              tmortagne Thomas Mortagne
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: