Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-16450

Upgrade to Tika 1.21

    XMLWordPrintable

Details

    • Task
    • Resolution: Fixed
    • Major
    • 11.6-rc-1
    • 11.4
    • Dependency Upgrades
    • None
    • Unknown
    • N/A

    Description

      See https://dist.apache.org/repos/dist/release/tika/CHANGES-1.21.txt

         * Add optional AUTO mode to OCR'ing of PDFs.  If tesseract is installed
           and on the path, and this option is selected programmatically
           or via TikaConfig(), the PDFParser will use heuristics to decide
           whether or not to run OCR per page on PDFs. (TIKA-2749)
      
         * The ZipContainerDetector's default behavior was changed to run
           streaming detection up to its markLimit.  Users can get the
           legacy behavior (spool-to-file/rely-on-underlying-file-in-TikaInputStream)
           by setting markLimit=-1. The POIFSContainerDetector requires an underlying file;
           it will try to spool the file to disk; if the file's length is > markLimit,
           it will not attempt detection; set markLimit to -1 for legacy behavior (TIKA-2849).
      
         * Upgrade PDFBox to 2.0.14 (TIKA-2834).
      
         * Add CSV detection and replace TXTParser with TextAndCSVParser;
           users can turn off CSV detection by excluding the TextAndCSVParser
           and adding back the TXTParser via tika-config (TIKA-2833).
      
         * Add a CSVParser.  CSV detection is currently based solely on filename
           and/or information conveyed via Metadata (TIKA-2826).
      
         * General upgrades: asm, bouncycastle, commons-codec, commons-lang3, cxf,
           guava, h2, httpcomponents, jackcess, junrar, Lucene, mime4j, opennlp, parso,
           sqlite-jdbc (provided), zstd-jni (provided) (TIKA-2824)
      
         * Bundle xerces2 with tika-parsers (TIKA-2802).
      
         * Upgrade jaxb to 2.3.2 (TIKA-2819).
      
         * Upgrade jackson to 2.9.8 (TIKA-2717).
      
         * Update tika-eval's common tokens lists (TIKA-2822).
      
         * Handle bad tags in tika-eval more robustly (TIKA-2810).
      
         * Add reports for tags in tika-eval (TIKA-2809).
      
         * Extract text from SDT element within textboxes in .docx files (TIKA-2807).
      
         * Try to handle truncated OOXML files more robustly (TIKA-2765).
      

      Attachments

        Issue Links

          Activity

            People

              tmortagne Thomas Mortagne
              tmortagne Thomas Mortagne
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: