Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-15236

Upgrade to Tika 1.18

    XMLWordPrintable

Details

    • Task
    • Resolution: Fixed
    • Major
    • 10.4-rc-1
    • 10.3
    • Dependency Upgrades
    • None
    • Unknown
    • N/A

    Description

      See https://github.com/apache/tika/blob/1.18/CHANGES.txt

         * Upgrade Jackson to 2.9.5 (TIKA-2634).
      
         * Add support for brotli (TIKA-2621).
      
         * Upgrade PDFBox to 2.0.9 and include new jbig2-imageio
           from org.apache.pdfbox (TIKA-2579 and TIKA-2607).
      
         * Support for TIFF images in PDF files (TIKA-2338)
         
         * Detection of full encrypted 7z files (TIKA-2568)
      
         * Various new mimes and typo fixes in tika-mimetypes.xml
           via Andreas Meier (TIKA-2527).
      
         * Revert to listenForAllRecords=false in ExcelExtractor
           via Grigoriy Alekseev (TIKA-2590)
      
         * Add workaround to identify TIFFs that might confuse
           commons-compress's tar detection via Daniel Schmidt
           (TIKA-2591)
      
         * Ignore non-IANA supported charsets in HTML meta-headers
           during charset detection in HTMLEncodingDetector
           via Andreas Meier (TIKA-2592)
      
         * Add detection and parsing of zstd (if user provides
           com.github.luben:zstd-jni) via Andreas Meier (TIKA-2576)
      
         * Allow for RFC822 detection for files starting with "dkim-"
           and/or "x-" via Andreas Meier (TIKA-2578 and TIKA-2587)
      
         * Extract xlsx files embedded in OLE objects within PPT and PPTX
           via Brian McColgan (TIKA-2588).
      
         * Extract files embedded in HTML and javascript inside HTML
           that are stored in the Data URI scheme (TIKA-2563).
      
         * Extract text from grouped text boxes in PPT (TIKA-2569).
      
         * Extract language metadata item from PDF files via Matt Sheppard (TIKA-2559)
      
         * RFC822 with multipart/mixed, first text element should be treated
           as the main body of the email, not an attachment (TIKA-2547).
      
         * Swap out com.tdunning:json for com.github.openjson:openjson to avoid
           jar conflicts (TIKA-2556).
      
         * No longer hardcode HtmlParser for XML files in tika-server (TIKA-2551).
      
         * Require Java 8 (TIKA-2553).
      
         * Add a parser for XPS (TIKA-2524).
      
         * Mime magic for Dolby Digital AC3 and EAC3 files
      
         * Fixed bug where TesseractOCRParser ignores configured ImageMagickPath,
           and set rotation script to ignore Python warnings (TIKA-2509)
      
         * Upgrade geo-apis to 3.0.1 (TIKA-2535).
      
         * Added local Docker image build using dockerfile-maven-plugin to allow
      images to be built from source (TIKA-1518).
      

      Attachments

        Activity

          People

            tmortagne Thomas Mortagne
            tmortagne Thomas Mortagne
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: