Loading...

XML

Word

Printable

Details

Type: Task
Resolution: Fixed
Priority: Major
Fix Version/s: 9.0-rc-1
Affects Version/s: 7.4
Component/s: Dependency Upgrades
Labels:
None

Difficulty:
Unknown
Documentation:
N/A
Documentation in Release Notes:
http://www.xwiki.org/xwiki/bin/view/ReleaseNotes/Data/XWiki/9.0RC1/#HUpgrades
Similar issues:

Description

See http://www.apache.org/dist/tika/CHANGES-1.14.txt

Release 1.14 - 10/19/2016

  * Extract all headers from MSG/RFC822 (TIKA-2122).

  * Upgrade metadata-extractor to 2.9.1 (TIKA-2113).

  * Extract PDF DocInfo metadata into separate keys to prevent
    overwriting by XMP metadata (TIKA-2057).

  * Re-enable fileUrl for tika-server (TIKA-2081).  If you choose,
    to use this feature, beware of the security vulnerabilities!
    See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271

  * Add Tesseract's hOCR output format as an option, via Eric Pugh
    (TIKA-2093)

  * Extract macros from MSOffice files (TIKA-2069).

  * Maintain passed-in mime in TXTParser (TIKA-2047).

  * Upgrade to POI.3-15 (TIKA-2013).

  * Upgrade to PDFBox 2.0.3 (TIKA-2051).

  * Fix hyperlinks with formatting in DOC and DOCX (TIKA-1255
    and TIKA-2078)

  * Tika now is integrated with the Tensorflow library from Google 
    and it can use its Inception v3 image classification model to 
    identify objects in images (TIKA-1993).

  * Parser configuration is now type-safe and parameters for parsers
    can have assigned types (TIKA-1508, TIKA-1986).

  * Prevent OOM/permanent hang on some corrupt CHM files (TIKA-2040).

  * Upgrade ICU4J charset detection components to fix multithreading
    bug (TIKA-2041).

  * Upgrade to Jackcess 2.1.4 (TIKA-2039).

  * Maintain more significant digits in cells of "General" format
    in XLS and XLSX (TIKA-2025).

  * Avoid mark/reset issues when extracting or detecting embedded resources
    in RFC822 emails (TIKA-2037).

  * Improving accuracy of Tesseract for better extraction of numeric 
    and alphanumeric text from images (TIKA-2021, TIKA-2031).

  * Improve extraction of embedded documents from PPT, PPTX and XLSX
    (TIKA-2026).

  * Add parser for applefile (AppleSingle) (TIKA-2022).

  * Add mime types, mime magic and/or globs for:
     * Endnote Import File (TIKA-2011)
     * DJVU files (TIKA-2009)
     * MS Owner File (TIKA-2008)
     * Windows Media Metafile (TIKA-2004)
     * iCal and vCalendar (TIKA-2006)
     * MBOX (TIKA-2042)
     * Stata DTA (TIKA-2064)

  * Add configurable maximum threshold for number of events extracted
    from the XMP Media Management Schema in JempboxExtractor (TIKA-1999).

  * Integrate TesseractOCR with full page image rendering for PDFs (TIKA-1994).

  * Add mime detection via Nick C and parser for DBF files (TIKA-1513).
  
  * Add mime detection and parsers for MSOffice 2003 XML Word
    and Excel formats (TIKA-1958).

  * Extract hyperlinks from PPT, PPTX, XSLX (TIKA-1454).

  * Upgrade to Commons Compress 1.12 (supports progress on TIKA-1358)

Attachments

Activity

People

Assignee:: Vincent Massol

Reporter:: Vincent Massol

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 10/Nov/16 09:33

Updated:: 10/Nov/16 09:38

Resolved:: 10/Nov/16 09:38