Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-11669

Upgrade to Tika 1.7



    • Unknown
    • N/A


      See http://www.apache.org/dist/tika/CHANGES-1.7.txt.

      Release 1.7 - 1/9/2015
        * Fixed resource leak in OutlookPSTParser that caused TikaException 
          when invoked via AutoDetectParser on Windows (TIKA-1506).
        * HTML tags are properly stripped from content by FeedParser
        * Tika Server support for selecting a single metadata key;
          wrapped MetadataEP into MetadataResource (TIKA-1499).
        * Tika Server support for JSON and XMP views of metadata (TIKA-1497).
        * Tika Parent uses dependency management to keep duplicate 
          dependencies in different modules the same version (TIKA-1384).
        * Upgraded slf4j to version 1.7.7 (TIKA-1496).
        * Tika Server support for RecursiveParserWrapper's JSON output
          (endpoint=rmeta) equivalent to (TIKA-1451's) -J option 
          in tika-app (TIKA-1498).
        * Tika Server support for providing the password for files on a 
          per-request basis through the Password http header (TIKA-1494).
        * Simple support for the BPG (Better Portable Graphics) image format
          (TIKA-1491, TIKA-1495).
        * Prevent exceptions from being thrown for some malformed
          mp3 files (TIKA-1218).
        * Reformat pom.xml files to use two spaces per indent (TIKA-1475).
        * Fix warning of slf4j logger on Tika Server startup (TIKA-1472).
        * Tika CLI and GUI now have option to view JSON rendering of output
          of RecursiveParserWrapper (TIKA-1451).
        * Tika now integrates the Geospatial Data Abstraction Library
          (GDAL) for parsing hundreds of geospatial formats (TIKA-605,
        * ExternalParsers can now use Regexs to specify dynamic keys
        * Thread safety issues in ImageMetadataExtractor were resolved
        * The ForkParser service is now registered in Activator
        * The Rome Library was upgraded to version 1.5 (TIKA-1435).
        * Add markup for files embedded in PDFs (TIKA-1427).
        * Extract files embedded in annotations in PDFS (TIKA-1433).
        * Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442).
        * Add RecursiveParserWrapper (aka Jukka's and Nick's) 
          RecursiveMetadataParser (TIKA-1329)
        * Add example for how to dump TikaConfig to XML (TIKA-1418).
        * Allow users to specify a tika config file for tika-app (TIKA-1426).
        * PackageParser includes the last-modified date from the archive
          in the metadata, when handling embedded entries (TIKA-1246)
        * Created a new Tesseract OCR Parser to extract text from images.
          Requires installation of Tesseract before use (TIKA-93).
        * Basic parser for older Excel formats, such as Excel 4, 5 and 95,
          which can get simple text, and metadata for Excel 5+95 (TIKA-1490)
      Release 1.6 - 08/31/2014
        * Parse output should indicate which Parser was actually used
        * Use the forbidden-apis Maven plugin to check for unsafe Java
          operations (TIKA-1387).
        * Created an ExternalTranslator class to interface with command
          line Translators (TIKA-1385).
        * Created a MosesTranslator as a subclass of ExternalTranslator
          that calls the Moses Decoder machine translation program (TIKA-1385).
        * Created the tika-example module. It will have examples of how to
          use the main Tika interfaces (TIKA-1390).
        * Upgraded to Commons Compress 1.8.1 (TIKA-1275).
        * Upgraded to POI 3.11-beta1 (TIKA-1380).
        * Tika now extracts SDTCell content from tables in .docx files (TIKA-1317).
        * Tika now supports detection of the Persian/Farsi language.
        * The Tika Detector interface is now exposed through the JAX-RS
          server (TIKA-1336, TIKA-1336).
        * Tika now has support for parsing binary Matlab files as part of 
          our larger effort to increase the number of scientific data formats 
          supported. (TIKA-1327)
        * The Tika Server URLs for the unpacker resources have been changed,
          to bring them under a common prefix (TIKA-1324). The mapping is
          /unpacker/{id} -> /unpack/{id}
          /all/{id}      -> /unpack/all/{id}
        * Added module and core Tika interface for translating text between
          languages and added a default implementation that call's Microsoft's
          translate service (TIKA-1319)
        * Added an Translator implementation that calls Lingo24's Premium
          Machine Translation API (TIKA-1381)
        * Made RTFParser's list handling slightly more robust against corrupt
          list metadata (TIKA-1305)
        * Fixed bug in CLI json output (TIKA-1291/TIKA-1310)
        * Added ability to turn off image extraction from PDFs (TIKA-1294).
          Users must now turn on this capability via the PDFParserConfig.
        * Upgrade to PDFBox 1.8.6 (TIKA-1290, TIKA-1231, TIKA-1233, TIKA-1352)
        * Zip Container Detection for DWFX and XPS formats, which are OPC
          based (TIKA-1204, TIKA-1221)
        * Added a user facing welcome page to the Tika Server, which
          says what it is, and a very brief summary of what is available. 
        * Added Tika Server endpoints to list the available mime types,
          Parsers and Detectors, similar to the --list-<foo> methods on
          the Tika CLI App (TIKA-1270)
        * Improvements to NetCDF and HDF parsing to mimic the output of
          ncdump and extract text dimensions and spatial and variable
          information from scientific data files (TIKA-1265)
        * Extract attachments from RTF files (TIKA-1010)
        * Support Outlook Personal Folders File Format *.pst (TIKA-623)
        * Added mime entries for additional Ogg based formats (TIKA-1259)
        * Updated the Ogg Vorbis plugin to v0.4, which adds detection for a wider
          range of Ogg formats, and parsers for more Ogg Audio ones (TIKA-1113)
        * PDF: Images in PDF documents can now be extracted as embedded resources.
        * Fixed RuntimeException thrown for certain Word Documents (TIKA-1251).
        * CLI: TikaCLI now has another option: --list-parser-details-apt, which outputs
          the list of supported parsers in APT format. This is used to generate the list
          on the formats page (TIKA-411).
      Release 1.5 - 02/04/2014
        * Fixed bug in handling of embedded file processing in PDFs (TIKA-1228).
        * Added SourceCodeParser to support java, Groovy, C++ files (TIKA-1224).
        * Updated Tika Server to support multipart/form-data payloads (TIKA-1198).
        * Updated Tika Server to CXF 2.7.8 (TIKA-1197).
        * Updated Tika Server to accept requests over wildcard addresses (TIKA-1196).
        * Added option to use alternate NonSequentialPDFParser (TIKA-1201).
        * Content from PDF AcroForms is now extracted (TIKA-973).
        * Fixed invalid asterisks from master slide in PPT (TIKA-1171).
        * Added test cases to confirm handling of auto-date in PPT and PPTX (TIKA-817).
        * Text from tables in PPT files is once again extracted correctly (TIKA-1076).
        * Text is extracted from text boxes in XLSX (TIKA-1100).
        * Tika no longer hangs when processing Excel files with custom fraction format (TIKA-1132).
        * Disconcerting stacktrace from missing beans no longer printed for some DOCX files (TIKA-792).
        * Upgraded POI to 3.10-beta2 (TIKA-1173).
        * Upgraded PDFBox to 1.8.4 (TIKA-1230).
        * Made HtmlEncodingDetector more flexible in finding meta 
          header charset (TIKA-1001).
        * Added sanitized test HTML file for local file test (TIKA-1139).
        * Fixed bug that prevented attachments within a PDF from being processed
          if the PDF itself was an attachment (TIKA-1124).
        * Text from paragraph-level structured document tags in DOCX files is now extracted (TIKA-1130).
        * RTF: Fixed ArrayIndexOutOfBoundsException when parsing list override (TIKA-1192).
        * CLI: TikaCLI now escapes invalid filename characters as hex
          characters (TIKA-1078).
      Release 1.4 - 06/15/2013
        * Removed a test HTML file with a poorly chosen GPL text in it (TIKA-1129).
        * Improvements to tika-server to allow it to produce text/html and
          text/xml content (TIKA-1126, TIKA-1127).
        * Improvements were made to the Compressor Parser to handle g'zipped files
          that require the decompressConcatenated option set to true (TIKA-1096).
        * Addressed a typographic error that was preventing from detection of 
          awk files (TIKA-1081).
        * Added a new end-point to Tika's JAX-RS REST server that only detects
          the media-type based on a small portion of the document submitted
        * RTF: Ordered and unordered lists are now extracted (TIKA-1062).
        * MP3: Audio duration is now extracted (TIKA-991)
        * Java .class files: upgraded from ASM 3.1 to ASM 4.1 for parsing
          the Java bytecodes (TIKA-1053).
        * Mime Types: Definitions extended to optionally include Link (URL) and
          UTI, along with details for several common formats (TIKA-1012 / TIKA-1083)
        * Exceptions when parsing OLE10 embedded documents, when parsing
          summary information from Office documents, and when saving
          embedded documennts in TikaCLI are now logged instead
          of aborting extraction (TIKA-1074)
        * MS Word: line tabular character is now replaced with newline
        * XML: ElementMetadataHandlers can now optionally accept duplicate
          and empty values (TIKA-1133)
      Release 1.3 - 01/19/2013
        * Mimetype definitions added for more common programming languages,
          including common extensions, but not magic patterns. (TIKA-1055)
        * MS Word: When a Word (.doc) document contains embedded files or
          links to external documents, Tika now places a <div
          class="embedded" id="_XXX"/> placeholder into the XHTML so you can
          see where in the main text the embedded document occurred
          (TIKA-956, TIKA-1019).  Embedded Wordpad/RTF documents are now
          recognized (TIKA-982).
        * PDF: Text from pop-up annotations is now extracted (TIKA-981).
          Text from bookmarks is now extracted (TIKA-1035).
        * PKCS7: Detached signatures no longer through NullPointerException
        * iWork: The chart name for charts embedded in numbers documents is
          now extracted (TIKA-918).
        * CLI: TikaCLI -m now handles multi-valued metadata keys correctly
          (previously it only printed the first value).  (TIKA-920)
        * MS Word (.docx): When a Word (.docx) document contains embedded
          files, Tika now places a <div class="embedded" id="XXX"/> into the
          XHTML so you can see where in the main text the embedded document
          occurred.  The id (rId) is included in the Metadata of each
          embedded document as the new Metadata.EMBEDDED_RELATIONSHIP_ID
          key, and TikaCLI prepends the rId (if present) onto the filename
          it extracts (TIKA-989).  Fixed NullPointerException when style is
          null (TIKA-1006).  Text inside text boxes is now extracted
        * RTF: Page, word, character count and creation date metadata are
          now extracted for RTF documents (TIKA-999).
        * MS PowerPoint (.pptx): When a PowerPoint (.pptx) document contains
          embedded files, Tika now places a <div class="embedded" id="XXX"/> into the
          XHTML so you can see where in the main text the embedded document
          occurred.  The id (rId) is included in the Metadata of each
          embedded document as the new Metadata.EMBEDDED_RELATIONSHIP_ID
          key, and TikaCLI prepends the rId (if present) onto the filename
          it extracts (TIKA-997, TIKA-1032).
        * MS PowerPoint (.ppt): When a PowerPoint (.ppt) document contains
          embedded files, Tika now places a <div class="embedded" id="XXX"/> into the
          XHTML so you can see where in the main text the embedded document
          occurred (TIKA-1025).  Text from the master slide is now extracted
        * MHTML: fixed Null charset name exception when a mime part has an
          unrecognized charset (TIKA-1011).
        * MP3: if an ID3 tag was encoded in UTF-16 with only the BOM then on
          certain JVMs this would incorrectly extract the BOM as the tag's
          value (TIKA-1024).
        * ZIP: placeholders (<div class="embedded" id="<entry name>"/>) are
          now left in the XHTML so you can see where each archive member
          appears (TIKA-1036). TikaCLI would hit FileNotFoundException when
          extracting files that were under sub-directories from a ZIP
          archive, because it failed to create the parent directories first
        * XML: a space character is now added before each element
      Release 1.2 - 07/10/2012
        * Tika's JAX-RS based Network server now is based on Apache CXF,
          which is available in Maven Central and now allows the server
          module to be packaged and included in our release
          (TIKA-593, TIKA-901).
        * Tika: parseToString now lets you specify the max string length
          per-call, in addition to per-Tika-instance. (TIKA-870)
        * Tika now has the ability to detect FITS (Flexible Image Transport System) 
          files (TIKA-874).
        * Images: Fixed file handle leak in ImageParser. (TIKA-875)
        * iWork: Comments in Pages files are now extracted (TIKA-907).
          Headers, footers and footnotes in Pages files are now extracted
          (TIKA-906).  Don't throw NullPointerException on passsword
          protected iWork files, even though we can't parse their contents
          yet (TIKA-903).  Text extracted from Keynote text boxes and bullet
          points no longer runs together (TIKA-910). Also extract text for
          Pages documents created in layout mode (TIKA-904).  Table names
          are now extracted in Numbers documents (TIKA-924).  Content added
          to master slides is also extracted (TIKA-923).
        * Archive and compression formats: The Commons Compress dependency was
          upgraded from 1.3 to 1.4.1. With this change Tika can now parse also
          Unix dump archives and documents compressed using the XZ and Pack200
          compression formats. (TIKA-932)
        * KML: Tika now has basic support for Keyhole Markup Language documents
          (KML and KMZ) used by tools like Google Earth. See also
          http://www.opengeospatial.org/standards/kml/. (TIKA-941)
        * CLI: You can now use the TIKA_PASSWORD environment variable or the
          --password=X command line option to specify the password that Tika CLI
          should use for opening encrypted documents (TIKA-943).
        * Character encodings: Tika's character encoding detection mechanism was
          improved by adding integration to the juniversalchardet library that
          implements Mozilla's universal charset detection algorithm. The slower
          ICU4J algorithms are still used as a fallback thanks to their wider
          coverage of custom character encodings. (TIKA-322, TIKA-471)
        * Charset parameter: Related to the character encoding improvements
          mentioned above, Tika now returns the detected character encoding as
          a "charset" parameter of the content type metadata field for text/plain
          and text/html documents. For example, instead of just "text/plain", the
          returned content type will be something like "text/plain; charset=UTF-8"
          for a UTF-8 encoded text document. Character encoding information is still
          present also in the content encoding metadata field for backwards
          compatibility, but that field should be considered deprecated. (TIKA-431)
        * Extraction of embedded resources from OLE2 Office Documents, where
          the resource isn't another office document, has been fixed (TIKA-948)
      Release 1.1 - 3/7/2012
       * Link Extraction: The rel attribute is now extracted from 
         links per the LinkConteHandler. (TIKA-824)
       * MP3: Fixed handling of UTF-16 (two byte) ID3v2 tags (previously
         the last character in a UTF-16 tag could be corrupted) (TIKA-793)
       * Performance: Loading of the default media type registry is now
         significantly faster. (TIKA-780)
       * PDF: Allow controlling whether overlapping duplicated text should
         be removed.  Disabling this (the default) can give big
         speedups to text extraction and may workaround cases where
         non-duplicated characters were incorrectly removed (TIKA-767).
         Allow controlling whether text tokens should be sorted by their x/y
         position before extracting text (TIKA-612); this is necessary for
         certain PDFs.  Fixed cases where too many </p> tags appear in the
         XHTML output, causing NPE when opening some PDFs with the GUI
       * RTF: Fixed case where a font change would result in processing
         bytes in the wrong font's charset, producing bogus text output
         (TIKA-777).  Don't output whitespace in ignored group states,
         avoiding excessive whitespace output (TIKA-781).  Binary embedded
         content (using \bin control word) is now skipped correctly;
         previously it could cause the parser to incorrectly extract binary
         content as text (TIKA-782).
       * CLI: New TikaCLI option "--list-detectors", which displays the
         mimetype detectors that are available, similar to the existing
         "--list-parsers" option for parsers. (TIKA-785).
       * Detectors: The order of detectors, as supplied via the service
         registry loader, is now controlled. User supplied detectors are
         prefered, then Tika detectors (such as the container aware ones),
         and finally the core Tika MimeTypes is used as a backup. This
         allows for specific, detailed detectors to take preference over
         the default mime magic + filename detector. (TIKA-786)
       * Microsoft Project (MPP): Filetype detection has been fixed,
         and basic metadata (but no text) is now extracted. (TIKA-789)
       * Outlook: fixed NullPointerException in TikaGUI when messages with
         embedded RTF or HTML content were filtered (TIKA-801).
       * Ogg Vorbis and FLAC: Parser added for Ogg Vorbis and FLAC audio
         files, which extract audio metadata and tags (TIKA-747)
       * MP4: Improved mime magic detection for MP4 based formats (including
         QuickTime, MP4 Video and Audio, and 3GPP) (TIKA-851)
       * MP4: Basic metadata extracting parser for MP4 files added, which includes
         limited audio and video metadata, along with the iTunes media metadata
         (such as Artist and Title) (TIKA-852)
       * Document Passwords: A new ParseContext object, PasswordProvider,
         has been added. This provides a way to supply the password for 
         a document during processing. Currently, only password protected
         PDFs and Microsoft OOXML Files are supported. (TIKA-850)
      Release 1.0 - 11/4/2011
      The most notable changes in Tika 1.0 over previous releases are:
       * API: All methods, classes and interfaces that were marked as
         deprecated in Tika 0.10 have been removed to clean up the API
         (TIKA-703). You may need to adjust and recompile client code
         accordingly. The declared OSGi package versions are now 1.0, and
         will thus not resolve for client bundles that still refer to 0.x
         versions (TIKA-565).
       * Configuration: The context class loader of the current thread is
         no longer used as the default for loading configured parser and
         detector classes. You can still pass an explicit class loader
         to the configuration mechanism to get the previous behaviour.
       * OSGi: The tika-core bundle will now automatically pick up and use
         any available Parser and Detector services when deployed to an OSGi
         environment. The tika-parsers bundle provides such services based on
         for all the supported file formats for which the upstream parser library
         is available. If you don't want to track all the parser libraries as
         separate OSGi bundles, you can use the tika-bundle bundle that packages
         tika-parsers together with all its upstream dependencies. (TIKA-565)
       * RTF: Hyperlinks in RTF documents are now extracted as an <a
         href=...>...</a> element (TIKA-632). The RTF parser is also now
         more robust when encountering too many closing {'s vs. opening {'s
       * MS Word: From Word (.doc) documents we now extract optional hyphen
         as Unicode zero-width space (U+200B), and non-breaking hyphen as
         Unicode non-breaking hyphen (U+2011). (TIKA-711)
       * Outlook: Tika can now process also attachments in Outlook messages.
       * MS Office: Performance of extracting embedded office docs was improved.
       * PDF: The PDF parser now extracts paragraphs within each page 
         (TIKA-742) and  can now optionally extract text from PDF 
         annotations (TIKA-738). There's also an option to enable (the 
         default) or disable auto-space insertion (TIKA-724). 
       * Language detection: Tika can now detect Belarusian, Catalan,
         Esperanto, Galician, Lithuanian (TIKA-582), Romanian, Slovak,
         Slovenian, and Ukrainian (TIKA-681).
       * Java: Tika no longer ships retrotranslated Java 1.4 binaries along
         with the normal ones that work with Java 5 and higher. (TIKA-744)
       * OpenOffice documents: header/footer text is now extracted for text,
         presentation and spreadsheet documents (TIKA-736)
      Tika 1.0 relies on the following set of major dependencies (generated using
      mvn dependency:tree from tika-parsers):
         +- org.apache.tika:tika-core:jar:1.0:compile
         +- edu.ucar:netcdf:jar:4.2-min:compile
         |  \- org.slf4j:slf4j-api:jar:1.5.6:compile
         +- org.apache.james:apache-mime4j-core:jar:0.7:compile
         +- org.apache.james:apache-mime4j-dom:jar:0.7:compile
         +- org.apache.commons:commons-compress:jar:1.3:compile
         +- commons-codec:commons-codec:jar:1.5:compile
         +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile
         |  +- org.apache.pdfbox:fontbox:jar:1.6.0:compile
         |  +- org.apache.pdfbox:jempbox:jar:1.6.0:compile
         |  \- commons-logging:commons-logging:jar:1.1.1:compile
         +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
         +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
         +- org.apache.poi:poi:jar:3.8-beta4:compile
         +- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile
         +- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile
         |  +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile
         |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
         |  \- dom4j:dom4j:jar:1.6.1:compile
         +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
         +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
         +- asm:asm:jar:3.1:compile
         +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
         +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
         +- rome:rome:jar:0.9:compile
            \- jdom:jdom:jar:1.0:compile
      The following people have contributed to Tika 1.0 by submitting or commenting
      on the issues resolved in this release:
      Andrzej Bialecki
      Antoni Mylka
      Benson Margulies
      Chris A. Mattmann
      Cristian Vat
      Dave Meikle
      David Smiley
      Dennis Adler
      Erik Hetzner
      Ingo Renner
      Jeremias Maerki
      Jeremy Anderson
      Jeroen van Vianen
      John Bartak
      Jukka Zitting
      Julien Nioche
      Ken Krugler
      Mark Butler
      Maxim Valyanskiy
      Michael Bryant
      Michael McCandless 
      Nick Burch
      Pablo Queixalos
      Uwe Schindler
      Žygimantas Medelis
      See http://s.apache.org/Zk6 for more details on these contributions.
      Release 0.10 - 09/25/2011
      The most notable changes in Tika 0.10 over previous releases are:
       * A parser for CHM help files was added. (TIKA-245)
       * TIKA-698: Invalid characters are now replaced with the Unicode
         replacement character (U+FFFD), whereas before such characters were
         replaced with spaces, so you may need to change your processing of
         Tika's output to now handle U+FFFD.
       * The RTF parser was rewritten to perform its own direct shallow
         parse of the RTF content, instead of using RTFEditorKit from
         javax.swing.  This fixes several issues in the old parser,
         including doubling of Unicode characters in certain cases
         (TIKA-683), exceptions on mal-formed RTF docs (TIKA-666), and
         missing text from some elements (header/footer, hyperlinks,
         footnotes, text inside pictures).
       * Handling of temporary files within Tika was much improved
         (TIKA-701, TIKA-654, TIKA-645, TIKA-153)
       * The Tika GUI got a facelift and some extra features (TIKA-635)
       * The apache-mime4j dependency of the email message parser was upgraded
         from version 0.6 to 0.7 (TIKA-716). The parser also now accepts a
         MimeConfig object in the ParseContext as configuration (TIKA-640).
      Tika 0.10 relies on the following set of major dependencies (generated using
      mvn dependency:tree from tika-parsers):
         +- org.apache.tika:tika-core:jar:0.10:compile
         +- edu.ucar:netcdf:jar:4.2-min:compile
         |  \- org.slf4j:slf4j-api:jar:1.5.6:compile
         +- org.apache.james:apache-mime4j-core:jar:0.7:compile
         +- org.apache.james:apache-mime4j-dom:jar:0.7:compile
         +- org.apache.commons:commons-compress:jar:1.1:compile
         +- commons-codec:commons-codec:jar:1.4:compile
         +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile
         |  +- org.apache.pdfbox:fontbox:jar:1.6.0:compile
         |  +- org.apache.pdfbox:jempbox:jar:1.6.0:compile
         |  \- commons-logging:commons-logging:jar:1.1.1:compile
         +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
         +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
         +- org.apache.poi:poi:jar:3.8-beta4:compile
         +- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile
         +- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile
         |  +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile
         |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
         |  \- dom4j:dom4j:jar:1.6.1:compile
         +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
         +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
         +- asm:asm:jar:3.1:compile
         +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
         +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
         +- rome:rome:jar:0.9:compile
            \- jdom:jdom:jar:1.0:compile
      The following people have contributed to Tika 0.10 by submitting or commenting
      on the issues resolved in this release:
         Alain Viret
         Alex Ott
         Alexander Chow
         Andreas Kemkes
         Andrew Khoury
         Babak Farhang
         Benjamin Douglas
         Benson Margulies
         Chris A. Mattmann
         chris hudson
         Chris Lott
         Cristian Vat
         Curt Arnold
         Cynthia L Wong
         Dave Brosius
         David Benson
         Enrico Donelli
         Erik Hetzner
         Erna de Groot
         Gabriele Columbro
         Geoff Jarrad
         Gregory Kanevsky
         gunter rombauts
         Henning Gross
         Henri Bergius
         Ingo Renner
         Ingo Wiarda
         Izaak Alpert
         Jan H√∏ydahl
         Jens Wilmer
         Jeremy Anderson
         Joseph Vychtrle
         Joshua Turner
         Jukka Zitting
         Julien Nioche
         Karl Heinz Marbaise
         Ken Krugler
         Kostya Gribov
         Luciano Leggieri
         Mads Hansen
         Mark Butler
         Matt Sheppard
         Maxim Valyanskiy
         Michael McCandless
         Michael Pisula
         Murad Shahid
         Nick Burch
         Oleg Tikhonov
         Pablo Queixalos
         Paul Jakubik
         Raimund Merkert
         Rajiv Kumar
         Robert Trickey
         Sami Siren
         Selva Ganesan
         Sjoerd Smeets
         Stephen Duncan Jr
         Tran Nam Quang
         Uwe Schindler
         Vitaliy Filippov
      See http://s.apache.org/vR for more details on these contributions.
      Release 0.9 - 02/13/2011
      The most notable changes in Tika 0.9 over previous releases are:
       * A critical bugfix preventing metadata from printing to the 
         command line when the underlying Parser didn't generate 
         XHTML output was fixed. (TIKA-596)
       * The 0.8 version of Tika included a NetCDF jar file that pulled 
         in tremendous amounts of redundant dependencies. This has 
         been addressed in Tika 0.9 by republishing a minimal NetCDF 
         jar and changing Tika to depend on that. (TIKA-556)
       * MIME detection for iWork, and OpenXML documents has been 
         improved. (TIKA-533, TIKA-562, TIKA-588)
       * A critical backwards incompatible bug in PDF parsing that 
         was introduced in Tika 0.8 has been fixed. (TIKA-548)
       * Support for forked parsing in separate processes was added. 
       * Tika's language identifier now supports the Lithuanian 
         language. (TIKA-582)
      Tika 0.9 relies on the following set of major dependencies (generated using
      mvn dependency:tree from tika-parsers):
         +- org.apache.tika:tika-core:jar:0.9:compile
         +- edu.ucar:netcdf:jar:4.2-min:compile
         |  \- org.slf4j:slf4j-api:jar:1.5.6:compile
         +- commons-httpclient:commons-httpclient:jar:3.1:compile
         |  +- commons-logging:commons-logging:jar:1.1.1:compile (version managed from 1.0.4)
         |  \- commons-codec:commons-codec:jar:1.2:compile
         +- org.apache.james:apache-mime4j:jar:0.6:compile
         +- org.apache.commons:commons-compress:jar:1.1:compile
         +- org.apache.pdfbox:pdfbox:jar:1.4.0:compile
         |  +- org.apache.pdfbox:fontbox:jar:1.4.0:compile
         |  \- org.apache.pdfbox:jempbox:jar:1.4.0:compile
         +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
         +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
         +- org.apache.poi:poi:jar:3.7:compile
         +- org.apache.poi:poi-scratchpad:jar:3.7:compile
         +- org.apache.poi:poi-ooxml:jar:3.7:compile
         |  +- org.apache.poi:poi-ooxml-schemas:jar:3.7:compile
         |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
         |  \- dom4j:dom4j:jar:1.6.1:compile
         +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
         +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile
         +- asm:asm:jar:3.1:compile
         +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
         +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
         +- rome:rome:jar:0.9:compile
            \- jdom:jdom:jar:1.0:compile
      The following people have contributed to Tika 0.9 by submitting or commenting
      on the issues resolved in this release:
         Alex Skochin
         Alexander Chow
         Antoine L.
         Antoni Mylka
         Benjamin Douglas
         Benson Margulies
         Chris A. Mattmann
         Cristian Vat
         Cyriel Vringer
         David Benson
         Erik Hetzner
         Gabriel Miklos
         Geoff Jarrad
         Jukka Zitting
         Ken Krugler
         Kostya Gribov
         Leszek Piotrowicz
         Martijn van Groningen
         Maxim Valyanskiy
         Michel Tremblay
         Nick Burch
         Paul Pearcy
         Peter van Raamsdonk
         Piotr Bartosiewicz
         Reinhard Schwab
         Scott Severtson
         Shinsuke Sugaya
         Staffan Olsson
         Steve Kearns
         Tom Klonikowski
         ≈Ωygimantas Medelis
      See http://s.apache.org/qi for more details on these contributions.
      Release 0.8 - 11/07/2010
      The most notable changes in Tika 0.8 over previous releases are:
       * Language identification is now dynamically configurable, 
         managed via a config file loaded from the classpath. (TIKA-490)
       * Tika now supports parsing Feeds by wrapping the underlying
         Rome library. (TIKA-466)
       * A quick-start guide for Tika parsing was contributed. (TIKA-464)
       * An approach for plumbing through XHTML attributes was added. (TIKA-379)
       * Media type hierarchy information is now taken into account when
         selecting the best parser for a given input document. (TIKA-298)
       * Support for parsing common scientific data formats including netCDF
         and HDF4/5 was added (TIKA-400 and TIKA-399).
       * Unit tests for Windows have been fixed, allowing TestParsers
         to complete. (TIKA-398)
      Tika 0.8 relies on the following set of major dependencies (generated using
      mvn dependency:tree from tika-parsers):
         +- org.apache.tika:tika-core:jar:0.8:compile
         +- edu.ucar:netcdf:jar:4.2:compile
         |  \- org.slf4j:slf4j-api:jar:1.5.6:compile
         +- commons-httpclient:commons-httpclient:jar:3.1:compile
         |  +- commons-logging:commons-logging:jar:1.1.1:compile (version managed from 1.0.4)
         |  \- commons-codec:commons-codec:jar:1.2:compile
         +- org.apache.commons:commons-compress:jar:1.1:compile
         +- org.apache.pdfbox:pdfbox:jar:1.3.1:compile
         |  +- org.apache.pdfbox:fontbox:jar:1.3.1:compile
         |  \- org.apache.pdfbox:jempbox:jar:1.3.1:compile
         +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
         +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
         +- org.apache.poi:poi:jar:3.7:compile
         +- org.apache.poi:poi-scratchpad:jar:3.7:compile
         +- org.apache.poi:poi-ooxml:jar:3.7:compile
         |  +- org.apache.poi:poi-ooxml-schemas:jar:3.7:compile
         |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
         |  \- dom4j:dom4j:jar:1.6.1:compile
         +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
         +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile
         +- asm:asm:jar:3.1:compile
         +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
         +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
         +- rome:rome:jar:0.9:compile
            \- jdom:jdom:jar:1.0:compile
      The following people have contributed to Tika 0.8 by submitting or commenting
      on the issues resolved in this release:
         Łukasz Wiktor
         Adam Wilmer
         Alex Baranau
         Alex Ott
         André Ricardo
         Andrey Barhatov
         Andrey Sidorenko
         Antoni Mylka
         Arturo Beltran
         Attila Kir√°ly
         Brad Greenlee
         Bruno Dumon
         Chris A. Mattmann
         Chris Bamford
         Christophe Gourmelon
         Dave Meikle
         David Weekly
         Dmitry Kuzmenko
         Erik Hetzner
         Geoff Jarrad
         Gerd Bremer
         Grant Ingersoll
         Jan H√∏ydahl
         Jean-Philippe Ricard
         Jeremias Maerki
         Joao Garcia
         Jukka Zitting
         Julien Nioche
         Ken Krugler
         Liam O'Boyle
         Mads Hansen
         Marcel May
         Markus Goldbach
         Martijn van Groningen
         Maxim Valyanskiy
         Mike Hays
         Miroslav Pokorny
         Nick Burch
         Otis Gospodnetic
         Peter van Raamsdonk
         Peter Wolanin
         Piotr Bartosiewicz
         Rajiv Kumar
         Reinhard Schwab
         rick cameron
         Robert Muir
         Sanjeev Rao
         Simon Tyler
         Sjoerd Smeets
         Slavomir Varchula
         Staffan Olsson
         Tom De Leu
         Uwe Schindler
         Victor Kazakov
      See http://s.apache.org/ab0 for more details on these contributions.
      Release 0.7 - 3/31/2010
      The most notable changes in Tika 0.7 over previous releases are:
       * MP3 file parsing was improved, including Channel and SampleRate 
         extraction and ID3v2 support (TIKA-368, TIKA-372). Further, audio
         parsing mime detection was also improved for the MIDI format. (TIKA-199)
       * Tika no longer relies on X11 for its RTF parsing functionality. (TIKA-386)
       * A Thread-safe bug in the AutoDetectParser was discovered and 
         addressed. (TIKA-374)
       * Upgrade to PDFBox 1.0.0. The new PDFBox version improves PDF parsing
         performance and fixes a number of text extraction issues. (TIKA-380)
      The following people have contributed to Tika 0.7 by submitting or commenting
      on the issues resolved in this release:
         Adam Rauch
         Benson Margulies
         Brett S.
         Chris A. Mattmann
         Daan de Wit
         Dave Meikle
         Ingo Renner
         Jukka Zitting
         Ken Krugler
         Kenny Neal
         Markus Goldbach
         Maxim Valyanskiy
         Nick Burch
         Sami Siren
         Uwe Schindler
      See http://tinyurl.com/yklopby for more details on these contributions.
      Release 0.6 - 01/20/2010
      The most notable changes in Tika 0.6 over the previous release are:
       * Mime-type detection for HTML (and all types) has been improved, allowing malformed
         HTML files and those HTML files that require a bit more observed content
         before the type is properly detected, are now correctly identified by 
         the AutoDetectParser. (TIKA-327, TIKA-357, TIKA-366, TIKA-367)
       * Tika now has an additional OSGi bundle packaging that includes all the
         required parser libraries. This bundle package makes it easy to use all
         Tika features in an OSGi environment. (TIKA-340, TIKA-342)
       * The Apache POI dependency used for parsing Microsoft Office file formats
         has been upgraded to version 3.6. The most visible improvement in this
         version is the notably reduced ooxml jar file size. The tika-app jar size
         is now down to 15MB from the 25MB in Tika 0.5. (TIKA-353)
       * Handling of character encoding information in input metadata and HTML
         <meta> tags has been improved. When no applicable encoding information is
         available, the encoding is detected by looking at the input data.
         (TIKA-332, TIKA-334, TIKA-335, TIKA-341) 
       * Some document types like Excel spreadsheets contain content like
         numbers or formulas whose exact text format depends on the current locale.
         So far Tika has used the platform default locale in such cases, but
         clients can now explicitly specify the locale by passing a Locale instance
         in the parse context. (TIKA-125)
       * The default text output encoding of the tika-app jar is now UTF-8
         when running on Mac OS X. This is because the default encoding used
         by Java is not compatible with the console application in Mac OS X.
         On all other platforms the text output from tika-app still uses
         the platform default encoding. (TIKA-324)
       * A flash video (video/x-flv) parser has been added. (TIKA-328)
       * The handling of Number and Date cell formatting within the Microsoft Excel
         documents has been added. This include currencies, percentages and
         scientific formats. (TIKA-103)
      The following people have contributed to Tika 0.6 by submitting or commenting
      on the issues resolved in this release:
         Andrzej Bialecki
         Bertrand Delacretaz
         Chris A. Mattmann
         Dave Meikle
         Erik Hetzner
         Felix Meschberger
         Jukka Zitting
         Julien Nioche
         Ken Krugler
         Luke Nezda
         Maxim Valyanskiy
         Niall Pemberton
         Peter Wolanin
         Piotr B.
         Sami Siren
         Yuan-Fang Li
      See http://tinyurl.com/yc3dk67 for more details on these contributions.
      Release 0.5 - 11/14/2009
      The most notable changes in Tika 0.5 over the previous release are:
       * Improved RDF/OWL mime detection using both MIME magic as well as
         pattern matching (TIKA-309)
       * An org.apache.tika.Tika facade class has been added to simplify common
         text extraction and type detection use cases. (TIKA-269)
       * A new parse context argument was added to the Parser.parse() method.
         This context map can be used to pass things like a delegate parser or
         other settings to the parsing process. The previous parse() method
         signature has been deprecated and will be removed in Tika 1.0. (TIKA-275)
       * A simple ngram-based language detection mechanism has been added along
         with predefined language profiles for 18 languages. (TIKA-209)
       * The media type registry in Tika was synchronized with the MIME type
         configuration in the Apache HTTP Server. Tika now knows about 1274
         different media types and can detect 672 of those using 927 file
         extension and 280 magic byte patterns. (TIKA-285)
       * Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing PDF
         documents. This version is notably better than the 0.7.3 release used
         earlier. (TIKA-158)
      The following people have contributed to Tika 0.5 by submitting or commenting
      on the issues resolved in this release:
         Alex Baranov
         Bart Hanssens
         Benson Margulies
         Chris A. Mattmann
         Daan de Wit
         Erik Hetzner
         Frank Hellwig
         Jeff Cadow
         Joachim Zittmayr
         Jukka Zitting
         Julien Nioche
         Ken Krugler
         Maxim Valyanskiy
         Paul Borgermans
         Piotr B.
         Robert Newson
         Sascha Szott
         Ted Dunning
         Thilo Goetz
         Uwe Schindler
         Yuan-Fang Li
      See http://tinyurl.com/yl9prwp for more details on these contributions.
      Release 0.4 - 07/14/2009
      The most notable changes in Tika 0.4 over the previous release are:
        * Tika has been split to three different components for increased
          modularity. The tika-core component contains the key interfaces and
          core functionality of Tika, tika-parsers contains all the adapters
          to external parser libraries, and tika-app bundles everything together
          in a single executable jar file. (TIKA-219)
        * All the three Tika components are packaged as OSGi bundles. (TIKA-228)
        * Tika now uses the new Commons Compress library for improved support
          of compression and packaging formats like gzip, bzip2, tar, cpio,
          ar, zip and jar. (TIKA-204)
        * The memory use of parsing Excel sheets with lots of numbers
          has been considerably reduced. (TIKA-211)
        * The AutoDetectParser now has basic protection against "zip bomb"
          attacks, where a specially crafted input document can expand to
          practically infinite amount of output text. (TIKA-216)
        * The ParsingReader class can now use a thread pool or a more complex
          execution model (java.util.concurrent.Executor) for the background
          parsing task. (TIKA-215)
        * Automatic type detection of text- and XML-based documents has been
          improved. (TIKA-225)
        * Charset detection functionality from the ICU4J library was inlined
          in Tika to avoid the dependency to the large ICU4J jar. (TIKA-229)
        * Composite parsers like the AutoDetectParser now make sure that any
          RuntimeExceptions, IOExceptions or SAXExceptions unrelated to the given
          document stream or content handler are converted to TikaExceptions
          before being passed to the client. (TIKA-198, TIKA-237)
      The following people have contributed to Tika 0.4 by submitting or commenting
      on the issues resolved in this release:
         Chris A. Mattmann
         Daan de Wit
         Dave Meikle
         David Weekly
         Jeremias Maerki
         Jonathan Koren
         Jukka Zitting
         Karl Heinz Marbaise
         Keith R. Bennett
         Maxim Valyanskiy
         Niall Pemberton
         Robert Burrell Donkin
         Sami Siren
         Siddharth Gargate
         Uwe Schindler
      See http://tinyurl.com/mgv9o3 for more details on these contributions.
      Release 0.3 - 03/09/2009
      The most notable changes in Tika 0.3 over the previous release are:
       * Tika now supports mime type glob patterns specified using
         standard JDK 1.4 (and beyond) syntax via the isregex attribute
         on the glob tag. See:
         for more information. (TIKA-194)
       * Tika now supports the Office Open XML format used by
         Microsoft Office 2007. (TIKA-152)
       * All the metadata keys for Microsoft Office document properties are now
         included as constants in the MSOffice interface. Clients should use
         these constants instead of the raw string values to refer to specific
         metadata items. (TIKA-186)
       * Automatic detection of document types in Tika has been improved.
         For example Tika can now detect plain text just by looking at the first
         few bytes of the document. (TIKA-154)
       * Tika now disables the loading of all external entities in XML files
         that it parses as input documents. This improves security and avoids
         problems with potentially broken references. (TIKA-185)
       * Tika now replaces all invalid XML characters in the extracted text
         content with spaces. This prevents problems when output from Tika
         is processed with XML tools. (TIKA-180)
       * The Tika CLI now correctly flushes its buffers when invoked with the
         --text argument. This prevents the end of the text output from being
         lost. (TIKA-179)
       * Embedded text in MIDI files is now extracted. For example many karaoke
         files contain song lyrics embedded as MIDI text.
       * The text content of Microsoft Outlook message files no longer appears as
         multiple copies in the extracted text. (TIKA-197)
       * The ParsingReader class now makes most document metadata available
         already before any of the extracted text is consumed. This makes it
         easier for example to construct Lucene Document instances that contain
         both extracted text and metadata. (TIKA-203)
      See http://tinyurl.com/tika-0-3-changes for a list of all changes in Tika 0.3.
      The following people have contributed to Tika 0.3 by submitting or commenting
      on the issues resolved in this release:
         Andrzej Rusin
         Chris A. Mattmann
         Dave Meikle
         Georger Ara√∫jo
         Guillermo Arribas
         Jonathan Koren
         Jukka Zitting
         Karl Heinz Marbaise
         Kumar Raja Jana
         Paul Borgermans
         Peter Becker
         Sébastien Michel
         Uwe Schindler
      See http://tinyurl.com/tika-0-3-contributions for more details on
      these contributions.
      Release 0.2 - 12/04/2008
      1.  TIKA-109 - WordParser fails on some Word files (Dave Meikle)
      2.  TIKA-105 - Excel parser implementation based on POI's Event API
                     (Niall Pemberton)
      3.  TIKA-116 - Streaming parser for OpenDocument files (Jukka Zitting)
      4.  TIKA-117 - Drop JDOM and Jaxen dependencies (Jukka Zitting)
      5.  TIKA-115 - Tika package with all the dependencies (Jukka Zitting)
      6.  TIKA-97  - Tika GUI (Jukka Zitting)
      7.  TIKA-96  - Tika CLI (Jukka Zitting)
      8.  TIKA-112 - Use Commons IO 1.4 (Jukka Zitting)
      9.  TIKA-127 - Add support for Visio files (Jukka Zitting)
      10. TIKA-129 - node() support for the streaming XPath utility (Jukka Zitting)
      11. TIKA-130 - self-or-descendant axis does not match self in streaming XPath
                     (Jukka Zitting)
      12. TIKA-131 - Lazy XHTML prefix generation (Jukka Zitting)
      13. TIKA-128 - HTML parser should produce XHTML SAX events (Jukka Zitting)
      14. TIKA-133 - TeeContentHandler constructor should use varargs (Jukka Zitting)
      15. TIKA-132 - Refactor Excel extractor to parse per sheet and add
                     hyperlink support (Niall Pemberton)
      16. TIKA-134 - mvn package does not produce packages for bin/src
                     (Karl Heinz Marbaise)
      17. TIKA-138 - Ignore HTML style and script content (Jukka Zitting)
      18. TIKA-113 - Metadata (such as title) should not be part of content
                     (Jukka Zitting)
      19. TIKA-139 - Add a composite parser (Jukka Zitting)
      20. TIKA-142 - Include application/xhtml+xml as valid mime type for XMLParser
      21. TIKA-143 - Add ParsingReader (Jukka Zitting)
      22. TIKA-144 - Upgrade nekohtml dependency (Jukka Zitting)
      23. TIKA-145 - Separate NOTICEs and LICENSEs for binary and source packages
                     (Jukka Zitting)
      24. TIKA-146 - Upgrade to POI 3.1 (Jukka Zitting)
      25. TIKA-99  - Support external parser programs (Jukka Zitting)
      26. TIKA-149 - Parser for Zip files (Dave Meikle & Jukka Zitting)
      27. TIKA-150 - Parser for tar files (Jukka Zitting)
      28. TIKA-151 - Stream compression support (Jukka Zitting)
      29. TIKA-156 - Some MIME magic patterns are ignored by MimeTypes
                     (Jukka Zitting)
      30. TIKA-155 - Java class file parser (Dave Brosius & Jukka Zitting)
      31. TIKA-108 - New Tika logos (Yongqian Li & Jukka Zitting)
      32. TIKA-120 - Add support for retrieving ID3 tags from MP3 files
                     (Dave Meikle & Jukka Zitting)
      33. TIKA-54  - Outlook msg parser
                     (Rida Benjelloun, Dave Meikle & Jukka Zitting)
      34. TIKA-114 - PDFParser : Getting content of the document using
                     "writer.ToString ()" , some words are stuck together
                     (Dave Meikle)
      35. TIKA-161 - Enable PMD reports (Jukka Zitting)
      36. TIKA-159 - Add support for parsing basic audio types: wav, aiff, au, midi
                     (Sami Siren)
      37. TIKA-140 - HTML parser unable to extract text
                     (Julien Nioche & Jukka Zitting)
      38. TIKA-163 - GUI does not support drag and drop in Gnome or KDE (Dave Meikle)
      39. TIKA-166 - Update HTMLParser to parse contents of meta tags (Dave Meikle)
      40. TIKA-164 - Upgrade of the nekohtml dependency to 1.9.9 (Jukka Zitting)
      41. TIKA-165 - Upgrade of the ICU4J dependency to version 3.8 (Jukka Zitting)
      42. TIKA-172 - New Open Document Parser that emits structured XHTML content
                     (Uwe Schindler & Jukka Zitting)
      43. TIKA-175 - Retrotranslate Tika for use in Java 1.4 environments (Jukka Zitting)
      44. TIKA-177 - Improvements to build instruction in README (Chris Hostetter & Jukka Zitting)
      45. TIKA-171 - New ContentHandler for plain text output that has no problem with
                     missing white space after XHTML block tags (Uwe Schindler & Jukka Zitting)
      Release 0.1-incubating - 12/27/2007
      1. TIKA-5 - Port Metadata Framework from Nutch (mattmann)
      2. TIKA-11 - Consolidate test classes into a src/test/java directory tree (mattmann)
      3. TIKA-15 - Utils.print does not print a Content having no value (jukka)
      4. TIKA-19 - org.apache.tika.TestParsers fails (bdelacretaz)
      5. TIKA-16 - Issues with data files used for testing by TestParsers (bdelacretaz)
      6. TIKA-14 - MimeTypeUtils.getMimeType() returns the default mime type for
                   .odt (Open Office) file (bdelacretaz)
      7. TIKA-12 - Add URL capability to MimeTypesUtils (jukka)
      8. TIKA-13 - Fix obsolete package names in config.xml (siren)
      9. TIKA-10 - Remove MimeInfoException catch clauses and import from TestParsers (siren)
      10. TIKA-8 - Replaced the jmimeinfo dependency with a trivial mime type detector (jukka)
      11. TIKA-7 - Added the Lius Lite code. Added missing dependencies to POM (jukka)
      12. TIKA-18 - "Office" interface should be renamed "MSOffice" (mattmann)
      13. TIKA-23 - Decouple Parser from ParserConfig (jukka)
      14. TIKA-6 - Port Nutch (or better) MimeType detection system into Tika (J. Charron & mattmann)
      15. TIKA-25 - Removed hardcoded reference to C:\oo.xml in OpenOfficeParser (K. Bennett & jukka)
      16. TIKA-17 - Need to support URL's for input resources. (K. Bennett & mattmann)
      17. TIKA-22 - Remove @author tags from the java source (mattmann)
      18. TIKA-21 - Simplified configuration code (jukka)
      19. TIKA-17 - Rename all "Lius" classes to be "Tika" classes (jukka)
      20. TIKA-30 - Added utility constructors to TikaConfig (K. Bennett & jukka)
      21. TIKA-28 - Rename config.xml to tika-config.xml or similar (mattmann)
      22. TIKA-26 - Use Map<String, Content> instead of List<Content> (jukka)
      23. TIKA-31 - protected Parser.parse(InputStream stream,
                    Iterable<Content> contents) (jukka & K. Bennett)
      24. TIKA-36 - A convenience method for getting a document's content's text
                    would be helpful (K. Bennett & mattmann)
      25. TIKA-33 - Stateless parsers (jukka)
      26. TIKA-38 - TXTParser adds a space to the content it reads from a file (K. Bennett & ridabenjelloun)
      27. TIKA-35 - Extract MsOffice properties, use RereadableInputStream devloped by K. Bennett (ridabenjelloun & K. Bennett)
      28. TIKA-39 - Excel parsing improvements (siren & ridabenjelloun)
      29. TIKA-34 - Provide a method that will return a default configuration
                    (TikaConfig) (K. Bennett & mattmann)
      30. TIKA-42 - Content class needs (String, String, String) constructor (K. Bennett)
      31. TIKA-43 - Parser interface (jukka)
      32. TIKA-47 - Remove TikaLogger (jukka)
      33. TIKA-46 - Use Metadata in Parser (jukka & mattmann)
      34. TIKA-48 - Merge MS Extractors and Parsers (jukka)
      35. TIKA-45 - RereadableInputStream needs to be able to read to
                    the end of the original stream on first rewind. (K. Bennett)
      36. TIKA-41 - Resource files occur twice in jar file. (jukka)
      37. TIKA-49 - Some files have old-style license headers, fixed (Robert Burrell Donkin & bdelacretaz)
      38. TIKA-51 - Leftover temp files after running Tika tests, fixed (bdelacretaz)
      39. TIKA-40 - Tika needs to support diverse character encodings (jukka)
      40. TIKA-55 - ParseUtils.getParser() method variants should have consistent parameter orders
                    (K. Bennett)
      41. TIKA-52 - RereadableInputStream needs to support not closing the input stream it wraps.
                    (K. Bennett via bdelacretaz)
      42. TIKA-53 - XHTML SAX events from parsers (jukka)
      43. TIKA-57 - Rename org.apache.tika.ms to org.apache.tika.parser.ms (jukka)
      44. TIKA-62 - Use TikaConfig.getDefaultConfig() instead of a hardcoded
                    config path in TestParsers (jukka)
      45. TIKA-58 - Replace jtidy html parser with nekohtml based parser (siren)
      46. TIKA-60 - Rename Microsoft parser classes (jukka)
      47. TIKA-63 - Avoid multiple passes over the input stream in Microsoft parsers
      48. TIKA-66 - Use Java 5 features in org.apache.tika.mime (jukka)
      49. TIKA-56 - Mime type detection fails with upper case file extensions such as "PDF"
      50. TIKA-65 - Add encode detection support for HTML parser (siren)
      51. TIKA-68 - Add dummy parser classes to be used as sentinels (jukka)
      52. TIKA-67 - Add an auto-detecting Parser implementation (jukka)
      53. TIKA-70 - Better MIME information for the Open Document formats (jukka)
      54. TIKA-71 - Remove ParserConfig and ParserFactory (jukka)
      55. TIKA-83 - Create a org.apache.tika.sax package for SAX utilities (jukka)
      56. TIKA-84 - Add MimeTypes.getMimeType(InputStream) (jukka)
      57. TIKA-85 - Add glob patterns from the ASF svn:eol-style documentation (jukka)
      58. TIKA-100 - Structured PDF parsing (jukka)
      59. TIKA-101 - Improve site and build (mattmann)
      60. TIKA-102 - Parser implementations loading a large amount of content
                     into a single String could be problematic (Niall Pemberton)
      61. TIKA-107 - Remove use of assertions for argument checking (Niall Pemberton)
      62. TIKA-104 - Add utility methods to throw IOException with the caused
                     intialized (jukka & Niall Pemberton)
      63. TIKA-106 - Remove dependency on Jakarta ORO - use JDK 1.4 Regex
                     (Niall Pemberton)
      64. TIKA-111 - Missing license headers (jukka)
      65. TIKA-112 - XMLParser improvement (ridabenjelloun)




            tmortagne Thomas Mortagne
            tmortagne Thomas Mortagne
            0 Vote for this issue
            1 Start watching this issue

