Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-13405

PDF attachments are not indexed by SOLR

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Duplicate
    • Affects Version/s: 7.4.3
    • Fix Version/s: None
    • Component/s: Search - Solr
    • Labels:
      None
    • Difficulty:
      Unknown
    • Similar issues:

      Description

      This is an example of the error raised while indexing such an attachment :

      2016-05-10 10:36:08,839 [XWiki Solr index thread] ERROR o.a.p.f.FlateFilter            - FlateFilter: stop reading corrupt stream due to a DataFormatException 
      2016-05-10 10:36:08,839 [XWiki Solr index thread] ERROR o.a.p.f.FlateFilter            - FlateFilter: stop reading corrupt stream due to a DataFormatException 
      2016-05-10 10:36:08,861 [XWiki Solr index thread] ERROR .DocumentSolrMetadataExtractor - Failed to retrieve the content of attachment [Attachment xwiki:Main.TestPage1@Global Contact 9.pdf] 
      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@54dd5350
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) ~[tika-core-1.11.jar:1.11]
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.11.jar:1.11]
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) ~[tika-core-1.11.jar:1.11]
      	at org.apache.tika.Tika.parseToString(Tika.java:496) ~[tika-core-1.11.jar:1.11]
      	at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:507) [xwiki-platform-search-solr-api-7.4.3.jar:7.4.3]
      	at org.xwiki.search.solr.internal.metadata.DocumentSolrMetadataExtractor.setAttachment(DocumentSolrMetadataExtractor.java:299) [xwiki-platform-search-solr-api-7.4.3.jar:7.4.3]
      	at org.xwiki.search.solr.internal.metadata.DocumentSolrMetadataExtractor.setAttachments(DocumentSolrMetadataExtractor.java:279) [xwiki-platform-search-solr-api-7.4.3.jar:7.4.3]
      	at org.xwiki.search.solr.internal.metadata.DocumentSolrMetadataExtractor.setExtras(DocumentSolrMetadataExtractor.java:205) [xwiki-platform-search-solr-api-7.4.3.jar:7.4.3]
      	at org.xwiki.search.solr.internal.metadata.DocumentSolrMetadataExtractor.setFieldsInternal(DocumentSolrMetadataExtractor.java:145) [xwiki-platform-search-solr-api-7.4.3.jar:7.4.3]
      	at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getSolrDocument(AbstractSolrMetadataExtractor.java:132) [xwiki-platform-search-solr-api-7.4.3.jar:7.4.3]
      	at org.xwiki.search.solr.internal.DefaultSolrIndexer.getSolrDocument(DefaultSolrIndexer.java:518) [xwiki-platform-search-solr-api-7.4.3.jar:7.4.3]
      	at org.xwiki.search.solr.internal.DefaultSolrIndexer.processBatch(DefaultSolrIndexer.java:425) [xwiki-platform-search-solr-api-7.4.3.jar:7.4.3]
      	at org.xwiki.search.solr.internal.DefaultSolrIndexer.run(DefaultSolrIndexer.java:391) [xwiki-platform-search-solr-api-7.4.3.jar:7.4.3]
      	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
      Caused by: java.lang.RuntimeException: java.io.IOException: Unknown dir object c=')' cInt=41 peek=')' peekInt=41 3734
      	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:198) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:205) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) ~[tika-parsers-1.11.jar:1.11]
      	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) ~[tika-parsers-1.11.jar:1.11]
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.11.jar:1.11]
      	... 13 common frames omitted
      Caused by: java.io.IOException: Unknown dir object c=')' cInt=41 peek=')' peekInt=41 3734
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1362) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:276) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:49) ~[pdfbox-1.8.10.jar:1.8.10]
      	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:193) ~[pdfbox-1.8.10.jar:1.8.10]
      	... 23 common frames omitted
      

      These attachments are not indexed. When searching them with SOLR, you don't get results.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tmortagne Thomas Mortagne
                Reporter:
                ralucamorosan Raluca Stavro
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Date of First Response: