Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-20139

Large PDF attachements are not indexed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • 13.10.5
    • Search - Solr
    • None
    • Unknown

    Description

      For some PDF documents we get the following error message, after uploading them to the File Manager Application in our case:

       

      Sep 16 15:24:32 ptbwiki server[31564]: 2022-09-16 15:24:32,372 [XWiki Solr index thread] ERROR ttachmentSolrMetadataExtractor - Failed to retrieve the content of attachment [Attachment explosionsschutz:FileManager.Warnatz1993_Technische
      _Verbrennung_until_chapter_5_page_1\.pdf@Warnatz1993_Technische_Verbrennung_until_chapter_5_page_1.pdf]
      
      Sep 16 15:24:32 ptbwiki server[31564]: org.apache.tika.exception.TikaException: Unable to extract PDF content
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:119)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.Tika.parseToString(Tika.java:525)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.Tika.parseToString(Tika.java:495)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.xwiki.tika.internal.TikaUtils.parseToString(TikaUtils.java:153)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:528)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setLocaleAndContentFields(AttachmentSolrMetadataExtractor.java:111)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setFieldsInternal(AttachmentSolrMetadataExtractor.java:93)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getSolrDocument(AbstractSolrMetadataExtractor.java:151)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.xwiki.search.solr.internal.DefaultSolrIndexer.getSolrDocument(DefaultSolrIndexer.java:499)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.xwiki.search.solr.internal.DefaultSolrIndexer.processBatch(DefaultSolrIndexer.java:408)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.xwiki.search.solr.internal.DefaultSolrIndexer.run(DefaultSolrIndexer.java:376)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at java.lang.Thread.run(Thread.java:825)
      
      Sep 16 15:24:32 ptbwiki server[31564]: Caused by: java.io.IOException: Unable to write a string: gewisse physikalische Gräßen transportiert werden. Diffusion ist Transport von
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:195)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:785)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1744)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:666)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:395)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:126)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1055)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:269)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         ... 15 common frames omitted
      
      Sep 16 15:24:32 ptbwiki server[31564]: Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, incre
      ase your limit. (Text up to the limit is however available).
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.sax.SafeContentHandler$$Lambda$1199/0x00000000d4524c80.write(Unknown Source)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:193)
      
      Sep 16 15:24:32 ptbwiki server[31564]:         ... 23 common frames omitted
      
      

      As tmortagne reported back in 2017, there should not be such a limit, if this was not changed meanwhile. I attached the respective file until the actual page, on which the limit of 100.000 characters is exeeded.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            bluxwi Björn Ludwig
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: