Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-19668

Out of Memory when Solr is indexing a 500MB zip/pdf attachment

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • 13.10.4
    • Search - Solr
    • None
    • tomcat9, debian
    • Unknown

    Description

      Heap size: 1GB

      The problem occurs with a 500MB zip file containing some smaller PDFs and DOC files but also:

      • one 60 MB PDF
      • 6 15-30 MB PDFs
      • one 360MB PDF

      Isolated the problem to the page containg this attachment. Removed it from the index and reindexed it. Immediately after, the following exception is thrown:

      2022-04-27 07:36:24,102 [XWiki Solr index thread] ERROR o.x.s.s.i.DefaultSolrIndexer   - Failed to process entry [INDEX xwiki:Documents.20596869] 
      java.lang.OutOfMemoryError: Java heap space
      	at org.apache.pdfbox.io.ScratchFileBuffer.addPage(ScratchFileBuffer.java:132)
      	at org.apache.pdfbox.io.ScratchFileBuffer.<init>(ScratchFileBuffer.java:84)
      	at org.apache.pdfbox.io.ScratchFile.createBuffer(ScratchFile.java:424)
      	at org.apache.pdfbox.cos.COSStream.createRawOutputStream(COSStream.java:273)
      	at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1126)
      	at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:915)
      	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:876)
      	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:796)
      	at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:756)
      	at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187)
      	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
      	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1228)
      	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1202)
      	at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:186)
      	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:147)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)
      	at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
      	at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:106)
      	at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:432)
      	at org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:349)
      	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:299)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)
      	at org.apache.tika.Tika.parseToString(Tika.java:525)
      	at org.apache.tika.Tika.parseToString(Tika.java:495)
      	at org.xwiki.tika.internal.TikaUtils.parseToString(TikaUtils.java:153)
      	at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:528)
      	at org.xwiki.search.solr.internal.metadata.DocumentSolrMetadataExtractor.setAttachment(DocumentSolrMetadataExtractor.java:281)
      	at org.xwiki.search.solr.internal.metadata.DocumentSolrMetadataExtractor.setAttachments(DocumentSolrMetadataExtractor.java:261)
      
      
      Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Glowroot-Stack-Trace-Collector"
      
      Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Catalina-utility-2"
      
      Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Glowroot-Gauge-Collection"
      
      Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "AsyncFileHandlerWriter-1961595039"
      
      Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Glowroot-Background-0"
      
      Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "MVStore background writer nio:/var/lib/xwiki/data/mentions/mvqueue"
      
      Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "SolrRrdBackendFactory-11-thread-1"
      ...
      etc
      

      The failure seems to occur occurs deep inside tika when it tries to parse the large pdf found inside the zip. So it's not an issue of returning the result, but actually parsing the provided input. Ideally, disk buffers should be used instead of loading everything in memory.

      The XWiki code calling tika is located at https://github.com/xwiki/xwiki-platform/blob/xwiki-platform-13.10.4/xwiki-platform-core/xwiki-platform-search/xwiki-platform-search-solr/xwiki-platform-search-solr-api/src/main/java/org/xwiki/search/solr/internal/metadata/AbstractSolrMetadataExtractor.java#L528

      The problems is that the wiki becomes completely unresponsive once it occurs. The heap space is never reclaimed and it eventually is restarted by xinit.

      Even worse, at restart, Solr notices the missing index document so it tries again to reindex it, triggering the OOM yet again.

      This cycle repeats over and over, effectively making the wiki unusable. I consider this a blocker scenario.

      The workaround is to increase the heap size enough for the problem attachment to fit in the memory and get indexed.

      One solution could be for the Solr Attachment Metadata Extractor to chcek the attachment that it is supposed to analyze and, if the file size is larger than the remaining heap size, an empty string (or null) should be returned and indexed in Solr. IMO, it's better to not have the content of some attachments indexed, rather than losing control of your wiki after you upload some large file.

      This strategy could be applied for all attachments OR based on file types: i.e. for large documents (pdf, doc, docx, etc.) where we know tika will try to go into details and load data to memory, but we could let tika process large media files, since it usually extracts minimal information from it without causing issues (i.e. by sequentially reading metadata and not loading an entire movie into memory for deep analysis).

      Ideally, it would be nice to detect the OOM and return the empty string (so that Solr does not retry next time), but by that time, the app is too unstable to do anything.

      Attachments

        Activity

          People

            Unassigned Unassigned
            enygma Eduard Moraru
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: