Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: 13.10.4
Component/s: Search - Solr
Labels:
None
Environment:
tomcat9, debian

Difficulty:
Unknown
Similar issues:

Description

Heap size: 1GB

The problem occurs with a 500MB zip file containing some smaller PDFs and DOC files but also:

one 60 MB PDF
6 15-30 MB PDFs
one 360MB PDF

Isolated the problem to the page containg this attachment. Removed it from the index and reindexed it. Immediately after, the following exception is thrown:

2022-04-27 07:36:24,102 [XWiki Solr index thread] ERROR o.x.s.s.i.DefaultSolrIndexer   - Failed to process entry [INDEX xwiki:Documents.20596869] 
java.lang.OutOfMemoryError: Java heap space
	at org.apache.pdfbox.io.ScratchFileBuffer.addPage(ScratchFileBuffer.java:132)
	at org.apache.pdfbox.io.ScratchFileBuffer.<init>(ScratchFileBuffer.java:84)
	at org.apache.pdfbox.io.ScratchFile.createBuffer(ScratchFile.java:424)
	at org.apache.pdfbox.cos.COSStream.createRawOutputStream(COSStream.java:273)
	at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1126)
	at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:915)
	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:876)
	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:796)
	at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:756)
	at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1228)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1202)
	at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:186)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:147)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)
	at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
	at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:106)
	at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:432)
	at org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:349)
	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:299)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)
	at org.apache.tika.Tika.parseToString(Tika.java:525)
	at org.apache.tika.Tika.parseToString(Tika.java:495)
	at org.xwiki.tika.internal.TikaUtils.parseToString(TikaUtils.java:153)
	at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:528)
	at org.xwiki.search.solr.internal.metadata.DocumentSolrMetadataExtractor.setAttachment(DocumentSolrMetadataExtractor.java:281)
	at org.xwiki.search.solr.internal.metadata.DocumentSolrMetadataExtractor.setAttachments(DocumentSolrMetadataExtractor.java:261)


Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Glowroot-Stack-Trace-Collector"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Catalina-utility-2"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Glowroot-Gauge-Collection"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "AsyncFileHandlerWriter-1961595039"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Glowroot-Background-0"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "MVStore background writer nio:/var/lib/xwiki/data/mentions/mvqueue"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "SolrRrdBackendFactory-11-thread-1"
...
etc

The failure seems to occur occurs deep inside tika when it tries to parse the large pdf found inside the zip. So it's not an issue of returning the result, but actually parsing the provided input. Ideally, disk buffers should be used instead of loading everything in memory.

The XWiki code calling tika is located at https://github.com/xwiki/xwiki-platform/blob/xwiki-platform-13.10.4/xwiki-platform-core/xwiki-platform-search/xwiki-platform-search-solr/xwiki-platform-search-solr-api/src/main/java/org/xwiki/search/solr/internal/metadata/AbstractSolrMetadataExtractor.java#L528

The problems is that the wiki becomes completely unresponsive once it occurs. The heap space is never reclaimed and it eventually is restarted by xinit.

Even worse, at restart, Solr notices the missing index document so it tries again to reindex it, triggering the OOM yet again.

This cycle repeats over and over, effectively making the wiki unusable. I consider this a blocker scenario.

The workaround is to increase the heap size enough for the problem attachment to fit in the memory and get indexed.

One solution could be for the Solr Attachment Metadata Extractor to chcek the attachment that it is supposed to analyze and, if the file size is larger than the remaining heap size, an empty string (or null) should be returned and indexed in Solr. IMO, it's better to not have the content of some attachments indexed, rather than losing control of your wiki after you upload some large file.

This strategy could be applied for all attachments OR based on file types: i.e. for large documents (pdf, doc, docx, etc.) where we know tika will try to go into details and load data to memory, but we could let tika process large media files, since it usually extracts minimal information from it without causing issues (i.e. by sequentially reading metadata and not loading an entire movie into memory for deep analysis).

Ideally, it would be nice to detect the OOM and return the empty string (so that Solr does not retry next time), but by that time, the app is too unstable to do anything.

Out of Memory when Solr is indexing a 500MB zip/pdf attachment

Details

Description

Attachments

Activity

People

Dates