Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
11.10.4
-
Unknown
-
Description
I've recently come across a particular case when a 5 MB *.vsdx file (Microsoft Visio diagram) lead the Solr Index thread into what looks like a memory leak that very quickly fills up the heap, spikes CPU usage (mainly caused by GC) and eventually results in java.lang.OutOfMemoryError: GC overhead limit exceeded. (was able to reproduce locally with a fresh instance)
The worst part is that once the Solr indexing fails with the OOM exception, the heap remains full. Even worse, after a restart, since the attachment was not properly indexed, Solr starts over again, leading to another OOM, practically locking access to the wiki. (this part, after a restart, I did not reproduce locally, Solr saying 1 documents added and finishing the indexing job at startup).
Here is a thread dump from my initial remote XWiki instance where I noticed the issue, just before the heap managed to get full again (i.e. after a restart):
"XWiki Solr index thread" #105 java.lang.Thread.State: RUNNABLE at java.util.Arrays.copyOf(Arrays.java:3181) at java.util.ArrayList.grow(ArrayList.java:265) at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:239) at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:231) at java.util.ArrayList.add(ArrayList.java:462) at org.apache.xmlbeans.impl.values.NamespaceContext$NamespaceContextStack.push(NamespaceContext.java:81) at org.apache.xmlbeans.impl.values.NamespaceContext.push(NamespaceContext.java:106) at org.apache.xmlbeans.impl.values.XmlObjectBase.check_dated(XmlObjectBase.java:1318) at org.apache.xmlbeans.impl.values.XmlObjectBase.getStringValue(XmlObjectBase.java:1529) - locked on org.apache.xmlbeans.impl.store.Locale@59102ef5 at com.microsoft.schemas.office.visio.x2012.main.impl.CellTypeImpl.getV() - locked on org.apache.xmlbeans.impl.store.Locale@59102ef5 at org.apache.poi.xdgf.usermodel.XDGFCell.parseDoubleValue(XDGFCell.java:84) at org.apache.poi.xdgf.usermodel.section.geometry.LineTo.<init>(LineTo.java:49) at sun.reflect.GeneratedConstructorAccessor206.newInstance() at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.poi.xdgf.util.ObjectFactory.load(ObjectFactory.java:49) at org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory.load(GeometryRowFactory.java:58) at org.apache.poi.xdgf.usermodel.section.GeometrySection.<init>(GeometrySection.java:55) at org.apache.poi.xdgf.usermodel.XDGFSheet.<init>(XDGFSheet.java:77) at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:113) at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:126) at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:107) at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:73) at org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:62) at org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:83) at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:105) at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) at org.apache.poi.xdgf.usermodel.XmlVisioDocument.<init>(XmlVisioDocument.java:76) at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.<init>(XDGFVisioExtractor.java:41) at org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:199) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:161) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.Tika.parseToString(Tika.java:527) at org.xwiki.tika.internal.TikaUtils.parseToString(TikaUtils.java:153) at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:509) at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setLocaleAndContentFields(AttachmentSolrMetadataExtractor.java:111) at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setFieldsInternal(AttachmentSolrMetadataExtractor.java:93) at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getSolrDocument(AbstractSolrMetadataExtractor.java:133) at org.xwiki.search.solr.internal.DefaultSolrIndexer.getSolrDocument(DefaultSolrIndexer.java:504) at org.xwiki.search.solr.internal.DefaultSolrIndexer.processBatch(DefaultSolrIndexer.java:411) at org.xwiki.search.solr.internal.DefaultSolrIndexer.run(DefaultSolrIndexer.java:377) at java.lang.Thread.run(Thread.java:748)
And the top heap consumers around the same time, retrieved with glowroot's "Heap histogram" view:
Class name Bytes Count org.apache.xmlbeans.impl.store.Xobj$AttrXobj 365,408,928 3,806,343 org.apache.xmlbeans.impl.store.Xobj$ElementXobj 143,546,592 1,495,277 byte[] 99,377,840 63,643 char[] 88,724,368 366,691 short[] 41,184,224 4,323 java.lang.String 38,768,160 1,615,340 org.apache.xmlbeans.impl.values.XmlStringImpl 23,440,752 976,698 java.util.HashMap$Node 12,195,296 381,103 java.util.TreeMap$Entry 10,547,960 263,699 com.microsoft.schemas.office.visio.x2012.main.impl.CellTypeImpl 10,274,088 428,087 org.apache.xmlbeans.impl.values.XmlUnsignedIntImpl 8,182,520 204,563 java.lang.Double 8,136,312 339,013 java.lang.Object[] 7,121,568 118,269 java.util.HashMap$Node[] 6,783,560 67,629 java.util.concurrent.ConcurrentHashMap$Node 5,760,416 180,013 java.util.HashMap 4,395,360 91,570 org.apache.poi.xdgf.usermodel.section.geometry.LineTo 4,123,424 128,857 com.microsoft.schemas.office.visio.x2012.main.impl.RowTypeImpl 4,036,704 168,196 java.util.LinkedHashMap$Entry 3,854,880 96,372 ... Total 963829800 12730081
I tried locally with a sample vsdx file and everything was indexed fine and I was able to even find text from inside the diagram. However, once I have attached the same problematic vsdx file, I immediately got an OOM locally as well, as mentioned. Some particular content must be tripping apache Tika.
Local stacktrace:
java.lang.OutOfMemoryError: GC overhead limit exceeded Dumping heap to data/java_pid207527.hprof ... Heap dump file created [1267724775 bytes in 9.929 secs] 2021-02-12 17:21:23,103 [XWiki Solr index thread] ERROR o.x.s.s.i.DefaultSolrIndexer - Failed to process entry [INDEX Attachment xwiki:test.WebHome@cleanName.vsdx] java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.xmlbeans.impl.store.CharUtil.getString(CharUtil.java:97) at org.apache.xmlbeans.impl.store.Xobj.getValueAsString(Xobj.java:1196) at org.apache.xmlbeans.impl.store.Xobj.fetch_text(Xobj.java:1814) at org.apache.xmlbeans.impl.values.XmlObjectBase.get_wscanon_text(XmlObjectBase.java:1377) at org.apache.xmlbeans.impl.values.XmlObjectBase.check_dated(XmlObjectBase.java:1314) at org.apache.xmlbeans.impl.values.XmlObjectBase.getStringValue(XmlObjectBase.java:1529) at com.microsoft.schemas.office.visio.x2012.main.impl.SectionTypeImpl.getN(Unknown Source) at org.apache.poi.xdgf.usermodel.XDGFSheet.<init>(XDGFSheet.java:75) at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:113) at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:126) at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:107) at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:73) at org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:62) at org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:83) at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:105) at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) at org.apache.poi.xdgf.usermodel.XmlVisioDocument.<init>(XmlVisioDocument.java:76) at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.<init>(XDGFVisioExtractor.java:41) at org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:199) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:161) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.Tika.parseToString(Tika.java:527) at org.xwiki.tika.internal.TikaUtils.parseToString(TikaUtils.java:153) at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:509) at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setLocaleAndContentFields(AttachmentSolrMetadataExtractor.java:111) at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setFieldsInternal(AttachmentSolrMetadataExtractor.java:93) at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getSolrDocument(AbstractSolrMetadataExtractor.java:133) at org.xwiki.search.solr.internal.DefaultSolrIndexer.getSolrDocument(DefaultSolrIndexer.java:504) at org.xwiki.search.solr.internal.DefaultSolrIndexer.processBatch(DefaultSolrIndexer.java:411) 2021-02-12 17:21:43,895 [XWiki Solr index thread] ERROR o.x.s.s.i.DefaultSolrIndexer - Failed to process entry [INDEX xwiki:test.WebHome] java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3181) at java.util.ArrayList.grow(ArrayList.java:267) at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:241) at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:233) at java.util.ArrayList.add(ArrayList.java:464) at org.apache.xmlbeans.impl.values.NamespaceContext$NamespaceContextStack.push(NamespaceContext.java:81) at org.apache.xmlbeans.impl.values.NamespaceContext.push(NamespaceContext.java:106) at org.apache.xmlbeans.impl.values.XmlObjectBase.check_dated(XmlObjectBase.java:1318) at org.apache.xmlbeans.impl.values.XmlObjectBase.getStringValue(XmlObjectBase.java:1529) at com.microsoft.schemas.office.visio.x2012.main.impl.RowTypeImpl.getT(Unknown Source) at org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory.load(GeometryRowFactory.java:58) at org.apache.poi.xdgf.usermodel.section.GeometrySection.<init>(GeometrySection.java:55) at org.apache.poi.xdgf.usermodel.XDGFSheet.<init>(XDGFSheet.java:77) at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:113) at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:126) at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:107) at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:73) at org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:62) at org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:83) at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:105) at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) at org.apache.poi.xdgf.usermodel.XmlVisioDocument.<init>(XmlVisioDocument.java:76) at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.<init>(XDGFVisioExtractor.java:41) at org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:199) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:161) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.Tika.parseToString(Tika.java:527) at org.xwiki.tika.internal.TikaUtils.parseToString(TikaUtils.java:153) at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:509)
Not sure if the solution is to upgrade Tika or to add some sort of exclude option (file name/path/reference) to the Solr indexer, in order to allow users to work around problematic content.
Note: I am unable to attach the problematic file publicly due to privacy reasons.