Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-18340

Microsoft Visio *.vsdx file attachment causes OOM when Solr tries to index it

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • 11.10.4
    • Search - Solr
    • None
    • Unknown

    Description

      I've recently come across a particular case when a 5 MB *.vsdx file (Microsoft Visio diagram) lead the Solr Index thread into what looks like a memory leak that very quickly fills up the heap, spikes CPU usage (mainly caused by GC) and eventually results in java.lang.OutOfMemoryError: GC overhead limit exceeded. (was able to reproduce locally with a fresh instance)

      The worst part is that once the Solr indexing fails with the OOM exception, the heap remains full. Even worse, after a restart, since the attachment was not properly indexed, Solr starts over again, leading to another OOM, practically locking access to the wiki. (this part, after a restart, I did not reproduce locally, Solr saying 1 documents added and finishing the indexing job at startup).

      Here is a thread dump from my initial remote XWiki instance where I noticed the issue, just before the heap managed to get full again (i.e. after a restart):

      "XWiki Solr index thread" #105
         java.lang.Thread.State: RUNNABLE
              at java.util.Arrays.copyOf(Arrays.java:3181)
              at java.util.ArrayList.grow(ArrayList.java:265)
              at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:239)
              at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:231)
              at java.util.ArrayList.add(ArrayList.java:462)
              at org.apache.xmlbeans.impl.values.NamespaceContext$NamespaceContextStack.push(NamespaceContext.java:81)
              at org.apache.xmlbeans.impl.values.NamespaceContext.push(NamespaceContext.java:106)
              at org.apache.xmlbeans.impl.values.XmlObjectBase.check_dated(XmlObjectBase.java:1318)
              at org.apache.xmlbeans.impl.values.XmlObjectBase.getStringValue(XmlObjectBase.java:1529)
              - locked on org.apache.xmlbeans.impl.store.Locale@59102ef5
              at com.microsoft.schemas.office.visio.x2012.main.impl.CellTypeImpl.getV()
              - locked on org.apache.xmlbeans.impl.store.Locale@59102ef5
              at org.apache.poi.xdgf.usermodel.XDGFCell.parseDoubleValue(XDGFCell.java:84)
              at org.apache.poi.xdgf.usermodel.section.geometry.LineTo.<init>(LineTo.java:49)
              at sun.reflect.GeneratedConstructorAccessor206.newInstance()
              at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
              at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
              at org.apache.poi.xdgf.util.ObjectFactory.load(ObjectFactory.java:49)
              at org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory.load(GeometryRowFactory.java:58)
              at org.apache.poi.xdgf.usermodel.section.GeometrySection.<init>(GeometrySection.java:55)
              at org.apache.poi.xdgf.usermodel.XDGFSheet.<init>(XDGFSheet.java:77)
              at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:113)
              at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:126)
              at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:107)
              at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:73)
              at org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:62)
              at org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:83)
              at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:105)
              at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
              at org.apache.poi.xdgf.usermodel.XmlVisioDocument.<init>(XmlVisioDocument.java:76)
              at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.<init>(XDGFVisioExtractor.java:41)
              at org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:199)
              at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:161)
              at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
              at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
              at org.apache.tika.Tika.parseToString(Tika.java:527)
              at org.xwiki.tika.internal.TikaUtils.parseToString(TikaUtils.java:153)
              at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:509)
              at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setLocaleAndContentFields(AttachmentSolrMetadataExtractor.java:111)
              at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setFieldsInternal(AttachmentSolrMetadataExtractor.java:93)
              at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getSolrDocument(AbstractSolrMetadataExtractor.java:133)
              at org.xwiki.search.solr.internal.DefaultSolrIndexer.getSolrDocument(DefaultSolrIndexer.java:504)
              at org.xwiki.search.solr.internal.DefaultSolrIndexer.processBatch(DefaultSolrIndexer.java:411)
              at org.xwiki.search.solr.internal.DefaultSolrIndexer.run(DefaultSolrIndexer.java:377)
              at java.lang.Thread.run(Thread.java:748)
      

      And the top heap consumers around the same time, retrieved with glowroot's "Heap histogram" view:

      Class name 	Bytes 	Count
      org.apache.xmlbeans.impl.store.Xobj$AttrXobj	365,408,928 	3,806,343
      org.apache.xmlbeans.impl.store.Xobj$ElementXobj	143,546,592 	1,495,277
      byte[]	99,377,840 	63,643
      char[]	88,724,368 	366,691
      short[]	41,184,224 	4,323
      java.lang.String	38,768,160 	1,615,340
      org.apache.xmlbeans.impl.values.XmlStringImpl	23,440,752 	976,698
      java.util.HashMap$Node	12,195,296 	381,103
      java.util.TreeMap$Entry	10,547,960 	263,699
      com.microsoft.schemas.office.visio.x2012.main.impl.CellTypeImpl	10,274,088 	428,087
      org.apache.xmlbeans.impl.values.XmlUnsignedIntImpl	8,182,520 	204,563
      java.lang.Double	8,136,312 	339,013
      java.lang.Object[]	7,121,568 	118,269
      java.util.HashMap$Node[]	6,783,560 	67,629
      java.util.concurrent.ConcurrentHashMap$Node	5,760,416 	180,013
      java.util.HashMap	4,395,360 	91,570
      org.apache.poi.xdgf.usermodel.section.geometry.LineTo	4,123,424 	128,857
      com.microsoft.schemas.office.visio.x2012.main.impl.RowTypeImpl	4,036,704 	168,196
      java.util.LinkedHashMap$Entry	3,854,880 	96,372
      ...
      Total 	963829800 	12730081
      

      I tried locally with a sample vsdx file and everything was indexed fine and I was able to even find text from inside the diagram. However, once I have attached the same problematic vsdx file, I immediately got an OOM locally as well, as mentioned. Some particular content must be tripping apache Tika.

      Local stacktrace:

      java.lang.OutOfMemoryError: GC overhead limit exceeded
      Dumping heap to data/java_pid207527.hprof ...
      Heap dump file created [1267724775 bytes in 9.929 secs]
      2021-02-12 17:21:23,103 [XWiki Solr index thread] ERROR o.x.s.s.i.DefaultSolrIndexer   - Failed to process entry [INDEX Attachment xwiki:test.WebHome@cleanName.vsdx] 
      java.lang.OutOfMemoryError: GC overhead limit exceeded
              at org.apache.xmlbeans.impl.store.CharUtil.getString(CharUtil.java:97)
              at org.apache.xmlbeans.impl.store.Xobj.getValueAsString(Xobj.java:1196)
              at org.apache.xmlbeans.impl.store.Xobj.fetch_text(Xobj.java:1814)
              at org.apache.xmlbeans.impl.values.XmlObjectBase.get_wscanon_text(XmlObjectBase.java:1377)
              at org.apache.xmlbeans.impl.values.XmlObjectBase.check_dated(XmlObjectBase.java:1314)
              at org.apache.xmlbeans.impl.values.XmlObjectBase.getStringValue(XmlObjectBase.java:1529)
              at com.microsoft.schemas.office.visio.x2012.main.impl.SectionTypeImpl.getN(Unknown Source)
              at org.apache.poi.xdgf.usermodel.XDGFSheet.<init>(XDGFSheet.java:75)
              at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:113)
              at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:126)
              at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:107)
              at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:73)
              at org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:62)
              at org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:83)
              at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:105)
              at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
              at org.apache.poi.xdgf.usermodel.XmlVisioDocument.<init>(XmlVisioDocument.java:76)
              at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.<init>(XDGFVisioExtractor.java:41)
              at org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:199)
              at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:161)
              at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
              at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
              at org.apache.tika.Tika.parseToString(Tika.java:527)
              at org.xwiki.tika.internal.TikaUtils.parseToString(TikaUtils.java:153)
              at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:509)
              at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setLocaleAndContentFields(AttachmentSolrMetadataExtractor.java:111)
              at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setFieldsInternal(AttachmentSolrMetadataExtractor.java:93)
              at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getSolrDocument(AbstractSolrMetadataExtractor.java:133)
              at org.xwiki.search.solr.internal.DefaultSolrIndexer.getSolrDocument(DefaultSolrIndexer.java:504)
              at org.xwiki.search.solr.internal.DefaultSolrIndexer.processBatch(DefaultSolrIndexer.java:411)
      2021-02-12 17:21:43,895 [XWiki Solr index thread] ERROR o.x.s.s.i.DefaultSolrIndexer   - Failed to process entry [INDEX xwiki:test.WebHome] 
      java.lang.OutOfMemoryError: GC overhead limit exceeded
              at java.util.Arrays.copyOf(Arrays.java:3181)
              at java.util.ArrayList.grow(ArrayList.java:267)
              at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:241)
              at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:233)
              at java.util.ArrayList.add(ArrayList.java:464)
              at org.apache.xmlbeans.impl.values.NamespaceContext$NamespaceContextStack.push(NamespaceContext.java:81)
              at org.apache.xmlbeans.impl.values.NamespaceContext.push(NamespaceContext.java:106)
              at org.apache.xmlbeans.impl.values.XmlObjectBase.check_dated(XmlObjectBase.java:1318)
              at org.apache.xmlbeans.impl.values.XmlObjectBase.getStringValue(XmlObjectBase.java:1529)
              at com.microsoft.schemas.office.visio.x2012.main.impl.RowTypeImpl.getT(Unknown Source)
              at org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory.load(GeometryRowFactory.java:58)
              at org.apache.poi.xdgf.usermodel.section.GeometrySection.<init>(GeometrySection.java:55)
              at org.apache.poi.xdgf.usermodel.XDGFSheet.<init>(XDGFSheet.java:77)
              at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:113)
              at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:126)
              at org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:107)
              at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:73)
              at org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:62)
              at org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:83)
              at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:105)
              at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
              at org.apache.poi.xdgf.usermodel.XmlVisioDocument.<init>(XmlVisioDocument.java:76)
              at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.<init>(XDGFVisioExtractor.java:41)
              at org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:199)
              at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:161)
              at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
              at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
              at org.apache.tika.Tika.parseToString(Tika.java:527)
              at org.xwiki.tika.internal.TikaUtils.parseToString(TikaUtils.java:153)
              at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:509)
      

      Not sure if the solution is to upgrade Tika or to add some sort of exclude option (file name/path/reference) to the Solr indexer, in order to allow users to work around problematic content.

      Note: I am unable to attach the problematic file publicly due to privacy reasons.

      Attachments

        Activity

          People

            Unassigned Unassigned
            enygma Eduard Moraru
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: