Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-23271

Cache attachment context extracted by Tika

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • 16.10.0
    • Search - Solr
    • Unknown

    Description

      The Solr indexer currently extracts the textual content of attachments using Tika whenever an attachment is indexed. The same attachment is usually indexed at least twice: once on the document entry, once for a separate attachment entry, and then if the document has translations, also once for every translation. Whenever a document is changed, the attachment will be re-parsed even if it didn't change. As parsing attachments with Tika can be slow and resource-consuming, we should introduce a cache for the extracted text. This cache should ideally be stored in the same store as the attachment itself. There should be a way to manually or automatically clear the cache, e.g., on Tika upgrades.

      Attachments

        Activity

          People

            Unassigned Unassigned
            MichaelHamann Michael Hamann
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: