Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 16.10.0
Component/s: Search - Solr
Labels:
- performance

Difficulty:
Unknown
Similar issues:

Description

The Solr indexer currently extracts the textual content of attachments using Tika whenever an attachment is indexed. The same attachment is usually indexed at least twice: once on the document entry, once for a separate attachment entry, and then if the document has translations, also once for every translation. Whenever a document is changed, the attachment will be re-parsed even if it didn't change. As parsing attachments with Tika can be slow and resource-consuming, we should introduce a cache for the extracted text. This cache should ideally be stored in the same store as the attachment itself. There should be a way to manually or automatically clear the cache, e.g., on Tika upgrades.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Michael Hamann

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Jun/25 17:30

Updated:: 04/Jun/25 17:30