Uploaded image for project: 'LLM AI Integration'
  1. LLM AI Integration
  2. LLMAI-79

Improve chunking by taking newlines and headings into account

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • 0.5
    • 0.4
    • None
    • Unit
    • Unknown
    • N/A

    Description

      Chunking should try putting whole sections and paragraphs of the document into a chunk instead of splitting the content in the middle of words. Also, this isn't possible, chunking should at least to try split on a space character or word boundary.

      A possible chunking algorithm could be to read the input line by line and doing the

      • If adding the next line makes the chunk too big, don't add the line. If the chunk is smaller than, e.g., half of the maximum, add a part of the line (ideally splitting at least at a word boundary or at end of sentence).
      • While the last line of a chunk is a heading, remove that heading (so the next chunk starts with it) unless it makes the chunk too small.

      We should also check what other projects do.

      Attachments

        Activity

          People

            MichaelHamann Michael Hamann
            MichaelHamann Michael Hamann
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: