Details
-
Improvement
-
Resolution: Fixed
-
Major
-
0.4
-
None
Description
Chunking should try putting whole sections and paragraphs of the document into a chunk instead of splitting the content in the middle of words. Also, this isn't possible, chunking should at least to try split on a space character or word boundary.
A possible chunking algorithm could be to read the input line by line and doing the
- If adding the next line makes the chunk too big, don't add the line. If the chunk is smaller than, e.g., half of the maximum, add a part of the line (ideally splitting at least at a word boundary or at end of sentence).
- While the last line of a chunk is a heading, remove that heading (so the next chunk starts with it) unless it makes the chunk too small.
We should also check what other projects do.