Package org.apache.tika.inference
Class MarkdownChunker
java.lang.Object
org.apache.tika.inference.MarkdownChunker
Splits markdown text into chunks that respect structural boundaries.
The chunker first splits on markdown heading boundaries (# ...).
If a section exceeds the maximum chunk size, it is further split on
paragraph boundaries (double newlines). As a last resort, oversized
paragraphs are split at the character limit.
Consecutive chunks can overlap by a configurable number of characters to avoid losing context at boundaries.
-
Constructor Summary
Constructors -
Method Summary
-
Constructor Details
-
MarkdownChunker
public MarkdownChunker(int maxChunkChars, int overlapChars)
-
-
Method Details
-
chunk
Chunk the given markdown text.- Parameters:
text- the full markdown content- Returns:
- ordered list of chunks with offsets relative to
text
-