Class MarkdownChunker

java.lang.Object
org.apache.tika.inference.MarkdownChunker

public class MarkdownChunker extends Object
Splits markdown text into chunks that respect structural boundaries.

The chunker first splits on markdown heading boundaries (# ...). If a section exceeds the maximum chunk size, it is further split on paragraph boundaries (double newlines). As a last resort, oversized paragraphs are split at the character limit.

Consecutive chunks can overlap by a configurable number of characters to avoid losing context at boundaries.

  • Constructor Details

    • MarkdownChunker

      public MarkdownChunker(int maxChunkChars, int overlapChars)
  • Method Details

    • chunk

      public List<Chunk> chunk(String text)
      Chunk the given markdown text.
      Parameters:
      text - the full markdown content
      Returns:
      ordered list of chunks with offsets relative to text