org.apache.tika.inference.MarkdownChunker

public class MarkdownChunker extends Object

Splits markdown text into chunks that respect structural boundaries.

The chunker first splits on markdown heading boundaries (# ...). If a section exceeds the maximum chunk size, it is further split on paragraph boundaries (double newlines). As a last resort, oversized paragraphs are split at the character limit.

Consecutive chunks can overlap by a configurable number of characters to avoid losing context at boundaries.

Constructor Summary

Constructors

Constructor

Description

MarkdownChunker(int maxChunkChars, int overlapChars)
Method Summary

Modifier and Type

Method

Description

List<Chunk>

chunk(String text)

Chunk the given markdown text.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- MarkdownChunker
  
  public MarkdownChunker(int maxChunkChars, int overlapChars)
Method Details
- chunk
  
  public List<Chunk> chunk(String text)
  
  Chunk the given markdown text.
  
  Parameters:
  
  text - the full markdown content
  
  Returns:
  
  ordered list of chunks with offsets relative to text

Class MarkdownChunker

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

MarkdownChunker

Method Details

chunk