Using Tika as a Library (Java API)
This section covers using Apache Tika programmatically in your Java applications.
Overview
Tika can be embedded directly into your Java applications as a library. This gives you full control over parsing, detection, and configuration.
However, for most use cases we recommend using tika-server or tika-grpc instead. See Getting Started for guidance on choosing the right approach.
Parsers
The org.apache.tika.parser.Parser interface is Tika’s fundamental mechanism for document
processing. It employs a single parsing method accepting an input stream, content handler,
metadata object, and parse context.
Design Principles
Tika’s parser architecture prioritizes:
-
Streamed Processing - When possible, documents aren’t held entirely in memory, enabling efficient handling of large files
-
Structured Output - Preserves document hierarchy (headings, links, etc.) for relevance assessment
-
Input Metadata - Accepts file names and content types to guide parsing decisions
-
Output Metadata - Returns extracted metadata like author information alongside content
-
Context Sensitivity - Allows fine-grained control through parse context injection
The Parse Method
The parse method accepts four arguments:
void parse(InputStream stream,
ContentHandler handler,
Metadata metadata,
ParseContext context) throws IOException, SAXException, TikaException;
-
InputStream- the document content -
ContentHandler- receives XHTML SAX events -
Metadata- bidirectional metadata exchange -
ParseContext- context-specific settings
Output Format
Parsers generate XHTML SAX events structured as:
<html>
<head>
<title>...</title>
</head>
<body>...</body>
</html>
AutoDetectParser
The AutoDetectParser class automatically determines document type and selects the
appropriate parser, encapsulating all Tika functionality in a single parser:
try (TikaInputStream stream = TikaInputStream.get(path)) {
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
String content = handler.toString();
String title = metadata.get(TikaCoreProperties.TITLE);
}
Always use TikaInputStream and pass the original resource directly when possible.
For example, use TikaInputStream.get(path) for a Path, or TikaInputStream.get(bytes)
for a byte[]. This allows Tika to access the underlying resource efficiently and enables
features like mark/reset support that many parsers and detectors require.
|
Content Handlers
Tika provides several content handlers that control the output format:
- BodyContentHandler
-
Extracts and converts the body content to streams or strings.
- ToTextContentHandler
-
Outputs plain text.
- ToHTMLContentHandler
-
Outputs HTML.
- ToXMLContentHandler
-
Outputs XHTML/XML.
- ToMarkdownContentHandler
-
Outputs Markdown, preserving structural semantics like headings, lists, tables, code blocks, emphasis, and links.
- ParsingReader
-
Uses background threading to return extracted text as character streams.
Use BasicContentHandlerFactory to create handlers by type: TEXT, HTML, XML, BODY, MARKDOWN, IGNORE.
Detectors
The org.apache.tika.detect.Detector interface is the foundation of Tika’s detection system.
All detection approaches implement a shared method:
MediaType detect(TikaInputStream tis, Metadata metadata, ParseContext parseContext)
throws IOException;
This method examines a TikaInputStream, metadata object, and parse context, returning a
MediaType representing the detected file type.
Detection Types
- Magic Detection
-
Identifies files by analyzing special byte patterns near the file start using the Freedesktop MIME-info format. Works through
MimeTypesand configuration files liketika-mimetypes.xml. - Name-Based Detection
-
Uses filename patterns to guess file types via
NameDetector. Quick but potentially unreliable if files are renamed. - Known Content Type
-
Leverages pre-existing MIME type information (from web servers or repositories) to refine detection.
- Container-Aware Detection
-
Handles formats stored within containers (OLE2 for .doc/.ppt, ZIP for iWork files). Requires
TikaInputStreamand the Tika Parsers jar to inspect container contents.
DefaultDetector
DefaultDetector uses service loaders to discover and try all available detectors automatically.
With only Tika Core, it provides magic and name detection. With Tika Parsers included,
container detection becomes available.
Detection Example
TikaConfig tika = new TikaConfig();
ParseContext parseContext = new ParseContext();
for (Path p : myListOfPaths) {
Metadata metadata = new Metadata();
try (TikaInputStream stream = TikaInputStream.get(p, metadata)) {
MediaType mimetype = tika.getDetector().detect(stream, metadata, parseContext);
System.out.println("File " + p + " is " + mimetype);
}
}
TikaInputStream.get(path, metadata) automatically sets the resource name in the
metadata, so you don’t need to set it manually.
|
Topics
-
Getting Started - Recommendations and PipesForkParser usage