Apache Tika Documentation

Table of Contents
This reference guide was generated with the assistance of AI and requires human review before it can be fully trusted. This documentation serves as an example and a starting point, but more work remains. Contributions and corrections are welcome.

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Features

  • Unified Interface - Parse all supported file types through a single, consistent API

  • Broad Format Support - Over 1,000 different file types supported

  • Metadata Extraction - Automatically identify and extract document metadata

  • Text Extraction - Pull readable content from complex file formats

  • Content Detection - Identify file types regardless of file extension

Getting Started

Use the navigation on the left to explore the documentation, or start with Using Tika to choose your integration method.

Apache Tika is an Apache Software Foundation project, formerly a subproject of Apache Lucene.

Built from commit: c5b9849d9 (2026-04-09)