Configuring Tika

Out of the box, Apache Tika will attempt to start with all available Detectors and Parsers, running with sensible defaults. For most users, this default configuration will work well.

This page gives you information on how to configure the various components of Apache Tika, such as Parsers and Detectors, if you need fine-grained control over ordering, exclusions and the like.

Configuring Parsers

In Tika 1.9, there is some support for configuring Parsers in the Tika Config xml. You can provide a custom list of parser to use, in a custom order, and you can also force certain mimetypes to be used or not-used for parsers. You can do so with Tika Config something like:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <!-- Default Parser for most things, except for 2 mime types -->
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
    </parser>
    <!-- Use a different parser for PDF -->
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>

In code, the key classes to use to build up your own custom parser heirarchy are org.apache.tika.parser.DefaultParser, org.apache.tika.parser.CompositeParser and org.apache.tika.parser.ParserDecorator.

Configuring Detectors

In Tika 1.9, there is limited support for configuring Detectors in the Tika Config xml. You can provide a custom list of detectors to use, in a custom order, with Tika Config something like:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <detectors>
    <!-- Only use these two detectors, and ignore all others -->
    <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
    <detector class="org.apache.tika.mime.MimeTypes"/>
  </detectors>
</properties>

In code, the key classes to use to build up your own custom detector heirarchy are org.apache.tika.detect.DefaultDetector and org.apache.tika.detect.CompositeDetector.

Configuring Mime Types

TODO Mention non-standard paths, and custom mime type files

Configuring Language Identifiers

At this time, there is no unified way to configure language identifiers. While the work on that is ongoing, for now you will need to review the Tika Javadocs to see how individual identifiers are configured.

Configuring Translators

At this time, there is no unified way to configure Translators. While the work on that is ongoing, for now you will need to review the Tika Javadocs to see how individual Translators are configured.

Using a Tika Configuration XML file

However you call Tika, the System Property of tika.config is checked first, and the Environment Variable of TIKA_CONFIG is tried next. Setting one of those will cause Tika to use your given Tika Config XML file.

If you are calling Tika from your own code, then you can pass in the location of your Tika Config XML file when you construct your TikaConfig instance. From that, you can fetch your configured parser, detectors etc.

TikaConfig config = new TikaConfig("/path/to/tika-config.xml");
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);

For users of the Tika App, in addition to the sytem property and the environement variable, you can also use the --config=[tika-config.xml] option to select a different Tika Config XML file to use

For users of the Tika Server, in addition to the sytem property and the environement variable, you can also use -c [tika-config.xml] or --config [tika-config.xml] options to select a different Tika Config XML file to use