Apache Tika is an Open Source project built and maintained by a diverse range of contributors. We welcome contributions of all types to the project - code, documentation, testing, bug triage, user support, and more! Send an email to the Tika development list if you're looking for somewhere to help.
To download the source code for the latest release of Apache Tika, please see the Download page.
When reporting an issue, please try to include the details, steps and documents required to reproduce it. If there are multiple documents that trigger the issue, a small file we can use in unit testing would be great. A JUnit unit test showing the problem can be helpful, but isn't required.
If you're new to reporting problems, you might find the How to Report Bugs Effectively essay (amongst many others) useful for learning more about what makes an effective and helpful bug report.
The Parser Quick Start Guide provides instructions on adding new mime types and new parsers to Tika.
If your new Parser or Detector depends on libraries which we cannot include in Tika for license reasons, you are encouraged to list it on the 3rd Party Parser Plugins page on the Tika wiki.
All enhancements and fixes should have a JIRA Issue or Enhancement opened for them. This should describe the problem and the proposed fix / new code. The JIRA can be used for discussions on the code, and provides a single identifier for the change.
SVN - For users of SVN, you can use svn diff to generate a patch file of your changes, which can then be attached to the issue. Note that a SVN diff won't normally include new or binary files, so these will need to be attached separately.
Git - Git users can run git diff --no-prefix to generate an SVN compatible patch which can then be attached to an issue.
Github Pulls - If you are working from our GitHub mirror, it is possible to open a pull request for your change. Please include the JIRA Issue number in the pull request, so it can be linked by the ASF GitHub bot.
ReviewBoard - If you have a Work-In-Progress patch for which you would like feedback / review / assistance, you can use the Apache ReviewBoard Instance to post your code. Please reference the JIRA Issue number from the review request, and add a link to it to the JIRA Issue.
Unit tests, License Headers - Wherever possible, we like new functionality and fixes to include small-ish unit tests. Whenever you make changes, please re-run the unit test suite (mvn install is one way to trigger this), and ensure your changes don't break anything. If adding new files, please include the Apache License v2 license header at the top of the file.
Any new dependencies introduced must be under a suitable license. Broadly, they must be Open Source, and must not place restrictions on larger works they are incorporated within. A list of the allowed licenses is maintained by the ASF Legal Affairs Committee. If in doubt, check on the dev list.
All new and updated dependencies must be in Maven Central. (It is not possible for Apache releases to depend on additional repositories in their poms). If possible, the project producing the dependency should be asked to publish it to Central, such as through the Sonatype OSS Maven Repo. If that isn't possible, someone will need to upload it via the Sonatype 3rd Party OSS Artifacts process. This will need to be completed before any patches depending on the new library can be committed to Tika.
Java code should be indented with 4 spaces, no tabs. Opening brackets should normally be on the same line as the statement. Java coding standards are normally followed, but if in doubt follow what the existing code does!
Imports should normally be explicit, wildcard (foo.*) imports should not normally be used. The imports should be ordered by javax, then java, then other.
From time to time, you may find that code you are working on doesn't follow these rules. If you find that, please don't submit a single patch with logic changes + formatting together, as those are very hard to review. Instead, please submit two patches, one to correct formatting problems, and a second for your logic changes / fixes.
- The Apache Community Development project (ComDev) provide general advice on getting started with contributing to Apache projects
- The Apache Nutch project provide a comprehensive guide on becoming a Nutch Devloper, much of which applies equally for Apache Tika too
- The book Tika in Action has a lot of great information on how Tika works, and how to extend it