Apache Tika is an Open Source project built and maintained by a diverse range of contributors. We welcome contributions of all types to the project - code, documentation, testing, bug triage, user support, and more! Send an email to the Tika development list if you're looking for somewhere to help.
To download the source code for the latest release of Apache Tika, please see the Download page.
The master copy of the Apache Tika source code is held in GIT. You can clone (checkout) the code from https://git-wip-us.apache.org/repos/asf/tika.git and you can browse it online through Git web interface
For those who prefer working on GitHub, we also maintain a GitHub mirror, which you are welcome to fork from and open pull requests to.
When reporting an issue, please try to include the details, steps and documents required to reproduce it. If there are multiple documents that trigger the issue, a small file we can use in unit testing would be great. A JUnit unit test showing the problem can be helpful, but isn't required.
If you're new to reporting problems, you might find the How to Report Bugs Effectively essay (amongst many others) useful for learning more about what makes an effective and helpful bug report.
The Parser Quick Start Guide provides instructions on adding new mime types and new parsers to Tika.
If your new Parser or Detector depends on libraries which we cannot include in Tika for license reasons, you are encouraged to list it on the 3rd Party Parser Plugins page on the Tika wiki.
All enhancements and fixes should have a JIRA Issue or Enhancement opened for them. This should describe the problem and the proposed fix / new code. The JIRA can be used for discussions on the code, and provides a single identifier for the change.
Git - Git users can run git diff --no-prefix to generate a patch of changed and new files, including binaries, which can then be attached to an issue.
Github Pulls - If you are working from our GitHub mirror, it is possible to open a pull request for your change. Please include the JIRA Issue number in the pull request, so it can be linked by the ASF GitHub bot.
ReviewBoard - If you have a Work-In-Progress patch for which you would like feedback / review / assistance, you can use the Apache ReviewBoard Instance to post your code. Please reference the JIRA Issue number from the review request, and add a link to it to the JIRA Issue.
Unit tests, License Headers - Wherever possible, we like new functionality and fixes to include small-ish unit tests. Whenever you make changes, please re-run the unit test suite (mvn install is one way to trigger this), and ensure your changes don't break anything. If adding new files, please include the Apache License v2 license header at the top of the file.
Any new dependencies introduced must be under a suitable license. Broadly, they must be Open Source, and must not place restrictions on larger works they are incorporated within. A list of the allowed licenses is maintained by the ASF Legal Affairs Committee. If in doubt, check on the dev list.
All new and updated dependencies must be in Maven Central. (It is not possible for Apache releases to depend on additional repositories in their poms). If possible, the project producing the dependency should be asked to publish it to Central, such as through the Sonatype OSS Maven Repo. If that isn't possible, someone will need to upload it via the Sonatype 3rd Party OSS Artifacts process. This will need to be completed before any patches depending on the new library can be committed to Tika.
Java code should be indented with 4 spaces, no tabs. Opening brackets should normally be on the same line as the statement. Java coding standards are normally followed, but if in doubt follow what the existing code does!
Imports should normally be explicit, wildcard (foo.*) imports should not normally be used. The imports should be ordered by javax, then java, then other.
From time to time, you may find that code you are working on doesn't follow these rules. If you find that, please don't submit a single patch with logic changes + formatting together, as those are very hard to review. Instead, please submit two patches, one to correct formatting problems, and a second for your logic changes / fixes.
- The Apache Community Development project (ComDev) provide general advice on getting started with contributing to Apache projects
- The Apache Nutch project provide a comprehensive guide on becoming a Nutch Devloper, much of which applies equally for Apache Tika too
- The book Tika in Action has a lot of great information on how Tika works, and how to extend it