Apache Tika

Apache Tika 1.1

The most notable changes in Tika 1.1 over the previous release are:

Link Extraction: The rel attribute is now extracted from links per the LinkConteHandler. (TIKA-824)
MP3: Fixed handling of UTF-16 (two byte) ID3v2 tags (previously the last character in a UTF-16 tag could be corrupted) (TIKA-793)
Performance: Loading of the default media type registry is now significantly faster. (TIKA-780)
PDF: Allow controlling whether overlapping duplicated text should be removed. Disabling this (the default) can give big speedups to text extraction and may workaround cases where non-duplicated characters were incorrectly removed (TIKA-767). Allow controlling whether text tokens should be sorted by their x/y position before extracting text (TIKA-612); this is necessary for certain PDFs. Fixed cases where too many /p tags appear in the XHTML output, causing NPE when opening some PDFs with the GUI (TIKA-778).
RTF: Fixed case where a font change would result in processing bytes in the wrong font's charset, producing bogus text output (TIKA-777). Don't output whitespace in ignored group states, avoiding excessive whitespace output (TIKA-781). Binary embedded content (using \bin control word) is now skipped correctly; previously it could cause the parser to incorrectly extract binary content as text (TIKA-782).
CLI: New TikaCLI option "--list-detectors", which displays the mimetype detectors that are available, similar to the existing "--list-parsers" option for parsers. (TIKA-785).
Detectors: The order of detectors, as supplied via the service registry loader, is now controlled. User supplied detectors are prefered, then Tika detectors (such as the container aware ones), and finally the core Tika MimeTypes is used as a backup. This allows for specific, detailed detectors to take preference over the default mime magic + filename detector. (TIKA-786)
Microsoft Project (MPP): Filetype detection has been fixed, and basic metadata (but no text) is now extracted. (TIKA-789)
Outlook: fixed NullPointerException in TikaGUI when messages with embedded RTF or HTML content were filtered (TIKA-801).
Ogg Vorbis and FLAC: Parser added for Ogg Vorbis and FLAC audio files, which extract audio metadata and tags (TIKA-747).
MP4: Improved mime magic detection for MP4 based formats (including QuickTime, MP4 Video and Audio, and 3GPP) (TIKA-851).
MP4: Basic metadata extracting parser for MP4 files added, which includes limited audio and video metadata, along with the iTunes media metadata (such as Artist and Title) (TIKA-852).
Document Passwords: A new ParseContext object, PasswordProvider, has been added. This provides a way to supply the password for a document during processing. Currently, only password protected PDFs and Microsoft OOXML Files are supported. (TIKA-850).

The following people have contributed to Tika 1.1 by submitting or commenting on the issues resolved in this release:

Alex Ott
Alexander Chow
Ali Oral
Andrzej Bialecki
Antoni Mylka
Arjohn Kampman
Bastian Mathes
Chris A. Mattmann
Craig Stires
David Tran
Etienne Jouvin
Fabian Lange
Geoff Jarrad
Jan Høydahl
Jerome Lacoste
John Mastarone
Jukka Zitting
Julien Nioche
Ken Krugler
Lau Brino
Markus Jelsma
Maxim Valyanskiy
Michael McCandless
Nick Burch
Pablo Queixalos
Paul Hill
Paul Pearcy
peter royal
PNS
Radek
Ray Gauss II
Stephan Mühlstrasser
Swapna Vuppala
Torsten Krah
William Seemann
Yegor Kozlov

See http://s.apache.org/Jn4 for more details on these contributions.

Apache Tika 1.1

Documentation

The Apache Software Foundation

Books about Tika