Apache Tika 1.13

The most notable changes in Tika 1.13 over the previous release are:

  • Major changes to the PDFParser including upgrade to PDFBox 2.0.1 (TIKA-1285, TIKA-1959). This include the classic sequential parser is no longer available, Tiff files are no longer extracted by default, Some truncated/corrupted files that had some content extracted with 1.8.x may have no content extracted in 2.0.x.
  • The MIT-NLP Information Extraction (MITIE) Named Entity Recognition (NER) system is now supported in Tika (TIKA-1913, Github-108).
  • Tika now supports the use of the Yandex translation service (TIKA-1943, Github-106).
  • Tika now uses NER to extract scientific measurements from text using either GROBID Quantities which uses conditional random fields and NLTK which uses regular expressesions (TIKA-1917, Github-104).
  • Fixed JournalParser to handle null responses from GROBID and to log a message (TIKA-1925).
  • Refactored Language Detector into tika-landetect module, added default N-Gram implementation}}, Optimaize LangDetector and MIT Text.jl implementation (TIKA-1872, TIKA-1696, TIKA-1723).
  • Extract metadata from MP4 videos whether or not the PooledTimeSeries parser is available via Aditya Dhulipala (TIKA-1844).
  • Fix NPE when trying to get embedded image identifier in WordParser (TIKA-1956).
  • Improvements to MIME database for detection of Scientific and other formats present in the TREC-DD-Polar dataset (TIKA-1881, Github-85, TIKA-1883, TIKA-1884, TIKA-1886, TIKA-1882).
  • LinkContentHandler now extracts links from script tags via Joseph Naegele (TIKA-1937).
  • Handle per page IOExceptions more robustly in PDFParser (TIKA-1948).
  • Upgrade commons-compress to 1.11 (TIKA-1949).
  • Add detection for embedded MSChart.Graph files (TIKA-1033).
  • Fix NPE in Sqlite parser from Nick C (TIKA-1927).
  • Fix NPE in Open Document parser from Nick C (TIKA-1916).
  • Upgrade mp4parser's isoparser to 1.1.7 (TIKA-1924 and TIKA-1931).
  • Upgrade BouncyCastle to 1.54 (TIKA-1923).
  • Upgrade Jackcess to 2.1.3 (TIKA-1922).
  • Upgrade Drew Noakes' metadata-extractor to 2.8.1 (TIKA-1921).
  • Upgrade Gson in tika-serialization to 2.6.2 (TIKA-1920).
  • Upgrade commons-cli in tika-batch to 1.3.1 (TIKA-1919).
  • Add XMPMM support to PDFParser and JpegParser via Jempbox (TIKA-1894).
  • Move serialization of TikaConfig to tika-core and enable dumping of the config file via tika-app (TIKA-1657).
  • Tika now incorporates the Natural Language Toolkit (NLTK) from the Python community as an option for Named Entity Recognition (TIKA-1876).
  • Add support for XFA extraction via Pascal Essiembre (TIKA-1857).
  • Upgrade to sqlite-jdbc 3.8.11.2 (TIKA-1861). NOTE: this dependency is still scopeprovided/scope. You need to include this dependency in order to parse sqlite files.
  • Upgrade to POI 3.15-beta1 (TIKA-1895).
  • Upgrade to Jackson 2.7.1 (TIKA-1869).
  • Upgrade to Apache SIS 0.6 (TIKA-1878).
  • RichTextContentHandler moved from the Server package to Core (TIKA-1870).
  • Added ZeroSizeFileDetector to support application/x-zerovalue via Adesh Gupta (TIKA-1885).
  • Addition of types information to Grobid quantities parser via Can Menekse (TIKA-1965).

    The following people have contributed to Tika 1.13 by submitting or commenting on the issues resolved in this release:

    • Adesh Gupta
    • Ajay Kumar Loganathan Ravichandran
    • Alessandro De Angelis
    • Avinash
    • Ayesha Hasan
    • Can Menekse
    • Chris A. Mattmann
    • Dave Meikle
    • Franco Catto
    • Hendy Irawan
    • Ian Williams
    • Jeremy Anderson
    • John Patrick
    • Jorge Spinsanti
    • Joseph Naegele
    • Ken Krugler
    • Lewis John McGibbney
    • Luca Moretti
    • Manali Shah
    • Manisha Kampasi
    • Mark Duske
    • Namitha Sanjeeva Ganiga
    • Nandan Chandrashekar
    • Nick Burch
    • Nick C
    • Pascal Essiembre
    • Paul Ramirez
    • Prasad Nagaraj Subramanya
    • Ramit Wadhwa
    • Ray Gauss II
    • Sergey Beryozkin
    • Shawn Johnson
    • Steffen Netz
    • Suman Kashyap
    • Thamme Gowda N
    • Tim Allison
    • Trevor Lewis
    • Yash Tanna
    • aoeu
    • kostali

    See https://s.apache.org/BaRc for more details on these contributions.