Class XmlToJsonConfigConverter

java.lang.Object
org.apache.tika.cli.XmlToJsonConfigConverter

public class XmlToJsonConfigConverter extends Object
Converts legacy XML Tika configuration files to the new JSON format.

Currently supports converting the "parsers" section of tika-config.xml files for parsers in the tika-parsers-standard module.

Supports parameter types: bool, int, long, double, float, string, list, and map.

Special Case: TesseractOCR's otherTesseractSettings list (containing space-delimited key-value pairs) is automatically converted to the otherTesseractConfig map format expected by the JSON configuration.

Example usage:

 XmlToJsonConfigConverter.convert(
     Paths.get("tika-config.xml"),
     Paths.get("tika-config.json")
 );
 

XML Format (with various parameter types):

 <properties>
   <parsers>
     <parser class="org.apache.tika.parser.pdf.PDFParser">
       <params>
         <param name="sortByPosition" type="bool">true</param>
         <param name="maxPages" type="int">1000</param>
       </params>
     </parser>
     <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
       <params>
         <!-- Special case: space-delimited key-value pairs -->
         <param name="otherTesseractSettings" type="list">
           <string>textord_initialx_ile 0.75</string>
           <string>textord_noise_hfract 0.15625</string>
         </param>
         <param name="envVars" type="map">
           <TESSDATA_PREFIX>/usr/share/tesseract</TESSDATA_PREFIX>
         </param>
       </params>
     </parser>
     <parser class="org.apache.tika.parser.DefaultParser">
       <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
     </parser>
   </parsers>
 </properties>
 

JSON Format:

 {
   "parsers": [
     {
       "pdf-parser": {
         "sortByPosition": true,
         "maxPages": 1000
       }
     },
     {
       "tesseract-ocr-parser": {
         "otherTesseractConfig": {
           "textord_initialx_ile": "0.75",
           "textord_noise_hfract": "0.15625"
         },
         "envVars": {
           "TESSDATA_PREFIX": "/usr/share/tesseract"
         }
       }
     },
     {
       "default-parser": {
         "_exclude": ["pdf-parser"]
       }
     }
   ]
 }
 
  • Method Details