org.apache.tika.cli.XmlToJsonConfigConverter

public class XmlToJsonConfigConverter extends Object

Converts legacy XML Tika configuration files to the new JSON format.

Currently supports converting the "parsers" section of tika-config.xml files for parsers in the tika-parsers-standard module.

Supports parameter types: bool, int, long, double, float, string, list, and map.

Special Case: TesseractOCR's otherTesseractSettings list (containing space-delimited key-value pairs) is automatically converted to the otherTesseractConfig map format expected by the JSON configuration.

Example usage:

 XmlToJsonConfigConverter.convert(
     Paths.get("tika-config.xml"),
     Paths.get("tika-config.json")
 );

XML Format (with various parameter types):

 <properties>
   <parsers>
     <parser class="org.apache.tika.parser.pdf.PDFParser">
       <params>
         <param name="sortByPosition" type="bool">true</param>
         <param name="maxPages" type="int">1000</param>
       </params>
     </parser>
     <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
       <params>
         <!-- Special case: space-delimited key-value pairs -->
         <param name="otherTesseractSettings" type="list">
           <string>textord_initialx_ile 0.75</string>
           <string>textord_noise_hfract 0.15625</string>
         </param>
         <param name="envVars" type="map">
           <TESSDATA_PREFIX>/usr/share/tesseract</TESSDATA_PREFIX>
         </param>
       </params>
     </parser>
     <parser class="org.apache.tika.parser.DefaultParser">
       <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
     </parser>
   </parsers>
 </properties>

JSON Format:

 {
   "parsers": [
     {
       "pdf-parser": {
         "sortByPosition": true,
         "maxPages": 1000
       }
     },
     {
       "tesseract-ocr-parser": {
         "otherTesseractConfig": {
           "textord_initialx_ile": "0.75",
           "textord_noise_hfract": "0.15625"
         },
         "envVars": {
           "TESSDATA_PREFIX": "/usr/share/tesseract"
         }
       }
     },
     {
       "default-parser": {
         "_exclude": ["pdf-parser"]
       }
     }
   ]
 }

Method Summary

Modifier and Type

Method

Description

static void

convert(InputStream xmlInput, OutputStream jsonOutput)

Converts an XML Tika configuration stream to JSON format.

static void

convert(InputStream xmlInput, OutputStream jsonOutput, ClassLoader classLoader)

Converts an XML Tika configuration stream to JSON format.

static void

convert(Path xmlPath, Path jsonPath)

Converts an XML Tika configuration file to JSON format.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- convert
  
  public static void convert(Path xmlPath, Path jsonPath) throws TikaConfigException, IOException
  
  Converts an XML Tika configuration file to JSON format.
  
  Parameters:
  
  xmlPath - path to the XML configuration file
  
  jsonPath - path where the JSON output should be written
  
  Throws:
  
  TikaConfigException - if conversion fails
  
  IOException - if file I/O fails
- convert
  
  public static void convert(InputStream xmlInput, OutputStream jsonOutput) throws TikaConfigException, IOException
  
  Converts an XML Tika configuration stream to JSON format.
  
  Parameters:
  
  xmlInput - input stream containing XML configuration
  
  jsonOutput - output stream where JSON will be written
  
  Throws:
  
  TikaConfigException - if conversion fails
  
  IOException - if stream I/O fails
- convert
  
  public static void convert(InputStream xmlInput, OutputStream jsonOutput, ClassLoader classLoader) throws TikaConfigException, IOException
  
  Converts an XML Tika configuration stream to JSON format.
  
  Parameters:
  
  xmlInput - input stream containing XML configuration
  
  jsonOutput - output stream where JSON will be written
  
  classLoader - class loader to use for component registry
  
  Throws:
  
  TikaConfigException - if conversion fails
  
  IOException - if stream I/O fails

Class XmlToJsonConfigConverter

Method Summary

Methods inherited from class java.lang.Object

Method Details

convert

convert

convert