Package org.apache.tika.cli
Class XmlToJsonConfigConverter
java.lang.Object
org.apache.tika.cli.XmlToJsonConfigConverter
Converts legacy XML Tika configuration files to the new JSON format.
Currently supports converting the "parsers" section of tika-config.xml files for parsers in the tika-parsers-standard module.
Supports parameter types: bool, int, long, double, float, string, list, and map.
Special Case: TesseractOCR's otherTesseractSettings list
(containing space-delimited key-value pairs) is automatically converted to the
otherTesseractConfig map format expected by the JSON configuration.
Example usage:
XmlToJsonConfigConverter.convert(
Paths.get("tika-config.xml"),
Paths.get("tika-config.json")
);
XML Format (with various parameter types):
<properties>
<parsers>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="sortByPosition" type="bool">true</param>
<param name="maxPages" type="int">1000</param>
</params>
</parser>
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
<params>
<!-- Special case: space-delimited key-value pairs -->
<param name="otherTesseractSettings" type="list">
<string>textord_initialx_ile 0.75</string>
<string>textord_noise_hfract 0.15625</string>
</param>
<param name="envVars" type="map">
<TESSDATA_PREFIX>/usr/share/tesseract</TESSDATA_PREFIX>
</param>
</params>
</parser>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
</parser>
</parsers>
</properties>
JSON Format:
{
"parsers": [
{
"pdf-parser": {
"sortByPosition": true,
"maxPages": 1000
}
},
{
"tesseract-ocr-parser": {
"otherTesseractConfig": {
"textord_initialx_ile": "0.75",
"textord_noise_hfract": "0.15625"
},
"envVars": {
"TESSDATA_PREFIX": "/usr/share/tesseract"
}
}
},
{
"default-parser": {
"_exclude": ["pdf-parser"]
}
}
]
}
-
Method Summary
Modifier and TypeMethodDescriptionstatic voidconvert(InputStream xmlInput, OutputStream jsonOutput) Converts an XML Tika configuration stream to JSON format.static voidconvert(InputStream xmlInput, OutputStream jsonOutput, ClassLoader classLoader) Converts an XML Tika configuration stream to JSON format.static voidConverts an XML Tika configuration file to JSON format.
-
Method Details
-
convert
Converts an XML Tika configuration file to JSON format.- Parameters:
xmlPath- path to the XML configuration filejsonPath- path where the JSON output should be written- Throws:
TikaConfigException- if conversion failsIOException- if file I/O fails
-
convert
public static void convert(InputStream xmlInput, OutputStream jsonOutput) throws TikaConfigException, IOException Converts an XML Tika configuration stream to JSON format.- Parameters:
xmlInput- input stream containing XML configurationjsonOutput- output stream where JSON will be written- Throws:
TikaConfigException- if conversion failsIOException- if stream I/O fails
-
convert
public static void convert(InputStream xmlInput, OutputStream jsonOutput, ClassLoader classLoader) throws TikaConfigException, IOException Converts an XML Tika configuration stream to JSON format.- Parameters:
xmlInput- input stream containing XML configurationjsonOutput- output stream where JSON will be writtenclassLoader- class loader to use for component registry- Throws:
TikaConfigException- if conversion failsIOException- if stream I/O fails
-