Class TikaResource
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic ParseContextCreates a new ParseContext with defaults loaded from tika-config.static Parserstatic StringdetectFilename(jakarta.ws.rs.core.MultivaluedMap<String, String> httpHeaders) static voidfillMetadata(Parser parser, Metadata metadata, jakarta.ws.rs.core.MultivaluedMap<String, String> httpHeaders) jakarta.ws.rs.core.StreamingOutputgetHtml(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse document and return HTML content.getJson(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders, String handlerTypeName) Parse document and return JSON with metadata and specified content type.getJsonDefault(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse document and return JSON with metadata and text content.jakarta.ws.rs.core.StreamingOutputgetMarkdown(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse document and return Markdown content.static PipesParsingHelperGets the PipesParsingHelper instance.jakarta.ws.rs.core.StreamingOutputgetText(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse document and return plain text content.static booleangetThrowOnWriteLimitReached(jakarta.ws.rs.core.MultivaluedMap<String, String> httpHeaders) static TikaLoaderstatic intgetWriteLimit(jakarta.ws.rs.core.MultivaluedMap<String, String> httpHeaders) Parses the writeLimit header value from HTTP headers.jakarta.ws.rs.core.StreamingOutputgetXhtml(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse document and return XHTML content.jakarta.ws.rs.core.StreamingOutputgetXml(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse document and return XML content.static voidinit(TikaLoader tikaLoader, ServerStatus serverStatus, PipesParsingHelper pipesParsingHelper) Initialize TikaResource with pipes-based parsing for process isolation.static voidlogRequest(org.slf4j.Logger logger, String endpoint, Metadata metadata) static voidmergeParseContextFromConfig(String configJson, ParseContext context) Parses config JSON and merges parseContext entries into the provided ParseContext.static voidparse(Parser parser, org.slf4j.Logger logger, String path, TikaInputStream inputStream, ContentHandler handler, Metadata metadata, ParseContext parseContext) Use this to call a parser and unify exception handling.parseWithPipes(TikaInputStream tis, Metadata metadata, ParseContext parseContext, ParseMode parseMode) Parses using pipes-based parsing with process isolation.jakarta.ws.rs.core.StreamingOutputpostHtml(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse multipart document with optional config, return HTML.postJson(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse multipart document with optional config, return JSON.jakarta.ws.rs.core.StreamingOutputpostMarkdown(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse multipart document with optional config, return Markdown.jakarta.ws.rs.core.StreamingOutputpostRaw(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse multipart document with optional config, return XHTML output.jakarta.ws.rs.core.StreamingOutputpostText(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse multipart document with optional config, return plain text.jakarta.ws.rs.core.StreamingOutputpostXml(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, jakarta.ws.rs.core.HttpHeaders httpHeaders) Parse multipart document with optional config, return XML.preparePostHeaderMap(org.apache.cxf.jaxrs.ext.multipart.Attachment att, jakarta.ws.rs.core.HttpHeaders httpHeaders) Prepares a multivalued map, combining attachment headers and request headers.static voidsetupContentHandlerFactory(ParseContext context, String handlerTypeName, int writeLimit, boolean throwOnWriteLimitReached) Sets up the ContentHandlerFactory in the ParseContext based on explicit parameters.static voidsetupContentHandlerFactory(ParseContext context, String handlerTypeName, jakarta.ws.rs.core.MultivaluedMap<String, String> httpHeaders) Sets up the ContentHandlerFactory in the ParseContext based on handler type and HTTP headers.static voidsetupContentHandlerFactoryIfNeeded(ParseContext context, String handlerTypeName, int writeLimit, boolean throwOnWriteLimitReached) Sets up the ContentHandlerFactory in the ParseContext if not already set.static voidsetupContentHandlerFactoryIfNeeded(ParseContext context, String handlerTypeName, jakarta.ws.rs.core.MultivaluedMap<String, String> httpHeaders) Sets up the ContentHandlerFactory in the ParseContext if not already set.static TikaInputStreamsetupMultipartConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, Metadata metadata, ParseContext context) Processes multipart attachments for /config endpoints.
-
Field Details
-
GREETING
-
HANDLER_TYPE_HEADER
Header to specify the handler type for content extraction. Valid values: text, html, xml, markdown, ignore (default: text)- See Also:
-
-
Constructor Details
-
TikaResource
public TikaResource()
-
-
Method Details
-
init
public static void init(TikaLoader tikaLoader, ServerStatus serverStatus, PipesParsingHelper pipesParsingHelper) Initialize TikaResource with pipes-based parsing for process isolation.- Parameters:
tikaLoader- the Tika loaderserverStatus- server status trackerpipesParsingHelper- helper for pipes-based parsing, may be null if /tika endpoint is not enabled
-
getPipesParsingHelper
Gets the PipesParsingHelper instance.- Returns:
- the helper
-
createParseContext
Creates a new ParseContext with defaults loaded from tika-config. This loads components from "parse-context" such as DigesterFactory and MetadataWriteLimiterFactory.- Returns:
- a new ParseContext with defaults applied
-
createParser
- Throws:
TikaConfigExceptionIOException
-
getTikaLoader
-
detectFilename
-
mergeParseContextFromConfig
public static void mergeParseContextFromConfig(String configJson, ParseContext context) throws IOException, TikaConfigException Parses config JSON and merges parseContext entries into the provided ParseContext.- Parameters:
configJson- the JSON config stringcontext- the ParseContext to merge into- Throws:
IOException- if parsing failsTikaConfigException
-
fillMetadata
-
setupMultipartConfig
public static TikaInputStream setupMultipartConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, Metadata metadata, ParseContext context) throws IOException, TikaConfigException Processes multipart attachments for /config endpoints. Extracts the "file" and optional "config" attachments, sets up metadata (filename, content-type) from the file attachment, and processes any config JSON into the ParseContext.- Parameters:
attachments- the multipart attachmentsmetadata- metadata to populate with filename and content-typecontext- parse context to populate from config JSON- Returns:
- TikaInputStream wrapping the file attachment's content
- Throws:
IOException- if file attachment is missing or config processing failsTikaConfigException
-
parse
public static void parse(Parser parser, org.slf4j.Logger logger, String path, TikaInputStream inputStream, ContentHandler handler, Metadata metadata, ParseContext parseContext) throws IOException Use this to call a parser and unify exception handling. NOTE: This call to parse closes the TikaInputStream. DO NOT surround the call in an auto-close block.This method is used by endpoints that don't yet use pipes-based parsing (UnpackerResource, MetadataResource). For /tika and /rmeta endpoints, use parseWithPipes() instead.
- Parameters:
parser- parser to uselogger- logger to usepath- file pathinputStream- TikaInputStream (which is closed by this call!)handler- handler to usemetadata- metadataparseContext- parse context- Throws:
IOException- wrapper for all exceptions
-
parseWithPipes
public static List<Metadata> parseWithPipes(TikaInputStream tis, Metadata metadata, ParseContext parseContext, ParseMode parseMode) throws IOException Parses using pipes-based parsing with process isolation.The TikaInputStream should already be spooled to a temp file via
TikaInputStream.getPath(). The caller is responsible for closing the TikaInputStream after this method returns, which will clean up any temp files.- Parameters:
tis- the TikaInputStream to parsemetadata- metadata to pass to the parserparseContext- parse context with handler configurationparseMode- RMETA or CONCATENATE- Returns:
- list of metadata objects from parsing
- Throws:
IOException- if parsing fails
-
logRequest
-
getThrowOnWriteLimitReached
-
getWriteLimit
Parses the writeLimit header value from HTTP headers.- Parameters:
httpHeaders- the HTTP headers- Returns:
- the write limit value, or -1 if not specified
-
setupContentHandlerFactory
public static void setupContentHandlerFactory(ParseContext context, String handlerTypeName, jakarta.ws.rs.core.MultivaluedMap<String, String> httpHeaders) Sets up the ContentHandlerFactory in the ParseContext based on handler type and HTTP headers. This is a shared utility method used by both /tika and /rmeta endpoints.- Parameters:
context- the ParseContext to configurehandlerTypeName- the handler type name (text, html, xml, ignore), may be null for defaulthttpHeaders- the HTTP headers containing writeLimit and throwOnWriteLimitReached
-
setupContentHandlerFactory
public static void setupContentHandlerFactory(ParseContext context, String handlerTypeName, int writeLimit, boolean throwOnWriteLimitReached) Sets up the ContentHandlerFactory in the ParseContext based on explicit parameters. This overload is used when the values have already been parsed (e.g., from ServerHandlerConfig).- Parameters:
context- the ParseContext to configurehandlerTypeName- the handler type name (text, html, xml, ignore), may be null for defaultwriteLimit- the write limit, or -1 for unlimitedthrowOnWriteLimitReached- whether to throw when write limit is reached
-
setupContentHandlerFactoryIfNeeded
public static void setupContentHandlerFactoryIfNeeded(ParseContext context, String handlerTypeName, jakarta.ws.rs.core.MultivaluedMap<String, String> httpHeaders) Sets up the ContentHandlerFactory in the ParseContext if not already set. Used when a ParseContext may already have a factory configured.- Parameters:
context- the ParseContext to configurehandlerTypeName- the handler type namehttpHeaders- the HTTP headers
-
setupContentHandlerFactoryIfNeeded
public static void setupContentHandlerFactoryIfNeeded(ParseContext context, String handlerTypeName, int writeLimit, boolean throwOnWriteLimitReached) Sets up the ContentHandlerFactory in the ParseContext if not already set. This overload is used when the values have already been parsed.- Parameters:
context- the ParseContext to configurehandlerTypeName- the handler type namewriteLimit- the write limit, or -1 for unlimitedthrowOnWriteLimitReached- whether to throw when write limit is reached
-
getMessage
-
getXhtml
@PUT @Consumes("*/*") @Produces("text/xml") public jakarta.ws.rs.core.StreamingOutput getXhtml(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException Parse document and return XHTML content.- Throws:
IOException
-
getText
@PUT @Consumes("*/*") @Produces("text/plain") @Path("text") public jakarta.ws.rs.core.StreamingOutput getText(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException Parse document and return plain text content.- Throws:
IOException
-
getHtml
@PUT @Consumes("*/*") @Produces("text/html") @Path("html") public jakarta.ws.rs.core.StreamingOutput getHtml(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException Parse document and return HTML content.- Throws:
IOException
-
getXml
@PUT @Consumes("*/*") @Produces("text/xml") @Path("xml") public jakarta.ws.rs.core.StreamingOutput getXml(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException Parse document and return XML content.- Throws:
IOException
-
getMarkdown
@PUT @Consumes("*/*") @Produces("text/plain") @Path("md") public jakarta.ws.rs.core.StreamingOutput getMarkdown(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException Parse document and return Markdown content.- Throws:
IOException
-
getJsonDefault
@PUT @Consumes("*/*") @Produces("application/json") @Path("json") public Metadata getJsonDefault(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException Parse document and return JSON with metadata and text content.- Throws:
IOException
-
getJson
@PUT @Consumes("*/*") @Produces("application/json") @Path("json/{handler}") public Metadata getJson(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders, @PathParam("handler") String handlerTypeName) throws IOException Parse document and return JSON with metadata and specified content type.- Parameters:
handlerTypeName- content handler type: text, html, or xml- Throws:
IOException
-
postRaw
@POST @Consumes("multipart/form-data") @Produces("text/xml") @Path("config") public jakarta.ws.rs.core.StreamingOutput postRaw(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException Parse multipart document with optional config, return XHTML output.Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings and handler type
Returns XHTML by default. Use /tika/config/text, /tika/config/html, or /tika/config/xml for other formats.
This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.
- Throws:
IOExceptionTikaConfigException
-
postText
@POST @Consumes("multipart/form-data") @Produces("text/plain") @Path("config/text") public jakarta.ws.rs.core.StreamingOutput postText(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException Parse multipart document with optional config, return plain text.Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings
This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.
- Throws:
IOExceptionTikaConfigException
-
postHtml
@POST @Consumes("multipart/form-data") @Produces("text/html") @Path("config/html") public jakarta.ws.rs.core.StreamingOutput postHtml(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException Parse multipart document with optional config, return HTML.Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings
This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.
- Throws:
IOExceptionTikaConfigException
-
postXml
@POST @Consumes("multipart/form-data") @Produces("text/xml") @Path("config/xml") public jakarta.ws.rs.core.StreamingOutput postXml(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException Parse multipart document with optional config, return XML.Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings
This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.
- Throws:
IOExceptionTikaConfigException
-
postMarkdown
@POST @Consumes("multipart/form-data") @Produces("text/plain") @Path("config/md") public jakarta.ws.rs.core.StreamingOutput postMarkdown(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException Parse multipart document with optional config, return Markdown.Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings
This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.
- Throws:
IOExceptionTikaConfigException
-
postJson
@POST @Consumes("multipart/form-data") @Produces("application/json") @Path("config/json") public Metadata postJson(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException Parse multipart document with optional config, return JSON.Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings and handler type
Default handler is text. Use config to specify different handler type.
This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.
- Throws:
IOExceptionTikaConfigException
-
preparePostHeaderMap
public static org.apache.cxf.jaxrs.impl.MetadataMap<String,String> preparePostHeaderMap(org.apache.cxf.jaxrs.ext.multipart.Attachment att, jakarta.ws.rs.core.HttpHeaders httpHeaders) Prepares a multivalued map, combining attachment headers and request headers. For multipart requests, the attachment's Content-Type takes priority over the request's Content-Type (which is multipart/form-data).- Parameters:
att- the attachment.httpHeaders- the http headers, fetched from context.- Returns:
- the case insensitive MetadataMap containing combined headers.
-