Class TikaResource

java.lang.Object
org.apache.tika.server.core.resource.TikaResource

@Path("/tika") public class TikaResource extends Object
  • Field Details

    • GREETING

      public static final String GREETING
    • HANDLER_TYPE_HEADER

      public static final String HANDLER_TYPE_HEADER
      Header to specify the handler type for content extraction. Valid values: text, html, xml, markdown, ignore (default: text)
      See Also:
  • Constructor Details

    • TikaResource

      public TikaResource()
  • Method Details

    • init

      public static void init(TikaLoader tikaLoader, ServerStatus serverStatus, PipesParsingHelper pipesParsingHelper)
      Initialize TikaResource with pipes-based parsing for process isolation.
      Parameters:
      tikaLoader - the Tika loader
      serverStatus - server status tracker
      pipesParsingHelper - helper for pipes-based parsing, may be null if /tika endpoint is not enabled
    • getPipesParsingHelper

      public static PipesParsingHelper getPipesParsingHelper()
      Gets the PipesParsingHelper instance.
      Returns:
      the helper
    • createParseContext

      public static ParseContext createParseContext()
      Creates a new ParseContext with defaults loaded from tika-config. This loads components from "parse-context" such as DigesterFactory and MetadataWriteLimiterFactory.
      Returns:
      a new ParseContext with defaults applied
    • createParser

      public static Parser createParser() throws TikaConfigException, IOException
      Throws:
      TikaConfigException
      IOException
    • getTikaLoader

      public static TikaLoader getTikaLoader()
    • detectFilename

      public static String detectFilename(jakarta.ws.rs.core.MultivaluedMap<String,String> httpHeaders)
    • mergeParseContextFromConfig

      public static void mergeParseContextFromConfig(String configJson, ParseContext context) throws IOException, TikaConfigException
      Parses config JSON and merges parseContext entries into the provided ParseContext.
      Parameters:
      configJson - the JSON config string
      context - the ParseContext to merge into
      Throws:
      IOException - if parsing fails
      TikaConfigException
    • fillMetadata

      public static void fillMetadata(Parser parser, Metadata metadata, jakarta.ws.rs.core.MultivaluedMap<String,String> httpHeaders)
    • setupMultipartConfig

      public static TikaInputStream setupMultipartConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, Metadata metadata, ParseContext context) throws IOException, TikaConfigException
      Processes multipart attachments for /config endpoints. Extracts the "file" and optional "config" attachments, sets up metadata (filename, content-type) from the file attachment, and processes any config JSON into the ParseContext.
      Parameters:
      attachments - the multipart attachments
      metadata - metadata to populate with filename and content-type
      context - parse context to populate from config JSON
      Returns:
      TikaInputStream wrapping the file attachment's content
      Throws:
      IOException - if file attachment is missing or config processing fails
      TikaConfigException
    • parse

      public static void parse(Parser parser, org.slf4j.Logger logger, String path, TikaInputStream inputStream, ContentHandler handler, Metadata metadata, ParseContext parseContext) throws IOException
      Use this to call a parser and unify exception handling. NOTE: This call to parse closes the TikaInputStream. DO NOT surround the call in an auto-close block.

      This method is used by endpoints that don't yet use pipes-based parsing (UnpackerResource, MetadataResource). For /tika and /rmeta endpoints, use parseWithPipes() instead.

      Parameters:
      parser - parser to use
      logger - logger to use
      path - file path
      inputStream - TikaInputStream (which is closed by this call!)
      handler - handler to use
      metadata - metadata
      parseContext - parse context
      Throws:
      IOException - wrapper for all exceptions
    • parseWithPipes

      public static List<Metadata> parseWithPipes(TikaInputStream tis, Metadata metadata, ParseContext parseContext, ParseMode parseMode) throws IOException
      Parses using pipes-based parsing with process isolation.

      The TikaInputStream should already be spooled to a temp file via TikaInputStream.getPath(). The caller is responsible for closing the TikaInputStream after this method returns, which will clean up any temp files.

      Parameters:
      tis - the TikaInputStream to parse
      metadata - metadata to pass to the parser
      parseContext - parse context with handler configuration
      parseMode - RMETA or CONCATENATE
      Returns:
      list of metadata objects from parsing
      Throws:
      IOException - if parsing fails
    • logRequest

      public static void logRequest(org.slf4j.Logger logger, String endpoint, Metadata metadata)
    • getThrowOnWriteLimitReached

      public static boolean getThrowOnWriteLimitReached(jakarta.ws.rs.core.MultivaluedMap<String,String> httpHeaders)
    • getWriteLimit

      public static int getWriteLimit(jakarta.ws.rs.core.MultivaluedMap<String,String> httpHeaders)
      Parses the writeLimit header value from HTTP headers.
      Parameters:
      httpHeaders - the HTTP headers
      Returns:
      the write limit value, or -1 if not specified
    • setupContentHandlerFactory

      public static void setupContentHandlerFactory(ParseContext context, String handlerTypeName, jakarta.ws.rs.core.MultivaluedMap<String,String> httpHeaders)
      Sets up the ContentHandlerFactory in the ParseContext based on handler type and HTTP headers. This is a shared utility method used by both /tika and /rmeta endpoints.
      Parameters:
      context - the ParseContext to configure
      handlerTypeName - the handler type name (text, html, xml, ignore), may be null for default
      httpHeaders - the HTTP headers containing writeLimit and throwOnWriteLimitReached
    • setupContentHandlerFactory

      public static void setupContentHandlerFactory(ParseContext context, String handlerTypeName, int writeLimit, boolean throwOnWriteLimitReached)
      Sets up the ContentHandlerFactory in the ParseContext based on explicit parameters. This overload is used when the values have already been parsed (e.g., from ServerHandlerConfig).
      Parameters:
      context - the ParseContext to configure
      handlerTypeName - the handler type name (text, html, xml, ignore), may be null for default
      writeLimit - the write limit, or -1 for unlimited
      throwOnWriteLimitReached - whether to throw when write limit is reached
    • setupContentHandlerFactoryIfNeeded

      public static void setupContentHandlerFactoryIfNeeded(ParseContext context, String handlerTypeName, jakarta.ws.rs.core.MultivaluedMap<String,String> httpHeaders)
      Sets up the ContentHandlerFactory in the ParseContext if not already set. Used when a ParseContext may already have a factory configured.
      Parameters:
      context - the ParseContext to configure
      handlerTypeName - the handler type name
      httpHeaders - the HTTP headers
    • setupContentHandlerFactoryIfNeeded

      public static void setupContentHandlerFactoryIfNeeded(ParseContext context, String handlerTypeName, int writeLimit, boolean throwOnWriteLimitReached)
      Sets up the ContentHandlerFactory in the ParseContext if not already set. This overload is used when the values have already been parsed.
      Parameters:
      context - the ParseContext to configure
      handlerTypeName - the handler type name
      writeLimit - the write limit, or -1 for unlimited
      throwOnWriteLimitReached - whether to throw when write limit is reached
    • getMessage

      @GET @Produces("text/plain") public String getMessage()
    • getXhtml

      @PUT @Consumes("*/*") @Produces("text/xml") public jakarta.ws.rs.core.StreamingOutput getXhtml(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException
      Parse document and return XHTML content.
      Throws:
      IOException
    • getText

      @PUT @Consumes("*/*") @Produces("text/plain") @Path("text") public jakarta.ws.rs.core.StreamingOutput getText(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException
      Parse document and return plain text content.
      Throws:
      IOException
    • getHtml

      @PUT @Consumes("*/*") @Produces("text/html") @Path("html") public jakarta.ws.rs.core.StreamingOutput getHtml(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException
      Parse document and return HTML content.
      Throws:
      IOException
    • getXml

      @PUT @Consumes("*/*") @Produces("text/xml") @Path("xml") public jakarta.ws.rs.core.StreamingOutput getXml(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException
      Parse document and return XML content.
      Throws:
      IOException
    • getMarkdown

      @PUT @Consumes("*/*") @Produces("text/plain") @Path("md") public jakarta.ws.rs.core.StreamingOutput getMarkdown(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException
      Parse document and return Markdown content.
      Throws:
      IOException
    • getJsonDefault

      @PUT @Consumes("*/*") @Produces("application/json") @Path("json") public Metadata getJsonDefault(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException
      Parse document and return JSON with metadata and text content.
      Throws:
      IOException
    • getJson

      @PUT @Consumes("*/*") @Produces("application/json") @Path("json/{handler}") public Metadata getJson(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders, @PathParam("handler") String handlerTypeName) throws IOException
      Parse document and return JSON with metadata and specified content type.
      Parameters:
      handlerTypeName - content handler type: text, html, or xml
      Throws:
      IOException
    • postRaw

      @POST @Consumes("multipart/form-data") @Produces("text/xml") @Path("config") public jakarta.ws.rs.core.StreamingOutput postRaw(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException
      Parse multipart document with optional config, return XHTML output.

      Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings and handler type

      Returns XHTML by default. Use /tika/config/text, /tika/config/html, or /tika/config/xml for other formats.

      This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.

      Throws:
      IOException
      TikaConfigException
    • postText

      @POST @Consumes("multipart/form-data") @Produces("text/plain") @Path("config/text") public jakarta.ws.rs.core.StreamingOutput postText(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException
      Parse multipart document with optional config, return plain text.

      Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings

      This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.

      Throws:
      IOException
      TikaConfigException
    • postHtml

      @POST @Consumes("multipart/form-data") @Produces("text/html") @Path("config/html") public jakarta.ws.rs.core.StreamingOutput postHtml(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException
      Parse multipart document with optional config, return HTML.

      Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings

      This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.

      Throws:
      IOException
      TikaConfigException
    • postXml

      @POST @Consumes("multipart/form-data") @Produces("text/xml") @Path("config/xml") public jakarta.ws.rs.core.StreamingOutput postXml(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException
      Parse multipart document with optional config, return XML.

      Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings

      This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.

      Throws:
      IOException
      TikaConfigException
    • postMarkdown

      @POST @Consumes("multipart/form-data") @Produces("text/plain") @Path("config/md") public jakarta.ws.rs.core.StreamingOutput postMarkdown(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException
      Parse multipart document with optional config, return Markdown.

      Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings

      This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.

      Throws:
      IOException
      TikaConfigException
    • postJson

      @POST @Consumes("multipart/form-data") @Produces("application/json") @Path("config/json") public Metadata postJson(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders) throws IOException, TikaConfigException
      Parse multipart document with optional config, return JSON.

      Accepts multipart with: - "file" part (required): the document to parse - "config" part (optional): JSON configuration for parser settings and handler type

      Default handler is text. Use config to specify different handler type.

      This endpoint is gated behind enableUnsecureFeatures=true because per-request configuration could enable dangerous operations.

      Throws:
      IOException
      TikaConfigException
    • preparePostHeaderMap

      public static org.apache.cxf.jaxrs.impl.MetadataMap<String,String> preparePostHeaderMap(org.apache.cxf.jaxrs.ext.multipart.Attachment att, jakarta.ws.rs.core.HttpHeaders httpHeaders)
      Prepares a multivalued map, combining attachment headers and request headers. For multipart requests, the attachment's Content-Type takes priority over the request's Content-Type (which is multipart/form-data).
      Parameters:
      att - the attachment.
      httpHeaders - the http headers, fetched from context.
      Returns:
      the case insensitive MetadataMap containing combined headers.