Class BasicContentHandlerFactory

java.lang.Object
org.apache.tika.sax.BasicContentHandlerFactory
All Implemented Interfaces:
Serializable, ContentHandlerFactory, StreamingContentHandlerFactory, WriteLimiter

public class BasicContentHandlerFactory extends Object implements StreamingContentHandlerFactory, WriteLimiter
Basic factory for creating common types of ContentHandlers.

Implements StreamingContentHandlerFactory to support both in-memory content extraction and streaming output to an OutputStream.

See Also:
  • Constructor Details

    • BasicContentHandlerFactory

      public BasicContentHandlerFactory()
      No-arg constructor for bean-style configuration (e.g., Jackson deserialization). Creates a factory with TEXT handler type, unlimited write, and throwOnWriteLimitReached=true.
    • BasicContentHandlerFactory

      public BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE type, int writeLimit)
      Create a BasicContentHandlerFactory with throwOnWriteLimitReached is true
      Parameters:
      type - basic type of handler
      writeLimit - max number of characters to store; if < 0, the handler will store all characters
    • BasicContentHandlerFactory

      public BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE type, int writeLimit, boolean throwOnWriteLimitReached, ParseContext parseContext)
      Parameters:
      type - basic type of handler
      writeLimit - maximum number of characters to store
      throwOnWriteLimitReached - whether or not to throw a WriteLimitReachedException when the write limit has been reached
      parseContext - to store the writelimitreached warning if throwOnWriteLimitReached is set to false
  • Method Details

    • newInstance

      Creates a new BasicContentHandlerFactory configured from OutputLimits in the ParseContext.

      If OutputLimits is present in the context, the factory will be configured with those limits (writeLimit, throwOnWriteLimit). Otherwise, default values are used.

      Parameters:
      type - the handler type
      context - the ParseContext (required if throwOnWriteLimit is false)
      Returns:
      a configured BasicContentHandlerFactory
    • parseHandlerType

      public static BasicContentHandlerFactory.HANDLER_TYPE parseHandlerType(String handlerTypeName, BasicContentHandlerFactory.HANDLER_TYPE defaultType)
      Tries to parse string into handler type. Returns default if string is null or parse fails.

      Options: xml, html, text, body, ignore (no content), markdown/md

      Parameters:
      handlerTypeName - string to parse
      defaultType - type to return if parse fails
      Returns:
      handler type
    • createHandler

      public ContentHandler createHandler()
      Description copied from interface: ContentHandlerFactory
      Creates a new ContentHandler for extracting content.
      Specified by:
      createHandler in interface ContentHandlerFactory
      Returns:
      a new ContentHandler instance
    • createHandler

      public ContentHandler createHandler(OutputStream os, Charset charset)
      Description copied from interface: StreamingContentHandlerFactory
      Creates a new ContentHandler that writes output directly to the given OutputStream.
      Specified by:
      createHandler in interface StreamingContentHandlerFactory
      Parameters:
      os - the output stream to write to
      charset - the character encoding to use
      Returns:
      a new ContentHandler instance that writes to the stream
    • getType

      Returns:
      handler type used by this factory
    • handlerTypeName

      public String handlerTypeName()
      Description copied from interface: ContentHandlerFactory
      Returns the name of the handler type produced by this factory (e.g. TEXT, MARKDOWN, HTML, XML).

      This value is written to TikaCoreProperties.TIKA_CONTENT_HANDLER_TYPE so that downstream components (such as the inference pipeline) can determine what format tika:content is in without guessing.

      Specified by:
      handlerTypeName in interface ContentHandlerFactory
      Returns:
      handler type name, never null
    • setType

      public void setType(BasicContentHandlerFactory.HANDLER_TYPE type)
      Sets the handler type.
      Parameters:
      type - the handler type
    • getWriteLimit

      public int getWriteLimit()
      Specified by:
      getWriteLimit in interface WriteLimiter
    • setWriteLimit

      public void setWriteLimit(int writeLimit)
      Sets the write limit.
      Parameters:
      writeLimit - max characters to extract; -1 for unlimited
    • isThrowOnWriteLimitReached

      public boolean isThrowOnWriteLimitReached()
      Specified by:
      isThrowOnWriteLimitReached in interface WriteLimiter
    • setThrowOnWriteLimitReached

      public void setThrowOnWriteLimitReached(boolean throwOnWriteLimitReached)
      Sets whether to throw an exception when write limit is reached.
      Parameters:
      throwOnWriteLimitReached - true to throw, false to silently stop
    • setParseContext

      public void setParseContext(ParseContext parseContext)
      Sets the parse context for storing warnings when throwOnWriteLimitReached is false.
      Parameters:
      parseContext - the parse context
    • equals

      public boolean equals(Object o)
      Overrides:
      equals in class Object
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object