Package org.apache.tika.example
Class PickBestTextEncodingParser
java.lang.Object
org.apache.tika.parser.AbstractParser
org.apache.tika.parser.multiple.AbstractMultipleParser
org.apache.tika.example.PickBestTextEncodingParser
- All Implemented Interfaces:
Serializable
,Parser
Deprecated.
Currently not suitable for real use, more a demo / prototype!
Inspired by TIKA-1443 and https://wiki.apache.org/tika/CompositeParserDiscussion
this tries several different text encodings, then does the real
text parsing based on which is "best".
The logic for "best" needs a lot of work!
This is not recommended for actual production use... It is mostly to
prove that the AbstractMultipleParser
environment is
sufficient to support this use-case
TODO Implement proper "Junk" detection
- See Also:
-
Nested Class Summary
Modifier and TypeClassDescriptionprotected class
Deprecated.protected class
Deprecated.Nested classes/interfaces inherited from class org.apache.tika.parser.multiple.AbstractMultipleParser
AbstractMultipleParser.MetadataPolicy
-
Field Summary
Fields inherited from class org.apache.tika.parser.multiple.AbstractMultipleParser
METADATA_POLICY_CONFIG_KEY
-
Constructor Summary
ConstructorDescriptionPickBestTextEncodingParser
(MediaTypeRegistry registry, String[] charsets) Deprecated. -
Method Summary
Modifier and TypeMethodDescriptionvoid
parse
(InputStream stream, ContentHandlerFactory handlers, Metadata metadata, ParseContext context) Deprecated.Processes the given Stream through one or more parsers, resetting things between parsers as requested by policy.void
parse
(InputStream stream, ContentHandler handler, Metadata originalMetadata, ParseContext context) Deprecated.Processes the given Stream through one or more parsers, resetting things between parsers as requested by policy.protected boolean
parserCompleted
(Parser parser, Metadata metadata, ContentHandler handler, ParseContext context, Exception exception) Deprecated.Used to notify implementations that a Parser has Finished or Failed, and to allow them to decide to continue or abort further parsingprotected void
parserPrepare
(Parser parser, Metadata metadata, ParseContext context) Deprecated.Used to allow implementations to prepare or change things before parsing occursMethods inherited from class org.apache.tika.parser.multiple.AbstractMultipleParser
getAllParsers, getMediaTypeRegistry, getMetadataPolicy, getMetadataPolicy, getSupportedTypes, mergeMetadata, setMediaTypeRegistry
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
-
Constructor Details
-
PickBestTextEncodingParser
Deprecated.
-
-
Method Details
-
parserPrepare
Deprecated.Description copied from class:AbstractMultipleParser
Used to allow implementations to prepare or change things before parsing occurs- Overrides:
parserPrepare
in classAbstractMultipleParser
-
parserCompleted
protected boolean parserCompleted(Parser parser, Metadata metadata, ContentHandler handler, ParseContext context, Exception exception) Deprecated.Description copied from class:AbstractMultipleParser
Used to notify implementations that a Parser has Finished or Failed, and to allow them to decide to continue or abort further parsing- Specified by:
parserCompleted
in classAbstractMultipleParser
-
parse
public void parse(InputStream stream, ContentHandler handler, Metadata originalMetadata, ParseContext context) throws IOException, SAXException, TikaException Deprecated.Description copied from class:AbstractMultipleParser
Processes the given Stream through one or more parsers, resetting things between parsers as requested by policy. The actual processing is delegated to one or moreParser
s.Note that you'll get text from every parser this way, to have control of which content is from which parser you need to call the method with a
ContentHandlerFactory
instead.- Specified by:
parse
in interfaceParser
- Overrides:
parse
in classAbstractMultipleParser
- Parameters:
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)originalMetadata
- document metadata (input and output)context
- parse context- Throws:
IOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed
-
parse
public void parse(InputStream stream, ContentHandlerFactory handlers, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException Deprecated.Description copied from class:AbstractMultipleParser
Processes the given Stream through one or more parsers, resetting things between parsers as requested by policy. The actual processing is delegated to one or moreParser
s. You will get one ContentHandler fetched for each Parser used. TODO Do we need to return all the ContentHandler instances we created?- Overrides:
parse
in classAbstractMultipleParser
- Throws:
IOException
SAXException
TikaException
-