|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.tika.parser.AbstractParser org.apache.tika.parser.pdf.PDFParser
public class PDFParser
PDF parser.
This parser can process also encrypted PDF documents if the required password is given as a part of the input metadata associated with a document. If no password is given, then this parser will try decrypting the document using the empty password that's often used with PDFs.
Field Summary | |
---|---|
static String |
PASSWORD
Deprecated. Supply a PasswordProvider on the ParseContext instead |
Constructor Summary | |
---|---|
PDFParser()
|
Method Summary | |
---|---|
boolean |
getEnableAutoSpace()
|
boolean |
getExtractAnnotationText()
If true, text in annotations will be extracted. |
boolean |
getSortByPosition()
|
Set<MediaType> |
getSupportedTypes(ParseContext context)
Returns the set of media types supported by this parser when used with the given parse context. |
boolean |
getSuppressDuplicateOverlappingText()
|
void |
parse(InputStream stream,
ContentHandler handler,
Metadata metadata,
ParseContext context)
Parses a document stream into a sequence of XHTML SAX events. |
void |
setEnableAutoSpace(boolean v)
If true (the default), the parser should estimate where spaces should be inserted between words. |
void |
setExtractAnnotationText(boolean v)
If true (the default), text in annotations will be extracted. |
void |
setSortByPosition(boolean v)
If true, sort text tokens by their x/y position before extracting text. |
void |
setSuppressDuplicateOverlappingText(boolean v)
If true, the parser should try to remove duplicated text over the same region. |
Methods inherited from class org.apache.tika.parser.AbstractParser |
---|
parse |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String PASSWORD
PasswordProvider
on the ParseContext
instead
Constructor Detail |
---|
public PDFParser()
Method Detail |
---|
public Set<MediaType> getSupportedTypes(ParseContext context)
Parser
context
- parse context
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
Parser
The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse context
IOException
- if the document stream could not be read
SAXException
- if the SAX events could not be processed
TikaException
- if the document could not be parsedpublic void setEnableAutoSpace(boolean v)
public boolean getEnableAutoSpace()
#setEnableAutoSpace.
public void setExtractAnnotationText(boolean v)
public boolean getExtractAnnotationText()
public void setSuppressDuplicateOverlappingText(boolean v)
public boolean getSuppressDuplicateOverlappingText()
#setSuppressDuplicateOverlappingText.
public void setSortByPosition(boolean v)
public boolean getSortByPosition()
#setSortByPosition.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |