Package org.apache.tika.parser.pdf
Class PDFMarkedContent2XHTML
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- org.apache.tika.parser.pdf.PDFMarkedContent2XHTML
-
public class PDFMarkedContent2XHTML extends org.apache.pdfbox.text.PDFTextStripper
This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
- Since:
- 1.24
-
-
Field Summary
Fields Modifier and Type Field Description static String
XMP_DOCUMENT_CATALOG_LOCATION
static String
XMP_PAGE_LOCATION_PREFIX
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected float
computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)
protected void
endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)
protected void
endPage(org.apache.pdfbox.pdmodel.PDPage page)
int
getCurrentPageNo()
we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)
int
getStartPage()
static void
process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)
Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.void
processPage(org.apache.pdfbox.pdmodel.PDPage page)
protected void
processPages(org.apache.pdfbox.pdmodel.PDPageTree pages)
See TIKA-2845 for why we need to override this.void
setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
void
setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
void
setStartPage(int startPage)
protected void
showGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, String unicode, org.apache.pdfbox.util.Vector displacement)
protected void
startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)
protected void
startPage(org.apache.pdfbox.pdmodel.PDPage page)
protected void
writeCharacters(org.apache.pdfbox.text.TextPosition text)
protected void
writeLineSeparator()
protected void
writeParagraphEnd()
protected void
writeParagraphStart()
protected void
writeString(String text)
protected void
writeWordSeparator()
-
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
endArticle, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeString, writeText
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Field Detail
-
XMP_DOCUMENT_CATALOG_LOCATION
public static final String XMP_DOCUMENT_CATALOG_LOCATION
- See Also:
- Constant Field Values
-
XMP_PAGE_LOCATION_PREFIX
public static final String XMP_PAGE_LOCATION_PREFIX
- See Also:
- Constant Field Values
-
-
Method Detail
-
process
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws SAXException, TikaException
Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.- Parameters:
pdDocument
- PDF documenthandler
- SAX content handlermetadata
- PDF metadata- Throws:
SAXException
- if the content handler fails to process SAX eventsTikaException
- if there was an exception outside of per page processing
-
processPages
protected void processPages(org.apache.pdfbox.pdmodel.PDPageTree pages) throws IOException
See TIKA-2845 for why we need to override this.- Throws:
IOException
-
processPage
public void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException
- Overrides:
processPage
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
endPage
protected void endPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException
- Throws:
IOException
-
writeParagraphStart
protected void writeParagraphStart() throws IOException
- Overrides:
writeParagraphStart
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
writeParagraphEnd
protected void writeParagraphEnd() throws IOException
- Overrides:
writeParagraphEnd
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
writeString
protected void writeString(String text) throws IOException
- Overrides:
writeString
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
writeCharacters
protected void writeCharacters(org.apache.pdfbox.text.TextPosition text) throws IOException
- Overrides:
writeCharacters
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
writeWordSeparator
protected void writeWordSeparator() throws IOException
- Overrides:
writeWordSeparator
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
writeLineSeparator
protected void writeLineSeparator() throws IOException
- Overrides:
writeLineSeparator
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
startPage
protected void startPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException
- Overrides:
startPage
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
startDocument
protected void startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException
- Overrides:
startDocument
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
endDocument
protected void endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException
- Overrides:
endDocument
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
getCurrentPageNo
public int getCurrentPageNo()
we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)
- Overrides:
getCurrentPageNo
in classorg.apache.pdfbox.text.PDFTextStripper
- Returns:
-
setStartBookmark
public void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
- Overrides:
setStartBookmark
in classorg.apache.pdfbox.text.PDFTextStripper
-
setEndBookmark
public void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
- Overrides:
setEndBookmark
in classorg.apache.pdfbox.text.PDFTextStripper
-
getStartPage
public int getStartPage()
- Overrides:
getStartPage
in classorg.apache.pdfbox.text.PDFTextStripper
-
setStartPage
public void setStartPage(int startPage)
- Overrides:
setStartPage
in classorg.apache.pdfbox.text.PDFTextStripper
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, String unicode, org.apache.pdfbox.util.Vector displacement) throws IOException
- Throws:
IOException
-
computeFontHeight
protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException
- Throws:
IOException
-
-