Package org.apache.tika.parser.pdf
Class PDFMarkedContent2XHTML
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- org.apache.tika.parser.pdf.PDFMarkedContent2XHTML
-
public class PDFMarkedContent2XHTML extends org.apache.pdfbox.text.PDFTextStripperThis was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
- Since:
- 1.24
-
-
Field Summary
Fields Modifier and Type Field Description static StringXMP_DOCUMENT_CATALOG_LOCATIONstatic StringXMP_PAGE_LOCATION_PREFIX
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected floatcomputeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)protected voidendDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)protected voidendPage(org.apache.pdfbox.pdmodel.PDPage page)intgetCurrentPageNo()we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)static voidprocess(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.voidprocessPage(org.apache.pdfbox.pdmodel.PDPage page)protected voidprocessPages(org.apache.pdfbox.pdmodel.PDPageTree pageTree)See TIKA-2845 for why we need to override this.voidsetEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)voidsetStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)protected voidshowGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement)protected voidstartDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)protected voidstartPage(org.apache.pdfbox.pdmodel.PDPage page)protected voidwriteCharacters(org.apache.pdfbox.text.TextPosition text)protected voidwriteLineSeparator()protected voidwriteParagraphEnd()protected voidwriteParagraphStart()protected voidwriteString(String text)protected voidwriteWordSeparator()-
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
beginMarkedContentSequence, endArticle, endMarkedContentSequence, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeString, writeText
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, markedContentPoint, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Field Detail
-
XMP_DOCUMENT_CATALOG_LOCATION
public static final String XMP_DOCUMENT_CATALOG_LOCATION
- See Also:
- Constant Field Values
-
XMP_PAGE_LOCATION_PREFIX
public static final String XMP_PAGE_LOCATION_PREFIX
- See Also:
- Constant Field Values
-
-
Method Detail
-
process
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws SAXException, TikaExceptionConverts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.- Parameters:
pdDocument- PDF documenthandler- SAX content handlercontext-metadata- PDF metadataconfig-- Throws:
SAXException- if the content handler fails to process SAX eventsTikaException- if there was an exception outside of per page processing
-
processPages
protected void processPages(org.apache.pdfbox.pdmodel.PDPageTree pageTree) throws IOExceptionSee TIKA-2845 for why we need to override this.- Throws:
IOException
-
processPage
public void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException- Overrides:
processPagein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
endPage
protected void endPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException- Throws:
IOException
-
writeParagraphStart
protected void writeParagraphStart() throws IOException- Overrides:
writeParagraphStartin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeParagraphEnd
protected void writeParagraphEnd() throws IOException- Overrides:
writeParagraphEndin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeString
protected void writeString(String text) throws IOException
- Overrides:
writeStringin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeCharacters
protected void writeCharacters(org.apache.pdfbox.text.TextPosition text) throws IOException- Overrides:
writeCharactersin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeWordSeparator
protected void writeWordSeparator() throws IOException- Overrides:
writeWordSeparatorin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeLineSeparator
protected void writeLineSeparator() throws IOException- Overrides:
writeLineSeparatorin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
startPage
protected void startPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException- Overrides:
startPagein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
startDocument
protected void startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException- Overrides:
startDocumentin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
endDocument
protected void endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException- Overrides:
endDocumentin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
getCurrentPageNo
public int getCurrentPageNo()
we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)- Overrides:
getCurrentPageNoin classorg.apache.pdfbox.text.PDFTextStripper- Returns:
-
setStartBookmark
public void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
- Overrides:
setStartBookmarkin classorg.apache.pdfbox.text.PDFTextStripper
-
setEndBookmark
public void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
- Overrides:
setEndBookmarkin classorg.apache.pdfbox.text.PDFTextStripper
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement) throws IOException- Throws:
IOException
-
computeFontHeight
protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException- Throws:
IOException
-
-