Package org.apache.tika.parser.pdf
Class PDFMarkedContent2XHTML
- java.lang.Object
- 
- org.apache.pdfbox.contentstream.PDFStreamEngine
- 
- org.apache.pdfbox.text.PDFTextStripper
- 
- org.apache.tika.parser.pdf.PDFMarkedContent2XHTML
 
 
 
- 
 public class PDFMarkedContent2XHTML extends org.apache.pdfbox.text.PDFTextStripperThis was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags. - Since:
- 1.24
 
- 
- 
Field SummaryFields Modifier and Type Field Description static StringXMP_DOCUMENT_CATALOG_LOCATIONstatic StringXMP_PAGE_LOCATION_PREFIX
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected floatcomputeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)protected voidendDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)protected voidendPage(org.apache.pdfbox.pdmodel.PDPage page)intgetCurrentPageNo()we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)intgetStartPage()static voidprocess(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.voidprocessPage(org.apache.pdfbox.pdmodel.PDPage page)protected voidprocessPages(org.apache.pdfbox.pdmodel.PDPageTree pages)See TIKA-2845 for why we need to override this.voidsetEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)voidsetStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)voidsetStartPage(int startPage)protected voidshowGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4)protected voidshowGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement)protected voidstartDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)protected voidstartPage(org.apache.pdfbox.pdmodel.PDPage page)protected voidwriteCharacters(org.apache.pdfbox.text.TextPosition text)protected voidwriteLineSeparator()protected voidwriteParagraphEnd()protected voidwriteParagraphStart()protected voidwriteString(String text)protected voidwriteWordSeparator()- 
Methods inherited from class org.apache.pdfbox.text.PDFTextStripperendArticle, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeString, writeText
 - 
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngineaddOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
 
- 
 
- 
- 
- 
Field Detail- 
XMP_DOCUMENT_CATALOG_LOCATIONpublic static final String XMP_DOCUMENT_CATALOG_LOCATION - See Also:
- Constant Field Values
 
 - 
XMP_PAGE_LOCATION_PREFIXpublic static final String XMP_PAGE_LOCATION_PREFIX - See Also:
- Constant Field Values
 
 
- 
 - 
Method Detail- 
processpublic static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws SAXException, TikaExceptionConverts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.- Parameters:
- pdDocument- PDF document
- handler- SAX content handler
- context-
- metadata- PDF metadata
- config-
- Throws:
- SAXException- if the content handler fails to process SAX events
- TikaException- if there was an exception outside of per page processing
 
 - 
processPagesprotected void processPages(org.apache.pdfbox.pdmodel.PDPageTree pages) throws IOExceptionSee TIKA-2845 for why we need to override this.- Throws:
- IOException
 
 - 
processPagepublic void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException- Overrides:
- processPagein class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
endPageprotected void endPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException- Throws:
- IOException
 
 - 
writeParagraphStartprotected void writeParagraphStart() throws IOException- Overrides:
- writeParagraphStartin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
writeParagraphEndprotected void writeParagraphEnd() throws IOException- Overrides:
- writeParagraphEndin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
writeStringprotected void writeString(String text) throws IOException - Overrides:
- writeStringin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
writeCharactersprotected void writeCharacters(org.apache.pdfbox.text.TextPosition text) throws IOException- Overrides:
- writeCharactersin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
writeWordSeparatorprotected void writeWordSeparator() throws IOException- Overrides:
- writeWordSeparatorin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
writeLineSeparatorprotected void writeLineSeparator() throws IOException- Overrides:
- writeLineSeparatorin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
startPageprotected void startPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException- Overrides:
- startPagein class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
startDocumentprotected void startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException- Overrides:
- startDocumentin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
endDocumentprotected void endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException- Overrides:
- endDocumentin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
getCurrentPageNopublic int getCurrentPageNo() we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)- Overrides:
- getCurrentPageNoin class- org.apache.pdfbox.text.PDFTextStripper
- Returns:
 
 - 
setStartBookmarkpublic void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) - Overrides:
- setStartBookmarkin class- org.apache.pdfbox.text.PDFTextStripper
 
 - 
setEndBookmarkpublic void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) - Overrides:
- setEndBookmarkin class- org.apache.pdfbox.text.PDFTextStripper
 
 - 
getStartPagepublic int getStartPage() - Overrides:
- getStartPagein class- org.apache.pdfbox.text.PDFTextStripper
 
 - 
setStartPagepublic void setStartPage(int startPage) - Overrides:
- setStartPagein class- org.apache.pdfbox.text.PDFTextStripper
 
 - 
showGlyphprotected void showGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement) throws IOException- Overrides:
- showGlyphin class- org.apache.pdfbox.contentstream.PDFStreamEngine
- Throws:
- IOException
 
 - 
showGlyphprotected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) throws IOException- Overrides:
- showGlyphin class- org.apache.pdfbox.contentstream.PDFStreamEngine
- Throws:
- IOException
 
 - 
computeFontHeightprotected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException- Throws:
- IOException
 
 
- 
 
-