Package org.apache.tika.parser.pdf
Class PDFMarkedContent2XHTML
- java.lang.Object
- 
- org.apache.pdfbox.contentstream.PDFStreamEngine
- 
- org.apache.pdfbox.text.PDFTextStripper
- 
- org.apache.tika.parser.pdf.PDFMarkedContent2XHTML
 
 
 
- 
 public class PDFMarkedContent2XHTML extends org.apache.pdfbox.text.PDFTextStripperThis was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags. - Since:
- 1.24
 
- 
- 
Field SummaryFields Modifier and Type Field Description static StringXMP_DOCUMENT_CATALOG_LOCATIONstatic StringXMP_PAGE_LOCATION_PREFIX
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected floatcomputeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)protected voidendDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)protected voidendPage(org.apache.pdfbox.pdmodel.PDPage page)intgetCurrentPageNo()we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)static voidprocess(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.voidprocessPage(org.apache.pdfbox.pdmodel.PDPage page)protected voidprocessPages(org.apache.pdfbox.pdmodel.PDPageTree pageTree)See TIKA-2845 for why we need to override this.voidsetEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)voidsetStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)protected voidshowGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement)protected voidstartDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)protected voidstartPage(org.apache.pdfbox.pdmodel.PDPage page)protected voidwriteCharacters(org.apache.pdfbox.text.TextPosition text)protected voidwriteLineSeparator()protected voidwriteParagraphEnd()protected voidwriteParagraphStart()protected voidwriteString(String text)protected voidwriteWordSeparator()- 
Methods inherited from class org.apache.pdfbox.text.PDFTextStripperbeginMarkedContentSequence, endArticle, endMarkedContentSequence, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeString, writeText
 - 
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngineaddOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
 
- 
 
- 
- 
- 
Field Detail- 
XMP_DOCUMENT_CATALOG_LOCATIONpublic static final String XMP_DOCUMENT_CATALOG_LOCATION - See Also:
- Constant Field Values
 
 - 
XMP_PAGE_LOCATION_PREFIXpublic static final String XMP_PAGE_LOCATION_PREFIX - See Also:
- Constant Field Values
 
 
- 
 - 
Method Detail- 
processpublic static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws SAXException, TikaExceptionConverts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.- Parameters:
- pdDocument- PDF document
- handler- SAX content handler
- context-
- metadata- PDF metadata
- config-
- Throws:
- SAXException- if the content handler fails to process SAX events
- TikaException- if there was an exception outside of per page processing
 
 - 
processPagesprotected void processPages(org.apache.pdfbox.pdmodel.PDPageTree pageTree) throws IOExceptionSee TIKA-2845 for why we need to override this.- Throws:
- IOException
 
 - 
processPagepublic void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException- Overrides:
- processPagein class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
endPageprotected void endPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException- Throws:
- IOException
 
 - 
writeParagraphStartprotected void writeParagraphStart() throws IOException- Overrides:
- writeParagraphStartin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
writeParagraphEndprotected void writeParagraphEnd() throws IOException- Overrides:
- writeParagraphEndin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
writeStringprotected void writeString(String text) throws IOException - Overrides:
- writeStringin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
writeCharactersprotected void writeCharacters(org.apache.pdfbox.text.TextPosition text) throws IOException- Overrides:
- writeCharactersin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
writeWordSeparatorprotected void writeWordSeparator() throws IOException- Overrides:
- writeWordSeparatorin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
writeLineSeparatorprotected void writeLineSeparator() throws IOException- Overrides:
- writeLineSeparatorin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
startPageprotected void startPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException- Overrides:
- startPagein class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
startDocumentprotected void startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException- Overrides:
- startDocumentin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
endDocumentprotected void endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException- Overrides:
- endDocumentin class- org.apache.pdfbox.text.PDFTextStripper
- Throws:
- IOException
 
 - 
getCurrentPageNopublic int getCurrentPageNo() we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)- Overrides:
- getCurrentPageNoin class- org.apache.pdfbox.text.PDFTextStripper
- Returns:
 
 - 
setStartBookmarkpublic void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) - Overrides:
- setStartBookmarkin class- org.apache.pdfbox.text.PDFTextStripper
 
 - 
setEndBookmarkpublic void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) - Overrides:
- setEndBookmarkin class- org.apache.pdfbox.text.PDFTextStripper
 
 - 
showGlyphprotected void showGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement) throws IOException- Throws:
- IOException
 
 - 
computeFontHeightprotected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException- Throws:
- IOException
 
 
- 
 
-