Package org.apache.tika.parser.pdf
Class PDFMarkedContent2XHTML
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.tika.parser.pdf.PDFMarkedContent2XHTML
public class PDFMarkedContent2XHTML
extends org.apache.pdfbox.text.PDFTextStripper
This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
- Since:
- 1.24
-
Field Summary
FieldsFields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output -
Method Summary
Modifier and TypeMethodDescriptionprotected floatcomputeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) protected voidendDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) protected voidendPage(org.apache.pdfbox.pdmodel.PDPage page) intwe need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)static voidprocess(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.voidprocessPage(org.apache.pdfbox.pdmodel.PDPage page) protected voidprocessPages(org.apache.pdfbox.pdmodel.PDPageTree pageTree) See TIKA-2845 for why we need to override this.voidsetEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) voidsetStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) protected voidshowGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement) protected voidstartDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) protected voidstartPage(org.apache.pdfbox.pdmodel.PDPage page) protected voidwriteCharacters(org.apache.pdfbox.text.TextPosition text) protected voidprotected voidprotected voidprotected voidwriteString(String text) protected voidMethods inherited from class org.apache.pdfbox.text.PDFTextStripper
beginMarkedContentSequence, endArticle, endMarkedContentSequence, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeString, writeTextMethods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, markedContentPoint, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
Field Details
-
XMP_DOCUMENT_CATALOG_LOCATION
- See Also:
-
XMP_PAGE_LOCATION_PREFIX
- See Also:
-
-
Method Details
-
process
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws SAXException, TikaException Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.- Parameters:
pdDocument- PDF documenthandler- SAX content handlercontext-metadata- PDF metadataconfig-- Throws:
SAXException- if the content handler fails to process SAX eventsTikaException- if there was an exception outside of per page processing
-
processPages
See TIKA-2845 for why we need to override this.- Throws:
IOException
-
processPage
- Overrides:
processPagein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
endPage
- Throws:
IOException
-
writeParagraphStart
- Overrides:
writeParagraphStartin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeParagraphEnd
- Overrides:
writeParagraphEndin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeString
- Overrides:
writeStringin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeCharacters
- Overrides:
writeCharactersin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeWordSeparator
- Overrides:
writeWordSeparatorin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeLineSeparator
- Overrides:
writeLineSeparatorin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
startPage
- Overrides:
startPagein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
startDocument
- Overrides:
startDocumentin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
endDocument
- Overrides:
endDocumentin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
getCurrentPageNo
public int getCurrentPageNo()we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)- Overrides:
getCurrentPageNoin classorg.apache.pdfbox.text.PDFTextStripper- Returns:
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement) throws IOException - Throws:
IOException
-
computeFontHeight
- Throws:
IOException
-