Package org.apache.tika.parser.pdf
Class PDFMarkedContent2XHTML
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.tika.parser.pdf.PDFMarkedContent2XHTML
public class PDFMarkedContent2XHTML
extends org.apache.pdfbox.text.PDFTextStripper
This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
- Since:
- 1.24
-
Field Summary
Fields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output
-
Method Summary
Modifier and TypeMethodDescriptionprotected float
computeFontHeight
(org.apache.pdfbox.pdmodel.font.PDFont arg0) protected void
endDocument
(org.apache.pdfbox.pdmodel.PDDocument pdf) protected void
endPage
(org.apache.pdfbox.pdmodel.PDPage page) int
we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)
int
static void
process
(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.void
processPage
(org.apache.pdfbox.pdmodel.PDPage page) protected void
processPages
(org.apache.pdfbox.pdmodel.PDPageTree pages) See TIKA-2845 for why we need to override this.void
setEndBookmark
(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) void
setStartBookmark
(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) void
setStartPage
(int startPage) protected void
showGlyph
(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) protected void
showGlyph
(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement) protected void
startDocument
(org.apache.pdfbox.pdmodel.PDDocument pdf) protected void
startPage
(org.apache.pdfbox.pdmodel.PDPage page) protected void
writeCharacters
(org.apache.pdfbox.text.TextPosition text) protected void
protected void
protected void
protected void
writeString
(String text) protected void
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
endArticle, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeString, writeText
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
Field Details
-
XMP_DOCUMENT_CATALOG_LOCATION
- See Also:
-
XMP_PAGE_LOCATION_PREFIX
- See Also:
-
-
Method Details
-
process
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws SAXException, TikaException Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.- Parameters:
pdDocument
- PDF documenthandler
- SAX content handlercontext
-metadata
- PDF metadataconfig
-- Throws:
SAXException
- if the content handler fails to process SAX eventsTikaException
- if there was an exception outside of per page processing
-
processPages
See TIKA-2845 for why we need to override this.- Throws:
IOException
-
processPage
- Overrides:
processPage
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
endPage
- Throws:
IOException
-
writeParagraphStart
- Overrides:
writeParagraphStart
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
writeParagraphEnd
- Overrides:
writeParagraphEnd
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
writeString
- Overrides:
writeString
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
writeCharacters
- Overrides:
writeCharacters
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
writeWordSeparator
- Overrides:
writeWordSeparator
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
writeLineSeparator
- Overrides:
writeLineSeparator
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
startPage
- Overrides:
startPage
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
startDocument
- Overrides:
startDocument
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
endDocument
- Overrides:
endDocument
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
getCurrentPageNo
public int getCurrentPageNo()we need to override this because we are overridingPDFTextStripper.processPages(PDPageTree)
- Overrides:
getCurrentPageNo
in classorg.apache.pdfbox.text.PDFTextStripper
- Returns:
-
getStartPage
public int getStartPage()- Overrides:
getStartPage
in classorg.apache.pdfbox.text.PDFTextStripper
-
setStartPage
public void setStartPage(int startPage) - Overrides:
setStartPage
in classorg.apache.pdfbox.text.PDFTextStripper
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement) throws IOException - Overrides:
showGlyph
in classorg.apache.pdfbox.contentstream.PDFStreamEngine
- Throws:
IOException
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) throws IOException - Overrides:
showGlyph
in classorg.apache.pdfbox.contentstream.PDFStreamEngine
- Throws:
IOException
-
computeFontHeight
- Throws:
IOException
-