Class PDFMarkedContent2XHTML


  • public class PDFMarkedContent2XHTML
    extends org.apache.pdfbox.text.PDFTextStripper

    This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.

    Since:
    1.24
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected float computeFontHeight​(org.apache.pdfbox.pdmodel.font.PDFont arg0)  
      protected void endDocument​(org.apache.pdfbox.pdmodel.PDDocument pdf)  
      protected void endPage​(org.apache.pdfbox.pdmodel.PDPage page)  
      int getCurrentPageNo()
      we need to override this because we are overriding PDFTextStripper.processPages(PDPageTree)
      int getStartPage()  
      static void process​(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)
      Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
      void processPage​(org.apache.pdfbox.pdmodel.PDPage page)  
      protected void processPages​(org.apache.pdfbox.pdmodel.PDPageTree pages)
      See TIKA-2845 for why we need to override this.
      void setEndBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)  
      void setStartBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)  
      void setStartPage​(int startPage)  
      protected void showGlyph​(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, String unicode, org.apache.pdfbox.util.Vector displacement)  
      protected void startDocument​(org.apache.pdfbox.pdmodel.PDDocument pdf)  
      protected void startPage​(org.apache.pdfbox.pdmodel.PDPage page)  
      protected void writeCharacters​(org.apache.pdfbox.text.TextPosition text)  
      protected void writeLineSeparator()  
      protected void writeParagraphEnd()  
      protected void writeParagraphStart()  
      protected void writeString​(String text)  
      protected void writeWordSeparator()  
      • Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

        endArticle, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeString, writeText
      • Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

        addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
    • Method Detail

      • process

        public static void process​(org.apache.pdfbox.pdmodel.PDDocument pdDocument,
                                   ContentHandler handler,
                                   ParseContext context,
                                   Metadata metadata,
                                   PDFParserConfig config)
                            throws SAXException,
                                   TikaException
        Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
        Parameters:
        pdDocument - PDF document
        handler - SAX content handler
        metadata - PDF metadata
        Throws:
        SAXException - if the content handler fails to process SAX events
        TikaException - if there was an exception outside of per page processing
      • processPages

        protected void processPages​(org.apache.pdfbox.pdmodel.PDPageTree pages)
                             throws IOException
        See TIKA-2845 for why we need to override this.
        Throws:
        IOException
      • processPage

        public void processPage​(org.apache.pdfbox.pdmodel.PDPage page)
                         throws IOException
        Overrides:
        processPage in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • endPage

        protected void endPage​(org.apache.pdfbox.pdmodel.PDPage page)
                        throws IOException
        Throws:
        IOException
      • writeParagraphStart

        protected void writeParagraphStart()
                                    throws IOException
        Overrides:
        writeParagraphStart in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • writeParagraphEnd

        protected void writeParagraphEnd()
                                  throws IOException
        Overrides:
        writeParagraphEnd in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • writeString

        protected void writeString​(String text)
                            throws IOException
        Overrides:
        writeString in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • writeCharacters

        protected void writeCharacters​(org.apache.pdfbox.text.TextPosition text)
                                throws IOException
        Overrides:
        writeCharacters in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • writeWordSeparator

        protected void writeWordSeparator()
                                   throws IOException
        Overrides:
        writeWordSeparator in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • writeLineSeparator

        protected void writeLineSeparator()
                                   throws IOException
        Overrides:
        writeLineSeparator in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • startPage

        protected void startPage​(org.apache.pdfbox.pdmodel.PDPage page)
                          throws IOException
        Overrides:
        startPage in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • startDocument

        protected void startDocument​(org.apache.pdfbox.pdmodel.PDDocument pdf)
                              throws IOException
        Overrides:
        startDocument in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • endDocument

        protected void endDocument​(org.apache.pdfbox.pdmodel.PDDocument pdf)
                            throws IOException
        Overrides:
        endDocument in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • getCurrentPageNo

        public int getCurrentPageNo()
        we need to override this because we are overriding PDFTextStripper.processPages(PDPageTree)
        Overrides:
        getCurrentPageNo in class org.apache.pdfbox.text.PDFTextStripper
        Returns:
      • setStartBookmark

        public void setStartBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
        Overrides:
        setStartBookmark in class org.apache.pdfbox.text.PDFTextStripper
      • setEndBookmark

        public void setEndBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
        Overrides:
        setEndBookmark in class org.apache.pdfbox.text.PDFTextStripper
      • getStartPage

        public int getStartPage()
        Overrides:
        getStartPage in class org.apache.pdfbox.text.PDFTextStripper
      • setStartPage

        public void setStartPage​(int startPage)
        Overrides:
        setStartPage in class org.apache.pdfbox.text.PDFTextStripper
      • showGlyph

        protected void showGlyph​(org.apache.pdfbox.util.Matrix textRenderingMatrix,
                                 org.apache.pdfbox.pdmodel.font.PDFont font,
                                 int code,
                                 String unicode,
                                 org.apache.pdfbox.util.Vector displacement)
                          throws IOException
        Throws:
        IOException
      • computeFontHeight

        protected float computeFontHeight​(org.apache.pdfbox.pdmodel.font.PDFont arg0)
                                   throws IOException
        Throws:
        IOException