Class BoilerpipeContentHandler

  • All Implemented Interfaces:
    ContentHandler

    public class BoilerpipeContentHandler
    extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
    Uses the boilerpipe library to automatically extract the main content from a web page.

    Use this as a ContentHandler object passed to HtmlParser#parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)

    • Constructor Detail

      • BoilerpipeContentHandler

        public BoilerpipeContentHandler​(ContentHandler delegate)
        Creates a new boilerpipe-based content extractor, using the DefaultExtractor extraction rules and "delegate" as the content handler.
        Parameters:
        delegate - The ContentHandler object
      • BoilerpipeContentHandler

        public BoilerpipeContentHandler​(Writer writer)
        Creates a content handler that writes XHTML body character events to the given writer.
        Parameters:
        writer - writer
      • BoilerpipeContentHandler

        public BoilerpipeContentHandler​(ContentHandler delegate,
                                        de.l3s.boilerpipe.BoilerpipeExtractor extractor)
        Creates a new boilerpipe-based content extractor, using the given extraction rules. The extracted main content will be passed to the content handler.
        Parameters:
        delegate - The ContentHandler object
        extractor - Extraction rules to use, e.g. ArticleExtractor