Package org.apache.tika.sax.boilerpipe
Class BoilerpipeContentHandler
java.lang.Object
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler
- All Implemented Interfaces:
ContentHandler
public class BoilerpipeContentHandler
extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Uses the boilerpipe
library to automatically extract the main content from a web page.
Use this as a
ContentHandler object passed to
HtmlParser#parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)-
Constructor Summary
ConstructorsConstructorDescriptionBoilerpipeContentHandler(Writer writer) Creates a content handler that writes XHTML body character events to the given writer.BoilerpipeContentHandler(ContentHandler delegate) Creates a new boilerpipe-based content extractor, using theDefaultExtractorextraction rules and "delegate" as the content handler.BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor) Creates a new boilerpipe-based content extractor, using the given extraction rules. -
Method Summary
Modifier and TypeMethodDescriptionvoidcharacters(char[] chars, int offset, int length) voidvoidendElement(String uri, String localName, String qName) de.l3s.boilerpipe.document.TextDocumentRetrieves the built TextDocumentbooleanvoidsetIncludeMarkup(boolean includeMarkup) voidvoidstartElement(String uri, String localName, String qName, Attributes atts) voidstartPrefixMapping(String prefix, String uri) Methods inherited from class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
addTextBlock, addWhitespaceIfNecessary, endPrefixMapping, getTitle, ignorableWhitespace, processingInstruction, recycle, setDocumentLocator, setTitle, skippedEntity, toTextDocument
-
Constructor Details
-
BoilerpipeContentHandler
Creates a new boilerpipe-based content extractor, using theDefaultExtractorextraction rules and "delegate" as the content handler.- Parameters:
delegate- TheContentHandlerobject
-
BoilerpipeContentHandler
Creates a content handler that writes XHTML body character events to the given writer.- Parameters:
writer- writer
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor) Creates a new boilerpipe-based content extractor, using the given extraction rules. The extracted main content will be passed to thecontent handler. - Parameters:
delegate- TheContentHandlerobjectextractor- Extraction rules to use, e.g.ArticleExtractor
-
-
Method Details
-
isIncludeMarkup
public boolean isIncludeMarkup() -
setIncludeMarkup
public void setIncludeMarkup(boolean includeMarkup) -
getTextDocument
public de.l3s.boilerpipe.document.TextDocument getTextDocument()Retrieves the built TextDocument- Returns:
- TextDocument
-
startDocument
- Specified by:
startDocumentin interfaceContentHandler- Overrides:
startDocumentin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
startPrefixMapping
- Specified by:
startPrefixMappingin interfaceContentHandler- Overrides:
startPrefixMappingin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
startElement
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException - Specified by:
startElementin interfaceContentHandler- Overrides:
startElementin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
characters
- Specified by:
charactersin interfaceContentHandler- Overrides:
charactersin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
endElement
- Specified by:
endElementin interfaceContentHandler- Overrides:
endElementin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
endDocument
- Specified by:
endDocumentin interfaceContentHandler- Overrides:
endDocumentin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-