Package org.apache.tika.sax.boilerpipe
Class BoilerpipeContentHandler
- java.lang.Object
-
- de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
-
- org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler
-
- All Implemented Interfaces:
ContentHandler
public class BoilerpipeContentHandler extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandlerUses the boilerpipe library to automatically extract the main content from a web page. Use this as aContentHandlerobject passed toHtmlParser#parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)
-
-
Constructor Summary
Constructors Constructor Description BoilerpipeContentHandler(Writer writer)Creates a content handler that writes XHTML body character events to the given writer.BoilerpipeContentHandler(ContentHandler delegate)Creates a new boilerpipe-based content extractor, using theDefaultExtractorextraction rules and "delegate" as the content handler.BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)Creates a new boilerpipe-based content extractor, using the given extraction rules.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidcharacters(char[] chars, int offset, int length)voidendDocument()voidendElement(String uri, String localName, String qName)de.l3s.boilerpipe.document.TextDocumentgetTextDocument()Retrieves the built TextDocumentbooleanisIncludeMarkup()voidsetIncludeMarkup(boolean includeMarkup)voidstartDocument()voidstartElement(String uri, String localName, String qName, Attributes atts)voidstartPrefixMapping(String prefix, String uri)
-
-
-
Constructor Detail
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(ContentHandler delegate)
Creates a new boilerpipe-based content extractor, using theDefaultExtractorextraction rules and "delegate" as the content handler.- Parameters:
delegate- TheContentHandlerobject
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(Writer writer)
Creates a content handler that writes XHTML body character events to the given writer.- Parameters:
writer- writer
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
Creates a new boilerpipe-based content extractor, using the given extraction rules. The extracted main content will be passed to thecontent handler. - Parameters:
delegate- TheContentHandlerobjectextractor- Extraction rules to use, e.g.ArticleExtractor
-
-
Method Detail
-
isIncludeMarkup
public boolean isIncludeMarkup()
-
setIncludeMarkup
public void setIncludeMarkup(boolean includeMarkup)
-
getTextDocument
public de.l3s.boilerpipe.document.TextDocument getTextDocument()
Retrieves the built TextDocument- Returns:
- TextDocument
-
startDocument
public void startDocument() throws SAXException- Specified by:
startDocumentin interfaceContentHandler- Overrides:
startDocumentin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
startPrefixMapping
public void startPrefixMapping(String prefix, String uri) throws SAXException
- Specified by:
startPrefixMappingin interfaceContentHandler- Overrides:
startPrefixMappingin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
startElement
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException
- Specified by:
startElementin interfaceContentHandler- Overrides:
startElementin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
characters
public void characters(char[] chars, int offset, int length) throws SAXException- Specified by:
charactersin interfaceContentHandler- Overrides:
charactersin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
endElement
public void endElement(String uri, String localName, String qName) throws SAXException
- Specified by:
endElementin interfaceContentHandler- Overrides:
endElementin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
endDocument
public void endDocument() throws SAXException- Specified by:
endDocumentin interfaceContentHandler- Overrides:
endDocumentin classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler- Throws:
SAXException
-
-