Package org.apache.tika.sax.boilerpipe
Class BoilerpipeContentHandler
- java.lang.Object
- 
- de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- 
- org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler
 
 
- 
- All Implemented Interfaces:
- ContentHandler
 
 public class BoilerpipeContentHandler extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandlerUses the boilerpipe library to automatically extract the main content from a web page. Use this as aContentHandlerobject passed toHtmlParser#parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)
- 
- 
Constructor SummaryConstructors Constructor Description BoilerpipeContentHandler(Writer writer)Creates a content handler that writes XHTML body character events to the given writer.BoilerpipeContentHandler(ContentHandler delegate)Creates a new boilerpipe-based content extractor, using theDefaultExtractorextraction rules and "delegate" as the content handler.BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)Creates a new boilerpipe-based content extractor, using the given extraction rules.
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description voidcharacters(char[] chars, int offset, int length)voidendDocument()voidendElement(String uri, String localName, String qName)de.l3s.boilerpipe.document.TextDocumentgetTextDocument()Retrieves the built TextDocumentbooleanisIncludeMarkup()voidsetIncludeMarkup(boolean includeMarkup)voidstartDocument()voidstartElement(String uri, String localName, String qName, Attributes atts)voidstartPrefixMapping(String prefix, String uri)
 
- 
- 
- 
Constructor Detail- 
BoilerpipeContentHandlerpublic BoilerpipeContentHandler(ContentHandler delegate) Creates a new boilerpipe-based content extractor, using theDefaultExtractorextraction rules and "delegate" as the content handler.- Parameters:
- delegate- The- ContentHandlerobject
 
 - 
BoilerpipeContentHandlerpublic BoilerpipeContentHandler(Writer writer) Creates a content handler that writes XHTML body character events to the given writer.- Parameters:
- writer- writer
 
 - 
BoilerpipeContentHandlerpublic BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor) Creates a new boilerpipe-based content extractor, using the given extraction rules. The extracted main content will be passed to thecontent handler. - Parameters:
- delegate- The- ContentHandlerobject
- extractor- Extraction rules to use, e.g.- ArticleExtractor
 
 
- 
 - 
Method Detail- 
isIncludeMarkuppublic boolean isIncludeMarkup() 
 - 
setIncludeMarkuppublic void setIncludeMarkup(boolean includeMarkup) 
 - 
getTextDocumentpublic de.l3s.boilerpipe.document.TextDocument getTextDocument() Retrieves the built TextDocument- Returns:
- TextDocument
 
 - 
startDocumentpublic void startDocument() throws SAXException- Specified by:
- startDocumentin interface- ContentHandler
- Overrides:
- startDocumentin class- de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
- SAXException
 
 - 
startPrefixMappingpublic void startPrefixMapping(String prefix, String uri) throws SAXException - Specified by:
- startPrefixMappingin interface- ContentHandler
- Overrides:
- startPrefixMappingin class- de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
- SAXException
 
 - 
startElementpublic void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException - Specified by:
- startElementin interface- ContentHandler
- Overrides:
- startElementin class- de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
- SAXException
 
 - 
characterspublic void characters(char[] chars, int offset, int length) throws SAXException- Specified by:
- charactersin interface- ContentHandler
- Overrides:
- charactersin class- de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
- SAXException
 
 - 
endElementpublic void endElement(String uri, String localName, String qName) throws SAXException - Specified by:
- endElementin interface- ContentHandler
- Overrides:
- endElementin class- de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
- SAXException
 
 - 
endDocumentpublic void endDocument() throws SAXException- Specified by:
- endDocumentin interface- ContentHandler
- Overrides:
- endDocumentin class- de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
- SAXException
 
 
- 
 
-