Package org.apache.tika.parser.html
Class BoilerpipeContentHandler
- java.lang.Object
-
- de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
-
- org.apache.tika.parser.html.BoilerpipeContentHandler
-
- All Implemented Interfaces:
ContentHandler
public class BoilerpipeContentHandler extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Uses the boilerpipe library to automatically extract the main content from a web page. Use this as aContentHandler
object passed toHtmlParser.parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)
-
-
Constructor Summary
Constructors Constructor Description BoilerpipeContentHandler(Writer writer)
Creates a content handler that writes XHTML body character events to the given writer.BoilerpipeContentHandler(ContentHandler delegate)
Creates a new boilerpipe-based content extractor, using theDefaultExtractor
extraction rules and "delegate" as the content handler.BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
Creates a new boilerpipe-based content extractor, using the given extraction rules.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
characters(char[] chars, int offset, int length)
void
endDocument()
void
endElement(String uri, String localName, String qName)
de.l3s.boilerpipe.document.TextDocument
getTextDocument()
Retrieves the built TextDocumentboolean
isIncludeMarkup()
void
setIncludeMarkup(boolean includeMarkup)
void
startDocument()
void
startElement(String uri, String localName, String qName, Attributes atts)
void
startPrefixMapping(String prefix, String uri)
-
-
-
Constructor Detail
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(ContentHandler delegate)
Creates a new boilerpipe-based content extractor, using theDefaultExtractor
extraction rules and "delegate" as the content handler.- Parameters:
delegate
- TheContentHandler
object
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(Writer writer)
Creates a content handler that writes XHTML body character events to the given writer.- Parameters:
writer
- writer
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
Creates a new boilerpipe-based content extractor, using the given extraction rules. The extracted main content will be passed to thecontent handler. - Parameters:
delegate
- TheContentHandler
objectextractor
- Extraction rules to use, e.g.ArticleExtractor
-
-
Method Detail
-
isIncludeMarkup
public boolean isIncludeMarkup()
-
setIncludeMarkup
public void setIncludeMarkup(boolean includeMarkup)
-
getTextDocument
public de.l3s.boilerpipe.document.TextDocument getTextDocument()
Retrieves the built TextDocument- Returns:
- TextDocument
-
startDocument
public void startDocument() throws SAXException
- Specified by:
startDocument
in interfaceContentHandler
- Overrides:
startDocument
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
startPrefixMapping
public void startPrefixMapping(String prefix, String uri) throws SAXException
- Specified by:
startPrefixMapping
in interfaceContentHandler
- Overrides:
startPrefixMapping
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
startElement
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException
- Specified by:
startElement
in interfaceContentHandler
- Overrides:
startElement
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
characters
public void characters(char[] chars, int offset, int length) throws SAXException
- Specified by:
characters
in interfaceContentHandler
- Overrides:
characters
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
endElement
public void endElement(String uri, String localName, String qName) throws SAXException
- Specified by:
endElement
in interfaceContentHandler
- Overrides:
endElement
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
endDocument
public void endDocument() throws SAXException
- Specified by:
endDocument
in interfaceContentHandler
- Overrides:
endDocument
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
-