Package org.apache.tika.sax.boilerpipe
Class BoilerpipeContentHandler
java.lang.Object
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler
- All Implemented Interfaces:
ContentHandler
public class BoilerpipeContentHandler
extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Uses the boilerpipe
library to automatically extract the main content from a web page.
Use this as a
ContentHandler
object passed to
HtmlParser#parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)
-
Constructor Summary
ConstructorDescriptionBoilerpipeContentHandler
(Writer writer) Creates a content handler that writes XHTML body character events to the given writer.BoilerpipeContentHandler
(ContentHandler delegate) Creates a new boilerpipe-based content extractor, using theDefaultExtractor
extraction rules and "delegate" as the content handler.BoilerpipeContentHandler
(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor) Creates a new boilerpipe-based content extractor, using the given extraction rules. -
Method Summary
Modifier and TypeMethodDescriptionvoid
characters
(char[] chars, int offset, int length) void
void
endElement
(String uri, String localName, String qName) de.l3s.boilerpipe.document.TextDocument
Retrieves the built TextDocumentboolean
void
setIncludeMarkup
(boolean includeMarkup) void
void
startElement
(String uri, String localName, String qName, Attributes atts) void
startPrefixMapping
(String prefix, String uri) Methods inherited from class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
addTextBlock, addWhitespaceIfNecessary, endPrefixMapping, getTitle, ignorableWhitespace, processingInstruction, recycle, setDocumentLocator, setTitle, skippedEntity, toTextDocument
-
Constructor Details
-
BoilerpipeContentHandler
Creates a new boilerpipe-based content extractor, using theDefaultExtractor
extraction rules and "delegate" as the content handler.- Parameters:
delegate
- TheContentHandler
object
-
BoilerpipeContentHandler
Creates a content handler that writes XHTML body character events to the given writer.- Parameters:
writer
- writer
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor) Creates a new boilerpipe-based content extractor, using the given extraction rules. The extracted main content will be passed to thecontent handler. - Parameters:
delegate
- TheContentHandler
objectextractor
- Extraction rules to use, e.g.ArticleExtractor
-
-
Method Details
-
isIncludeMarkup
public boolean isIncludeMarkup() -
setIncludeMarkup
public void setIncludeMarkup(boolean includeMarkup) -
getTextDocument
public de.l3s.boilerpipe.document.TextDocument getTextDocument()Retrieves the built TextDocument- Returns:
- TextDocument
-
startDocument
- Specified by:
startDocument
in interfaceContentHandler
- Overrides:
startDocument
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
startPrefixMapping
- Specified by:
startPrefixMapping
in interfaceContentHandler
- Overrides:
startPrefixMapping
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
startElement
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException - Specified by:
startElement
in interfaceContentHandler
- Overrides:
startElement
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
characters
- Specified by:
characters
in interfaceContentHandler
- Overrides:
characters
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
endElement
- Specified by:
endElement
in interfaceContentHandler
- Overrides:
endElement
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-
endDocument
- Specified by:
endDocument
in interfaceContentHandler
- Overrides:
endDocument
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
-