|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler org.apache.tika.parser.html.BoilerpipeContentHandler
public class BoilerpipeContentHandler
Uses the boilerpipe
library to automatically extract the main content from a web page.
Use this as a ContentHandler
object passed to
HtmlParser.parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)
Constructor Summary | |
---|---|
BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate)
Creates a new boilerpipe-based content extractor, using the DefaultExtractor extraction rules and "delegate" as the content handler. |
|
BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate,
de.l3s.boilerpipe.BoilerpipeExtractor extractor)
Creates a new boilerpipe-based content extractor, using the given extraction rules. |
|
BoilerpipeContentHandler(java.io.Writer writer)
Creates a content handler that writes XHTML body character events to the given writer. |
Method Summary | |
---|---|
void |
endDocument()
|
Methods inherited from class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler |
---|
addTextBlock, addWhitespaceIfNecessary, characters, endElement, endPrefixMapping, getTitle, ignorableWhitespace, processingInstruction, recycle, setDocumentLocator, setTitle, skippedEntity, startDocument, startElement, startPrefixMapping, toTextDocument |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate)
DefaultExtractor
extraction rules and "delegate" as the content handler.
delegate
- The ContentHandler
objectpublic BoilerpipeContentHandler(java.io.Writer writer)
writer
- writerpublic BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
delegate
- The ContentHandler
objectextractor
- Extraction rules to use, e.g. ArticleExtractor
Method Detail |
---|
public void endDocument() throws org.xml.sax.SAXException
endDocument
in interface org.xml.sax.ContentHandler
endDocument
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
org.xml.sax.SAXException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |