org.apache.tika.parser.html
Class BoilerpipeContentHandler
java.lang.Object
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
org.apache.tika.parser.html.BoilerpipeContentHandler
- All Implemented Interfaces:
- ContentHandler
public class BoilerpipeContentHandler
- extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Uses the boilerpipe
library to automatically extract the main content from a web page.
Use this as a ContentHandler
object passed to
HtmlParser.parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)
Methods inherited from class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler |
addTextBlock, addWhitespaceIfNecessary, endPrefixMapping, getTitle, ignorableWhitespace, processingInstruction, recycle, setDocumentLocator, setTitle, skippedEntity, toTextDocument |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
BoilerpipeContentHandler
public BoilerpipeContentHandler(ContentHandler delegate)
- Creates a new boilerpipe-based content extractor, using the
DefaultExtractor
extraction rules and "delegate" as the content handler.
- Parameters:
delegate
- The ContentHandler
object
BoilerpipeContentHandler
public BoilerpipeContentHandler(Writer writer)
- Creates a content handler that writes XHTML body character events to
the given writer.
- Parameters:
writer
- writer
BoilerpipeContentHandler
public BoilerpipeContentHandler(ContentHandler delegate,
de.l3s.boilerpipe.BoilerpipeExtractor extractor)
- Creates a new boilerpipe-based content extractor, using the given
extraction rules. The extracted main content will be passed to the
content handler.
- Parameters:
delegate
- The ContentHandler
objectextractor
- Extraction rules to use, e.g. ArticleExtractor
setIncludeMarkup
public void setIncludeMarkup(boolean includeMarkup)
isIncludeMarkup
public boolean isIncludeMarkup()
startDocument
public void startDocument()
throws SAXException
- Specified by:
startDocument
in interface ContentHandler
- Overrides:
startDocument
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
startPrefixMapping
public void startPrefixMapping(String prefix,
String uri)
throws SAXException
- Specified by:
startPrefixMapping
in interface ContentHandler
- Overrides:
startPrefixMapping
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
startElement
public void startElement(String uri,
String localName,
String qName,
Attributes atts)
throws SAXException
- Specified by:
startElement
in interface ContentHandler
- Overrides:
startElement
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
characters
public void characters(char[] chars,
int offset,
int length)
throws SAXException
- Specified by:
characters
in interface ContentHandler
- Overrides:
characters
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
endElement
public void endElement(String uri,
String localName,
String qName)
throws SAXException
- Specified by:
endElement
in interface ContentHandler
- Overrides:
endElement
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
endDocument
public void endDocument()
throws SAXException
- Specified by:
endDocument
in interface ContentHandler
- Overrides:
endDocument
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
SAXException
Copyright © 2007-2012 The Apache Software Foundation. All Rights Reserved.