org.apache.tika.parser.html
Class BoilerpipeContentHandler
java.lang.Object
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
org.apache.tika.parser.html.BoilerpipeContentHandler
- All Implemented Interfaces:
- org.xml.sax.ContentHandler
public class BoilerpipeContentHandler
- extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Uses the boilerpipe
library to automatically extract the main content from a web page.
Use this as a ContentHandler
object passed to
HtmlParser.parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)
Constructor Summary |
BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate)
Creates a new boilerpipe-based content extractor, using the
DefaultExtractor extraction rules and "delegate" as the content handler. |
BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate,
de.l3s.boilerpipe.BoilerpipeExtractor extractor)
Creates a new boilerpipe-based content extractor, using the given
extraction rules. |
BoilerpipeContentHandler(java.io.Writer writer)
Creates a content handler that writes XHTML body character events to
the given writer. |
Methods inherited from class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler |
addTextBlock, addWhitespaceIfNecessary, endPrefixMapping, getTitle, ignorableWhitespace, processingInstruction, recycle, setDocumentLocator, setTitle, skippedEntity, toTextDocument |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
BoilerpipeContentHandler
public BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate)
- Creates a new boilerpipe-based content extractor, using the
DefaultExtractor
extraction rules and "delegate" as the content handler.
- Parameters:
delegate
- The ContentHandler
object
BoilerpipeContentHandler
public BoilerpipeContentHandler(java.io.Writer writer)
- Creates a content handler that writes XHTML body character events to
the given writer.
- Parameters:
writer
- writer
BoilerpipeContentHandler
public BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate,
de.l3s.boilerpipe.BoilerpipeExtractor extractor)
- Creates a new boilerpipe-based content extractor, using the given
extraction rules. The extracted main content will be passed to the
content handler.
- Parameters:
delegate
- The ContentHandler
objectextractor
- Extraction rules to use, e.g. ArticleExtractor
setIncludeMarkup
public void setIncludeMarkup(boolean includeMarkup)
isIncludeMarkup
public boolean isIncludeMarkup()
startDocument
public void startDocument()
throws org.xml.sax.SAXException
- Specified by:
startDocument
in interface org.xml.sax.ContentHandler
- Overrides:
startDocument
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
startPrefixMapping
public void startPrefixMapping(java.lang.String prefix,
java.lang.String uri)
throws org.xml.sax.SAXException
- Specified by:
startPrefixMapping
in interface org.xml.sax.ContentHandler
- Overrides:
startPrefixMapping
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
startElement
public void startElement(java.lang.String uri,
java.lang.String localName,
java.lang.String qName,
org.xml.sax.Attributes atts)
throws org.xml.sax.SAXException
- Specified by:
startElement
in interface org.xml.sax.ContentHandler
- Overrides:
startElement
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
characters
public void characters(char[] chars,
int offset,
int length)
throws org.xml.sax.SAXException
- Specified by:
characters
in interface org.xml.sax.ContentHandler
- Overrides:
characters
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
endElement
public void endElement(java.lang.String uri,
java.lang.String localName,
java.lang.String qName)
throws org.xml.sax.SAXException
- Specified by:
endElement
in interface org.xml.sax.ContentHandler
- Overrides:
endElement
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
endDocument
public void endDocument()
throws org.xml.sax.SAXException
- Specified by:
endDocument
in interface org.xml.sax.ContentHandler
- Overrides:
endDocument
in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
Copyright © 2007-2011 The Apache Software Foundation. All Rights Reserved.