org.apache.tika.parser.html
Class BoilerpipeContentHandler

java.lang.Object
  extended by de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
      extended by org.apache.tika.parser.html.BoilerpipeContentHandler
All Implemented Interfaces:
ContentHandler

public class BoilerpipeContentHandler
extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler

Uses the boilerpipe library to automatically extract the main content from a web page. Use this as a ContentHandler object passed to HtmlParser.parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)


Constructor Summary
BoilerpipeContentHandler(ContentHandler delegate)
          Creates a new boilerpipe-based content extractor, using the DefaultExtractor extraction rules and "delegate" as the content handler.
BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
          Creates a new boilerpipe-based content extractor, using the given extraction rules.
BoilerpipeContentHandler(Writer writer)
          Creates a content handler that writes XHTML body character events to the given writer.
 
Method Summary
 void characters(char[] chars, int offset, int length)
           
 void endDocument()
           
 void endElement(String uri, String localName, String qName)
           
 boolean isIncludeMarkup()
           
 void setIncludeMarkup(boolean includeMarkup)
           
 void startDocument()
           
 void startElement(String uri, String localName, String qName, Attributes atts)
           
 void startPrefixMapping(String prefix, String uri)
           
 
Methods inherited from class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
addTextBlock, addWhitespaceIfNecessary, endPrefixMapping, getTitle, ignorableWhitespace, processingInstruction, recycle, setDocumentLocator, setTitle, skippedEntity, toTextDocument
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BoilerpipeContentHandler

public BoilerpipeContentHandler(ContentHandler delegate)
Creates a new boilerpipe-based content extractor, using the DefaultExtractor extraction rules and "delegate" as the content handler.

Parameters:
delegate - The ContentHandler object

BoilerpipeContentHandler

public BoilerpipeContentHandler(Writer writer)
Creates a content handler that writes XHTML body character events to the given writer.

Parameters:
writer - writer

BoilerpipeContentHandler

public BoilerpipeContentHandler(ContentHandler delegate,
                                de.l3s.boilerpipe.BoilerpipeExtractor extractor)
Creates a new boilerpipe-based content extractor, using the given extraction rules. The extracted main content will be passed to the content handler.

Parameters:
delegate - The ContentHandler object
extractor - Extraction rules to use, e.g. ArticleExtractor
Method Detail

setIncludeMarkup

public void setIncludeMarkup(boolean includeMarkup)

isIncludeMarkup

public boolean isIncludeMarkup()

startDocument

public void startDocument()
                   throws SAXException
Specified by:
startDocument in interface ContentHandler
Overrides:
startDocument in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Throws:
SAXException

startPrefixMapping

public void startPrefixMapping(String prefix,
                               String uri)
                        throws SAXException
Specified by:
startPrefixMapping in interface ContentHandler
Overrides:
startPrefixMapping in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Throws:
SAXException

startElement

public void startElement(String uri,
                         String localName,
                         String qName,
                         Attributes atts)
                  throws SAXException
Specified by:
startElement in interface ContentHandler
Overrides:
startElement in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Throws:
SAXException

characters

public void characters(char[] chars,
                       int offset,
                       int length)
                throws SAXException
Specified by:
characters in interface ContentHandler
Overrides:
characters in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Throws:
SAXException

endElement

public void endElement(String uri,
                       String localName,
                       String qName)
                throws SAXException
Specified by:
endElement in interface ContentHandler
Overrides:
endElement in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Throws:
SAXException

endDocument

public void endDocument()
                 throws SAXException
Specified by:
endDocument in interface ContentHandler
Overrides:
endDocument in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Throws:
SAXException


Copyright © 2007-2012 The Apache Software Foundation. All Rights Reserved.