org.apache.tika.parser.html
Class BoilerpipeContentHandler

java.lang.Object
  extended by de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
      extended by org.apache.tika.parser.html.BoilerpipeContentHandler
All Implemented Interfaces:
org.xml.sax.ContentHandler

public class BoilerpipeContentHandler
extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler

Uses the boilerpipe library to automatically extract the main content from a web page. Use this as a ContentHandler object passed to HtmlParser.parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)


Constructor Summary
BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate)
          Creates a new boilerpipe-based content extractor, using the DefaultExtractor extraction rules and "delegate" as the content handler.
BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
          Creates a new boilerpipe-based content extractor, using the given extraction rules.
BoilerpipeContentHandler(java.io.Writer writer)
          Creates a content handler that writes XHTML body character events to the given writer.
 
Method Summary
 void endDocument()
           
 
Methods inherited from class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
addTextBlock, addWhitespaceIfNecessary, characters, endElement, endPrefixMapping, getTitle, ignorableWhitespace, processingInstruction, recycle, setDocumentLocator, setTitle, skippedEntity, startDocument, startElement, startPrefixMapping, toTextDocument
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BoilerpipeContentHandler

public BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate)
Creates a new boilerpipe-based content extractor, using the DefaultExtractor extraction rules and "delegate" as the content handler.

Parameters:
delegate - The ContentHandler object

BoilerpipeContentHandler

public BoilerpipeContentHandler(java.io.Writer writer)
Creates a content handler that writes XHTML body character events to the given writer.

Parameters:
writer - writer

BoilerpipeContentHandler

public BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate,
                                de.l3s.boilerpipe.BoilerpipeExtractor extractor)
Creates a new boilerpipe-based content extractor, using the given extraction rules. The extracted main content will be passed to the content handler.

Parameters:
delegate - The ContentHandler object
extractor - Extraction rules to use, e.g. ArticleExtractor
Method Detail

endDocument

public void endDocument()
                 throws org.xml.sax.SAXException
Specified by:
endDocument in interface org.xml.sax.ContentHandler
Overrides:
endDocument in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Throws:
org.xml.sax.SAXException


Copyright © 2007-2010 The Apache Software Foundation. All Rights Reserved.