Class HtmlByteStripper

java.lang.Object
org.apache.tika.ml.chardetect.HtmlByteStripper

public final class HtmlByteStripper extends Object
Byte-level HTML tag stripper used as a preprocess for charset detection.

Operates on raw bytes rather than decoded characters: <, >, quotes, and the ASCII letters in script/style are single-byte and preserved in every ASCII-compatible encoding the detector arbitrates between (UTF-8, ISO-8859-*, windows-125*, KOI8-*). This lets us strip once on the raw probe bytes and feed the shorter result to every candidate decoder, instead of decoding-then-stripping N times.

Not safe for UTF-16 or UTF-32 input: < is 2 or 4 bytes there. Callers should handle those candidates separately (they are almost always BOM-identified).

Contents of <script> and <style> elements, as well as HTML comments, are dropped: they carry bytes that are not natural-language text and pollute language-model scoring.

The strip is performed in place: bytes are written back into the input buffer, and the returned length marks the end of the content prefix. This is safe because the state machine maintains the invariant w <= i at the start of every iteration (where w is the next write index and i is the read index), and w <= i - 1 when entering states that may write two bytes in a single tick (LT stray-<). Every read of buf[i] is captured into a local before any write that could overlap it, and the i + 1, i + 2 peeks in the LT-!-- check reference positions the write cursor has not reached.

Ported from tika-encoding-detector-charsoup in the tika-main repository. Placed in this module so the Naive-Bayes pipeline detector can preprocess probes without a cross-module dependency on charsoup.

  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static final class 
    Result of a strip operation: new content length and the number of well-formed tags (including comments) successfully parsed.
  • Method Summary

    Modifier and Type
    Method
    Description
    strip(byte[] src, int srcOffset, int srcLen, byte[] dst, int dstOffset)
    Strip HTML/XML tags, comments, and the bodies of <script> and <style> elements from src[srcOffset .. srcOffset+srcLen) into dst starting at dstOffset.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Method Details

    • strip

      public static HtmlByteStripper.Result strip(byte[] src, int srcOffset, int srcLen, byte[] dst, int dstOffset)
      Strip HTML/XML tags, comments, and the bodies of <script> and <style> elements from src[srcOffset .. srcOffset+srcLen) into dst starting at dstOffset. Returns the number of content bytes written into dst.

      dst.length - dstOffset must be at least srcLen (stripping never produces more output than input).

      src and dst may refer to the same array; the source and destination ranges may overlap if dstOffset <= srcOffset. In that case the state machine's invariant (write index never leads read index) guarantees every source byte is loaded into a local before any write that could reach its position. Other overlap shapes are not supported and must not be used.