org.apache.tika.ml.chardetect.HtmlByteStripper

public final class HtmlByteStripper extends Object

Byte-level HTML tag stripper used as a preprocess for charset detection.

Operates on raw bytes rather than decoded characters: <, >, quotes, and the ASCII letters in script/style are single-byte and preserved in every ASCII-compatible encoding the detector arbitrates between (UTF-8, ISO-8859-*, windows-125*, KOI8-*). This lets us strip once on the raw probe bytes and feed the shorter result to every candidate decoder, instead of decoding-then-stripping N times.

Not safe for UTF-16 or UTF-32 input: < is 2 or 4 bytes there. Callers should handle those candidates separately (they are almost always BOM-identified).

Contents of <script> and <style> elements, as well as HTML comments, are dropped: they carry bytes that are not natural-language text and pollute language-model scoring.

The strip is performed in place: bytes are written back into the input buffer, and the returned length marks the end of the content prefix. This is safe because the state machine maintains the invariant w <= i at the start of every iteration (where w is the next write index and i is the read index), and w <= i - 1 when entering states that may write two bytes in a single tick (LT stray-<). Every read of buf[i] is captured into a local before any write that could overlap it, and the i + 1, i + 2 peeks in the LT-!-- check reference positions the write cursor has not reached.

Ported from tika-encoding-detector-charsoup in the tika-main repository. Placed in this module so the Naive-Bayes pipeline detector can preprocess probes without a cross-module dependency on charsoup.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static final class

HtmlByteStripper.Result

Result of a strip operation: new content length and the number of well-formed tags (including comments) successfully parsed.
Method Summary

Modifier and Type

Method

Description

static HtmlByteStripper.Result

strip(byte[] src, int srcOffset, int srcLen, byte[] dst, int dstOffset)

Strip HTML/XML tags, comments, and the bodies of <script> and <style> elements from src[srcOffset .. srcOffset+srcLen) into dst starting at dstOffset.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- strip
  
  public static HtmlByteStripper.Result strip(byte[] src, int srcOffset, int srcLen, byte[] dst, int dstOffset)
  
  Strip HTML/XML tags, comments, and the bodies of <script> and <style> elements from src[srcOffset .. srcOffset+srcLen) into dst starting at dstOffset. Returns the number of content bytes written into dst.
  dst.length - dstOffset must be at least srcLen (stripping never produces more output than input).
  src and dst may refer to the same array; the source and destination ranges may overlap if dstOffset <= srcOffset. In that case the state machine's invariant (write index never leads read index) guarantees every source byte is loaded into a local before any write that could reach its position. Other overlap shapes are not supported and must not be used.

Class HtmlByteStripper

Nested Class Summary

Method Summary

Methods inherited from class java.lang.Object

Method Details

strip