Class HtmlByteStripper
Operates on raw bytes rather than decoded characters: <, >,
quotes, and the ASCII letters in script/style are single-byte
and preserved in every ASCII-compatible encoding the detector arbitrates
between (UTF-8, ISO-8859-*, windows-125*, KOI8-*). This lets us strip once
on the raw probe bytes and feed the shorter result to every candidate
decoder, instead of decoding-then-stripping N times.
Not safe for UTF-16 or UTF-32 input: < is 2 or 4 bytes there.
Callers should handle those candidates separately (they are almost always
BOM-identified).
Contents of <script> and <style> elements, as well as
HTML comments, are dropped: they carry bytes that are not natural-language
text and pollute language-model scoring.
The strip is performed in place: bytes are written back into
the input buffer, and the returned length marks the end of the content
prefix. This is safe because the state machine maintains the invariant
w <= i at the start of every iteration (where w is the
next write index and i is the read index), and w <= i - 1
when entering states that may write two bytes in a single tick
(LT stray-<). Every read of buf[i] is captured into a
local before any write that could overlap it, and the i + 1,
i + 2 peeks in the LT-!-- check reference positions the
write cursor has not reached.
Ported from tika-encoding-detector-charsoup in the tika-main
repository. Placed in this module so the Naive-Bayes pipeline detector
can preprocess probes without a cross-module dependency on charsoup.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final classResult of a strip operation: new content length and the number of well-formed tags (including comments) successfully parsed. -
Method Summary
Modifier and TypeMethodDescriptionstatic HtmlByteStripper.Resultstrip(byte[] src, int srcOffset, int srcLen, byte[] dst, int dstOffset) Strip HTML/XML tags, comments, and the bodies of<script>and<style>elements fromsrc[srcOffset .. srcOffset+srcLen)intodststarting atdstOffset.
-
Method Details
-
strip
public static HtmlByteStripper.Result strip(byte[] src, int srcOffset, int srcLen, byte[] dst, int dstOffset) Strip HTML/XML tags, comments, and the bodies of<script>and<style>elements fromsrc[srcOffset .. srcOffset+srcLen)intodststarting atdstOffset. Returns the number of content bytes written intodst.dst.length - dstOffsetmust be at leastsrcLen(stripping never produces more output than input).srcanddstmay refer to the same array; the source and destination ranges may overlap ifdstOffset <= srcOffset. In that case the state machine's invariant (write index never leads read index) guarantees every source byte is loaded into a local before any write that could reach its position. Other overlap shapes are not supported and must not be used.
-