Class StartXRefScanner
- java.lang.Object
-
- org.apache.tika.parser.pdf.updates.StartXRefScanner
-
public class StartXRefScanner extends Object
This is a first draft of a scanner to extract incremental updates out of PDFs. It effectively scans the bytestream looking for startxref\\s*(\\d+)\\s*(%%EOF\n?)? It does not validate that the startxrefs point to actual xrefs.If the number component ends at the literal end of the file (e.g. the file is truncated or malformed), the startxref will not be reported.
There may be false positives, especially in adversarial settings. For example, there may be a startxref string in a comment or inside a stream or object.
The good parts come directly from PDFBox.
-
-
Constructor Summary
Constructors Constructor Description StartXRefScanner(org.apache.pdfbox.io.RandomAccessRead source)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
isEOL(int c)
This will tell if the next byte to be read is an end of line byte.protected boolean
isWhitespace(int c)
protected long
readLong()
protected StringBuilder
readStringNumber()
This method is used to read a token by the readLong() method.List<StartXRefOffset>
scan()
protected void
skipSpaces()
This will skip all spaces and comments that are present.protected void
skipWhiteSpaces()
-
-
-
Method Detail
-
scan
public List<StartXRefOffset> scan() throws IOException
- Throws:
IOException
-
skipWhiteSpaces
protected void skipWhiteSpaces() throws IOException
- Throws:
IOException
-
isWhitespace
protected boolean isWhitespace(int c)
-
readLong
protected long readLong() throws IOException
- Throws:
IOException
-
skipSpaces
protected void skipSpaces() throws IOException
This will skip all spaces and comments that are present.- Throws:
IOException
- If there is an error reading from the stream.
-
readStringNumber
protected final StringBuilder readStringNumber() throws IOException
This method is used to read a token by the readLong() method. Valid delimiters are any non digit values.- Returns:
- the token to parse as integer or long by the calling method.
- Throws:
IOException
- throws by thesource
methods.
-
isEOL
protected boolean isEOL(int c)
This will tell if the next byte to be read is an end of line byte.- Parameters:
c
- The character to check against end of line- Returns:
- true if the next byte is 0x0A or 0x0D.
-
-