org.apache.tika.parser.pdf.updates.StartXRefScanner

public class StartXRefScanner extends Object

This is a first draft of a scanner to extract incremental updates out of PDFs. It effectively scans the bytestream looking for startxref\\s*(\\d+)\\s*(%%EOF\n?)? It does not validate that the startxrefs point to actual xrefs.

If the number component ends at the literal end of the file (e.g. the file is truncated or malformed), the startxref will not be reported.

There may be false positives, especially in adversarial settings. For example, there may be a startxref string in a comment or inside a stream or object.

The good parts come directly from PDFBox.

Constructor Summary

Constructors

Constructor

Description

StartXRefScanner(org.apache.pdfbox.io.RandomAccessRead source)
Method Summary

Modifier and Type

Method

Description

protected boolean

isEOL(int c)

This will tell if the next byte to be read is an end of line byte.

protected boolean

isWhitespace(int c)

protected long

readLong()

protected final StringBuilder

readStringNumber()

This method is used to read a token by the readLong() method.

List<StartXRefOffset>

scan()

protected void

skipSpaces()

This will skip all spaces and comments that are present.

protected void

skipWhiteSpaces()

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- StartXRefScanner
  
  public StartXRefScanner(org.apache.pdfbox.io.RandomAccessRead source)
Method Details
- scan
  
  public List<StartXRefOffset> scan() throws IOException
  
  Throws:
  
  IOException
- skipWhiteSpaces
  
  protected void skipWhiteSpaces() throws IOException
  
  Throws:
  
  IOException
- isWhitespace
  
  protected boolean isWhitespace(int c)
- readLong
  
  protected long readLong() throws IOException
  
  Throws:
  
  IOException
- skipSpaces
  
  protected void skipSpaces() throws IOException
  
  This will skip all spaces and comments that are present.
  
  Throws:
  
  IOException - If there is an error reading from the stream.
- readStringNumber
  
  protected final StringBuilder readStringNumber() throws IOException
  
  This method is used to read a token by the readLong() method. Valid delimiters are any non digit values.
  
  Returns:
  
  the token to parse as integer or long by the calling method.
  
  Throws:
  
  IOException - throws by the source methods.
- isEOL
  
  protected boolean isEOL(int c)
  
  This will tell if the next byte to be read is an end of line byte.
  
  Parameters:
  
  c - The character to check against end of line
  
  Returns:
  
  true if the next byte is 0x0A or 0x0D.

Class StartXRefScanner

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

StartXRefScanner

Method Details

scan

skipWhiteSpaces

isWhitespace

readLong

skipSpaces

readStringNumber

isEOL