Class TikaInputStream
- All Implemented Interfaces:
Closeable
,AutoCloseable
InputStream
instance passed through the
Parser
interface and other similar APIs.
TikaInputStream instances can be created using the various static
get()
factory methods. Most of these methods take an optional
Metadata
argument that is then filled with the available input
metadata from the given resource. The created TikaInputStream instance
keeps track of the original resource used to create it, while behaving
otherwise just like a normal, buffered InputStream
.
A TikaInputStream instance is also guaranteed to support the
mark(int)
feature.
Code that wants to access the underlying file or other resources
associated with a TikaInputStream should first use the
get(InputStream)
factory method to cast or wrap a given
InputStream
into a TikaInputStream instance.
TikaInputStream includes a few safety features to protect against parsers
that may fail to check for an EOF or may incorrectly rely on the unreliable
value returned from FileInputStream.skip(long)
. These parser failures
can lead to infinite loops. We strongly encourage the use of
TikaInputStream.
- Since:
- Apache Tika 0.8
-
Field Summary
Fields inherited from class java.io.FilterInputStream
in
-
Method Summary
Modifier and TypeMethodDescriptionvoid
addCloseableResource
(Closeable closeable) protected void
afterRead
(int n) static TikaInputStream
cast
(InputStream stream) Returns the given stream casts to a TikaInputStream, ornull
if the stream is not a TikaInputStream.void
close()
static TikaInputStream
get
(byte[] data) Creates a TikaInputStream from the given array of bytes.static TikaInputStream
Creates a TikaInputStream from the given array of bytes.static TikaInputStream
Deprecated.static TikaInputStream
Deprecated.useget(Path, Metadata)
.static TikaInputStream
get
(InputStream stream) Casts or wraps the given stream to a TikaInputStream instance.static TikaInputStream
get
(InputStream stream, TemporaryResources tmp, Metadata metadata) Casts or wraps the given stream to a TikaInputStream instance.static TikaInputStream
Creates a TikaInputStream from the resource at the given URI.static TikaInputStream
Creates a TikaInputStream from the resource at the given URI.static TikaInputStream
Creates a TikaInputStream from the resource at the given URL.static TikaInputStream
Creates a TikaInputStream from the resource at the given URL.static TikaInputStream
Creates a TikaInputStream from the file at the given path.static TikaInputStream
Creates a TikaInputStream from the file at the given path.static TikaInputStream
get
(Path path, Metadata metadata, TemporaryResources tmp) static TikaInputStream
Creates a TikaInputStream from the given database BLOB.static TikaInputStream
Creates a TikaInputStream from the given database BLOB.static TikaInputStream
get
(InputStreamFactory factory) Creates a TikaInputStream from a Factory which can create freshInputStream
s for the same resource multiple times.static TikaInputStream
get
(InputStreamFactory factory, TemporaryResources tmp) Creates a TikaInputStream from a Factory which can create freshInputStream
s for the same resource multiple times.getFile()
If the Stream was created from anInputStreamFactory
, return that, otherwisenull
.long
Returns the length (in bytes) of this stream.Returns the open container object if any, such as a POIFS FileSystem in the event of an OLE2 document being detected and processed by the OLE2 detector.getPath()
If the user created this TikaInputStream with a file, the original file will be returned.getPath
(int maxBytes) long
Returns the current position within the stream.boolean
hasFile()
boolean
boolean
static boolean
isTikaInputStream
(InputStream stream) Checks whether the given stream is a TikaInputStream instance.void
mark
(int readlimit) boolean
int
peek
(byte[] buffer) Fills the given buffer with upcoming bytes from this stream without advancing the current stream position.void
reset()
void
setOpenContainer
(Object container) Stores the open container object against the stream, eg after a Zip contents detector has loaded the file to decide what it contains.long
skip
(long ln) This relies onIOUtils.skip(InputStream, long, byte[])
to ensure that the alleged bytes skipped were actually skipped.toString()
Methods inherited from class org.apache.commons.io.input.TaggedInputStream
handleIOException, isCauseOf, throwIfCauseOf
Methods inherited from class org.apache.commons.io.input.ProxyInputStream
available, beforeRead, read, read, read, unwrap
Methods inherited from class java.io.InputStream
nullInputStream, readAllBytes, readNBytes, readNBytes, transferTo
-
Method Details
-
isTikaInputStream
Checks whether the given stream is a TikaInputStream instance. The given stream can benull
, in which case the return value isfalse
.- Parameters:
stream
- input stream, possiblynull
- Returns:
true
if the stream is a TikaInputStream instance,false
otherwise
-
get
Casts or wraps the given stream to a TikaInputStream instance. This method can be used to access the functionality of this class even when given just a normal input stream instance.The given temporary file provider is used for any temporary files, and should be disposed when the returned stream is no longer used.
Use this method instead of the
get(InputStream)
alternative when you don't explicitly close the returned stream. The recommended access pattern is:try (TemporaryResources tmp = new TemporaryResources()) { TikaInputStream stream = TikaInputStream.get(..., tmp); // process stream but don't close it }
The given stream instance will not be closed when the
TemporaryResources.close()
method is called by the try-with-resources statement. The caller is expected to explicitly close the original stream when it's no longer used.- Parameters:
stream
- normal input stream- Returns:
- a TikaInputStream instance
- Since:
- Apache Tika 0.10
-
get
Casts or wraps the given stream to a TikaInputStream instance. This method can be used to access the functionality of this class even when given just a normal input stream instance.Use this method instead of the
get(InputStream, TemporaryResources, Metadata)
alternative when you do explicitly close the returned stream. The recommended access pattern is:try (TikaInputStream stream = TikaInputStream.get(...)) { // process stream }
The given stream instance will be closed along with any other resources associated with the returned TikaInputStream instance when the
close()
method is called by the try-with-resources statement.- Parameters:
stream
- normal input stream- Returns:
- a TikaInputStream instance
-
cast
Returns the given stream casts to a TikaInputStream, ornull
if the stream is not a TikaInputStream.- Parameters:
stream
- normal input stream- Returns:
- a TikaInputStream instance
- Since:
- Apache Tika 0.10
-
get
Creates a TikaInputStream from the given array of bytes.Note that you must always explicitly close the returned stream as in some cases it may end up writing the given data to a temporary file.
- Parameters:
data
- input data- Returns:
- a TikaInputStream instance
-
get
Creates a TikaInputStream from the given array of bytes. The length of the array is stored as input metadata in the given metadata instance.Note that you must always explicitly close the returned stream as in some cases it may end up writing the given data to a temporary file.
- Parameters:
data
- input datametadata
- metadata instance- Returns:
- a TikaInputStream instance
-
get
Creates a TikaInputStream from the file at the given path.Note that you must always explicitly close the returned stream to prevent leaking open file handles.
- Parameters:
path
- input file- Returns:
- a TikaInputStream instance
- Throws:
IOException
- if an I/O error occurs
-
get
Creates a TikaInputStream from the file at the given path. The file name and length are stored as input metadata in the given metadata instance.If there's an
TikaCoreProperties.RESOURCE_NAME_KEY
in the metadata object, this will not overwrite that value with the path's name.Note that you must always explicitly close the returned stream to prevent leaking open file handles.
- Parameters:
path
- input filemetadata
- metadata instance- Returns:
- a TikaInputStream instance
- Throws:
IOException
- if an I/O error occurs
-
get
public static TikaInputStream get(Path path, Metadata metadata, TemporaryResources tmp) throws IOException - Throws:
IOException
-
get
Deprecated.useget(Path)
. In Tika 2.0, this will be removed or modified to throw an IOException.Creates a TikaInputStream from the given file.Note that you must always explicitly close the returned stream to prevent leaking open file handles.
- Parameters:
file
- input file- Returns:
- a TikaInputStream instance
- Throws:
FileNotFoundException
- if the file does not exist
-
get
@Deprecated public static TikaInputStream get(File file, Metadata metadata) throws FileNotFoundException Deprecated.useget(Path, Metadata)
. In Tika 2.0, this will be removed or modified to throw an IOException.Creates a TikaInputStream from the given file. The file name and length are stored as input metadata in the given metadata instance.Note that you must always explicitly close the returned stream to prevent leaking open file handles.
- Parameters:
file
- input filemetadata
- metadata instance- Returns:
- a TikaInputStream instance
- Throws:
FileNotFoundException
- if the file does not exist or cannot be opened for reading
-
get
Creates a TikaInputStream from a Factory which can create freshInputStream
s for the same resource multiple times.This is typically desired when working with
Parser
s that need to re-read the stream multiple times, where other forms of buffering (eg File) are slower than just getting a fresh new stream each time.- Throws:
IOException
-
get
public static TikaInputStream get(InputStreamFactory factory, TemporaryResources tmp) throws IOException Creates a TikaInputStream from a Factory which can create freshInputStream
s for the same resource multiple times.This is typically desired when working with
Parser
s that need to re-read the stream multiple times, where other forms of buffering (eg File) are slower than just getting a fresh new stream each time.- Throws:
IOException
-
get
Creates a TikaInputStream from the given database BLOB.Note that the result set containing the BLOB may need to be kept open until the returned TikaInputStream has been processed and closed. You must also always explicitly close the returned stream as in some cases it may end up writing the blob data to a temporary file.
- Parameters:
blob
- database BLOB- Returns:
- a TikaInputStream instance
- Throws:
SQLException
- if BLOB data can not be accessed
-
get
Creates a TikaInputStream from the given database BLOB. The BLOB length (if available) is stored as input metadata in the given metadata instance.Note that the result set containing the BLOB may need to be kept open until the returned TikaInputStream has been processed and closed. You must also always explicitly close the returned stream as in some cases it may end up writing the blob data to a temporary file.
- Parameters:
blob
- database BLOBmetadata
- metadata instance- Returns:
- a TikaInputStream instance
- Throws:
SQLException
- if BLOB data can not be accessed
-
get
Creates a TikaInputStream from the resource at the given URI.Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.
- Parameters:
uri
- resource URI- Returns:
- a TikaInputStream instance
- Throws:
IOException
- if the resource can not be accessed
-
get
Creates a TikaInputStream from the resource at the given URI. The available input metadata is stored in the given metadata instance.Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.
- Parameters:
uri
- resource URImetadata
- metadata instance- Returns:
- a TikaInputStream instance
- Throws:
IOException
- if the resource can not be accessed
-
get
Creates a TikaInputStream from the resource at the given URL.Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.
- Parameters:
url
- resource URL- Returns:
- a TikaInputStream instance
- Throws:
IOException
- if the resource can not be accessed
-
get
Creates a TikaInputStream from the resource at the given URL. The available input metadata is stored in the given metadata instance.Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.
- Parameters:
url
- resource URLmetadata
- metadata instance- Returns:
- a TikaInputStream instance
- Throws:
IOException
- if the resource can not be accessed
-
peek
Fills the given buffer with upcoming bytes from this stream without advancing the current stream position. The buffer is filled up unless the end of stream is encountered before that. This method will block if not enough bytes are immediately available.- Parameters:
buffer
- byte buffer- Returns:
- number of bytes written to the buffer
- Throws:
IOException
- if the stream can not be read
-
getOpenContainer
Returns the open container object if any, such as a POIFS FileSystem in the event of an OLE2 document being detected and processed by the OLE2 detector.- Returns:
- Open Container for this stream, or
null
if none
-
setOpenContainer
Stores the open container object against the stream, eg after a Zip contents detector has loaded the file to decide what it contains. -
addCloseableResource
- Parameters:
closeable
-
-
hasInputStreamFactory
public boolean hasInputStreamFactory() -
getInputStreamFactory
If the Stream was created from anInputStreamFactory
, return that, otherwisenull
. -
hasFile
public boolean hasFile() -
getPath
If the user created this TikaInputStream with a file, the original file will be returned. If not, the entire stream will be spooled to a temporary file which will be deleted upon the close of this TikaInputStream- Returns:
- Throws:
IOException
-
getPath
- Parameters:
maxBytes
- if this is less than 0 and if an underlying file doesn't already exist, the full file will be spooled to disk- Returns:
- the original path used in the initialization of this TikaInputStream,
a temporary file if the stream was shorter than
maxBytes
, ornull
if the underlying stream was longer than maxBytes. - Throws:
IOException
-
getFile
- Throws:
IOException
- See Also:
-
getFileChannel
- Throws:
IOException
-
hasLength
public boolean hasLength() -
getLength
Returns the length (in bytes) of this stream. Note that if the length was not available when this stream was instantiated, then this method will use thegetPath()
method to buffer the entire stream to a temporary file in order to calculate the stream length. This case will only work if the stream has not yet been consumed.- Returns:
- stream length
- Throws:
IOException
- if the length can not be determined
-
getPosition
public long getPosition()Returns the current position within the stream.- Returns:
- stream position
-
skip
This relies onIOUtils.skip(InputStream, long, byte[])
to ensure that the alleged bytes skipped were actually skipped.- Overrides:
skip
in classorg.apache.commons.io.input.ProxyInputStream
- Parameters:
ln
- the number of bytes to skip- Returns:
- the number of bytes skipped
- Throws:
IOException
- if the number of bytes requested to be skipped does not match the number of bytes skipped or if there's an IOException during the read.
-
mark
public void mark(int readlimit) - Overrides:
mark
in classorg.apache.commons.io.input.ProxyInputStream
-
markSupported
public boolean markSupported()- Overrides:
markSupported
in classorg.apache.commons.io.input.ProxyInputStream
-
reset
- Overrides:
reset
in classorg.apache.commons.io.input.ProxyInputStream
- Throws:
IOException
-
close
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Overrides:
close
in classorg.apache.commons.io.input.ProxyInputStream
- Throws:
IOException
-
afterRead
- Overrides:
afterRead
in classorg.apache.commons.io.input.ProxyInputStream
- Throws:
IOException
-
toString
-
get(Path)
.