Package org.apache.tika.batch
Class FileResourceCrawler
- java.lang.Object
-
- org.apache.tika.batch.FileResourceCrawler
-
- All Implemented Interfaces:
Callable<IFileProcessorFutureResult>
- Direct Known Subclasses:
FSDirectoryCrawler
,FSListCrawler
public abstract class FileResourceCrawler extends Object implements Callable<IFileProcessorFutureResult>
-
-
Constructor Summary
Constructors Constructor Description FileResourceCrawler(ArrayBlockingQueue<FileResource> queue, int numConsumers)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description org.apache.tika.batch.FileResourceCrawlerFutureResult
call()
int
getAdded()
int
getConsidered()
boolean
isActive()
If the crawler stops for any reason, it is no longer active.boolean
isQueueEmpty()
Use sparingly.protected boolean
select(Metadata m)
void
setDocumentSelector(DocumentSelector documentSelector)
void
setMaxConsecWaitInMillis(long maxConsecWaitInMillis)
void
setMaxFilesToAdd(int maxFilesToAdd)
Maximum number of files to add.void
setMaxFilesToConsider(int maxFilesToConsider)
Maximum number of files to consider.void
shutDownNoPoison()
Set to true to shut down the FileResourceCrawler without adding poison.abstract void
start()
Implement this to control the addition of FileResources.protected int
tryToAdd(FileResource fileResource)
boolean
wasTimedOut()
Returns whether the crawler timed out while trying to add a resource to the queue.
-
-
-
Field Detail
-
LOG
protected static final org.slf4j.Logger LOG
-
SKIPPED
protected static final int SKIPPED
- See Also:
- Constant Field Values
-
ADDED
protected static final int ADDED
- See Also:
- Constant Field Values
-
STOP_NOW
protected static final int STOP_NOW
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
FileResourceCrawler
public FileResourceCrawler(ArrayBlockingQueue<FileResource> queue, int numConsumers)
- Parameters:
queue
- shared queuenumConsumers
- number of consumers (needs to know how many poisons to add when done)
-
-
Method Detail
-
start
public abstract void start() throws InterruptedException
Implement this to control the addition of FileResources. CalltryToAdd(org.apache.tika.batch.FileResource)
to add FileResources to the queue.- Throws:
InterruptedException
-
call
public org.apache.tika.batch.FileResourceCrawlerFutureResult call()
- Specified by:
call
in interfaceCallable<IFileProcessorFutureResult>
-
tryToAdd
protected int tryToAdd(FileResource fileResource) throws InterruptedException
- Parameters:
fileResource
- resource to add- Returns:
- int status of the attempt (SKIPPED, ADDED, STOP_NOW) to add the resource to the queue.
- Throws:
InterruptedException
-
isActive
public boolean isActive()
If the crawler stops for any reason, it is no longer active.- Returns:
- whether crawler is active or not
-
setMaxConsecWaitInMillis
public void setMaxConsecWaitInMillis(long maxConsecWaitInMillis)
-
setDocumentSelector
public void setDocumentSelector(DocumentSelector documentSelector)
-
getConsidered
public int getConsidered()
-
select
protected boolean select(Metadata m)
-
setMaxFilesToAdd
public void setMaxFilesToAdd(int maxFilesToAdd)
Maximum number of files to add. IfmaxFilesToAdd
< 0 (default), then this crawler will add all documents.- Parameters:
maxFilesToAdd
- maximum number of files to add to the queue
-
setMaxFilesToConsider
public void setMaxFilesToConsider(int maxFilesToConsider)
Maximum number of files to consider. A file is considered whether or not the DocumentSelector selects a document. IfmaxFilesToConsider
< 0 (default), then this crawler will add all documents.- Parameters:
maxFilesToConsider
- maximum number of files to consider adding to the queue
-
isQueueEmpty
public boolean isQueueEmpty()
Use sparingly. This synchronizes on the queue!- Returns:
- whether this queue contains any non-poison file resources
-
wasTimedOut
public boolean wasTimedOut()
Returns whether the crawler timed out while trying to add a resource to the queue. If the crawler timed out while trying to add poison, this is not set to true.- Returns:
- whether this was timed out or not
-
getAdded
public int getAdded()
- Returns:
- number of files that this crawler added to the queue
-
shutDownNoPoison
public void shutDownNoPoison()
Set to true to shut down the FileResourceCrawler without adding poison. Do this only if you've already called another mechanism to request that consumers shut down. This prevents a potential deadlock issue where the crawler is trying to add to the queue, but it is full.
-
-