Class FileResourceCrawler

    • Constructor Detail

      • FileResourceCrawler

        public FileResourceCrawler​(ArrayBlockingQueue<FileResource> queue,
                                   int numConsumers)
        Parameters:
        queue - shared queue
        numConsumers - number of consumers (needs to know how many poisons to add when done)
    • Method Detail

      • tryToAdd

        protected int tryToAdd​(FileResource fileResource)
                        throws InterruptedException
        Parameters:
        fileResource - resource to add
        Returns:
        int status of the attempt (SKIPPED, ADDED, STOP_NOW) to add the resource to the queue.
        Throws:
        InterruptedException
      • isActive

        public boolean isActive()
        If the crawler stops for any reason, it is no longer active.
        Returns:
        whether crawler is active or not
      • setMaxConsecWaitInMillis

        public void setMaxConsecWaitInMillis​(long maxConsecWaitInMillis)
      • setDocumentSelector

        public void setDocumentSelector​(DocumentSelector documentSelector)
      • getConsidered

        public int getConsidered()
      • select

        protected boolean select​(Metadata m)
      • setMaxFilesToAdd

        public void setMaxFilesToAdd​(int maxFilesToAdd)
        Maximum number of files to add. If maxFilesToAdd < 0 (default), then this crawler will add all documents.
        Parameters:
        maxFilesToAdd - maximum number of files to add to the queue
      • setMaxFilesToConsider

        public void setMaxFilesToConsider​(int maxFilesToConsider)
        Maximum number of files to consider. A file is considered whether or not the DocumentSelector selects a document.

        If maxFilesToConsider < 0 (default), then this crawler will add all documents.

        Parameters:
        maxFilesToConsider - maximum number of files to consider adding to the queue
      • isQueueEmpty

        public boolean isQueueEmpty()
        Use sparingly. This synchronizes on the queue!
        Returns:
        whether this queue contains any non-poison file resources
      • wasTimedOut

        public boolean wasTimedOut()
        Returns whether the crawler timed out while trying to add a resource to the queue.

        If the crawler timed out while trying to add poison, this is not set to true.

        Returns:
        whether this was timed out or not
      • getAdded

        public int getAdded()
        Returns:
        number of files that this crawler added to the queue
      • shutDownNoPoison

        public void shutDownNoPoison()
        Set to true to shut down the FileResourceCrawler without adding poison. Do this only if you've already called another mechanism to request that consumers shut down. This prevents a potential deadlock issue where the crawler is trying to add to the queue, but it is full.