Package org.apache.tika.eval.app
Class AbstractProfiler
java.lang.Object
org.apache.tika.batch.FileResourceConsumer
org.apache.tika.eval.app.AbstractProfiler
- All Implemented Interfaces:
Callable<IFileProcessorFutureResult>
- Direct Known Subclasses:
ExtractComparer
,ExtractProfiler
,FileProfiler
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic enum
static enum
If information was gathered from the log file about a parse error -
Field Summary
Modifier and TypeFieldDescriptionstatic final String
protected static final AtomicInteger
static TableInfo
static TableInfo
static TableInfo
static TableInfo
static final String
protected IDBWriter
Fields inherited from class org.apache.tika.batch.FileResourceConsumer
ELAPSED_MILLIS, IO_IS, IO_OS, OOM, PARSE_ERR, PARSE_EX, TIMED_OUT
-
Constructor Summary
ConstructorDescriptionAbstractProfiler
(ArrayBlockingQueue<FileResource> fileQueue, IDBWriter writer) -
Method Summary
Modifier and TypeMethodDescriptioncalcTextStats
(ContentTags contentTags) void
protected static ContentTags
getContent
(org.apache.tika.eval.app.EvalFilePaths evalFilePaths, Metadata metadata) protected long
protected org.apache.tika.eval.app.EvalFilePaths
getPathsFromExtractCrawl
(Metadata metadata, Path extracts) protected org.apache.tika.eval.app.EvalFilePaths
getPathsFromSrcCrawl
(Metadata metadata, Path srcDir, Path extracts) protected long
getSourceFileLength
(org.apache.tika.eval.app.EvalFilePaths fps, List<Metadata> metadataList) static void
loadCommonTokens
(Path p, String defaultLangCode) void
setMaxContentLength
(int maxContentLength) Truncate the content string if greater than this length to this lengthvoid
setMaxContentLengthForLangId
(int maxContentLengthForLangId) Truncate content string if greater than this length to this length for lang idvoid
setMaxTokens
(int maxTokens) Add a LimitTokenCountFilterFactory if > -1protected static String
truncateContent
(ContentTags contentTags, int maxLength, Map<Cols, String> data) Get the content and record in the dataCols.CONTENT_TRUNCATED_AT_MAX_LEN
whether the string was truncatedprotected void
Checks to see if metadata is null or content is empty (null or only whitespace).protected void
writeExceptionData
(String fileId, Metadata m, TableInfo exceptionTable) protected void
writeExtractException
(TableInfo extractExceptionTable, String containerId, String filePath, ExtractReaderException.TYPE type) protected void
writeProfileData
(org.apache.tika.eval.app.EvalFilePaths fps, int i, ContentTags contentTags, Metadata m, String fileId, String containerId, List<Integer> numAttachments, TableInfo profileTable) Methods inherited from class org.apache.tika.batch.FileResourceConsumer
call, checkForTimedOutMillis, close, flushAndClose, getCurrentFile, getNumHandledExceptions, getNumResourcesConsumed, getXMLifiedLogMsg, getXMLifiedLogMsg, incrementHandledExceptions, isStillActive, parse, pleaseShutdown, processFileResource
-
Field Details
-
TRUE
-
FALSE
-
ID
-
REF_EXTRACT_EXCEPTION_TYPES
-
REF_PARSE_ERROR_TYPES
-
REF_PARSE_EXCEPTION_TYPES
-
MIME_TABLE
-
writer
-
-
Constructor Details
-
AbstractProfiler
-
-
Method Details
-
loadCommonTokens
- Parameters:
p
- path to the common_tokens directory. If this is null, try to load from classPathdefaultLangCode
- this is the language code to use if a common_words list doesn't exist for the detected langauge; can benull
- Throws:
IOException
-
truncateContent
protected static String truncateContent(ContentTags contentTags, int maxLength, Map<Cols, String> data) Get the content and record in the dataCols.CONTENT_TRUNCATED_AT_MAX_LEN
whether the string was truncated- Parameters:
contentTags
-maxLength
-data
-- Returns:
-
getContent
protected static ContentTags getContent(org.apache.tika.eval.app.EvalFilePaths evalFilePaths, Metadata metadata) -
setMaxContentLength
public void setMaxContentLength(int maxContentLength) Truncate the content string if greater than this length to this length- Parameters:
maxContentLength
-
-
setMaxContentLengthForLangId
public void setMaxContentLengthForLangId(int maxContentLengthForLangId) Truncate content string if greater than this length to this length for lang id- Parameters:
maxContentLengthForLangId
-
-
setMaxTokens
public void setMaxTokens(int maxTokens) Add a LimitTokenCountFilterFactory if > -1- Parameters:
maxTokens
-
-
writeExtractException
protected void writeExtractException(TableInfo extractExceptionTable, String containerId, String filePath, ExtractReaderException.TYPE type) throws IOException - Throws:
IOException
-
writeProfileData
-
writeExceptionData
-
calcTextStats
-
writeContentData
protected void writeContentData(String fileId, Map<Class, Object> textStats, TableInfo contentsTable) throws IOExceptionChecks to see if metadata is null or content is empty (null or only whitespace). If any of these, then this does no processing, and the fileId is not entered into the content table.- Parameters:
fileId
-textStats
-contentsTable
-- Throws:
IOException
-
closeWriter
- Throws:
IOException
-
getPathsFromExtractCrawl
protected org.apache.tika.eval.app.EvalFilePaths getPathsFromExtractCrawl(Metadata metadata, Path extracts) - Parameters:
metadata
-extracts
-- Returns:
- evalfilepaths for files if crawling an extract directory
-
getPathsFromSrcCrawl
-
getSourceFileLength
-
getFileLength
-