Package org.apache.tika.eval.app
Class AbstractProfiler
java.lang.Object
org.apache.tika.batch.FileResourceConsumer
org.apache.tika.eval.app.AbstractProfiler
- All Implemented Interfaces:
Callable<IFileProcessorFutureResult>
- Direct Known Subclasses:
ExtractComparer,ExtractProfiler,FileProfiler
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumstatic enumIf information was gathered from the log file about a parse error -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final Stringprotected static final AtomicIntegerstatic TableInfostatic TableInfostatic TableInfostatic TableInfostatic final Stringprotected IDBWriterFields inherited from class org.apache.tika.batch.FileResourceConsumer
ELAPSED_MILLIS, IO_IS, IO_OS, OOM, PARSE_ERR, PARSE_EX, TIMED_OUT -
Constructor Summary
ConstructorsConstructorDescriptionAbstractProfiler(ArrayBlockingQueue<FileResource> fileQueue, IDBWriter writer) -
Method Summary
Modifier and TypeMethodDescriptioncalcTextStats(ContentTags contentTags) voidprotected static ContentTagsgetContent(org.apache.tika.eval.app.EvalFilePaths evalFilePaths, Metadata metadata) protected longprotected org.apache.tika.eval.app.EvalFilePathsgetPathsFromExtractCrawl(Metadata metadata, Path extracts) protected org.apache.tika.eval.app.EvalFilePathsgetPathsFromSrcCrawl(Metadata metadata, Path srcDir, Path extracts) protected longgetSourceFileLength(org.apache.tika.eval.app.EvalFilePaths fps, List<Metadata> metadataList) static voidloadCommonTokens(Path p, String defaultLangCode) voidsetMaxContentLength(int maxContentLength) Truncate the content string if greater than this length to this lengthvoidsetMaxContentLengthForLangId(int maxContentLengthForLangId) Truncate content string if greater than this length to this length for lang idvoidsetMaxTokens(int maxTokens) Add a LimitTokenCountFilterFactory if > -1protected static StringtruncateContent(ContentTags contentTags, int maxLength, Map<Cols, String> data) Get the content and record in the dataCols.CONTENT_TRUNCATED_AT_MAX_LENwhether the string was truncatedprotected voidChecks to see if metadata is null or content is empty (null or only whitespace).protected voidwriteExceptionData(String fileId, Metadata m, TableInfo exceptionTable) protected voidwriteExtractException(TableInfo extractExceptionTable, String containerId, String filePath, ExtractReaderException.TYPE type) protected voidwriteProfileData(org.apache.tika.eval.app.EvalFilePaths fps, int i, ContentTags contentTags, Metadata m, String fileId, String containerId, List<Integer> numAttachments, TableInfo profileTable) Methods inherited from class org.apache.tika.batch.FileResourceConsumer
call, checkForTimedOutMillis, close, flushAndClose, getCurrentFile, getNumHandledExceptions, getNumResourcesConsumed, getXMLifiedLogMsg, getXMLifiedLogMsg, incrementHandledExceptions, isStillActive, parse, pleaseShutdown, processFileResource
-
Field Details
-
TRUE
-
FALSE
-
ID
-
REF_EXTRACT_EXCEPTION_TYPES
-
REF_PARSE_ERROR_TYPES
-
REF_PARSE_EXCEPTION_TYPES
-
MIME_TABLE
-
writer
-
-
Constructor Details
-
AbstractProfiler
-
-
Method Details
-
loadCommonTokens
- Parameters:
p- path to the common_tokens directory. If this is null, try to load from classPathdefaultLangCode- this is the language code to use if a common_words list doesn't exist for the detected langauge; can benull- Throws:
IOException
-
truncateContent
protected static String truncateContent(ContentTags contentTags, int maxLength, Map<Cols, String> data) Get the content and record in the dataCols.CONTENT_TRUNCATED_AT_MAX_LENwhether the string was truncated- Parameters:
contentTags-maxLength-data-- Returns:
-
getContent
protected static ContentTags getContent(org.apache.tika.eval.app.EvalFilePaths evalFilePaths, Metadata metadata) -
setMaxContentLength
public void setMaxContentLength(int maxContentLength) Truncate the content string if greater than this length to this length- Parameters:
maxContentLength-
-
setMaxContentLengthForLangId
public void setMaxContentLengthForLangId(int maxContentLengthForLangId) Truncate content string if greater than this length to this length for lang id- Parameters:
maxContentLengthForLangId-
-
setMaxTokens
public void setMaxTokens(int maxTokens) Add a LimitTokenCountFilterFactory if > -1- Parameters:
maxTokens-
-
writeExtractException
protected void writeExtractException(TableInfo extractExceptionTable, String containerId, String filePath, ExtractReaderException.TYPE type) throws IOException - Throws:
IOException
-
writeProfileData
-
writeExceptionData
-
calcTextStats
-
writeContentData
protected void writeContentData(String fileId, Map<Class, Object> textStats, TableInfo contentsTable) throws IOExceptionChecks to see if metadata is null or content is empty (null or only whitespace). If any of these, then this does no processing, and the fileId is not entered into the content table.- Parameters:
fileId-textStats-contentsTable-- Throws:
IOException
-
closeWriter
- Throws:
IOException
-
getPathsFromExtractCrawl
protected org.apache.tika.eval.app.EvalFilePaths getPathsFromExtractCrawl(Metadata metadata, Path extracts) - Parameters:
metadata-extracts-- Returns:
- evalfilepaths for files if crawling an extract directory
-
getPathsFromSrcCrawl
-
getSourceFileLength
-
getFileLength
-