All Classes Interface Summary Class Summary Enum Summary Exception Summary Error Summary Annotation Types Summary
Class |
Description |
AbstractChunking |
This class specifies the base class for file chunking
|
AbstractConsumersBuilder |
|
AbstractConverter |
Base class for Tika Metadata to XMP converter which provides some needed common functionality.
|
AbstractDBParser |
Abstract class that handles iterating through tables within a database.
|
AbstractEmitter |
|
AbstractEncodingDetectorParser |
|
AbstractExternalProcessParser |
Abstract base class for parsers that call external processes.
|
AbstractFetcher |
|
AbstractFSConsumer |
|
AbstractImageParser |
|
AbstractListManager |
|
AbstractListManager.LevelTuple |
|
AbstractListManager.ParagraphLevelCounter |
|
AbstractMultipleParser |
Abstract base class for parser wrappers which may / will
process a given stream multiple times, merging the results
of the various parsers used.
|
AbstractMultipleParser.MetadataPolicy |
The various strategies for handling metadata emitted by
multiple parsers.
|
AbstractOfficeParser |
|
AbstractOOXMLExtractor |
Base class for all Tika OOXML extractors.
|
AbstractParser |
Abstract base class for new parsers.
|
AbstractProfiler |
|
AbstractProfiler.EXCEPTION_TYPE |
|
AbstractProfiler.PARSE_ERROR_TYPE |
If information was gathered from the log file about
a parse error
|
AbstractRecursiveParserWrapperHandler |
|
AbstractTranslator |
|
AbstractXML2003Parser |
|
AccessChecker |
Checks whether or not a document allows extraction generally
or extraction for accessibility only.
|
AccessPermissionException |
Exception to be thrown when a document does not allow content extraction.
|
AccessPermissions |
Until we can find a common standard, we'll use these options.
|
Activator |
|
AdapterHelper |
|
AdobeFontMetricParser |
Parser for AFM Font Files
|
AdvancedTypeDetector |
|
AgeRecogniser |
Parser for extracting features from text.
|
AgeRecogniserConfig |
Stores URL for AgePredictor
|
AlphaIdeographFilterFactory |
Factory for filter that only allows tokens with characters that "isAlphabetic" or "isIdeographic" through.
|
AlternativePackaging |
|
AmazonTranscribe |
|
AnalyzerManager |
|
AnnotationUtils |
This class contains utilities for dealing with tika annotations
|
AppleSingleFileParser |
Parser that strips the header off of AppleSingle and AppleDouble
files.
|
AppParserFactoryBuilder |
|
ArrayNumber |
The class is used to represent the number of the array.
|
AsyncConfig |
|
AsyncEmitter |
Worker thread that takes EmitData off the queue, batches it
and tries to emit it as a batch
|
AsyncProcessor |
This is the main class for handling async requests.
|
AsyncRequest |
|
AsyncResource |
|
AttributeDependantMetadataHandler |
This adds a Metadata entry for a given node.
|
AttributeMatcher |
Final evaluation state of a .../@* XPath expression.
|
AttributeMetadataHandler |
SAX event handler that maps the contents of an XML attribute into
a metadata field.
|
AudioFrame |
An Audio Frame in an MP3 file.
|
AudioParser |
|
AutoDetectParser |
|
AutoDetectParserConfig |
This config object can be used to tune how conservative we want to be
when parsing data that is extremely compressible and resembles a ZIP
bomb.
|
AutoDetectParserFactory |
Simple class for AutoDetectParser
|
AutoDetectParserFactory |
Factory for an AutoDetectParser
|
AutoDetectReader |
An input stream reader that automatically detects the character encoding
to be used for converting bytes to characters.
|
AutoDetectTransformer |
|
AZBlobEmitter |
Emit files to Azure blob storage.
|
AZBlobFetcher |
Fetches files from Azure blob storage.
|
AZBlobPipesIterator |
|
BasicContentHandlerFactory |
Basic factory for creating common types of ContentHandlers
|
BasicContentHandlerFactory.HANDLER_TYPE |
Common handler types for content.
|
BasicObject |
Base object for FSSHTTPB.
|
BasicTikaFSConsumer |
Basic FileResourceConsumer that reads files from an input
directory and writes content to the output directory.
|
BasicTikaFSConsumersBuilder |
|
BasicTokenCountStatsCalculator |
|
BatchNoRestartError |
FileResourceConsumers should throw this if something
catastrophic has happened and the BatchProcess should shutdown
and not be restarted.
|
BatchProcess |
This is the main processor class for a single process.
|
BatchProcess.BATCH_CONSTANTS |
|
BatchProcessBuilder |
Builds a BatchProcessor from a combination of runtime arguments and the
config file.
|
BatchProcessDriverCLI |
|
BatchTopCommonTokenCounter |
Utility class that runs TopCommonTokenCounter against a directory
of table files (named {lang}_table.gz or leipzip-like afr_...-sentences.txt)
and outputs common tokens files for each input table file in the output directory.
|
BinaryItem |
|
Bit |
The class is used to read/set bit value for a byte array
|
BitConverter |
|
BitReader |
A class is used to extract values across byte boundaries with arbitrary bit positions.
|
BitWriter |
|
BodyContentHandler |
Content handler decorator that only passes everything inside
the XHTML <body/> tag to the underlying handler.
|
BoilerpipeContentHandler |
Uses the boilerpipe
library to automatically extract the main content from a web page.
|
BouncyCastleDigester |
Digester that relies on BouncyCastle for MessageDigest implementations.
|
BoundedInputStream |
Very slight modification of Commons' BoundedInputStream
so that we can figure out if this hit the bound or not.
|
BPGParser |
Parser for the Better Portable Graphics (BPG) File Format.
|
BPListDetector |
Detector for BPList with utility functions for PList.
|
ByteDeleter |
|
ByteFlipper |
|
ByteInjector |
|
BytesRefCalculator<T> |
Interface for calculators that require a string
|
BytesRefCalculator.BytesRefCalcInstance<T> |
|
ByteUtil |
|
CachedTranslator |
CachedTranslator.
|
CallablePipesIterator |
This is a simple wrapper around PipesIterator
that allows it to be called in its own thread.
|
CantFuzzException |
|
CaptionObject |
A model for caption objects from graphics and texts typically includes
human readable sentence, language of the sentence and confidence score.
|
Cell |
Cell of content.
|
CellDecorator |
Cell decorator.
|
CellID |
|
CellIDArray |
|
CellManifestCurrentRevision |
|
CellManifestDataElementData |
Cell manifest data element
|
CharsetDetector |
CharsetDetector provides a facility for detecting the
charset or encoding of character data in an unknown format.
|
CharsetMatch |
This class represents a charset that has been identified by a CharsetDetector
as a possible encoding for a set of input data.
|
CharsetUtils |
|
ChildMatcher |
Intermediate evaluation state of a .../*... XPath expression.
|
ChmAccessor<T> |
Defines an accessor interface
|
ChmAssert |
Contains chm extractor assertions
|
ChmBlockInfo |
A container that contains chm block information such as: i.
|
ChmCommons |
|
ChmCommons.EntryType |
Represents entry types: uncompressed, compressed
|
ChmCommons.IntelState |
Represents intel file states during decompression
|
ChmCommons.LzxState |
Represents lzx states: started decoding, not started decoding
|
ChmConstants |
|
ChmDirectoryListingSet |
Holds chm listing entries
|
ChmExtractor |
Extracts text from chm file.
|
ChmItsfHeader |
The Header 0000: char[4] 'ITSF' 0004: DWORD 3 (Version number) 0008: DWORD
Total header length, including header section table and following data.
|
ChmItspHeader |
Directory header The directory starts with a header; its format is as
follows: 0000: char[4] 'ITSP' 0004: DWORD Version number 1 0008: DWORD Length
of the directory header 000C: DWORD $0a (unknown) 0010: DWORD $1000 Directory
chunk size 0014: DWORD "Density" of quickref section, usually 2 0018: DWORD
Depth of the index tree - 1 there is no index, 2 if there is one level of
PMGI chunks 001C: DWORD Chunk number of root index chunk, -1 if there is none
(though at least one file has 0 despite there being no index chunk, probably
a bug) 0020: DWORD Chunk number of first PMGL (listing) chunk 0024: DWORD
Chunk number of last PMGL (listing) chunk 0028: DWORD -1 (unknown) 002C:
DWORD Number of directory chunks (total) 0030: DWORD Windows language ID
0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC} 0044: DWORD $54 (This is
the length again) 0048: DWORD -1 (unknown) 004C: DWORD -1 (unknown) 0050:
DWORD -1 (unknown)
|
ChmLzxBlock |
Decompresses a chm block.
|
ChmLzxcControlData |
::DataSpace/Storage//ControlData This file contains $20 bytes of
information on the compression.
|
ChmLzxcResetTable |
LZXC reset table For ensuring a decompression.
|
ChmLzxState |
|
ChmParser |
|
ChmParsingException |
|
ChmPmgiHeader |
Description Note: not always exists An index chunk has the following format:
0000: char[4] 'PMGI' 0004: DWORD Length of quickref/free area at end of
directory chunk 0008: Directory index entries (to quickref/free area) The
quickref area in an PMGI is the same as in an PMGL The format of a directory
index entry is as follows: BYTE: length of name BYTEs: name (UTF-8 encoded)
ENCINT: directory listing chunk which starts with name Encoded Integers aka
ENCINT An ENCINT is a variable-length integer.
|
ChmPmglHeader |
Description There are two types of directory chunks -- index chunks, and
listing chunks.
|
ChmSection |
|
ChmWrapper |
|
ChunkingFactory |
This class is used to create instance of AbstractChunking.
|
ChunkingMethod |
|
CJKBigramAwareLengthFilterFactory |
Creates a very narrowly focused TokenFilter that limits tokens based on length
_unless_ they've been identified as <DOUBLE> or <SINGLE>
by the CJKBigramFilter.
|
ClassLoaderUtil |
|
ClassParser |
Parser for Java .class files.
|
CleanPhoneText |
Class to help de-obfuscate phone numbers in text.
|
ClearByMimeMetadataFilter |
This class clears the entire metadata object if the
mime matches the mime filter.
|
ClimateForcast |
|
ColInfo |
|
Cols |
|
CommandLineParserBuilder |
Reads configurable options from a config file and returns org.apache.commons.cli.Options
object to be used in commandline parser.
|
CommonsDigester |
|
CommonsDigester.DigestAlgorithm |
|
CommonsDigesterFactory |
Simple factory for CommonsDigester with
default markLimit = 1000000 and md5 digester.
|
CommonTokenCountManager |
|
CommonTokenOverlapCounter |
|
CommonTokenResult |
|
CommonTokens |
|
CommonTokensBhattacharyya |
|
CommonTokensCosine |
|
CommonTokensHellinger |
|
CommonTokensKLDivergence |
|
CommonTokensKLDNormed |
|
Compact64bitInt |
A 9-byte encoding of values in the range 0x0002000000000000 through 0xFFFFFFFFFFFFFFFF
|
CompactID |
This class is used to represent the CompactID structrue.
|
CompareUtils |
|
CompositeDetector |
Content type detector that combines multiple different detection mechanisms.
|
CompositeDigester |
|
CompositeEncodingDetector |
|
CompositeExternalParser |
A Composite Parser that wraps up all the available External Parsers,
and provides an easy way to access them.
|
CompositeMatcher |
Composite XPath evaluation state.
|
CompositeMetadataFilter |
|
CompositeParseContextConfig |
|
CompositeParser |
Composite parser that delegates parsing tasks to a component parser
based on the declared content type of the incoming document.
|
CompositePipesReporter |
|
CompositeRenderer |
|
CompositeTagHandler |
Takes an array of ID3Tags in preference order, and when asked for
a given tag, will return it from the first ID3Tags that has it.
|
CompositeTextStatsCalculator |
|
CompressorConstants |
|
CompressorParser |
Parser for various compression formats.
|
CompressorParserOptions |
|
ConcurrentUtils |
Utility Class for Concurrency in Tika
|
ConfigBase |
|
ConfigurableThreadPoolExecutor |
Allows Thread Pool to be Configurable.
|
ConsumersManager |
Simple interface around a collection of consumers that allows
for initializing and shutting shared resources (e.g.
|
ContainerExtractor |
Tika container extractor interface.
|
ContentHandlerDecorator |
|
ContentHandlerDecoratorFactory |
|
ContentHandlerExample |
Examples of using different Content Handlers to
get different parts of the file's contents
|
ContentHandlerFactory |
Interface to allow easier injection of code for getting a new ContentHandler
|
ContentLengthCalculator |
|
ContentTagParser |
|
ContentTags |
|
ContrastStatistics |
|
CoreNLPNERecogniser |
This class offers an implementation of NERecogniser based on
CRF classifiers from Stanford CoreNLP.
|
CorruptedFileException |
This exception should be thrown when the parse absolutely, positively has to stop.
|
CreativeCommons |
A collection of Creative Commons properties names.
|
CryptoParser |
Decrypts the incoming document stream and delegates further parsing to
another parser instance.
|
CSVMessageBodyWriter |
|
CSVParams |
|
CSVPipesIterator |
Iterates through a UTF-8 CSV file.
|
CSVResult |
|
CTAKESAnnotationProperty |
This enumeration includes the properties that an IdentifiedAnnotation object can provide.
|
CTAKESConfig |
|
CTAKESContentHandler |
Class used to extract biomedical information while parsing.
|
CTAKESParser |
CTAKESParser decorates a Parser and leverages on
CTAKESContentHandler to extract biomedical information from
clinical text using Apache cTAKES.
|
CTAKESSerializer |
Enumeration for types of cTAKES (UIMA) CAS serializer supported by cTAKES.
|
CTAKESUtils |
This class provides methods to extract biomedical information from plain text
using CTAKESContentHandler that relies on Apache cTAKES.
|
CustomMimeInfo |
|
Database |
|
DataElement |
|
DataElementData |
Base class of data element
|
DataElementHash |
Specifies an data element hash stream object
|
DataElementPackage |
|
DataElementParseErrorException |
|
DataElementType |
The enumeration of the data element type
|
DataElementUtils |
|
DataHashObject |
|
DataNodeObjectData |
Data Node Object data
|
DataSizeObject |
Data Size Object
|
DataURIScheme |
|
DataURISchemeParseException |
|
DataURISchemeUtil |
Not thread safe.
|
DateNormalizingMetadataFilter |
Some dates in some file formats do not have a timezone.
|
DateUtils |
Date related utility methods and constants
|
DBBuffer |
|
DBConsumersManager |
|
DBFParser |
This is a Tika wrapper around the DBFReader.
|
DBWriter |
This is still in its early stages.
|
DcXMLParser |
Dublin Core metadata parser
|
DefaultContentHandlerFactoryBuilder |
Builds BasicContentHandler with type defined by attribute "basicHandlerType"
with possible values: xml, html, text, body, ignore.
|
DefaultDetector |
|
DefaultEmbeddedStreamTranslator |
Loads EmbeddedStreamTranslators via service loading.
|
DefaultEncodingDetector |
|
DefaultHtmlMapper |
The default HTML mapping rules in Tika.
|
DefaultInputStreamFactory |
Passthrough -- returns InputStream as is
|
DefaultMetadataFilter |
|
DefaultParser |
|
DefaultProbDetector |
A version of DefaultDetector for probabilistic mime
detectors, which use statistical techniques to blend the
results of differing underlying detectors when attempting
to detect the type of a given file.
|
DefaultTranslator |
|
DefaultZipContainerDetector |
|
DelegatingParser |
Base class for parser implementations that want to delegate parts of the
task of parsing an input document to another parser.
|
DeprecatedStreamingZipContainerDetector |
|
DeprecatedZipContainerDetector |
A detector that works on Zip documents and tries to figure out
basic types -- epub, jar, ear, war, kmz and StarOffice
|
DescribeMetadata |
Print the supported Tika Metadata models and their fields.
|
Detector |
Content type detector.
|
DetectorResource |
|
DGN8Parser |
This is a VERY LIMITED parser.
|
DIFContentHandler |
|
DIFContentHandler |
|
DIFParser |
|
DigestingAutoDetectParserFactory |
|
DigestingParser |
|
DigestingParser.Digester |
Interface for digester.
|
DigestingParser.DigesterFactory |
|
DigestingParser.Encoder |
Encodes byte array from a MessageDigest to String
|
DirectoryListingEntry |
The format of a directory listing entry is as follows: BYTE: length of name
BYTEs: name (UTF-8 encoded) ENCINT: content section ENCINT: offset ENCINT:
length The offset is from the beginning of the content section the file is
in, after the section has been decompressed (if appropriate).
|
DirListParser |
Parses the output of /bin/ls and counts the number of files and the number of
executables using Tika.
|
DisplayMetInstance |
Grabs a PDF file from a URL and prints its Metadata
|
DL4JInceptionV3Net |
|
DL4JVGG16Net |
|
DocumentSelector |
Interface for different document selection strategies for purposes like
embedded document extraction by a ContainerExtractor instance.
|
DocumentSelectorConfig |
|
DublinCore |
A collection of Dublin Core metadata names.
|
DumpTikaConfigExample |
This class shows how to dump a TikaConfig object to a configuration file.
|
DurationFormatUtils |
Functionality and naming conventions (roughly) copied from org.apache.commons.lang3
so that we didn't have to add another dependency.
|
DWGParser |
DWG (CAD Drawing) parser.
|
EightBytesOfData |
This class is used to represent the property contains 8 bytes of data in the PropertySet.rgData stream field.
|
ElementMappingContentHandler |
Content handler decorator that maps element QName s using
a Map .
|
ElementMappingContentHandler.TargetElement |
|
ElementMatcher |
Final evaluation state of an XPath expression that targets an element.
|
ElementMetadataHandler |
SAX event handler that maps the contents of an XML element into
a metadata field.
|
EmbeddedContentHandler |
|
EmbeddedDocumentExtractor |
|
EmbeddedDocumentExtractorFactory |
|
EmbeddedDocumentUtil |
Utility class to handle common issues with embedded documents.
|
EmbeddedResourceHandler |
Tika container extractor callback interface.
|
EmbeddedStreamTranslator |
Interface for different filtering of embedded streams.
|
Embedder |
Tika embedder interface
|
EMFParser |
Extracts files embedded in EMF and offers a
very rough capability to extract text if there
is text stored in the EMF.
|
EmitData |
|
EmitKey |
|
Emitter |
|
EmitterManager |
Utility class that will apply the appropriate fetcher
to the fetcherString based on the prefix.
|
EmptyDetector |
Dummy detector that returns application/octet-stream for all documents.
|
EmptyEmitter |
|
EmptyFetcher |
|
EmptyParser |
Dummy parser that always produces an empty XHTML document without even
attempting to parse the given document stream.
|
EmptyTranslator |
Dummy translator that always declines to give any text.
|
EncodingDetector |
Character encoding detector.
|
EncryptedDocumentException |
|
EncryptedPrescriptionDetector |
|
EncryptedPrescriptionParser |
|
EndDocumentShieldingContentHandler |
|
EndianUtils |
General Endian Related Utilties.
|
EndianUtils.BufferUnderrunException |
|
EnviHeaderParser |
|
EpubContentParser |
Parser for EPUB OPS *.html files.
|
EpubParser |
Epub parser
|
Error |
|
ErrorParser |
Dummy parser that always throws a TikaException without even
attempting to parse the given document stream.
|
EvalConsumerBuilder |
|
EvalConsumersBuilder |
|
EvalExceptionUtils |
|
EvilCOSWriter |
|
ExcelExtractor |
Excel parser implementation which uses POI's Event API
to handle the contents of a Workbook.
|
ExceptionUtils |
|
ExcludeFieldMetadataFilter |
|
ExecutableParser |
Parser for executable files.
|
ExGuid |
|
ExGUIDArray |
|
ExpandedTitleContentHandler |
|
ExtendedGUID |
|
ExternalEmbedder |
Embedder that uses an external program (like sed or exiftool) to embed text
content and metadata into a given document.
|
ExternalParser |
Parser that uses an external program (like catdoc or pdf2txt) to extract
text content and metadata from a given document.
|
ExternalParser |
This is a next generation external parser that uses some of the more
recent additions to Tika.
|
ExternalParser.LineConsumer |
Consumer contract
|
ExternalParsersConfigReader |
Builds up ExternalParser instances based on XML file(s)
which define what to run, for what, and how to process
any output metadata.
|
ExternalParsersConfigReaderMetKeys |
|
ExternalParsersFactory |
Creates instances of ExternalParser based on XML
configuration files.
|
ExternalProcess |
|
ExternalTranslator |
Abstract class used to interact with command line/external Translators.
|
ExtractComparer |
|
ExtractComparerBuilder |
|
ExtractEmbeddedFiles |
|
ExtractProfiler |
|
ExtractProfilerBuilder |
|
ExtractReader |
|
ExtractReader.ALTER_METADATA_LIST |
|
ExtractReaderException |
Exception when trying to read extract
|
ExtractReaderException.TYPE |
|
FallbackParser |
Tries multiple parsers in turn, until one succeeds.
|
FeedParser |
Feed parser.
|
FetchEmitTuple |
|
FetchEmitTuple.ON_PARSE_EXCEPTION |
|
Fetcher |
Interface for an object that will fetch an InputStream given
a fetch string.
|
FetcherManager |
Utility class to hold multiple fetchers.
|
FetcherStreamFactory |
This class looks for "fetcherName" in the http header.
|
FetcherStringException |
If something goes wrong in parsing the fetcher string
|
FetchKey |
Pair of fetcherName (which fetcher to call) and the key
to send to that fetcher to retrieve a specific file.
|
FictionBookParser |
|
Field |
Field annotation is a contract for binding Param value from
Tika Configuration to an object.
|
FieldNameMappingFilter |
|
FileCommandDetector |
This runs the linux 'file' command against a file.
|
FileListPipesIterator |
Reads a list of file names/relative paths from a UTF-8 file.
|
FilenameUtils |
|
FileProcessResult |
|
FileProfiler |
This class profiles actual files as opposed to extracts e.g.
|
FileProfilerBuilder |
|
FileResource |
This is a basic interface to handle a logical "file".
|
FileResourceConsumer |
This is a base class for file consumers.
|
FileResourceCrawler |
|
FileSystemEmitter |
Emitter to write to a file system.
|
FileSystemFetcher |
|
FileSystemPipesIterator |
|
FileSystemStatusReporter |
This is intended to write summary statistics to disk
periodically.
|
FileTooLongException |
|
FlatOpenDocumentParser |
|
FLVParser |
Parser for metadata contained in Flash Videos (.flv).
|
Font |
|
ForkParser |
|
ForkProxy |
|
ForkResource |
|
FormattingUtils |
|
FormattingUtils.Tag |
|
FourBytesOfData |
This class is used to represent the property contains 4 bytes of data in the PropertySet.rgData stream field.
|
FrictionlessPackageDetector |
|
FSBatchProcessCLI |
|
FSConsumersManager |
|
FSCrawlerBuilder |
Builds either an FSDirectoryCrawler or an FSListCrawler.
|
FSDirectoryCrawler |
|
FSDirectoryCrawler.CRAWL_ORDER |
|
FSDocumentSelector |
Selector that chooses files based on their file name
and their size, as determined by TikaCoreProperties.RESOURCE_NAME_KEY and Metadata.CONTENT_LENGTH.
|
FSFileResource |
FileSystem(FS)Resource wraps a file name.
|
FSListCrawler |
Class that "crawls" a list of files.
|
FSOutputStreamFactory |
|
FSOutputStreamFactory.COMPRESSION |
|
FSProperties |
|
FSUtil |
Utility class to handle some common issues when
reading from and writing to a file system (FS).
|
FSUtil.HANDLE_EXISTING |
|
FuzzingCLI |
|
FuzzingCLIConfig |
|
FuzzOne |
Forked process that runs against a single input file
|
GCSEmitter |
|
GCSFetcher |
Fetches files from google cloud storage.
|
GCSPipesIterator |
|
GDALParser |
|
GeneralTransformer |
|
GenericConverter |
Trys to convert as much of the properties in the Metadata map to XMP namespaces.
|
GeoGazetteerClient |
|
Geographic |
Geographic schema.
|
GeographicInformationParser |
|
GeoParser |
|
GeoParserConfig |
|
GeoTag |
|
GlobalIdTableEntry3FNDX |
|
GlobalIdTableEntryFNDX |
|
GoogleTranslator |
|
GrabPhoneNumbersExample |
|
GribParser |
|
GrobidNERecogniser |
|
GrobidRESTParser |
|
GUID |
|
GuidUtil |
|
H2Util |
|
HandlerConfig |
|
HandlerConfig.PARSE_MODE |
|
HDFParser |
|
HeaderCell |
|
HeifParser |
|
HexCoDec |
A set of Hex encoding and decoding utility methods.
|
HSLFExtractor |
|
HTML |
|
HtmlEncodingDetector |
Character encoding detector for determining the character encoding of a
HTML document based on the potential charset parameter found in a
Content-Type http-equiv meta tag somewhere near the beginning.
|
HTMLHelper |
Helps produce user facing HTML output.
|
HtmlMapper |
HTML mapper used to make incoming HTML documents easier to handle by
Tika clients.
|
HtmlParser |
HTML parser.
|
HttpClientFactory |
This holds quite a bit of state and is not thread safe.
|
HttpClientUtil |
|
HttpFetcher |
Based on Apache httpclient
|
HttpHeaders |
A collection of HTTP header names.
|
HttpParser |
|
HwpStreamReader |
|
HwpTextExtractorV5 |
|
HwpV5Parser |
|
ICNSParser |
A basic parser class for Apple ICNS icon files
|
IContentHandlerFactoryBuilder |
|
ICrawlerBuilder |
|
Icu4jEncodingDetector |
|
ID3Tags |
Interface that defines the common interface for ID3 tag parsers,
such as ID3v1 and ID3v2.3.
|
ID3Tags.ID3Comment |
Represents a comments in ID3 (especially ID3 v2), where are
made up of several parts
|
ID3v1Handler |
This is used to parse ID3 Version 1 Tag information from an MP3 file,
if available.
|
ID3v22Handler |
This is used to parse ID3 Version 2.2 Tag information from an MP3 file,
if available.
|
ID3v23Handler |
This is used to parse ID3 Version 2.3 Tag information from an MP3 file,
if available.
|
ID3v24Handler |
This is used to parse ID3 Version 2.4 Tag information from an MP3 file,
if available.
|
ID3v2Frame |
A frame of ID3v2 data, which is then passed to a handler to
be turned into useful data.
|
ID3v2Frame.RawTag |
|
ID3v2Frame.TextEncoding |
|
IDBWriter |
|
IdentityHtmlMapper |
Alternative HTML mapping rules that pass the input HTML as-is without any
modifications.
|
IDMLParser |
Adobe InDesign IDML Parser.
|
IFileProcessorFutureResult |
stub interface to allow for different result types from different processors
|
IFSSHTTPBSerializable |
FSSHTTPB Serialize interface.
|
ImageDeskew |
|
ImageDeskew.HoughLine |
|
ImageGraphicsEngine |
Copied nearly verbatim from PDFBox
|
ImageGraphicsEngineFactory |
|
ImageMetadataExtractor |
Uses the Metadata Extractor library
to read EXIF and IPTC image metadata and map to Tika fields.
|
ImageParser |
|
ImageUtil |
|
ImportContextImpl |
ImportContextImpl ...
|
IncludeFieldMetadataFilter |
|
Initializable |
Components that must do special processing across multiple fields
at initialization time should implement this interface.
|
InitializableProblemHandler |
This is to be used to handle potential recoverable problems that
might arise during initialization.
|
InputStreamDigester |
|
InputStreamFactory |
A factory which returns a fresh InputStream for the same
resource each time.
|
InputStreamFactory |
Interface to allow for custom/consistent creation of InputStream
|
IntermediateNodeObject |
|
IntermediateNodeObject.RootNodeObjectBuilder |
The class is used to build a root node object.
|
InterruptableParsingExample |
This example demonstrates how to interrupt document parsing if
some condition is met.
|
Interrupter |
Class that waits for input on System.in.
|
InterrupterBuilder |
Builds an Interrupter
|
InterrupterFutureResult |
|
IOUtils |
|
IPADetector |
|
IParserFactoryBuilder |
|
IProperty |
The interface of the property in OneNote file.
|
IPTC |
IPTC photo metadata schema.
|
IptcAnpaParser |
Parser for IPTC ANPA New Wire Feeds
|
ISArchiveParser |
|
ISATabUtils |
|
ITikaToXMPConverter |
Interface for the specific Metadata to XMP converters
|
IWork13PackageParser |
|
IWork13PackageParser.IWork13DocumentType |
|
IWork18PackageParser |
For now, this parser isn't even registered.
|
IWork18PackageParser.IWork18DocumentType |
|
IWorkDetector |
|
IWorkPackageParser |
A parser for the IWork container files.
|
IWorkPackageParser.IWORKDocumentType |
|
JackcessParser |
Parser that handles Microsoft Access files via
Jackcess
|
JarDetector |
|
JCID |
This class is used to represent a JCID
|
JCIDObject |
This class is used to represent the JCID object.
|
JDBCEmitter |
This is only an initial, basic implementation of an emitter for JDBC.
|
JDBCEmitter.AttachmentStrategy |
|
JDBCPipesIterator |
Iterates through a the results from a sql call via jdbc.
|
JDBCTableReader |
General base class to iterate through rows of a JDBC table
|
JDBCUtil |
|
JDBCUtil.CREATE_TABLE |
|
JempboxExtractor |
|
JoshuaNetworkTranslator |
This translator is designed to work with a TCP-IP available
Joshua translation server, specifically the
REST-based Joshua server.
|
JournalParser |
|
JpegParser |
|
JsonEmitData |
|
JsonFetchEmitTuple |
|
JsonFetchEmitTupleList |
|
JSONMessageBodyWriter |
|
JsonMetadata |
|
JsonMetadataDeserializer |
|
JsonMetadataList |
|
JsonMetadataSerializer |
|
JSONObjWriter |
|
JsonResponse |
|
JsonResponse |
|
JsonStreamingSerializer |
|
JXLParser |
Tries to scrape XMP out of JXL
|
KafkaEmitter |
Emits the now-parsed documents into a specified Apache Kafka topic.
|
KafkaPipesIterator |
|
KMZDetector |
|
LangModel |
|
Language |
|
LanguageAwareTokenCountStats<T> |
Interface for calculators that require language probabilities and token stats
|
LanguageConfidence |
|
LanguageDetectingParser |
|
LanguageDetector |
|
LanguageDetectorExample |
|
LanguageDetectorTest |
|
LanguageHandler |
SAX content handler that updates a language detector based on all the
received character content.
|
LanguageIdentifier |
Identifier of the language that best matches a given content profile.
|
LanguageIDWrapper |
|
LanguageNames |
Support for language tags (as defined by https://tools.ietf.org/html/bcp47)
|
LanguageProfile |
Language profile based on ngram counts.
|
LanguageProfilerBuilder |
This class runs a ngram analysis over submitted text, results might be used
for automatic language identification.
|
LanguageResource |
|
LanguageResult |
|
LanguageWriter |
Writer that builds a language profile based on all the written content.
|
Latin1StringsParser |
Parser to extract printable Latin1 strings from arbitrary files with pure java
without running any external process.
|
LeafNodeObject |
|
LeafNodeObject.IntermediateNodeObjectBuilder |
The class is used to build a intermediate node object.
|
LeipzigHelper |
|
LeipzigSampler |
|
Lingo24LangDetector |
|
Lingo24Translator |
|
Link |
|
LinkContentHandler |
Content handler that collects links from an XHTML document.
|
LinkedCell |
Linked cell.
|
ListDescriptor |
Contains the information for a single list in the list or list override tables.
|
ListManager |
Computes the number text which goes at the beginning of each list paragraph
|
LittleEndianBitConverter |
Implement a converter which converts to/from little-endian byte arrays
|
LoadErrorHandler |
Interface for error handling strategies in service class loading.
|
Location |
|
LookaheadInputStream |
Stream wrapper that make it easy to read up to n bytes ahead from
a stream that supports the mark feature.
|
LuceneIndexer |
|
LuceneIndexerExtended |
|
LyricsHandler |
This is used to parse Lyrics3 tag information
from an MP3 file, if available.
|
MachineMetadata |
Metadata for describing machines, such as their
architecture, type and endian-ness
|
MachineMetadata.Endian |
|
MagicDetector |
Content type detection based on magic bytes, i.e.
|
MailDateParser |
|
MailUtil |
|
MappedBufferCleaner |
Copied/pasted from the Apache Lucene/Solr project.
|
MarianTranslator |
Translator that uses the Marian NMT decoder for translation.
|
MarianTranslator.MarianServerClient |
Internal Client for marian-server Web Socket Server.
|
Matcher |
XPath element matcher.
|
MatchingContentHandler |
Content handler decorator that only passes the elements, attributes,
and text nodes that match the given XPath expression.
|
MatParser |
|
MboxParser |
Mbox (mailbox) parser.
|
MediaType |
Internet media type.
|
MediaTypeExample |
|
MediaTypeRegistry |
Registry of known Internet media types.
|
Message |
A collection of Message related property names.
|
Metadata |
A multi-valued metadata container.
|
MetadataAwareLuceneIndexer |
Builds on the LuceneIndexer from Chapter 5 and adds indexing of Metadata.
|
MetadataExtractor |
OOXML metadata extractor.
|
MetadataFields |
Knowns about all declared Metadata fields.
|
MetadataFilter |
Filters the metadata in place after the parse
|
MetadataHandler |
Deprecated.
|
MetadataList |
wrapper class to make isWriteable in MetadataListMBW simpler
|
MetadataListMessageBodyWriter |
|
MetadataResource |
|
MetadataWriteFilter |
|
MetadataWriteFilterFactory |
|
MicrosoftTranslator |
Wrapper class to access the Windows translation service.
|
MidiParser |
|
MIFContentHandler |
Content handler for MIF Content and Metadata.
|
MIFExtractor |
Helper Class to Parse and Extract Adobe MIF Files.
|
MIFParser |
|
MimeBuffer |
|
MimeType |
Internet media type.
|
MimeTypeException |
A class to encapsulate MimeType related exceptions.
|
MimeTypes |
This class is a MimeType repository.
|
MimeTypesFactory |
Creates instances of MimeTypes.
|
MimeTypesReader |
A reader for XML files compliant with the freedesktop MIME-info DTD.
|
MimeTypesReaderMetKeys |
|
MiscOLEDetector |
A detector that works on a POIFS OLE2 document
to figure out exactly what the file is.
|
MITIENERecogniser |
This class offers an implementation of NERecogniser based on
trained models using state-of-the-art information extraction tools.
|
MosesTranslator |
Translator that uses the Moses decoder for translation.
|
MP3Frame |
A frame in an MP3 file, such as ID3v2 Tags or some
audio.
|
Mp3Parser |
The Mp3Parser is used to parse ID3 Version 1 Tag information
from an MP3 file, if available.
|
Mp3Parser.ID3TagsAndAudio |
|
MP4Parser |
Parser for the MP4 media container format, as well as the older
QuickTime format that MP4 is based on.
|
MSEmbeddedStreamTranslator |
|
MSOfficeBinaryConverter |
Tika to XMP mapping for the binary MS formats Word (.doc), Excel (.xls) and PowerPoint (.ppt).
|
MSOfficeXMLConverter |
Tika to XMP mapping for the Office Open XML formats Word (.docx), Excel (.xlsx) and PowerPoint
(.pptx).
|
MSOneStorePackage |
|
MSOneStoreParser |
|
MSOwnerFileParser |
Parser for temporary MSOFfice files.
|
MuPDFRenderer |
|
MyFirstTika |
Demonstrates how to call the different components within Tika: its
Detector framework (aka MIME identification and repository), its
Parser interface, its org.apache.tika.language.LanguageIdentifier and other goodies.
|
NamedAttributeMatcher |
Final evaluation state of a .../@name XPath expression.
|
NamedElementMatcher |
Intermediate evaluation state of a .../name... XPath
expression.
|
NamedEntityParser |
This implementation of Parser extracts
entity names from text content and adds it to the metadata.
|
NameDetector |
Content type detection based on the resource name.
|
NameEntityExtractor |
|
Namespace |
Utility class to hold namespace information.
|
NERecogniser |
Defines a contract for named entity recogniser.
|
NetCDFParser |
|
NetworkParser |
|
NLTKNERecogniser |
This class offers an implementation of NERecogniser based on
ne_chunk() module of NLTK.
|
NNExampleModelDetector |
|
NNTrainedModel |
|
NNTrainedModelBuilder |
|
NoData |
This class is used to represent the property contains no data.
|
NodeMatcher |
Final evaluation state of a .../node() XPath expression.
|
NodeObject |
|
NonDetectingEncodingDetector |
Always returns the charset passed in via the initializer
|
NoOpFilter |
This filter performs no operations on the metadata
and leaves it untouched.
|
NoTextPDFRenderer |
This class extends the PDFRenderer to exclude rendering of electronic text.
|
NSNormalizerContentHandler |
Content handler decorator that:
Maps old OpenOffice 1.0 Namespaces to the OpenDocument ones
Returns a fake DTD when parser requests OpenOffice DTD
|
NumberCell |
Number cell.
|
ObjectFromDOMAndQueueBuilder<T> |
|
ObjectFromDOMBuilder<T> |
Interface for things that build objects from a DOM Node and a map of runtime attributes
|
ObjectGroupData |
The ObjectGroupData class.
|
ObjectGroupDataElementData |
|
ObjectGroupDataElementData.Builder |
The internal class for build a list of DataElement from a node object.
|
ObjectGroupDeclarations |
Object Group Declarations
|
ObjectGroupMetadata |
Specifies an object group metadata
|
ObjectGroupMetadataDeclarations |
Object Metadata Declaration
|
ObjectGroupObjectBLOBDataDeclaration |
object data BLOB declaration
|
ObjectGroupObjectData |
|
ObjectGroupObjectDataBLOBReference |
object data BLOB reference
|
ObjectGroupObjectDeclare |
|
ObjectRecogniser |
|
ObjectRecognitionParser |
This parser recognises objects from Images.
|
ObjectSpaceObjectPropSet |
This class is used to represent a ObjectSpaceObjectPropSet.
|
ObjectSpaceObjectPropSet |
|
ObjectSpaceObjectStreamHeader |
|
ObjectSpaceObjectStreamOfContextIDs |
This class is used to represent a ObjectSpaceObjectStreamOfContextIDs.
|
ObjectSpaceObjectStreamOfOIDs |
This class is used to represent a ObjectSpaceObjectStreamOfOIDs.
|
ObjectSpaceObjectStreamOfOSIDs |
This class is used to represent a ObjectSpaceObjectStreamOfOSIDs.
|
OfferLargerThanQueueSize |
|
Office |
Office Document properties collection.
|
OfficeOpenXMLCore |
Core properties as defined in the Office Open XML specification part Two that are not
in the DublinCore namespace.
|
OfficeOpenXMLExtended |
Extended properties as defined in the Office Open XML specification part Four.
|
OfficeParser |
Defines a Microsoft document content extractor.
|
OfficeParser.POIFSDocumentType |
|
OfficeParserConfig |
|
OfflineContentHandler |
|
OldExcelParser |
A POI-powered Tika Parser for very old versions of Excel, from
pre-OLE2 days, such as Excel 4.
|
OneByteOfData |
This class is used to represent the property contains 1 byte of data in the PropertySet.rgData stream field.
|
OneNoteParser |
OneNote tika parser capable of parsing Microsoft OneNote files.
|
OneNotePropertyEnum |
|
OneNoteTreeWalkerOptions |
Options when walking the one note tree.
|
OOXMLExtractor |
Interface implemented by all Tika OOXML extractors.
|
OOXMLExtractorFactory |
Figures out the correct OOXMLExtractor for the supplied document and
returns it.
|
OOXMLParser |
Office Open XML (OOXML) parser.
|
OOXMLTikaBodyPartHandler |
|
OOXMLWordAndPowerPointTextHandler |
This class is intended to handle anything that might contain IBodyElements:
main document, headers, footers, notes, slides, etc.
|
OOXMLWordAndPowerPointTextHandler.EditType |
|
OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler |
|
OPCPackageDetector |
|
OPCPackageWrapper |
This is a wrapper around OPCPackage that calls revert() instead of close().
|
OpenDocumentContentParser |
Parser for ODF content.xml files.
|
OpenDocumentConverter |
Tika to XMP mapping for the Open Document formats: Text (.odt), Spreatsheet (.ods), Graphics
(.odg) and Presentation (.odp).
|
OpenDocumentDetector |
|
OpenDocumentMetaParser |
Parser for OpenDocument meta.xml files.
|
OpenDocumentParser |
OpenOffice parser
|
OpenNLPDetector |
This is based on OpenNLP's language detector.
|
OpenNLPMetadataFilter |
|
OpenNLPNameFinder |
An implementation of NERecogniser that finds names in text using Open NLP Model.
|
OpenNLPNERecogniser |
|
OpenSearchClient |
|
OpenSearchClient |
|
OpenSearchEmitter |
|
OpenSearchEmitter.AttachmentStrategy |
|
OpenSearchEmitter.UpdateStrategy |
|
OpenSearchPipesReporter |
As of the 2.5.0 release, this is ALPHA version.
|
OptimaizeLangDetector |
Implementation of the LanguageDetector API that uses
https://github.com/optimaize/language-detector
|
OptimaizeMetadataFilter |
|
OutlookExtractor |
Outlook Message Parser.
|
OutlookExtractor.RECIPIENT_TYPE |
|
OutlookPSTParser |
Parser for MS Outlook PST email storage files
|
OutputStreamFactory |
|
OverrideDetector |
|
PackageConstants |
|
PackageParser |
Parser for various packaging formats.
|
PageBasedRenderResults |
|
PagedText |
XMP Paged-text schema.
|
PageRangeRequest |
The range of pages to render.
|
ParagraphProperties |
|
ParallelFileProcessingResult |
|
Param<T> |
This is a serializable model class for parameters from configuration file.
|
ParamField |
This class stores metdata for Field annotation are used to map them
to Param at runtime
|
ParseContext |
Parse context.
|
ParseContextConfig |
Implementations must be thread-safe!
|
Parser |
Tika parser interface.
|
ParserContainerExtractor |
|
ParserDecorator |
Decorator base class for the Parser interface.
|
ParseRecord |
Use this class to store exceptions, warnings and other information
during the parse.
|
ParserFactory |
|
ParserFactory |
|
ParserFactoryBuilder |
|
ParserFactoryFactory |
Lightweight, easily serializable class that contains enough information
to build a ParserFactory
|
ParserPostProcessor |
Parser decorator that post-processes the results from a decorated parser.
|
ParserUtils |
Helper util methods for Parsers themselves.
|
ParsingEmbeddedDocumentExtractor |
Helper class for parsers of package archives or other compound document
formats that support embedded or attached component documents.
|
ParsingEmbeddedDocumentExtractorFactory |
|
ParsingExample |
|
ParsingReader |
Reader for the text content from a given binary stream.
|
PasswordProvider |
Interface for providing a password to a Parser for handling Encrypted
and Password Protected Documents.
|
PasswordProviderConfig |
|
PDDocumentRenderer |
stub interface for the PDFParser to use to figure out if it needs
to pass on the PDDocument or create a temp file to be used
by a file-based renderer down the road.
|
PDF |
PDF properties collection.
|
PDFBoxRenderer |
|
PDFMarkedContent2XHTML |
This was added in Tika 1.24 as an alpha version of a text extractor
that builds the text from the marked text tree and includes/normalizes
some of the structural tags.
|
PDFParser |
PDF parser.
|
PDFParserConfig |
Config for PDFParser.
|
PDFParserConfig.IMAGE_STRATEGY |
|
PDFParserConfig.OCR_RENDERING_STRATEGY |
|
PDFParserConfig.OCR_STRATEGY |
|
PDFParserConfig.OCRStrategyAuto |
Encapsulate the numbers used to control OCR Strategy when set to auto
|
PDFRenderingState |
|
PDFServerConfig |
PDF parser configuration, for the request
|
PDFTransformer |
|
PDFTransformerConfig |
|
PDMetadataExtractor |
|
Pharmacy |
|
PhoneExtractingContentHandler |
Class used to extract phone numbers while parsing.
|
Photoshop |
XMP Photoshop metadata schema.
|
PickBestTextEncodingParser |
Deprecated.
|
PipesClient |
The PipesClient is designed to be single-threaded.
|
PipesConfig |
|
PipesConfigBase |
|
PipesException |
Fatal exception that means that something went seriously wrong.
|
PipesIterator |
Abstract class that handles the testing for timeouts/thread safety
issues.
|
PipesParser |
|
PipesReporter |
This is called asynchronously by the AsyncProcessor.
|
PipesResource |
|
PipesResult |
|
PipesResult.STATUS |
|
PipesServer |
This server is forked from the PipesClient.
|
PipesServer.STATUS |
|
Pkcs7Parser |
Basic parser for PKCS7 data.
|
PListParser |
Parser for Apple's plist and bplist.
|
POIFSContainerDetector |
A detector that works on a POIFS OLE2 document
to figure out exactly what the file is.
|
POIXMLTextExtractorDecorator |
|
PooledTimeSeriesParser |
Uses the Pooled Time Series algorithm + command line tool, to
generate a numeric representation of the video suitable for
similarity searches.
|
PrescriptionParser |
|
PrettyMetadataKeyComparator |
|
ProbabilisticMimeDetectionSelector |
Selector for combining different mime detection results
based on probability
|
ProbabilisticMimeDetectionSelector.Builder |
build class for probability parameters setting
|
ProcessUtils |
|
ProduceTypeResourceComparator |
Resource comparator based to produce type.
|
ProfilingWriter |
Writer that builds a language profile based on all the written content.
|
Property |
XMP property definition.
|
Property.PropertyType |
|
Property.ValueType |
|
PropertyID |
This class is used to represent a PropertyID.
|
PropertySet |
This class is used to represent a PropertySet.
|
PropertySetObject |
This class is used to represent the property set.
|
PropertyType |
|
PropertyTypeException |
XMP property definition violation exception.
|
PropsUtil |
Utility class to handle properties.
|
PrtArrayOfPropertyValues |
The class is used to represent the prtArrayOfPropertyValues .
|
PrtFourBytesOfLengthFollowedByData |
This class is used to represent the prtFourBytesOfLengthFollowedByData.
|
PRTParser |
A basic text extracting parser for the CADKey PRT (CAD Drawing)
format.
|
PSDParser |
Parser for the Adobe Photoshop PSD File Format.
|
QuattroPro |
QuattroPro properties collection.
|
QuattroProParser |
Parser for Corel QuattroPro documents (part of Corel WordPerfect
Office Suite).
|
RangeFetcher |
This class extracts a range of bytes from a given fetch key.
|
RarParser |
Parser for Rar files.
|
RDCAnalysisChunking |
This class is used to process RDC analysis chunking
|
RecentFiles |
Builds on top of the LuceneIndexer and the Metadata discussions in Chapter 6
to output an RSS (or RDF) feed of files crawled by the LuceneIndexer within
the last N minutes.
|
RecognisedObject |
A model for recognised objects from graphics and texts typically includes
human readable label for the object, language of the label, id and confidence score.
|
RecursiveMetadataResource |
|
RecursiveParserWrapper |
This is a helper class that wraps a parser in a recursive handler.
|
RecursiveParserWrapperFSConsumer |
This runs a RecursiveParserWrapper against an input file
and outputs the json metadata to an output file.
|
RecursiveParserWrapperHandler |
|
RegexCaptureParser |
|
RegexNERecogniser |
This class offers an implementation of NERecogniser based on
Regular Expressions.
|
RegexUtils |
Inspired from Nutch code class OutlinkExtractor.
|
Renderer |
Interface for a renderer.
|
Rendering |
|
RenderingParser |
|
RenderingState |
This should be to track state for each file (embedded or otherwise).
|
RenderingTracker |
Use this in the ParseContext to keep track of unique ids for rendered
images in embedded docs.
|
RenderRequest |
Empty interface for requests to a renderer.
|
RenderResult |
|
RenderResult.STATUS |
|
RenderResults |
|
ReplacementCharset |
An implementation of the standard "replacement" charset defined by the W3C.
|
Report |
This class represents a single report.
|
ReporterBuilder |
Interface for reporter builders
|
RequestTypes |
The enumeration of request type.
|
RereadableInputStream |
Wraps an input stream, reading it only once, but making it available
for rereading an arbitrary number of times.
|
ResultsReporter |
|
RevisionManifest |
|
RevisionManifestDataElementData |
|
RevisionManifestObjectGroupReferences |
Specifies a revision manifest object group references, each followed by object group extended GUIDs
|
RevisionManifestRootDeclare |
Specifies a revision manifest root declare, each followed by root and object extended GUIDs
|
RevisionStoreObject |
The class is used to represent the revision store object.
|
RevisionStoreObjectGroup |
|
RFC822Parser |
Uses apache-mime4j to parse emails.
|
RichTextContentHandler |
Content handler for Rich Text, it will extract XHTML <img/>
tag <alt/> attribute and XHTML <a/> tag <name/>
attribute into the output.
|
RollbackSoftware |
Demonstrates Tika and its ability to sense symlinks.
|
RTFConverter |
Tika to XMP mapping for the RTF format.
|
RTFMetadata |
|
RTFParser |
RTF parser
|
RTGTranslator |
This translator is designed to work with a TCP-IP available
RTG translation server, specifically the
REST-based RTG server.
|
RunProperties |
WARNING: This class is mutable.
|
RuntimeSAXException |
Use this to throw a SAXException in subclassed methods that don't throw SAXExceptions
|
S3Emitter |
Emits to existing s3 bucket
|
S3Fetcher |
Fetches files from s3.
|
S3PipesIterator |
|
SafeContentHandler |
|
SafeContentHandler.Output |
Internal interface that allows both character and
ignorable whitespace content to be filtered the same way.
|
SAS7BDATParser |
Processes the SAS7BDAT data columnar database file used by SAS and
other similar languages.
|
SecureContentHandler |
Content handler decorator that attempts to prevent denial of service
attacks against Tika parsers.
|
SentimentAnalysisParser |
This parser classifies documents based on the sentiment of document.
|
SequenceNumberGenerator |
|
SerialNumber |
|
ServerStatus |
|
ServerStatus.STATUS |
|
ServerStatus.TASK |
|
ServerStatusResource |
|
ServerStatusWatcher |
|
ServiceLoader |
Internal utility class that Tika uses to look up service providers.
|
ServiceLoaderUtils |
Service Loading and Ordering related utils
|
SignatureObject |
Signature Object
|
SimpleChunking |
|
SimpleLogReporterBuilder |
|
SimpleTextExtractor |
|
SimpleThreadPoolExecutor |
Simple Thread Pool Executor
|
SimpleTypeDetector |
|
SlowCompositeReaderWrapper |
COPIED VERBATIM FROM LUCENE
This class forces a composite reader (eg a MultiReader or DirectoryReader ) to emulate a
LeafReader .
|
SolrEmitter |
|
SolrEmitter.AttachmentStrategy |
|
SolrEmitter.UpdateStrategy |
|
SolrPipesIterator |
Iterates through results from a Solr query.
|
SourceCodeParser |
Generic Source code parser for Java, Groovy, C++.
|
SpanSwapper |
randomly swaps spans from the input
|
SpreadsheetMLParser |
Parses wordml 2003 format Excel files.
|
SpringExample |
|
SQLite3Parser |
This is the main class for parsing SQLite3 files.
|
StandardHtmlEncodingDetector |
An encoding detector that tries to respect the spirit of the HTML spec
part 12.2.3 "The input byte stream", or at least the part that is compatible with
the implementation of tika.
|
StandardOrganizations |
This class provides a collection of the most important technical standard organizations.
|
StandardReference |
Class that represents a standard reference.
|
StandardReference.StandardReferenceBuilder |
|
StandardsExtractingContentHandler |
StandardsExtractingContentHandler is a Content Handler used to extract
standard references while parsing.
|
StandardsExtractionExample |
|
StandardsText |
StandardText relies on regular expressions to extract standard references
from text.
|
StandardWriteFilter |
|
StandardWriteFilterFactory |
|
StarOfficeDetector |
|
StatefulParser |
The RecursiveParserWrapper wraps the parser sent
into the parsecontext and then uses that parser
to store state (among many other things).
|
StatusReporter |
Basic class to use for reporting status from both the crawler and the consumers.
|
StatusReporterBuilder |
|
StatusReporterFutureResult |
Empty class for what a StatusReporter returns when it finishes.
|
StoppingEarlyException |
Sentinel exception to stop parsing xml once target is found
while SAX parsing.
|
StorageIndexCellMapping |
Specifies the storage index cell mappings (with cell identifier, cell mapping extended GUID,
and cell mapping serial number)
|
StorageIndexDataElementData |
|
StorageIndexManifestMapping |
|
StorageIndexRevisionMapping |
Specifies the storage index revision mappings (with revision and revision mapping
extended GUIDs, and revision mapping serial number)
|
StorageManifestDataElementData |
|
StorageManifestRootDeclare |
Specifies one or more storage manifest root declare.
|
StorageManifestSchemaGUID |
Specifies a storage manifest schema GUID
|
StrawManTikaAppDriver |
Simple single-threaded class that calls tika-app against every file in a directory.
|
StreamEmitter |
|
StreamGobbler |
|
StreamingDetectContext |
|
StreamingZipContainerDetector |
Currently only used in tests.
|
StreamObject |
|
StreamObjectHeaderEnd |
|
StreamObjectHeaderEnd16bit |
An 16-bit header for a compound object would indicate the end of a stream object
|
StreamObjectHeaderEnd8bit |
An 8-bit header for a compound object would indicate the end of a stream object
|
StreamObjectHeaderStart |
This class specifies the base class for 16-bit or 32-bit stream object header start
|
StreamObjectHeaderStart16bit |
An 16-bit header for a compound object would indicate the start of a stream object
|
StreamObjectHeaderStart32bit |
An 32-bit header for a compound object would indicate the start of a stream object
|
StreamObjectParseErrorException |
|
StreamObjectTypeHeaderEnd |
|
StreamObjectTypeHeaderStart |
The enumeration of the stream object type header start
|
StreamOutRPWFSConsumer |
|
StringsConfig |
Configuration for the "strings" (or strings-alternative) command.
|
StringsEncoding |
Character encoding of the strings that are to be found using the "strings" command.
|
StringsParser |
Parser that uses the "strings" (or strings-alternative) command to find the
printable strings in a object, or other binary, file
(application/octet-stream).
|
StringStatsCalculator<T> |
Interface for calculators that require a string
|
StringUtils |
|
SubtreeMatcher |
Evaluation state of a ...//... XPath expression.
|
SummaryExtractor |
Extractor for Common OLE2 (HPSF) metadata
|
SupplementingParser |
|
SXSLFPowerPointExtractorDecorator |
SAX/Streaming pptx extractior
|
SXWPFWordExtractorDecorator |
This is an experimental, alternative extractor for docx files.
|
SystemUtils |
Copied from commons-lang to avoid requiring the dependency
|
TableInfo |
|
TaggedContentHandler |
A content handler decorator that tags potential exceptions so that the
handler that caused the exception can easily be identified.
|
TaggedSAXException |
A SAXException wrapper that tags the wrapped exception with
a given object reference.
|
TailStream |
A specialized input stream implementation which records the last portion read
from an underlying stream.
|
TarWriter |
|
TaskStatus |
|
TeeContentHandler |
Content handler proxy that forwards the received SAX events to zero or
more underlying content handlers.
|
TEIDOMParser |
|
TemporaryResources |
Utility class for tracking and ultimately closing or otherwise disposing
a collection of temporary resources.
|
TensorflowImageRecParser |
|
TensorflowRESTCaptioner |
Tensorflow image captioner.
|
TensorflowRESTRecogniser |
Tensor Flow image recogniser which has high performance.
|
TensorflowRESTVideoRecogniser |
Tensor Flow video recogniser which has high performance.
|
TesseractOCRConfig |
Configuration for TesseractOCRParser.
|
TesseractOCRConfig.OUTPUT_TYPE |
|
TesseractOCRParser |
TesseractOCRParser powered by tesseract-ocr engine.
|
TesseractServerConfig |
Tesseract configuration, for the request
|
TextAndAttributeContentHandler |
|
TextAndAttributeXMLParser |
|
TextAndCSVParser |
|
TextCell |
Text cell.
|
TextContentHandler |
|
TextDetector |
Content type detection of plain text documents.
|
TextLangDetector |
Language Detection using MIT Lincoln Lab’s Text.jl library
https://github.com/trevorlewis/TextREST.jl
|
TextMatcher |
Final evaluation state of a .../text() XPath expression.
|
TextMessageBodyWriter |
Returns simple text string for a particular metadata value.
|
TextOnlyPDFRenderer |
This class extends the PDFRenderer to render only the textual
elements
|
TextProfileSignature |
Copied nearly directly from Apache Nutch:
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/TextProfileSignature.java
|
TextSha256Signature |
Calculates the base32 encoded SHA-256 checksum on the analyzed text
|
TextStatistics |
Utility class for computing a histogram of the bytes seen in a stream.
|
TextStatsCalculator |
Base text stats interface
|
TextStatsFromTikaEval |
|
TIAParsingExample |
|
TIFF |
XMP Exif TIFF schema.
|
TiffParser |
|
Tika |
Facade class for accessing Tika functionality.
|
TikaActivator |
Bundle activator that adjust the class loading mechanism of the
ServiceLoader class to work correctly in an OSGi environment.
|
TikaCLI |
Simple command line interface for Apache Tika.
|
TikaClient |
|
TikaClientCLI |
|
TikaClientConfigException |
|
TikaClientException |
|
TikaConfig |
Parse xml config file.
|
TikaConfigException |
Tika Config Exception is an exception to occur when there is an error
in Tika config file and/or one or more of the parsers failed to initialize
from that erroneous config.
|
TikaConfigSerializer |
|
TikaConfigSerializer.Mode |
|
TikaCoreProperties |
Contains a core set of basic Tika metadata properties, which all parsers
will attempt to supply (where the file format permits).
|
TikaCoreProperties.EmbeddedResourceType |
A file might contain different types of embedded documents.
|
TikaDetectors |
Provides details of all the Detector s registered with
Apache Tika, similar to --list-detectors with the Tika CLI.
|
TikaEmitterException |
|
TikaEmitterResult |
|
TikaEvalCLI |
|
TikaEvalMetadataFilter |
|
TikaEvalResource |
|
TikaExcelDataFormatter |
Overrides Excel's General format to include more
significant digits than the MS Spec allows.
|
TikaExcelGeneralFormat |
A Format that allows up to 15 significant digits for integers.
|
TikaException |
Tika exception
|
TikaFileTypeDetector |
|
TikaGUI |
Simple Swing GUI for Apache Tika.
|
TikaInputStream |
Input stream with extended capabilities.
|
TikaLanguageDetector |
This is Tika's original legacy, homegrown language detector.
|
TikaLoggingFilter |
|
TikaMemoryLimitException |
|
TikaMimeKeys |
A collection of Tika metadata keys used in Mime Type resolution
|
TikaMimeTypes |
Provides details of all the mimetypes known to Apache Tika,
similar to --list-supported-types with the Tika CLI.
|
TikaMp4BoxHandler |
|
TikaPagedText |
Metadata properties for paged text, metadata appropriate
for an individual page (useful for embedded document handlers
called on individual pages).
|
TikaParsers |
Provides details of all the Parser s registered with
Apache Tika, similar to --list-parsers and
--list-parser-details within the Tika CLI.
|
TikaResource |
|
TikaServerCli |
|
TikaServerClientConfig |
|
TikaServerConfig |
|
TikaServerParseException |
Simple wrapper exception to be thrown for consistent handling
of exceptions that can happen during a parse.
|
TikaServerParseExceptionMapper |
|
TikaServerProcess |
|
TikaServerResource |
Stub interface to allow for loading of resources via SPI
|
TikaServerStatus |
|
TikaServerWatchDog |
|
TikaServerWriter<T> |
Stub interface to allow for SPI loading from other modules
without opening up service loading to any generic MessageBodyWriter
|
TikaTaskTimeout |
|
TikaTimeoutException |
|
TikaToXMP |
|
TikaUserDataBox |
|
TikaVersion |
|
TikaWelcome |
Provides a basic welcome to the Apache Tika Server.
|
TikaWelcome.Endpoint |
|
TimeoutConfig |
|
TlsConfig |
|
TMXContentHandler |
Content Handler for Translation Memory eXchange (TMX) files.
|
TMXParser |
Parser for Translation Memory eXchange (TMX) files.
|
TNEFParser |
A POI-powered Tika Parser for TNEF (Transport Neutral
Encoding Format) messages, aka winmail.dat
|
ToHTMLContentHandler |
SAX event handler that serializes the HTML document to a character stream.
|
TokenContraster |
Computes some corpus contrast statistics.
|
TokenCounter |
Deprecated.
|
TokenCountPriorityQueue |
|
TokenCountPriorityQueue |
|
TokenCounts |
|
TokenCountStatsCalculator<T> |
Interface for calculators that require token stats
|
TokenEntropy |
|
TokenIntPair |
|
TokenLengths |
|
TokenStatistics |
|
TopCommonTokenCounter |
Utility class that reads in a UTF-8 input file with one document per row
and outputs the 20000 tokens with the highest document frequencies.
|
TopNTokens |
|
ToTextContentHandler |
SAX event handler that writes all character content out to a character
stream.
|
ToXMLContentHandler |
SAX event handler that serializes the XML document to a character stream.
|
TrainedModel |
|
TrainedModelDetector |
|
TrainTestSplit |
|
TranscribeTranslateExample |
This example demonstrates primitive logic for
chaining Tika API calls.
|
Transformer |
|
TranslateResource |
|
Translator |
Interface for Translator services.
|
TranslatorExample |
|
TrecDocumentGenerator |
Generates document summaries for corpus analysis in the Open Relevance
project.
|
TrueTypeParser |
Parser for TrueType font files (TTF).
|
Truncator |
|
TSDParser |
Tika parser for Time Stamped Data Envelope (application/timestamped-data)
|
TwoBytesOfData |
This class is used to represent the property contains 2 bytes of data in the PropertySet.rgData stream field.
|
TXTParser |
Plain text parser.
|
TypeDetector |
Content type detection based on a content type hint.
|
UByte |
The unsigned byte type
|
UInteger |
The unsigned int type
|
ULong |
The unsigned long type
|
UMath |
|
UnicodeBlockCounter |
|
UniversalEncodingDetector |
|
UnpackerResource |
|
UnrarParser |
Parser for Rar files.
|
Unsigned |
A utility class for static access to unsigned number functionality.
|
UnsupportedFormatException |
Parsers should throw this exception when they encounter
a file format that they do not support.
|
UNumber |
A base type for unsigned numbers.
|
URLEmailNormalizingFilterFactory |
Factory for filter that normalizes urls and emails to __url__ and __email__
respectively.
|
UrlFetcher |
Simple fetcher for URLs.
|
UShort |
The unsigned short type
|
UuidUtils |
|
VectorGraphicsOnlyPDFRenderer |
This class extends the PDFRenderer to render only the textual
elements
|
WACZParser |
|
WARC |
|
WARCParser |
|
WatchDogResult |
|
WebPParser |
|
WMFParser |
This parser offers a very rough capability to extract text if there
is text stored in the WMF files.
|
Word2006MLParser |
|
WordExtractor |
|
WordExtractor.TagAndStyle |
|
WordMLParser |
Parses wordml 2003 format word files.
|
WordPerfect |
WordPerfect properties collection.
|
WordPerfectParser |
Parser for Corel WordPerfect documents.
|
WriteLimiter |
|
WriteLimitReachedException |
|
WriteOutContentHandler |
SAX event handler that writes content up to an optional write
limit out to a character stream or other decorated handler.
|
XHTMLContentHandler |
Content handler decorator that simplifies the task of producing XHTML
events for Tika content parsers.
|
XLIFF12ContentHandler |
Content Handler for XLIFF 1.2 documents.
|
XLIFF12Parser |
Parser for XLIFF 1.2 files.
|
XLSXHREFFormatter |
|
XLZParser |
Parser for XLZ Archives.
|
XMLDOMUtil |
|
XMLErrorLogUpdater |
This is a very task specific class that reads a log file and updates
the "comparisons" table.
|
XMLLogMsgHandler |
|
XMLLogReader |
|
XMLParser |
XML parser.
|
XMLProfiler |
XMLReaderUtils |
Utility functions for reading XML.
|
XmlRootExtractor |
Utility class that uses a SAXParser to determine
the namespace URI and local name of the root element of an XML file.
|
XMP |
|
XMPContentHandler |
Content handler decorator that simplifies the task of producing XMP output.
|
XMPDM |
XMP Dynamic Media schema.
|
XMPDM.ChannelTypePropertyConverter |
Deprecated.
|
XMPIdq |
|
XMPMessageBodyWriter |
|
XMPMetadata |
Provides a conversion of the Metadata map from Tika to the XMP data model by also providing the
Metadata API for clients to ease transition.
|
XMPMetadataExtractor |
XMP Metadata Extractor based on Apache XmpBox.
|
XMPMetadataResource |
|
XMPMM |
|
XMPPacketScanner |
This class is a parser for XMP packets.
|
XMPRights |
XMP Rights management schema.
|
XMPSchemaPDFUA |
|
XMPSchemaPDFVT |
|
XMPSchemaPDFX |
This is somewhat of a hack to handle the older pdfx:
See also the more modern XMPSchemaPDFXId
|
XMPSchemaPDFXId |
|
XPathParser |
Parser for a very simple XPath subset.
|
XPSExtractorDecorator |
|
XPSTextExtractor |
Currently, mostly a pass-through class to hold pkg and properties
and keep the general framework similar to our other POI-integrated
extractors.
|
XSLFEventBasedPowerPointExtractor |
|
XSLFPowerPointExtractorDecorator |
|
XSSFBExcelExtractorDecorator |
|
XSSFExcelExtractorDecorator |
|
XSSFExcelExtractorDecorator.HeaderFooterFromString |
|
XSSFExcelExtractorDecorator.SheetTextAsHTML |
Turns formatted sheet events into HTML
|
XSSFExcelExtractorDecorator.XSSFSheetInterestingPartsCapturer |
Captures information on interesting tags, whilst
delegating the main work to the formatting handler
|
XUserDefinedCharset |
|
XUserDefinedCharset.NotImplementedException |
|
XWPFEventBasedWordExtractor |
Experimental class that is based on POI's XSSFEventBasedExcelExtractor
|
XWPFListManager |
|
XWPFNumberingShim |
Stub class of POI's XWPFNumbering because onDocumentRead() is protected
|
XWPFStylesShim |
For Tika, all we need (so far) is a mapping between styleId and a style's name.
|
XWPFWordExtractorDecorator |
|
YandexTranslator |
|
ZeroByteFileException |
Exception thrown by the AutoDetectParser when a file contains zero-bytes.
|
ZeroByteFileException.IgnoreZeroByteFileException |
|
ZeroSizeFileDetector |
Detector to identify zero length files as application/x-zerovalue
|
ZipContainerDetector |
Classes that implement this must be able to detect on a ZipFile and in streaming mode.
|
ZipFilesChunking |
This class is used to process zip file chunking
|
ZipHeader |
|
ZipListFiles |
Example code listing from Chapter 1.
|
ZipSalvager |
|
ZipWriter |
|