Package org.apache.tika.parser.microsoft.chm


package org.apache.tika.parser.microsoft.chm
  • Class
    Description
    Defines an accessor interface
    Contains chm extractor assertions
    A container that contains chm block information such as: i. initial block is using to reset main tree ii. start block is using for knowing where to start iii. end block is using for knowing where to stop iv. start offset is using for knowing where to start reading v. end offset is using for knowing where to stop reading
     
    Represents entry types: uncompressed, compressed
    Represents intel file states during decompression
    Represents lzx states: started decoding, not started decoding
     
    Holds chm listing entries
    Extracts text from chm file.
    The Header 0000: char[4] 'ITSF' 0004: DWORD 3 (Version number) 0008: DWORD Total header length, including header section table and following data. 000C: DWORD 1 (unknown) 0010: DWORD a timestamp 0014: DWORD Windows Language ID 0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC} 0028: GUID {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC} Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs. 0000: QWORD Offset of section from beginning of file 0008: QWORD Length of section Following the header section table is 8 bytes of additional header data.
    Directory header The directory starts with a header; its format is as follows: 0000: char[4] 'ITSP' 0004: DWORD Version number 1 0008: DWORD Length of the directory header 000C: DWORD $0a (unknown) 0010: DWORD $1000 Directory chunk size 0014: DWORD "Density" of quickref section, usually 2 0018: DWORD Depth of the index tree - 1 there is no index, 2 if there is one level of PMGI chunks 001C: DWORD Chunk number of root index chunk, -1 if there is none (though at least one file has 0 despite there being no index chunk, probably a bug) 0020: DWORD Chunk number of first PMGL (listing) chunk 0024: DWORD Chunk number of last PMGL (listing) chunk 0028: DWORD -1 (unknown) 002C: DWORD Number of directory chunks (total) 0030: DWORD Windows language ID 0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC} 0044: DWORD $54 (This is the length again) 0048: DWORD -1 (unknown) 004C: DWORD -1 (unknown) 0050: DWORD -1 (unknown)
    Decompresses a chm block.
    ::DataSpace/Storage//ControlData This file contains $20 bytes of information on the compression.
    LZXC reset table For ensuring a decompression.
     
     
     
    Description Note: not always exists An index chunk has the following format: 0000: char[4] 'PMGI' 0004: DWORD Length of quickref/free area at end of directory chunk 0008: Directory index entries (to quickref/free area) The quickref area in an PMGI is the same as in an PMGL The format of a directory index entry is as follows: BYTE: length of name BYTEs: name (UTF-8 encoded) ENCINT: directory listing chunk which starts with name Encoded Integers aka ENCINT An ENCINT is a variable-length integer.
    Description There are two types of directory chunks -- index chunks, and listing chunks.
     
     
    The format of a directory listing entry is as follows: BYTE: length of name BYTEs: name (UTF-8 encoded) ENCINT: content section ENCINT: offset ENCINT: length The offset is from the beginning of the content section the file is in, after the section has been decompressed (if appropriate).