As always, thank you, Stefan!

We might add a kluge at the Tika level to check for TIFF first...unless you'd 
like that kluge in your code? 😉

The reporter recommended one option: a conditional that checked the tarHeader 
variable to see if it started with one of the TIFF magic numbers (II/MM 49 49 
2A 00 / 4D 4D 00 2A).



-----Original Message-----
From: Stefan Bodewig [mailto:bode...@apache.org] 
Sent: Tuesday, February 27, 2018 3:46 PM
To: Stefan Bodewig <bode...@apache.org>
Cc: Allison, Timothy B. <talli...@mitre.org>; Commons Developers List 
<dev@commons.apache.org>
Subject: Re: [COMPRESS] TIFF file identified as TAR

On 2018-02-27, Stefan Bodewig wrote:

> On 2018-02-27, Allison, Timothy B. wrote:

>>    On TIKA-2591[0], a user reports that a specific type of TIFF is
>>    being identified as a TAR file.  Is this something we should try to
>>    fix at the Tika level, or is this something that would be better
>>    fixed in COMPRESS?

> TAR auto-detection is, erm, clumsy. But this is due to the format not 
> being built for being detected.

> This is how it works right now:

> * read the first candidate header of 512 bytes

> * look at the eight bytes that contain the "ustar" string and the
>   version and verify they look like something we support.

> * verify the checksum of the candidate tar header

Actually I was mis-reading the code. It is either "ustar and version look good" 
or "parses as tar header with correct checksum". So the chance for false 
positives is bigger.

Unfortunately this has proven necessary to detect all valid TAR
archives: https://issues.apache.org/jira/browse/COMPRESS-117

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org


Reply via email to