[jira] [Comment Edited] (TIKA-4650) Improve zip parsing in 4.x

Tim Allison (Jira) Thu, 05 Feb 2026 05:47:54 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056663#comment-18056663
 ]


Tim Allison edited comment on TIKA-4650 at 2/5/26 1:45 PM:
-----------------------------------------------------------

Not sure I agree with all of Claude's conclusions. So, y, the new features do 
add some cost on large files, but users can turn off the integrity checks if 
they notice, and I think those are important features to have.
h2. ZIP Parser Benchmark Results (Final)
h4. Test Files
||Name||Entries||Size||
|Small|10|10 KB|
|Medium|1,000|97 MB|
|Large|5,000|2,441 MB (~2.4 GB)|
h4. DefaultHandler Mode
||Branch||Small (10 entries, 10 KB)||Medium (1,000 entries, 97 MB)||Large 
(5,000 entries, 2.4 GB)||
|Tika 3.x (original)|11.4 ms|728 ms|4059 ms|
|Tika 3.x (+ openContainer fix)|12.0 ms|663 ms|2986 ms|
|Main (4.x)|7.6 ms|625 ms|3589 ms|
|Feature (4.x) - no integrity/metadata|11.3 ms|561 ms|*2571 ms*|
|Feature (4.x) - full|7.8 ms|580 ms|3378 ms|
h4. RecursiveParserWrapper Mode
||Branch||Small (10 entries, 10 KB)||Medium (1,000 entries, 97 MB)||Large 
(5,000 entries, 2.4 GB)||
|Tika 3.x (original)|12.8 ms|842 ms|4170 ms|
|Tika 3.x (+ openContainer fix)|13.0 ms|618 ms|2961 ms|
|Main (4.x)|7.5 ms|622 ms|3645 ms|
|Feature (4.x) - no integrity/metadata|7.3 ms|567 ms|*2762 ms*|
|Feature (4.x) - full|8.4 ms|595 ms|3453 ms|
h4. Key Findings
 * The openContainer optimization gives Tika 3.x a *26-29% speedup* on large 
ZIPs
 * Feature branch (4.x) without integrity check + metadata extraction is *14% 
faster* than 3.x with fix
 * The integrity check + metadata extraction adds ~800 ms overhead on 2.4 GB 
ZIPs (24% overhead)
 * With full features enabled, feature branch is still *17% faster* than 
original 3.x
 * The 3.x openContainer fix requires two changes:
 ** DefaultZipContainerDetector: Store ZipFile in openContainer even for plain 
ZIPs
 ** PackageParser: Check openContainer for existing ZipFile before creating new 
ArchiveInputStream


was (Author: [email protected]):
Not sure I agree with all of Claude's conclusions. So, y, the new features do 
add some cost on large files, but users can turn off the integrity checks if 
they notice, and I think those are important.

 

h3. ZIP Parser Benchmark Results (Final)

h4. Test Files

|| Name || Entries || Size ||
| Small | 10 | 10 KB |
| Medium | 1,000 | 97 MB |
| Large | 5,000 | 2,441 MB (~2.4 GB) |

h4. DefaultHandler Mode

|| Branch || Small (10 entries, 10 KB) || Medium (1,000 entries, 97 MB) || 
Large (5,000 entries, 2.4 GB) ||
| Tika 3.x (original) | 11.4 ms | 728 ms | 4059 ms |
| Tika 3.x (+ openContainer fix) | 12.0 ms | 663 ms | 2986 ms |
| Main (4.x) | 7.6 ms | 625 ms | 3589 ms |
| Feature (4.x) - no integrity/metadata | 11.3 ms | 561 ms | *2571 ms* |
| Feature (4.x) - full | 7.8 ms | 580 ms | 3378 ms |

h4. RecursiveParserWrapper Mode

|| Branch || Small (10 entries, 10 KB) || Medium (1,000 entries, 97 MB) || 
Large (5,000 entries, 2.4 GB) ||
| Tika 3.x (original) | 12.8 ms | 842 ms | 4170 ms |
| Tika 3.x (+ openContainer fix) | 13.0 ms | 618 ms | 2961 ms |
| Main (4.x) | 7.5 ms | 622 ms | 3645 ms |
| Feature (4.x) - no integrity/metadata | 7.3 ms | 567 ms | *2762 ms* |
| Feature (4.x) - full | 8.4 ms | 595 ms | 3453 ms |

h4. Key Findings
* The openContainer optimization gives Tika 3.x a *26-29% speedup* on large ZIPs
* Feature branch (4.x) without integrity check + metadata extraction is *14% 
faster* than 3.x with fix
* The integrity check + metadata extraction adds ~800 ms overhead on 2.4 GB 
ZIPs (24% overhead)
* With full features enabled, feature branch is still *17% faster* than 
original 3.x
* The 3.x openContainer fix requires two changes:
** DefaultZipContainerDetector: Store ZipFile in openContainer even for plain 
ZIPs
** PackageParser: Check openContainer for existing ZipFile before creating new 
ArchiveInputStream

> Improve zip parsing in 4.x
> --------------------------
>
>                 Key: TIKA-4650
>                 URL: https://issues.apache.org/jira/browse/TIKA-4650
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Zip parsing has a number of quirks that require special processing. Over time 
> those have accreted in the PackageParser. Further, there's not great 
> coordination between the zip detector and the zip parser...there are some 
> areas where we could streamline the detect+parse steps.
> Let's create a standalone zip parser and improve the coordination between 
> detection and parsing for zip files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-4650) Improve zip parsing in 4.x

Reply via email to