[
https://issues.apache.org/jira/browse/TIKA-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056663#comment-18056663
]
Tim Allison edited comment on TIKA-4650 at 2/5/26 1:45 PM:
-----------------------------------------------------------
Not sure I agree with all of Claude's conclusions. So, y, the new features do
add some cost on large files, but users can turn off the integrity checks if
they notice, and I think those are important features to have.
h2. ZIP Parser Benchmark Results (Final)
h4. Test Files
||Name||Entries||Size||
|Small|10|10 KB|
|Medium|1,000|97 MB|
|Large|5,000|2,441 MB (~2.4 GB)|
h4. DefaultHandler Mode
||Branch||Small (10 entries, 10 KB)||Medium (1,000 entries, 97 MB)||Large
(5,000 entries, 2.4 GB)||
|Tika 3.x (original)|11.4 ms|728 ms|4059 ms|
|Tika 3.x (+ openContainer fix)|12.0 ms|663 ms|2986 ms|
|Main (4.x)|7.6 ms|625 ms|3589 ms|
|Feature (4.x) - no integrity/metadata|11.3 ms|561 ms|*2571 ms*|
|Feature (4.x) - full|7.8 ms|580 ms|3378 ms|
h4. RecursiveParserWrapper Mode
||Branch||Small (10 entries, 10 KB)||Medium (1,000 entries, 97 MB)||Large
(5,000 entries, 2.4 GB)||
|Tika 3.x (original)|12.8 ms|842 ms|4170 ms|
|Tika 3.x (+ openContainer fix)|13.0 ms|618 ms|2961 ms|
|Main (4.x)|7.5 ms|622 ms|3645 ms|
|Feature (4.x) - no integrity/metadata|7.3 ms|567 ms|*2762 ms*|
|Feature (4.x) - full|8.4 ms|595 ms|3453 ms|
h4. Key Findings
* The openContainer optimization gives Tika 3.x a *26-29% speedup* on large
ZIPs
* Feature branch (4.x) without integrity check + metadata extraction is *14%
faster* than 3.x with fix
* The integrity check + metadata extraction adds ~800 ms overhead on 2.4 GB
ZIPs (24% overhead)
* With full features enabled, feature branch is still *17% faster* than
original 3.x
* The 3.x openContainer fix requires two changes:
** DefaultZipContainerDetector: Store ZipFile in openContainer even for plain
ZIPs
** PackageParser: Check openContainer for existing ZipFile before creating new
ArchiveInputStream
was (Author: [email protected]):
Not sure I agree with all of Claude's conclusions. So, y, the new features do
add some cost on large files, but users can turn off the integrity checks if
they notice, and I think those are important.
h3. ZIP Parser Benchmark Results (Final)
h4. Test Files
|| Name || Entries || Size ||
| Small | 10 | 10 KB |
| Medium | 1,000 | 97 MB |
| Large | 5,000 | 2,441 MB (~2.4 GB) |
h4. DefaultHandler Mode
|| Branch || Small (10 entries, 10 KB) || Medium (1,000 entries, 97 MB) ||
Large (5,000 entries, 2.4 GB) ||
| Tika 3.x (original) | 11.4 ms | 728 ms | 4059 ms |
| Tika 3.x (+ openContainer fix) | 12.0 ms | 663 ms | 2986 ms |
| Main (4.x) | 7.6 ms | 625 ms | 3589 ms |
| Feature (4.x) - no integrity/metadata | 11.3 ms | 561 ms | *2571 ms* |
| Feature (4.x) - full | 7.8 ms | 580 ms | 3378 ms |
h4. RecursiveParserWrapper Mode
|| Branch || Small (10 entries, 10 KB) || Medium (1,000 entries, 97 MB) ||
Large (5,000 entries, 2.4 GB) ||
| Tika 3.x (original) | 12.8 ms | 842 ms | 4170 ms |
| Tika 3.x (+ openContainer fix) | 13.0 ms | 618 ms | 2961 ms |
| Main (4.x) | 7.5 ms | 622 ms | 3645 ms |
| Feature (4.x) - no integrity/metadata | 7.3 ms | 567 ms | *2762 ms* |
| Feature (4.x) - full | 8.4 ms | 595 ms | 3453 ms |
h4. Key Findings
* The openContainer optimization gives Tika 3.x a *26-29% speedup* on large ZIPs
* Feature branch (4.x) without integrity check + metadata extraction is *14%
faster* than 3.x with fix
* The integrity check + metadata extraction adds ~800 ms overhead on 2.4 GB
ZIPs (24% overhead)
* With full features enabled, feature branch is still *17% faster* than
original 3.x
* The 3.x openContainer fix requires two changes:
** DefaultZipContainerDetector: Store ZipFile in openContainer even for plain
ZIPs
** PackageParser: Check openContainer for existing ZipFile before creating new
ArchiveInputStream
> Improve zip parsing in 4.x
> --------------------------
>
> Key: TIKA-4650
> URL: https://issues.apache.org/jira/browse/TIKA-4650
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Zip parsing has a number of quirks that require special processing. Over time
> those have accreted in the PackageParser. Further, there's not great
> coordination between the zip detector and the zip parser...there are some
> areas where we could streamline the detect+parse steps.
> Let's create a standalone zip parser and improve the coordination between
> detection and parsing for zip files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)