[
https://issues.apache.org/jira/browse/TIKA-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056655#comment-18056655
]
Tim Allison commented on TIKA-4650:
-----------------------------------
Claude benchmark update with fix to 3.x
Here's the updated JIRA table with 3.x + openContainer optimization by claude:
h3. ZIP Parser Benchmark Results (Final)
h4. Test Files
|| Name || Entries || Size ||
| Small | 10 | 10 KB |
| Medium | 1,000 | 97 MB |
| Large | 5,000 | 2,441 MB (~2.4 GB) |
h4. DefaultHandler Mode
|| Branch || Small (10 entries, 10 KB) || Medium (1,000 entries, 97 MB) ||
Large (5,000 entries, 2.4 GB) ||
| Tika 3.x (original) | 11.4 ms | 728 ms | 4059 ms |
| Tika 3.x (+ openContainer fix) | 12.0 ms | 663 ms | *2986 ms* |
| Main (4.x) | 7.6 ms | 625 ms | 3589 ms |
| Feature (4.x) | 7.8 ms | 580 ms | 3378 ms |
h4. RecursiveParserWrapper Mode
|| Branch || Small (10 entries, 10 KB) || Medium (1,000 entries, 97 MB) ||
Large (5,000 entries, 2.4 GB) ||
| Tika 3.x (original) | 12.8 ms | 842 ms | 4170 ms |
| Tika 3.x (+ openContainer fix) | 13.0 ms | 618 ms | *2961 ms* |
| Main (4.x) | 7.5 ms | 622 ms | 3645 ms |
| Feature (4.x) | 8.4 ms | 595 ms | 3453 ms |
h4. Key Findings
* The openContainer optimization gives Tika 3.x a *26-29% speedup* on large
ZIPs
* With this fix, 3.x large ZIP performance (*2961-2986 ms*) is actually
*faster* than 4.x main (3589-3645 ms)
* The fix requires two changes:
** DefaultZipContainerDetector: Store ZipFile in openContainer even for plain
ZIPs (was being closed)
** PackageParser: Check openContainer for existing ZipFile before creating
new ArchiveInputStream
* Small ZIPs show no improvement (setup cost dominates)
* 4.x still has an edge on small/medium ZIPs due to Java 17 optimizations
> Improve zip parsing in 4.x
> --------------------------
>
> Key: TIKA-4650
> URL: https://issues.apache.org/jira/browse/TIKA-4650
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Zip parsing has a number of quirks that require special processing. Over time
> those have accreted in the PackageParser. Further, there's not great
> coordination between the zip detector and the zip parser...there are some
> areas where we could streamline the detect+parse steps.
> Let's create a standalone zip parser and improve the coordination between
> detection and parsing for zip files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)