[
https://issues.apache.org/jira/browse/TIKA-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056650#comment-18056650
]
Tim Allison commented on TIKA-4650:
-----------------------------------
Benchmarks by claude
Here are the complete results including tika-3x, formatted for JIRA:
h3. ZIP Parser Benchmark Results
h4. DefaultHandler Mode
||Branch||Small (10 entries)||Medium (1000 entries)||Large (5000 entries)||
|Tika 3.x|11.395 ms|728 ms|4059 ms|
|Main (4.x)|7.558 ms|625 ms|3589 ms|
|Feature (4.x)|7.790 ms|580 ms|3378 ms|
h4. RecursiveParserWrapper Mode
||Branch||Small (10 entries)||Medium (1000 entries)||Large (5000 entries)||
|Tika 3.x|12.810 ms|842 ms|4170 ms|
|Main (4.x)|7.485 ms|622 ms|3645 ms|
|Feature (4.x)|8.444 ms|595 ms|3453 ms|
h4. Performance Comparison vs Tika 3.x
||Mode||Small||Medium||Large||
|DefaultHandler|32% faster|20% faster|17% faster|
|RecursiveParserWrapper|34% faster|29% faster|17% faster|
h4. Key Findings
* Feature branch (4.x) with full metadata extraction + integrity checking
outperforms both Tika 3.x and main (4.x)
* Tika 4.x main is already significantly faster than 3.x (likely due to Java
17 baseline and other improvements)
* The new ZipParser adds metadata extraction and integrity checking with *no
performance penalty*
* Small ZIPs show minimal overhead (~1-2ms) from the new architecture
* Larger ZIPs benefit from the ZipFile-based approach with detector hints
Summary: The feature branch is 17-34% faster than Tika 3.x and 4-7% faster than
main (4.x) on medium/large ZIPs, while adding metadata extraction and integrity
checking capabilities.
> Improve zip parsing in 4.x
> --------------------------
>
> Key: TIKA-4650
> URL: https://issues.apache.org/jira/browse/TIKA-4650
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Zip parsing has a number of quirks that require special processing. Over time
> those have accreted in the PackageParser. Further, there's not great
> coordination between the zip detector and the zip parser...there are some
> areas where we could streamline the detect+parse steps.
> Let's create a standalone zip parser and improve the coordination between
> detection and parsing for zip files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)