[
https://issues.apache.org/jira/browse/TIKA-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056652#comment-18056652
]
Tim Allison commented on TIKA-4650:
-----------------------------------
The feature branch includes an integrity check which compares the entries found
while streaming vs the entries in the central directory. And it checks for
duplicate paths.
The feature branch also extracts much more metadata from each entry.
I think there's an easy fix on 3x that I'll apply if the benchmarks indicate
that it helps.
> Improve zip parsing in 4.x
> --------------------------
>
> Key: TIKA-4650
> URL: https://issues.apache.org/jira/browse/TIKA-4650
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Zip parsing has a number of quirks that require special processing. Over time
> those have accreted in the PackageParser. Further, there's not great
> coordination between the zip detector and the zip parser...there are some
> areas where we could streamline the detect+parse steps.
> Let's create a standalone zip parser and improve the coordination between
> detection and parsing for zip files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)