[ 
https://issues.apache.org/jira/browse/TIKA-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056655#comment-18056655
 ] 

Tim Allison commented on TIKA-4650:
-----------------------------------

Claude benchmark update with fix to 3.x

 

Here's the updated JIRA table with 3.x + openContainer optimization by claude:  
                                                                                
                                       

h3. ZIP Parser Benchmark Results (Final)                                        
                                                                                
                             
                                                                             
  h4. Test Files                                                                
                                                                                
                               
   
  || Name || Entries || Size ||                                                 
                                                                                
                               
  | Small | 10 | 10 KB |                                                     
  | Medium | 1,000 | 97 MB |
  | Large | 5,000 | 2,441 MB (~2.4 GB) |

  h4. DefaultHandler Mode

  || Branch || Small (10 entries, 10 KB) || Medium (1,000 entries, 97 MB) || 
Large (5,000 entries, 2.4 GB) ||
  | Tika 3.x (original) | 11.4 ms | 728 ms | 4059 ms |
  | Tika 3.x (+ openContainer fix) | 12.0 ms | 663 ms | *2986 ms* |
  | Main (4.x) | 7.6 ms | 625 ms | 3589 ms |
  | Feature (4.x) | 7.8 ms | 580 ms | 3378 ms |

  h4. RecursiveParserWrapper Mode

  || Branch || Small (10 entries, 10 KB) || Medium (1,000 entries, 97 MB) || 
Large (5,000 entries, 2.4 GB) ||
  | Tika 3.x (original) | 12.8 ms | 842 ms | 4170 ms |
  | Tika 3.x (+ openContainer fix) | 13.0 ms | 618 ms | *2961 ms* |
  | Main (4.x) | 7.5 ms | 622 ms | 3645 ms |
  | Feature (4.x) | 8.4 ms | 595 ms | 3453 ms |

  h4. Key Findings
  * The openContainer optimization gives Tika 3.x a *26-29% speedup* on large 
ZIPs
  * With this fix, 3.x large ZIP performance (*2961-2986 ms*) is actually 
*faster* than 4.x main (3589-3645 ms)
  * The fix requires two changes:
  ** DefaultZipContainerDetector: Store ZipFile in openContainer even for plain 
ZIPs (was being closed)
  ** PackageParser: Check openContainer for existing ZipFile before creating 
new ArchiveInputStream
  * Small ZIPs show no improvement (setup cost dominates)
  * 4.x still has an edge on small/medium ZIPs due to Java 17 optimizations

> Improve zip parsing in 4.x
> --------------------------
>
>                 Key: TIKA-4650
>                 URL: https://issues.apache.org/jira/browse/TIKA-4650
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Zip parsing has a number of quirks that require special processing. Over time 
> those have accreted in the PackageParser. Further, there's not great 
> coordination between the zip detector and the zip parser...there are some 
> areas where we could streamline the detect+parse steps.
> Let's create a standalone zip parser and improve the coordination between 
> detection and parsing for zip files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to