[ 
https://issues.apache.org/jira/browse/TIKA-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694096#comment-17694096
 ] 

ASF GitHub Bot commented on TIKA-3979:
--------------------------------------

apismensky commented on PR #985:
URL: https://github.com/apache/tika/pull/985#issuecomment-1446743975

   I was going to submit this issue last week. 
   My observation was similar - lots of overhead around BitSet - mem 
allocations / cpu. 
   We switched from tika 1.27 to 2.7.0 
   For one of the files we saw the difference: 
   Extraction took: 2199 ( tika 1.27) vs
   Extraction took: 27010 ( tika 2.7.0) 
   
   Both in ms, so it is more than 10 times slower.
   Original file size is 50.5 Mb
   
   




> OneNoteParser - Improve performance for deserialization
> -------------------------------------------------------
>
>                 Key: TIKA-3979
>                 URL: https://issues.apache.org/jira/browse/TIKA-3979
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.7.0
>            Reporter: David Xie
>            Priority: Major
>         Attachments: image-2023-02-20-14-42-10-590.png, 
> image-2023-02-25-12-01-40-311.png
>
>
> We noticed some performance issues specific to parsing OneNote files. Our cpu 
> profiler reports that the parser spends a lot of time on deserializing byte 
> arrays (image included below)
> !image-2023-02-20-14-42-10-590.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to