[
https://issues.apache.org/jira/browse/TIKA-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694096#comment-17694096
]
ASF GitHub Bot commented on TIKA-3979:
--------------------------------------
apismensky commented on PR #985:
URL: https://github.com/apache/tika/pull/985#issuecomment-1446743975
I was going to submit this issue last week.
My observation was similar - lots of overhead around BitSet - mem
allocations / cpu.
We switched from tika 1.27 to 2.7.0
For one of the files we saw the difference:
Extraction took: 2199 ( tika 1.27) vs
Extraction took: 27010 ( tika 2.7.0)
Both in ms, so it is more than 10 times slower.
Original file size is 50.5 Mb
> OneNoteParser - Improve performance for deserialization
> -------------------------------------------------------
>
> Key: TIKA-3979
> URL: https://issues.apache.org/jira/browse/TIKA-3979
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.7.0
> Reporter: David Xie
> Priority: Major
> Attachments: image-2023-02-20-14-42-10-590.png,
> image-2023-02-25-12-01-40-311.png
>
>
> We noticed some performance issues specific to parsing OneNote files. Our cpu
> profiler reports that the parser spends a lot of time on deserializing byte
> arrays (image included below)
> !image-2023-02-20-14-42-10-590.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)