[jira] [Commented] (TIKA-4388) Performance degradation observed in Tika 3.1.0

Tim Allison (Jira) Wed, 26 Feb 2025 07:48:10 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930738#comment-17930738
 ]


Tim Allison commented on TIKA-4388:
-----------------------------------

I created 
https://github.com/apache/tika/blob/TIKA-4388/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypesMultithreaded.java

If you want to revert to the old XMLReaderUtils, copy/paste from here: 
https://raw.githubusercontent.com/apache/tika/035682cdd9e993cd441f005f62a3b36f410c50b6/tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java
 or simply revert to that commit.

If you can replicate the behavior there, let me know what you did. You'll have 
to modify the path to your test files, and then select the number of threads 
and then the exact {{detect-then-parse}} options that you're using.

I'm finding a 15x slow down generally when calling {{mimeTypes.getMimeType(File 
f)}} because that is recreating a new AutoDetectParser and a new TikaConfig 
with every call. If you are doing that, switch to 
{{mimeTypes.detect(InputStream, Metadata)}} and you'll be much better off.

Let me know what you find.

> Performance degradation observed in Tika 3.1.0
> ----------------------------------------------
>
>                 Key: TIKA-4388
>                 URL: https://issues.apache.org/jira/browse/TIKA-4388
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 3.1.0
>            Reporter: Sandeep Kulkarni
>            Assignee: Tim Allison
>            Priority: Major
>
> We are using Tika as a library and after upgrading to 3.1.0 started observing 
> degradation for time take for text extraction. We are observing degradation 
> for many file types, but one specific case where there is for html files.
> I used 
> [https://www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip] 
> dataset from https://www.cs.cornell.edu/people/pabo/movie-review-data/.
> On a test machine with 12 cores, I am getting too many warnings shown below:
> {noformat}
> [XMLReaderUtils] Contention waiting for a SAXParser. Consider increasing the 
> XMLReaderUtils.POOL_SIZE{noformat}
> Then I set the pool size to equivalent to number of cores available using a 
> call to XMLReaderUtils.setPoolSize(). But that had even worse effect on 
> performance, it increased to 2x the time taken earlier. Also started getting 
> other warning as well and that too more frequently.
> {noformat}
> [XMLReaderUtils] SAXParser not taken back into pool.  If you haven't resized 
> the pool this could be a sign that there are more calls to 'acquire' than to 
> 'release'{noformat}
> Looks like changes done in commit 
> [https://github.com/apache/tika/commit/6305da41756e59dcf19e92acf70657624581cfe3]
>  are somehow causing this behaviour.
> With Tika 3.0.0 which we are currently using, I don't see any warning and 
> performance is also good.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4388) Performance degradation observed in Tika 3.1.0

Reply via email to