[ 
https://issues.apache.org/jira/browse/TIKA-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930664#comment-17930664
 ] 

Tim Allison commented on TIKA-4388:
-----------------------------------

Thank you for identifying this problem, opening this ticket and sharing a 
corpus for us to work with. How are you calling Tika? Programmatically with the 
parse method? tika-app, tika-pipes, tika-server?

Some file formats require 2x (or more) xml parsers. If you set the pool size to 
4x your cpus/number of threads, does that help?

Thank you, again.

> Performance degradation observed in Tika 3.1.0
> ----------------------------------------------
>
>                 Key: TIKA-4388
>                 URL: https://issues.apache.org/jira/browse/TIKA-4388
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 3.1.0
>            Reporter: Sandeep Kulkarni
>            Priority: Major
>
> We are using Tika as a library and after upgrading to 3.1.0 started observing 
> degradation for time take for text extraction. We are observing degradation 
> for many file types, but one specific case where there is for html files.
> I used 
> [https://www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip] 
> dataset from https://www.cs.cornell.edu/people/pabo/movie-review-data/.
> On a test machine with 12 cores, I am getting too many warnings shown below:
> {noformat}
> [XMLReaderUtils] Contention waiting for a SAXParser. Consider increasing the 
> XMLReaderUtils.POOL_SIZE{noformat}
> Then I set the pool size to equivalent to number of cores available using a 
> call to XMLReaderUtils.setPoolSize(). But that had even worse effect on 
> performance, it increased to 2x the time taken earlier. Also started getting 
> other warning as well and that too more frequently.
> {noformat}
> [XMLReaderUtils] SAXParser not taken back into pool.  If you haven't resized 
> the pool this could be a sign that there are more calls to 'acquire' than to 
> 'release'{noformat}
> Looks like changes done in commit 
> [https://github.com/apache/tika/commit/6305da41756e59dcf19e92acf70657624581cfe3]
>  are somehow causing this behaviour.
> With Tika 3.0.0 which we are currently using, I don't see any warning and 
> performance is also good.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to