Sandeep Kulkarni created TIKA-4388:
--------------------------------------

             Summary: Performance degradation observed in Tika 3.1.0
                 Key: TIKA-4388
                 URL: https://issues.apache.org/jira/browse/TIKA-4388
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 3.1.0
            Reporter: Sandeep Kulkarni


We are using Tika as a library and after upgrading to 3.1.0 started observing 
degradation for time take for text extraction. We are observing degradation for 
many file types, but one specific case where there is for html files.

I used 
[https://www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip] 
dataset from https://www.cs.cornell.edu/people/pabo/movie-review-data/.

On a test machine with 12 cores, I am getting too many warnings shown below:
{noformat}
[XMLReaderUtils] Contention waiting for a SAXParser. Consider increasing the 
XMLReaderUtils.POOL_SIZE{noformat}
Then I set the pool size to equivalent to number of cores available using a 
call to XMLReaderUtils.setPoolSize(). But that had even worse effect on 
performance, it increased to 2x the time taken earlier. Also started getting 
other warning as well and that too more frequently.
{noformat}
[XMLReaderUtils] SAXParser not taken back into pool.  If you haven't resized 
the pool this could be a sign that there are more calls to 'acquire' than to 
'release'{noformat}
Looks like changes done in commit 
[https://github.com/apache/tika/commit/6305da41756e59dcf19e92acf70657624581cfe3]
 are somehow causing this behaviour.

With Tika 3.0.0 which we are currently using, I don't see any warning and 
performance is also good.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to