[ https://issues.apache.org/jira/browse/TIKA-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930680#comment-17930680 ]
Tim Allison commented on TIKA-4388: ----------------------------------- With tika-app in batch mode ({{java -jar tika-app-3.1.0.jar -J -t -i input -o output}}), I'm not getting the warning with 15 workers (16 cores) on that data set. And, the performance is roughly 4 seconds for both 3.1.0 and 3.0.0. When I drop the pool size down to 2, I do get the warning, but then a trivial hit on performance (4.4 seconds for 3.1.0 and 4.2 seconds for 3.0.0). I'm not doubting your findings! I need more info to be able to replicate. Thank you, again. > Performance degradation observed in Tika 3.1.0 > ---------------------------------------------- > > Key: TIKA-4388 > URL: https://issues.apache.org/jira/browse/TIKA-4388 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 3.1.0 > Reporter: Sandeep Kulkarni > Assignee: Tim Allison > Priority: Major > > We are using Tika as a library and after upgrading to 3.1.0 started observing > degradation for time take for text extraction. We are observing degradation > for many file types, but one specific case where there is for html files. > I used > [https://www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip] > dataset from https://www.cs.cornell.edu/people/pabo/movie-review-data/. > On a test machine with 12 cores, I am getting too many warnings shown below: > {noformat} > [XMLReaderUtils] Contention waiting for a SAXParser. Consider increasing the > XMLReaderUtils.POOL_SIZE{noformat} > Then I set the pool size to equivalent to number of cores available using a > call to XMLReaderUtils.setPoolSize(). But that had even worse effect on > performance, it increased to 2x the time taken earlier. Also started getting > other warning as well and that too more frequently. > {noformat} > [XMLReaderUtils] SAXParser not taken back into pool. If you haven't resized > the pool this could be a sign that there are more calls to 'acquire' than to > 'release'{noformat} > Looks like changes done in commit > [https://github.com/apache/tika/commit/6305da41756e59dcf19e92acf70657624581cfe3] > are somehow causing this behaviour. > With Tika 3.0.0 which we are currently using, I don't see any warning and > performance is also good. > -- This message was sent by Atlassian Jira (v8.20.10#820010)