[ https://issues.apache.org/jira/browse/TIKA-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930665#comment-17930665 ]
Tim Allison commented on TIKA-4388: ----------------------------------- Also, I'm guessing that you're getting this in the detector, not the parse because these are html files? > Performance degradation observed in Tika 3.1.0 > ---------------------------------------------- > > Key: TIKA-4388 > URL: https://issues.apache.org/jira/browse/TIKA-4388 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 3.1.0 > Reporter: Sandeep Kulkarni > Priority: Major > > We are using Tika as a library and after upgrading to 3.1.0 started observing > degradation for time take for text extraction. We are observing degradation > for many file types, but one specific case where there is for html files. > I used > [https://www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip] > dataset from https://www.cs.cornell.edu/people/pabo/movie-review-data/. > On a test machine with 12 cores, I am getting too many warnings shown below: > {noformat} > [XMLReaderUtils] Contention waiting for a SAXParser. Consider increasing the > XMLReaderUtils.POOL_SIZE{noformat} > Then I set the pool size to equivalent to number of cores available using a > call to XMLReaderUtils.setPoolSize(). But that had even worse effect on > performance, it increased to 2x the time taken earlier. Also started getting > other warning as well and that too more frequently. > {noformat} > [XMLReaderUtils] SAXParser not taken back into pool. If you haven't resized > the pool this could be a sign that there are more calls to 'acquire' than to > 'release'{noformat} > Looks like changes done in commit > [https://github.com/apache/tika/commit/6305da41756e59dcf19e92acf70657624581cfe3] > are somehow causing this behaviour. > With Tika 3.0.0 which we are currently using, I don't see any warning and > performance is also good. > -- This message was sent by Atlassian Jira (v8.20.10#820010)