[ https://issues.apache.org/jira/browse/TIKA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955280#comment-17955280 ]
Tim Allison edited comment on TIKA-4427 at 5/30/25 6:06 PM: ------------------------------------------------------------ Thank you for opening this issue and identifying the source of the memory leak. We're calling reset on the parser when we return it to the pool: https://github.com/apache/tika/blob/branch_3x/tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java#L1064 Any idea why this might not be enough? Is reset not working (e.g. is this a java problem) or are we not doing enough to clear the parser between calls? >From the last screenshot, you're using the default java xml parser, and it >looks like the cache only has 10 of them...so that's good. was (Author: talli...@mitre.org): Thank you for opening this issue and identifying the source of the memory leak. We're calling reset on the parser when we return it to the pool: https://github.com/apache/tika/blob/branch_3x/tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java#L1064 Any idea why this might not be enough? Is reset not working >From the last screenshot, you're using the default java xml parser, and it >looks like the cache only has 10 of them...so that's good. > Memory Leak when parsing a large (110K+) number of documents > -------------------------------------------------------------- > > Key: TIKA-4427 > URL: https://issues.apache.org/jira/browse/TIKA-4427 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 3.2.0 > Reporter: Tim Barrett > Priority: Major > Attachments: Screenshot 2025-05-30 at 17.22.38.png, Screenshot > 2025-05-30 at 18.31.01.png, Screenshot 2025-05-30 at 18.31.47.png > > > When parsing a very large number of documents, which include a lot of eml > files we see that > The static field XMLReaderUtils.SAX_PARSERS is holding a massive amount of > memory: 3.28 GB. This is a static pool of cached SAXParser instances, each of > which is holding onto substantial amounts of memory, apparently in the > fDocumentHandler field. > This is a big data test we run regularly, the memory issues did not occur in > Tika version 2.x > > I have attached JVM monitor screenshots. -- This message was sent by Atlassian Jira (v8.20.10#820010)