[ https://issues.apache.org/jira/browse/TIKA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955416#comment-17955416 ]
Tim Barrett commented on TIKA-4427: ----------------------------------- It looks to me as though reset isn’t working. The pooled instances all seem to have the following: saxParser->xmlreader->fDocumentSource->fDocumentHandler->fDocumentSource->…. which seems to go down to a great depth. This happened originally with java 21, I went back to java 11 but the problem remained. When I said this didn’t happen under tika 2.x I’m not sure that was accurate info. The last time we did a big data test was a few months ago so not sure what version of tika we were on then. I have put a workaround in XMLReaderUtils: I always call getSAXParser. There is no noticeable performance hit and no more memory being leaked. Begs a question as to why a pool was needed? /** * This checks context for a user specified \{@link SAXParser}. If one is not * found, this reuses a SAXParser from the pool. * * @param is * InputStream to parse * @param contentHandler * handler to use; this wraps a \{@link OfflineContentHandler} to * the content handler as an extra layer of defense against * external entity vulnerabilities * @param context * context to use * @return * @throws TikaException * @throws IOException * @throws SAXException * @since Apache Tika 1.19 * * Workaround Tim Barrett Nalanda 31/05/2025 - always get a new SAX * parser due to memory leak in XMLreader * */ publicstaticvoid parseSAX(InputStream is, ContentHandler contentHandler, ParseContext context) throws TikaException, IOException, SAXException { SAXParser saxParser = context.get(SAXParser.class); PoolSAXParser poolSAXParser = null; // if (saxParser == null) { // poolSAXParser = acquireSAXParser(); // if (poolSAXParser != null) { // saxParser = poolSAXParser.getSAXParser(); // } else { saxParser = getSAXParser(); // } // } try { saxParser.parse(is, new OfflineContentHandler(contentHandler)); } finally { if (poolSAXParser != null) { releaseParser(poolSAXParser); } } } > Memory Leak when parsing a large (110K+) number of documents > -------------------------------------------------------------- > > Key: TIKA-4427 > URL: https://issues.apache.org/jira/browse/TIKA-4427 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 3.2.0 > Reporter: Tim Barrett > Priority: Major > Attachments: Screenshot 2025-05-30 at 17.22.38.png, Screenshot > 2025-05-30 at 18.31.01.png, Screenshot 2025-05-30 at 18.31.47.png > > > When parsing a very large number of documents, which include a lot of eml > files we see that > The static field XMLReaderUtils.SAX_PARSERS is holding a massive amount of > memory: 3.28 GB. This is a static pool of cached SAXParser instances, each of > which is holding onto substantial amounts of memory, apparently in the > fDocumentHandler field. > This is a big data test we run regularly, the memory issues did not occur in > Tika version 2.x > > I have attached JVM monitor screenshots. -- This message was sent by Atlassian Jira (v8.20.10#820010)