Hi all, We recently identified and fixed an issue with the OpenNLP dictionary-based lemmatizer that seems to affect all versions. It resulted in generally high memory usage and random OOM exceptions, generally high server load during both indexing and querying and overall unstable performance.
It turns out that the issue was related with the way the dictionary is cached internally in Solr/Lucene. So, instead of caching the generated dictionary hashmap, the string contents were cached instead. This resulted in having to re-generate the dictionary hashmap in-memory whenever the TokenFilterFactory.create() was used. In our case, the dictionary was pretty large, so the effects were magnified. I have submitted a patch on Lucene, but also posting here for visibility and in case someone can help review & merge. The patch is available here: https://github.com/apache/lucene/pull/380, and this is the corresponding ticket: https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-10171 Please let me know if you need any more details. Thanks! Spyros