OpenNLP dictionary-based lemmatizer memory issue

Spyros Kapnissis Mon, 15 Nov 2021 11:53:02 -0800

Hi all,

We recently identified and fixed an issue with the OpenNLP dictionary-based
lemmatizer that seems to affect all versions. It resulted in generally high
memory usage and random OOM exceptions, generally high server load during
both indexing and querying and overall unstable performance.


It turns out that the issue was related with the way the dictionary is
cached internally in Solr/Lucene. So, instead of caching the generated
dictionary hashmap, the string contents were cached instead. This resulted
in having to re-generate the dictionary hashmap in-memory whenever the
TokenFilterFactory.create() was used. In our case, the dictionary was
pretty large, so the effects were magnified.

I have submitted a patch on Lucene, but also posting here for visibility
and in case someone can help review & merge. The patch is available here:
https://github.com/apache/lucene/pull/380, and this is the corresponding
ticket: https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-10171

Please let me know if you need any more details.

Thanks!
Spyros

OpenNLP dictionary-based lemmatizer memory issue

Reply via email to