Hi, I've done an initial review and it looks ok to me! Before committing I added a couple of other committers to the loop, let's see if they have any insight and in a couple of weeks we merge! Cheers -------------------------- Alessandro Benedetti Apache Lucene/Solr Committer Director, R&D Software Engineer, Search Consultant
www.sease.io On Mon, 15 Nov 2021 at 19:53, Spyros Kapnissis <ska...@gmail.com> wrote: > Hi all, > > We recently identified and fixed an issue with the OpenNLP dictionary-based > lemmatizer that seems to affect all versions. It resulted in generally high > memory usage and random OOM exceptions, generally high server load during > both indexing and querying and overall unstable performance. > > It turns out that the issue was related with the way the dictionary is > cached internally in Solr/Lucene. So, instead of caching the generated > dictionary hashmap, the string contents were cached instead. This resulted > in having to re-generate the dictionary hashmap in-memory whenever the > TokenFilterFactory.create() was used. In our case, the dictionary was > pretty large, so the effects were magnified. > > I have submitted a patch on Lucene, but also posting here for visibility > and in case someone can help review & merge. The patch is available here: > https://github.com/apache/lucene/pull/380, and this is the corresponding > ticket: https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-10171 > > Please let me know if you need any more details. > > Thanks! > Spyros >