Thank you Alessandro for your comments and getting back so quickly, that sounds great!
On Tue, Nov 16, 2021 at 7:35 PM Alessandro Benedetti <a.benede...@sease.io> wrote: > Hi, > I've done an initial review and it looks ok to me! > Before committing I added a couple of other committers to the loop, let's > see if they have any insight and in a couple of weeks we merge! > Cheers > -------------------------- > Alessandro Benedetti > Apache Lucene/Solr Committer > Director, R&D Software Engineer, Search Consultant > > www.sease.io > > > On Mon, 15 Nov 2021 at 19:53, Spyros Kapnissis <ska...@gmail.com> wrote: > > > Hi all, > > > > We recently identified and fixed an issue with the OpenNLP > dictionary-based > > lemmatizer that seems to affect all versions. It resulted in generally > high > > memory usage and random OOM exceptions, generally high server load during > > both indexing and querying and overall unstable performance. > > > > It turns out that the issue was related with the way the dictionary is > > cached internally in Solr/Lucene. So, instead of caching the generated > > dictionary hashmap, the string contents were cached instead. This > resulted > > in having to re-generate the dictionary hashmap in-memory whenever the > > TokenFilterFactory.create() was used. In our case, the dictionary was > > pretty large, so the effects were magnified. > > > > I have submitted a patch on Lucene, but also posting here for visibility > > and in case someone can help review & merge. The patch is available here: > > https://github.com/apache/lucene/pull/380, and this is the corresponding > > ticket: > https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-10171 > > > > Please let me know if you need any more details. > > > > Thanks! > > Spyros > > >