Re: OpenNLP dictionary-based lemmatizer memory issue

Alessandro Benedetti Tue, 16 Nov 2021 09:34:54 -0800

Hi,
I've done an initial review and it looks ok to me!
Before committing I added a couple of other committers to the loop, let's
see if they have any insight and in a couple of weeks we merge!
Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant


www.sease.io


On Mon, 15 Nov 2021 at 19:53, Spyros Kapnissis <ska...@gmail.com> wrote:

> Hi all,
>
> We recently identified and fixed an issue with the OpenNLP dictionary-based
> lemmatizer that seems to affect all versions. It resulted in generally high
> memory usage and random OOM exceptions, generally high server load during
> both indexing and querying and overall unstable performance.
>
> It turns out that the issue was related with the way the dictionary is
> cached internally in Solr/Lucene. So, instead of caching the generated
> dictionary hashmap, the string contents were cached instead. This resulted
> in having to re-generate the dictionary hashmap in-memory whenever the
> TokenFilterFactory.create() was used. In our case, the dictionary was
> pretty large, so the effects were magnified.
>
> I have submitted a patch on Lucene, but also posting here for visibility
> and in case someone can help review & merge. The patch is available here:
> https://github.com/apache/lucene/pull/380, and this is the corresponding
> ticket: https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-10171
>
> Please let me know if you need any more details.
>
> Thanks!
> Spyros
>

Re: OpenNLP dictionary-based lemmatizer memory issue

Reply via email to