Re: OpenNLP dictionary-based lemmatizer memory issue

Spyros Kapnissis Wed, 17 Nov 2021 01:38:10 -0800

Thank you Alessandro for your comments and getting back so quickly, that
sounds great!


On Tue, Nov 16, 2021 at 7:35 PM Alessandro Benedetti <a.benede...@sease.io>
wrote:

> Hi,
> I've done an initial review and it looks ok to me!
> Before committing I added a couple of other committers to the loop, let's
> see if they have any insight and in a couple of weeks we merge!
> Cheers
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Mon, 15 Nov 2021 at 19:53, Spyros Kapnissis <ska...@gmail.com> wrote:
>
> > Hi all,
> >
> > We recently identified and fixed an issue with the OpenNLP
> dictionary-based
> > lemmatizer that seems to affect all versions. It resulted in generally
> high
> > memory usage and random OOM exceptions, generally high server load during
> > both indexing and querying and overall unstable performance.
> >
> > It turns out that the issue was related with the way the dictionary is
> > cached internally in Solr/Lucene. So, instead of caching the generated
> > dictionary hashmap, the string contents were cached instead. This
> resulted
> > in having to re-generate the dictionary hashmap in-memory whenever the
> > TokenFilterFactory.create() was used. In our case, the dictionary was
> > pretty large, so the effects were magnified.
> >
> > I have submitted a patch on Lucene, but also posting here for visibility
> > and in case someone can help review & merge. The patch is available here:
> > https://github.com/apache/lucene/pull/380, and this is the corresponding
> > ticket:
> https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-10171
> >
> > Please let me know if you need any more details.
> >
> > Thanks!
> > Spyros
> >
>

Re: OpenNLP dictionary-based lemmatizer memory issue

Reply via email to