Re: Integrating NLP into Lucene Analysis Chain

Benoit Mercier Mon, 21 Nov 2022 10:20:31 -0800

Hi Luke,

Thank you for your work and information sharing. From my point of viewlemmatization is just a use case of text token annotation. I have beenworking with Lucene since 2006 to index lexicographic and linguisticdata and I always miss the fact that (1) token attributes are notsearchable and (2) that it is not straightforward to get all text tokensindexed at the same position (synonyms) directly from a span query(ideas and suggestions are welcome!). I think that the NLP communitymight be grateful if Lucene could offer a simple way to search on tokenannotations (attributes). MTAS project achieve that(https://github.com/textexploration/mtas), based on Lucene, and supportsthe CQL Query Language(https://meertensinstituut.github.io/mtas/search_cql.html). MTAS is aninspiring project I came accross recently and from which you might getinspiration too. But I am currently hesitating to use it because I haveno guarantee that they authors will port their code to support newLucene versions. I might come with my own solution but without (2) Ican't see yet how I could achieve it simply without redoing the samething that MTAS did!


Thank you.

Benoit

Le 2022-11-19 à 22 h 26, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) a écrit :

Greetings,
I would greatly appreciate anyone sharing their experience doing
NLP/lemmatization and am also very curious to gauge the opinion of the lucene
community regarding open-nlp. I know there are a few other libraries out there,
some of which can’t be directly included in the lucene project because of
licensing issues. If anyone has any suggestions/experiences, please do share
them :-)
As a side note I’ll add that I’ve been experimenting with open-nlp’s
PoS/lemmatization capabilities via lucene’s integration. During the process I
uncovered some issues which made me question whether open-nlp is the right tool
for the job. The first issue was a “low-hanging bug”, which would have most likely
been addressed sooner if this solution was popular, this simple bug was at least 5
years old -> https://github.com/apache/lucene/issues/11771

Second issue has more to do with the open-nlp library itself. It is not
thread-safe in some very unexpected ways. Looking at the library internals
reveals unsynchronized lazy initialization of shared components. Unfortunately
the lucene integration kind of sweeps this under the rug by wrapping everything
in a pretty big synchronized block, here is an example
https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
. This itself is problematic because these functions run in really tight loops
and probably shouldn’t be blocking. Even if one did decide to do blocking
initialization, it can still be done at a much lower level than currently. From
what I gather, the functions that are synchronized at the lucene-level could be
made thread-safe in a much more performant way if they were fixed in open-nlp.
But I am also starting to doubt if this is worth pursuing since I don't know
whether anyone would find this useful, hence the original inquiry.
I’ll add that I have separately used the open-nlp sentence break iterator
(which suffers from the same problem
https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39
) at production scale and discovered really bad performance during certain
conditions which I attribute to this unnecessary synching. I suspect this may
have impacted others as well
https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
Many thanks,
Luke Kot-Zaniewski

Re: Integrating NLP into Lucene Analysis Chain

Reply via email to