[
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323800#comment-15323800
]
Andriy Rysin commented on LUCENE-7287:
--------------------------------------
Ok, I've imported lucene-sorl and the Ukrainian analyzer project from
[~mr_gambal] into Eclipse and looked through the code.
Unfortunately we can't use the whole morfologik package as is - it's very
specific for Polish. We could still probably use part of morfologik for compact
dictionary representation. The whole Ukrainian dictionary in this format with
POS tags is ~1.6MB compared to 98M in csv and we could probably make it smaller
if we strip the tags.
There are several things I'd like to note:
1) this dictionary is for inflections (not related words) so this stemming will
be producing lemmas not quite root words (this is probably ok and in some cases
even better?)
2) as this is dictionary-based stemming it won't stem unknown words (but
dictionary contains ~200K lemmas so it should give good output)
3) as Ukrainian has high level of inflection (nouns produce up to 7 forms,
direct verbs up to 20, reverse verbs up to 30 forms) with many rules and
exceptions developing quality rule-base stemming will not be trivial
4) I was planning to work on Ukrainian analyzer in a separate project but if
it's better for the review process I can fork lucene-solr and work inside the
fork
5) I am thinking to create org.apache.lucene.analysis.uk classes based on
[~mr_gambal]'s work and the csv file we have and once it's working try more
compact representation
The question: once we have it working shall we include the dictionary in the
lucene project or make it an external dependency (like with
morfologik-polish.jar)? First is simpler but second will allow easy updates for
the dictionary (which I can see being actively developed for another year or
two) and also will keep the binary blob out of the project. I am leaning
towards second but open for discussion.
> New lemma-tizer plugin for ukrainian language.
> ----------------------------------------------
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Dmytro Hambal
> Priority: Minor
> Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a
> mapping between ukrainian word forms and their lemmas. Some tests and docs go
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]