How to use Hunspell dictionary to do the reverse of stemming ?

2017-10-24 Thread julien Blaize
Hello,

i am lookingfor a way to efficiently do the reverse of stemming.
Example : if i give to the program the verb "drug" it will give me
"drugged', "drugging", "drugs", "drugstore" etc...

I have used the program wordforms from hunspell to generate all possibles
combinations of the input word (even all the ridiculous one's that does not
match a real word). The i use org.apache.lucene.analysis.hunspell.Dictionary
class to check if the word exists and map to the original word.
This is really long and not efficient.

I was looking at the internals of the Dictionary class and saw the use of
patterns and FST (finite state machine). This seems a very efficient way to
check for the stem of a word. But i was unable to find a way to do the
reverse operation.

I am wondering if anyone has tried to do something similar ? Can someone
who understand FST and the usage of patterns in the Dictionary class give
me hints of wether what i am trying to do is possible and will be efficient
?

Kind Regards.

--
Julien Blaize


Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread julien Blaize
Hello Michael,

i had previously worked on emoji detection with lucene.

I had to extends the Tokenizer class (and not the TokenFilter like
WordDelimiterFilter) to preserve the delimiter attribute.
I also had to keep track of consecutive delimiters in the character stream
because Lucene default implementation only keep the last one.

Maybe it can put you on the right track to start by looking at the
Tokenizer instead of the TokenFilter.

By the way I used the emoji list from this project to detect sequences of
characters.
https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-fr.txt
I detect sequences of character and while the sequence is a possible emoji
i keep tracking, when i have a full emoji i put it in the CharTermAttribute
so it's treated as a word and not a delimiter.

Regards
--
Julien Blaize


Le mar. 3 juil. 2018 à 14:00, Michael Sokolov  a écrit :

> WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> like punctuation and thus remove them, but we would like to be able to
> search for emoji and use this filter for handling dashes, dots and other
> intra-word punctuation.
>
> These filters identify non-word and non-digit characters by two mechanisms:
> direct lookup in a character table, and fallback to Unicode class. The
> character table can't easily be used to handle emoji since it would need to
> be populated with the entire Unicode character set in order to reach
> emoji-land. On the other hand, if we change the handling of emoji by class,
> and say treat them as word-characters, this will also end up pulling in all
> the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> some of these other symbols are more like punctuation (this class is a grab
> bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc
> https://www.compart.com/en/unicode/category/So). On the other other hand,
> how do we even identify emoji? I don't think the Java Character API is
> adequate to the task. Perhaps we must incorporate a table.
>
> Suppose we come up with a good way to classify emoji; then how should they
> be treated in this class? Sometimes they may be embedded in tokens with
> other characters: I see people using emoji and other symbols as part of
> their names, and sometimes they stand alone (with whitespace separation). I
> think one way forward here would be to treat these as a special class akin
> to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> CATENATE_EMOJI) as we have for those classes.
>
> Or maybe as a convenience, we provide a way to get a table that encodes the
> default classifications of all characters up to some given limit, and then
> let the caller modify it? That would at least provide an easy way to treat
> emoji as letters.
>
> Any thoughts?
>