We started to implement a named entity recognition on the base of 
AnalyzingSuggester, which offers
the great support for Synonyms, Stopwords, etc.
For this, we slightly modified AnalyzingSuggester.lookup() to only return the 
exactFirst hits
(considering the exactFirst code block only, skipping the 'sameSurfaceForm' 
check and break, to get
the synonym hits too).

This works pretty good, and our next step would be to bring in some fuzzyness 
against spelling
mistakes. For this, the idea was to do exactly the same, but with 
FuzzySuggester instead.

Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only relies 
on sharing the same
prefix - also different/misspelled terms inside the edit distance are 
considered as 'not exact',
which means we get the same results as with AnalyzingSuggester.


query: "screen"
misspelled query: "screan"
dictionary: "screen", "screensaver"

AnalyzingSuggester hits: screen, screensaver
AnalyzingSuggester hits on misspelled query: <empty>
AnalyzingSuggester EXACT_FIRST hits: screen
AnalyzingSuggester EXACT_FIRST hits on misspelled query: <empty>

FuzzySuggester hits: screen, screensaver
FuzzySuggester hits on misspelled query: screen, screensaver
FuzzySuggester EXACT_FIRST hits: screen
FuzzySuggester EXACT_FIRST hits on misspelled query: <empty> => TARGET: screen


Is there a possibility to distinguish? I see that the 'exact' criteria relies 
on an FST aspect
'END_BYTE arc leaving'. Maybe these can be set differently when building the 
Levenshtein automata? I
have no clue.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to