[jira] [Commented] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

Michael McCandless (JIRA) Wed, 26 Jun 2013 05:05:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693931#comment-13693931
 ]


Michael McCandless commented on LUCENE-5030:
--------------------------------------------

Hmm, testStolenBytes should be using the 0x1f byte ... the intention
of the test is to ensure than an incoming token that contains
SEP_LABEL still works correctly (i.e., that the escaping we do is
working).

When I change the 0xff in the patch back to 0x1f I indeed see the
(unexpected) failure without the PRESERVE_SEP option, which is curious
because we do no escaping without PRESERVE_SEP.

OK I see the issue: before, when POS_SEP was 256 and the input space
was a byte, replaceSep always worked correctly because there was no
way for any byte input to be confused with POS_SEP.  But now that we
are increasing the input space to all unicode chars, there is not
"safe" value for POS_SEP.

OK given all this I think we should stop trying to not-steal the byte:
I think we should simply declare we steal both 0x1e and 0x1f.  This
means we can remove the escaping code, put back your previous code
that I had asked you to remove (sorry) that threw IAE on 0x1f (and now
also 0x1e), remove testStolenBytes, and then improve your new
testIllegalLookupArgument to also verify 0x1f gets the
IllegalArgumentException?

Also, we could maybe eliminate some code dup here, e.g. the two
toFiniteStrings ... maybe by having TS2A and TS2UA share a base class
/ interface.  Hmm, maybe we should just merge TS2UA back into TS2A,
and add a unicodeAware option to it?

                
> FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
> correctly for 1-byte (like English) and multi-byte (non-Latin) letters
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5030
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5030
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.3
>            Reporter: Artem Lukanin
>         Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
> benchmark-wo_convertion.txt, nonlatin_fuzzySuggester1.patch, 
> nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
> nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, 
> nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, 
> nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
> nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch
>
>
> There is a limitation in the current FuzzySuggester implementation: it 
> computes edits in UTF-8 space instead of Unicode character (code point) 
> space. 
> This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
> Unicode character space, then fix FuzzySuggester to do the same steps that 
> FuzzyQuery does: do the LevN expansion in Unicode character space, then 
> convert that automaton to UTF-8, then intersect with the suggest FST.
> See the discussion here: 
> http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

Reply via email to