Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

via GitHub Thu, 13 Mar 2025 09:39:03 -0700


dweiss commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720601955


   I think this will work just fine in most cases and is a rather inexpensive 
way to implement this case-insensitive matching, but this comes at the cost of 
the output automaton that may not be minimal. Consider this example:
   ```
       List<BytesRef> terms = new ArrayList<>(List.of(
               newBytesRef("abc"),
               newBytesRef("aBC")));
       Collections.sort(terms);
       Automaton a = build(terms, false, false);
   ```
   which produces:
   
![image](https://github.com/user-attachments/assets/5d89a382-0fe9-4b42-bdcc-a67cb2b90ef5)
   
   However, when you naively expand just the transitions for each letter 
variant, you get this:
   
![image](https://github.com/user-attachments/assets/2c55bbc5-1c16-43c9-b148-2effbd6b1efb)
   which clearly isn't minimal (and doesn't pass checkMinimized).
   
   I think the absolutely worst case is for the automaton to double the number 
of transitions - the number of states remains the same. So it's not like it's 
going to expand uncontrollably... But it's no longer minimal. Perhaps this is 
acceptable, given the constrained worst case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

Reply via email to