[PR] Add Automata.makeCharSet(int[]) to optimize caseless matching. [lucene]

via GitHub Mon, 03 Feb 2025 18:04:42 -0800


rmuir opened a new pull request, #14193:
URL: https://github.com/apache/lucene/pull/14193


   Previously caseless matching was implemented via code such as this:
   
   ```java
     Operations.union(Automata.makeChar('x'), Automata.makeChar('X'))
   ```
   
   Proposed unicode caseless matching (#14192) implements it with repeated 
unions:
   
   ```java
     a1 = Operations.union(Automata.makeChar('x'), Automata.makeChar('X'))
     a2 = Operations.union(a1, Automata.makeChar('y'))
     a3 = Operations.union(a2, Automata.makeChar('Y'))
   ```
   The union operation doesn't return a minimal automaton: improving union 
would always be nice, but this change offers a simple api for the task that 
returns half the number of states.
   
   Before: caseless match of "a":
   
![a_before](https://github.com/user-attachments/assets/5f1f7cc2-a792-434f-98ec-a5e629678f16)
   
   After:
   
![a_after](https://github.com/user-attachments/assets/aa60f883-cfd4-4291-9fb7-1bbecf0825ea)
   
   Before: caseless match of "lucene":
   
![lucene_before](https://github.com/user-attachments/assets/76e38dc6-5cb4-4b77-923a-22dbab345241)
   
   After:
   
![lucene_after](https://github.com/user-attachments/assets/3ca7755e-826d-4f79-b22e-8a514497d69e)
   
   Just like the `union`, the `concatenate` adds some useless states, but they 
are less of a problem than the ones from before.
   
   I didn't try anything more such as repeated union or kleene star, to see if 
I could make a really bad case, I felt like this was good enough, to get it to 
a better place. We can look at optimizing union/concatenate separately still, 
but that's always more dangerous and tricky.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add Automata.makeCharSet(int[]) to optimize caseless matching. [lucene]

Reply via email to