rmuir opened a new pull request, #14193:
URL: https://github.com/apache/lucene/pull/14193
Previously caseless matching was implemented via code such as this:
```java
Operations.union(Automata.makeChar('x'), Automata.makeChar('X'))
```
Proposed unicode caseless matching (#14192) implements it with repeated
unions:
```java
a1 = Operations.union(Automata.makeChar('x'), Automata.makeChar('X'))
a2 = Operations.union(a1, Automata.makeChar('y'))
a3 = Operations.union(a2, Automata.makeChar('Y'))
```
The union operation doesn't return a minimal automaton: improving union
would always be nice, but this change offers a simple api for the task that
returns half the number of states.
Before: caseless match of "a":

After:

Before: caseless match of "lucene":

After:

Just like the `union`, the `concatenate` adds some useless states, but they
are less of a problem than the ones from before.
I didn't try anything more such as repeated union or kleene star, to see if
I could make a really bad case, I felt like this was good enough, to get it to
a better place. We can look at optimizing union/concatenate separately still,
but that's always more dangerous and tricky.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]