Thomas Poppe created LUCENE-7921:
------------------------------------
Summary: More efficient way to transform a RegExp to an Automaton
Key: LUCENE-7921
URL: https://issues.apache.org/jira/browse/LUCENE-7921
Project: Lucene - Core
Issue Type: Improvement
Affects Versions: 6.5.1
Reporter: Thomas Poppe
Priority: Minor
Consider the following example:
public static void main(String[] args) {
org.apache.lucene.util.automaton.RegExp regExp =
new
org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
org.apache.lucene.util.automaton.Automaton automaton =
regExp.toAutomaton();
System.out.println("states: " + automaton.getNumStates());
System.out.println("transitions: " + automaton.getNumTransitions());
System.out.println("-------------------------------");
try {
regExp = new
org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
automaton = regExp.toAutomaton();
System.out.println("Will not happen...");
} catch
(org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
automaton = regExp.toAutomaton(1_000_000);
System.out.println("states: " + automaton.getNumStates());
System.out.println("transitions: " + automaton.getNumTransitions());
System.out.println("-------------------------------");
}
}
Both regular expressions are equivalent, but it's much more efficient to
"unroll" the repetition. It might be possible to optimize the
Regex#toAutomaton() method to handle this repetition without going over the
default number of determinized states, and using less memory and CPU?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]