[
https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194395#comment-14194395
]
Michael McCandless commented on LUCENE-6046:
--------------------------------------------
Hmm, two bugs here.
First off, RegExp.toAutomaton is an inherently costly method: wasteful of RAM
and CPU, doing minimize after each recursive operation, in order to build a DFA
in the end. It's unfortunately quite easy to concoct regular expressions that
make it consume ridiculous resources. I'll look at this example and see if we
can improve it, but in the end it will always have its "adversarial cases"
unless we give up on making the resulting automaton deterministic, which would
be a very big change.
Maybe we should add adversary defenses to it, e.g. you set a limit on the
number of states it's allowed to create, and it throws a RegExpTooHardException
if it would exceed that?
Second off, ArrayUtil.oversize has the wrong (too large) value for
MAX_ARRAY_LENGTH, which is a bug from LUCENE-5844. Which JVM did you run this
test on?
> RegExp.toAutomaton high memory use
> ----------------------------------
>
> Key: LUCENE-6046
> URL: https://issues.apache.org/jira/browse/LUCENE-6046
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/queryparser
> Affects Versions: 4.10.1
> Reporter: Lee Hinman
> Priority: Minor
>
> When creating an automaton from an org.apache.lucene.util.automaton.RegExp,
> it's possible for the automaton to use so much memory it exceeds the maximum
> array size for java.
> The following caused an OutOfMemoryError with a 32gb heap:
> {noformat}
> new
> RegExp("\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}").toAutomaton();
> {noformat}
> When increased to a 60gb heap, the following exception is thrown:
> {noformat}
> 1> java.lang.IllegalArgumentException: requested array size 2147483624
> exceeds maximum array in java (2147483623)
> 1>
> __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
> 1> org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
> 1> org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
> 1>
> org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
> 1>
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
> 1>
> org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
> 1>
> org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
> 1> org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
> 1> org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]