I looked into this, because it sure sounds like a similar issue from a few
years ago that was fixed in https://issues.apache.org/jira/browse/SPARK-5984

The change in that JIRA actually looks almost identical to the change
mentioned in the JDK bug:
http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239

Reading the paper
http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf in
section 5 a little more, I think they are saying that there were two ways
to fix the original problem: a) fix the invariant, or b) increase some data
structure size. Java did the latter, it seems, and now they've shown it's
actually still busted. However Python and the original paper did the
former, and we seem to have copied that fix from
http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/
My
understanding is that this still works, and is what Java *now* implements.

The only difference I can see in implementation is an extra check for a
negative array index before dereferencing array[n]. We can add that for
full consistency with the JVM change, I suppose. I don't think it's
relevant to the problem reported in the paper, but could be an issue
otherwise. The JVM implementation suggests it thinks this needs to be
guarded.

I did hack together a crude translation of the paper's bug reproduction at
http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some Spark
test code. It does need a huge amount of memory to run (>32g.. ended up at
44g) so not so feasible to put in the test suite. Running it on Spark
master nets a JVM crash:

# Problematic frame:
# J 10195 C2
org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I
(198 bytes) @ 0x00007ff64dd9a262 [0x00007ff64dd99f20+0x342]

Thats... not good, but I can't tell if it's really due to this same issue
or something else going on in the off-heap code. Making the tiny change I
mentioned above doesn't do anything.

On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin <r...@databricks.com> wrote:

> “As a byproduct of our study, we uncover a bug in the Java implementation
> that can cause the sorting method to fail during the execution.”
>
> http://drops.dagstuhl.de/opus/volltexte/2018/9467/
>
> This might impact Spark since we took the Java based TimSort
> implementation. I have seen in the wild TimSort failing in the past. Maybe
> this is the cause.
>

Reply via email to