Re: Correctness and data loss issues

Dongjoon Hyun Tue, 21 Jan 2020 23:57:53 -0800

Thank you for checking, Wenchen! Sure, we need to do that.

Another question is "What can we do for 2.4.5 release"?
Some of the fixes cannot be backported due to the technical difficulty like
the followings.


    1. https://issues.apache.org/jira/browse/SPARK-26154
        Stream-stream joins - left outer join gives inconsistent output
        (Like this, there are eight correctness fixes which lands only at
3.0.0)

    2. https://github.com/apache/spark/pull/27233
        [SPARK-29701][SQL] Correct behaviours of group analytical queries
when empty input given
        (This is on-going PR which is currently blocking 2.4.5 RC2).

Bests,
Dongjoon.

On Tue, Jan 21, 2020 at 11:10 PM Wenchen Fan <cloud0...@gmail.com> wrote:

> I think we need to go through them during the 3.0 QA period, and try to
> fix the valid ones.
>
> For example, the first ticket should be fixed already in
> https://issues.apache.org/jira/browse/SPARK-28344
>
> On Mon, Jan 20, 2020 at 2:07 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> According to our policy, "Correctness and data loss issues should be
>> considered Blockers".
>>
>>     - http://spark.apache.org/contributing.html
>>
>> Since we are close to branch-3.0 cut,
>> I want to ask your opinions on the following correctness and data loss
>> issues.
>>
>>     SPARK-30218 Columns used in inequality conditions for joins not
>> resolved correctly in case of common lineage
>>     SPARK-29701 Different answers when empty input given in GROUPING SETS
>>     SPARK-29699 Different answers in nested aggregates with window
>> functions
>>     SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
>>     SPARK-28125 dataframes created by randomSplit have overlapping rows
>>     SPARK-28067 Incorrect results in decimal aggregation with whole-stage
>> code gen enabled
>>     SPARK-28024 Incorrect numeric values when out of range
>>     SPARK-27784 Alias ID reuse can break correctness when substituting
>> foldable expressions
>>     SPARK-27619 MapType should be prohibited in hash expressions
>>     SPARK-27298 Dataset except operation gives different results(dataset
>> count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
>>     SPARK-27282 Spark incorrect results when using UNION with GROUP BY
>> clause
>>     SPARK-27213 Unexpected results when filter is used after distinct
>>     SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive
>> table if schema evolves
>>     SPARK-25150 Joining DataFrames derived from the same source yields
>> confusing/incorrect results
>>     SPARK-21774 The rule PromoteStrings cast string to a wrong data type
>>     SPARK-19248 Regex_replace works in 1.6 but not in 2.0
>>
>> Some of them are targeted on 3.0.0, but the others are not.
>> Although we will work on them until 3.0.0,
>> I'm not sure we can reach a status with no known correctness and data
>> loss issue.
>>
>> How do you think about the above issues?
>>
>> Bests,
>> Dongjoon.
>>
>

Re: Correctness and data loss issues

Reply via email to