It's great if you can help with it! Basically, we need to propagate the
column-level deterministic information and sort the inputs if the partition
key lineage has nondeterminisitc part.
On Wed, Mar 16, 2022 at 5:28 AM Jason Xu wrote:
> Hi Wenchen, thanks for the insight. Agree, the previous fix
Hi Wenchen, thanks for the insight. Agree, the previous fix for repartition
works for deterministic data. With non-deterministic data, I didn't find an
API to pass DeterministicLevel to underlying rdd.
Do you plan to continue work on integration with SQL operators? If not, I'm
available to take a s
We fixed the repartition correctness bug before, by sorting the data before
doing round-robin partitioning. But the issue is that we need to propagate
the isDeterministic property through SQL operators.
On Tue, Mar 15, 2022 at 1:50 AM Jason Xu wrote:
> Hi Reynold, do you suggest removing RoundRo
Hi Reynold, do you suggest removing RoundRobinPartitioning in
repartition(numPartitions: Int) API implementation? If that's the direction
we're considering, before we have a new implementation, should we suggest
users avoid using the repartition(numPartitions: Int) API?
On Sat, Mar 12, 2022 at 1:4
This is why RoundRobinPartitioning shouldn't be used ...
On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu < jasonxu.sp...@gmail.com > wrote:
>
> Hi Spark community,
>
> I reported a data correctness issue in https:/ / issues. apache. org/ jira/
> browse/ SPARK-38388 ( https://issues.apache.org/jira/b
Hi Spark community,
I reported a data correctness issue in
https://issues.apache.org/jira/browse/SPARK-38388. In short,
non-deterministic data + Repartition + FetchFailure could result in
incorrect data, this is an issue we run into in production pipelines, I
have an example to reproduce the bug i