Behaviour of operators like Outer Join when using indeterministic joining keys seems to be full of contradictions

2025-01-24 Thread Asif Shahid
Hi, While testing a use case where the query had an outer join such that joining key of left outer table either had a valid value or a random value( salting to avoid skew). The case was reported to have incorrect results in case of node failure, with retry. On debugging the code, have found followi

Re: [Connect] Install additional python packages after session creation

2025-01-24 Thread Hyukjin Kwon
That's me. It's not anywhere yet and it's WIP as mentioned in the talk. I'm still dealing with its design. On Sat, Jan 25, 2025 at 1:00 AM Deependra Patel wrote: > Hi all, > There are ways through the `addArtifacts` API in an existing session but > for that we need to have dependencies properly

Re: Proposal to improve data skew debugging

2025-01-24 Thread Mich Talebzadeh
Ok so the catalyst optimizer will use this method of inline key counting to provide spark optimizer with prior notification, so it identifies the hot keys? What is this inline key counting based? Likely Count-Min Sketch algorithm! HTH Mich Talebzadeh, Architect | Data Science | Financial Crime |

Proposal to improve data skew debugging

2025-01-24 Thread Rob Reeves
Hi Spark devs, I recently worked on a prototype to make it easier to identify the root cause of data skew in Spark. I wanted to see if the community was interested in it before working on contributing the changes (SPIP and PRs). *Problem* When a query has data skew today, you see outlier tasks ta

FYI: SPARK-49700 Unified Scala Interface for Connect and Classic

2025-01-24 Thread Dongjoon Hyun
Hi, All. SPARK-49700 landed one hour ago. Since this is another huge package redesign across 399 files in Spark 4.0, please check if you are not affected accidentally. Best Regards, Dongjoon.

[Connect] Install additional python packages after session creation

2025-01-24 Thread Deependra Patel
Hi all, There are ways through the `addArtifacts` API in an existing session but for that we need to have dependencies properly gzipped. In the case of different kernel/OS between client and server, it won't work either I believe. What I am interested in is doing some sort of `pip install https://y