Re: Proposal to improve data skew debugging

2025-01-29 Thread Mich Talebzadeh
Hi Rob, As a matter of interest, have you got an indication of a ballpark figure for percentage of queries that end up with skewed distribution? Thanks Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile

Re: Proposal to improve data skew debugging

2025-01-27 Thread Rob Reeves
The counting does use count-min sketch and publishes the top K keys above a skew threshold to an accumulator. The core implementation in my prototype is in InlineApproxCountExec

Re: Proposal to improve data skew debugging

2025-01-24 Thread Mich Talebzadeh
Ok so the catalyst optimizer will use this method of inline key counting to provide spark optimizer with prior notification, so it identifies the hot keys? What is this inline key counting based? Likely Count-Min Sketch algorithm! HTH Mich Talebzadeh, Architect | Data Science | Financial Crime |

Proposal to improve data skew debugging

2025-01-24 Thread Rob Reeves
Hi Spark devs, I recently worked on a prototype to make it easier to identify the root cause of data skew in Spark. I wanted to see if the community was interested in it before working on contributing the changes (SPIP and PRs). *Problem* When a query has data skew today, you see outlier tasks ta