Re: Proposal to improve data skew debugging

2025-01-29 Thread Mich Talebzadeh
Hi Rob, As a matter of interest, have you got an indication of a ballpark figure for percentage of queries that end up with skewed distribution? Thanks Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile

Re: Proposal to improve data skew debugging

2025-01-27 Thread Rob Reeves
The counting does use count-min sketch and publishes the top K keys above a skew threshold to an accumulator. The core implementation in my prototype is in InlineApproxCountExec

Re: Proposal to improve data skew debugging

2025-01-24 Thread Mich Talebzadeh
Ok so the catalyst optimizer will use this method of inline key counting to provide spark optimizer with prior notification, so it identifies the hot keys? What is this inline key counting based? Likely Count-Min Sketch algorithm! HTH Mich Talebzadeh, Architect | Data Science | Financial Crime |