Hi Enrico, Thanks for your time :)
Consider a huge data volume scenario, If I don't use any keywords like distinct, which one would be faster ? Window with partitionBy or normal SQL aggregation methods? and how does df.groupBy().reduceByGroups() work internally ? Thanks, Sid On Mon, Feb 28, 2022 at 12:59 AM Enrico Minack <i...@enrico.minack.dev> wrote: > Sid, > > Your Aggregation Query selects all employees where less than three > distinct salaries exist that are larger. So, both queries seem to do the > same. > > The Windowing Query is explicit in what it does: give me the rank for > salaries per department in the given order and pick the top 3 per > department. > > The Aggregation Query is trying to get to this conclusion by constructing > some comparison. The former is the better approach, the second scales badly > as this is done by counting distinct salaries that are larger than each > salary in E. This looks like a Cartesian product of Employees. You make > this very hard to optimize or execute by the query engine. > > And as you say, your example is very small, so this will not give any > insights into big data. > > Enrico > > > Am 27.02.22 um 19:30 schrieb Sid: > > My bad. > > Aggregation Query: > > # Write your MySQL query statement below > > SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary > FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id > WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee > WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3 > ORDER by E.DepartmentId, E.Salary DESC > > Time Taken: 1212 ms > > Windowing Query: > > select Department,Employee,Salary from ( > select d.name as Department, e.name as Employee,e.salary as > Salary,dense_rank() over(partition by d.name order by e.salary desc) as > rnk from Department d join Employee e on e.departmentId=d.id ) a where > rnk<=3 > > Time Taken: 790 ms > > Thanks, > Sid > > > On Sun, Feb 27, 2022 at 11:35 PM Sean Owen <sro...@gmail.com> wrote: > >> Those two queries are identical? >> >> On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote: >> >>> Hi Team, >>> >>> I am aware that if windowing functions are used, then at first it loads >>> the entire dataset into one window,scans and then performs the other >>> mentioned operations for that particular window which could be slower when >>> dealing with trillions / billions of records. >>> >>> I did a POC where I used an example to find the max 3 highest salary for >>> an employee per department. So, I wrote a below queries and compared the >>> time for it: >>> >>> Windowing Query: >>> >>> select Department,Employee,Salary from ( >>> select d.name as Department, e.name as Employee,e.salary as >>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as >>> rnk from Department d join Employee e on e.departmentId=d.id ) a where >>> rnk<=3 >>> >>> Time taken: 790 ms >>> >>> Aggregation Query: >>> >>> select Department,Employee,Salary from ( >>> select d.name as Department, e.name as Employee,e.salary as >>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as >>> rnk from Department d join Employee e on e.departmentId=d.id ) a where >>> rnk<=3 >>> >>> Time taken: 1212 ms >>> >>> But as per my understanding, the aggregation should have run faster. So, >>> my whole point is if the dataset is huge I should force some kind of map >>> reduce jobs like we have an option called df.groupby().reduceByGroups() >>> >>> So I think the aggregation query is taking more time since the dataset >>> size here is smaller and as we all know that map reduce works faster when >>> there is a huge volume of data. Haven't tested it yet on big data but >>> needed some expert guidance over here. >>> >>> Please correct me if I am wrong. >>> >>> TIA, >>> Sid >>> >>> >>> >>> >