Re: Difference between windowing functions and aggregation functions on big data

Sid Sun, 27 Feb 2022 11:34:53 -0800

Hi Enrico,

Thanks for your time :)


Consider a huge data volume scenario, If I don't use any keywords like
distinct, which one would be faster ? Window with partitionBy or normal SQL
aggregation methods? and how does df.groupBy().reduceByGroups() work
internally ?

Thanks,
Sid

On Mon, Feb 28, 2022 at 12:59 AM Enrico Minack <i...@enrico.minack.dev>
wrote:

> Sid,
>
> Your Aggregation Query selects all employees where less than three
> distinct salaries exist that are larger. So, both queries seem to do the
> same.
>
> The Windowing Query is explicit in what it does: give me the rank for
> salaries per department in the given order and pick the top 3 per
> department.
>
> The Aggregation Query is trying to get to this conclusion by constructing
> some comparison. The former is the better approach, the second scales badly
> as this is done by counting distinct salaries that are larger than each
> salary in E. This looks like a Cartesian product of Employees. You make
> this very hard to optimize or execute by the query engine.
>
> And as you say, your example is very small, so this will not give any
> insights into big data.
>
> Enrico
>
>
> Am 27.02.22 um 19:30 schrieb Sid:
>
> My bad.
>
> Aggregation Query:
>
> # Write your MySQL query statement below
>
>    SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>        WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
> ORDER by E.DepartmentId, E.Salary DESC
>
> Time Taken: 1212 ms
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time Taken: 790 ms
>
> Thanks,
> Sid
>
>
> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen <sro...@gmail.com> wrote:
>
>> Those two queries are identical?
>>
>> On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> Hi Team,
>>>
>>> I am aware that if windowing functions are used, then at first it loads
>>> the entire dataset into one window,scans and then performs the other
>>> mentioned operations for that particular window which could be slower when
>>> dealing with trillions / billions of records.
>>>
>>> I did a POC where I used an example to find the max 3 highest salary for
>>> an employee per department. So, I wrote a below queries and compared the
>>> time for it:
>>>
>>> Windowing Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 790 ms
>>>
>>> Aggregation Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 1212 ms
>>>
>>> But as per my understanding, the aggregation should have run faster. So,
>>> my whole point is if the dataset is huge I should force some kind of map
>>> reduce jobs like we have an option called df.groupby().reduceByGroups()
>>>
>>> So I think the aggregation query is taking more time since the dataset
>>> size here is smaller and as we all know that map reduce works faster when
>>> there is a huge volume of data. Haven't tested it yet on big data but
>>> needed some expert guidance over here.
>>>
>>> Please correct me if I am wrong.
>>>
>>> TIA,
>>> Sid
>>>
>>>
>>>
>>>
>

Re: Difference between windowing functions and aggregation functions on big data

Reply via email to