[jira] [Created] (SPARK-17662) Dedup UDAF

Ohad Raviv (JIRA) Sat, 24 Sep 2016 23:40:02 -0700

Ohad Raviv created SPARK-17662:
----------------------------------

             Summary: Dedup UDAF
                 Key: SPARK-17662
                 URL: https://issues.apache.org/jira/browse/SPARK-17662
             Project: Spark
          Issue Type: New Feature
            Reporter: Ohad Raviv



We have a common use case od deduping a table in a creation order.
For example, we have an event log of user actions. A user marks his favorite 
category from time to time.
In our analytics we would like to know only the user's last favorite category.
The data:
user_id    action_type    value    date    
123          fav category   1           2016-02-01
123          fav category   4           2016-02-02
123          fav category   8           2016-02-03
123          fav category   2           2016-02-04

we would like to get only the last update by the date column.

we could of-course do it in sql:
select * from (
select *, row_number() over (partition by user_id,action_type order by date 
desc) as rnum from tbl)
where rnum=1;

but then, I believe it can't be optimized on the mappers side and we'll get all 
the data shuffled to the reducers instead of partially aggregated in the map 
side.

We have written a UDAF for this, but then we have other issues - like blocking 
push-down-predicate for columns.

do you have any idea for a proper solution?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-17662) Dedup UDAF

Reply via email to