Ohad Raviv created SPARK-17662:
----------------------------------
Summary: Dedup UDAF
Key: SPARK-17662
URL: https://issues.apache.org/jira/browse/SPARK-17662
Project: Spark
Issue Type: New Feature
Reporter: Ohad Raviv
We have a common use case od deduping a table in a creation order.
For example, we have an event log of user actions. A user marks his favorite
category from time to time.
In our analytics we would like to know only the user's last favorite category.
The data:
user_id action_type value date
123 fav category 1 2016-02-01
123 fav category 4 2016-02-02
123 fav category 8 2016-02-03
123 fav category 2 2016-02-04
we would like to get only the last update by the date column.
we could of-course do it in sql:
select * from (
select *, row_number() over (partition by user_id,action_type order by date
desc) as rnum from tbl)
where rnum=1;
but then, I believe it can't be optimized on the mappers side and we'll get all
the data shuffled to the reducers instead of partially aggregated in the map
side.
We have written a UDAF for this, but then we have other issues - like blocking
push-down-predicate for columns.
do you have any idea for a proper solution?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]