I mentioned it opposite. collect_list generates duplicated results.

2017-05-16 0:50 GMT+09:00 goun na <gou...@gmail.com>:

> Hi, Jone Zhang
>
> 1. Hive UDF
> You might need collect_set or collect_list (to eliminate duplication), but
> make sure reduce its cardinality before applying UDFs as it can cause
> problems while handling 1 billion records. Union dataset 1,2,3 -> group by
> user_id1 -> collect_set (feature column) would works.
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
>
> 2.Spark Dataframe Pivot
> https://databricks.com/blog/2016/02/09/reshaping-data-
> with-pivot-in-apache-spark.html
>
> - Goun
>
> 2017-05-15 22:15 GMT+09:00 Jone Zhang <joyoungzh...@gmail.com>:
>
>> For example
>> Data1(has 1 billion records)
>> user_id1  feature1
>> user_id1  feature2
>>
>> Data2(has 1 billion records)
>> user_id1  feature3
>>
>> Data3(has 1 billion records)
>> user_id1  feature4
>> user_id1  feature5
>> ...
>> user_id1  feature100
>>
>> I want to get the result as follow
>> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
>>
>> Is there a more efficient way except join?
>>
>> Thanks!
>>
>
>

Reply via email to