I mentioned it opposite. collect_list generates duplicated results. 2017-05-16 0:50 GMT+09:00 goun na <gou...@gmail.com>:
> Hi, Jone Zhang > > 1. Hive UDF > You might need collect_set or collect_list (to eliminate duplication), but > make sure reduce its cardinality before applying UDFs as it can cause > problems while handling 1 billion records. Union dataset 1,2,3 -> group by > user_id1 -> collect_set (feature column) would works. > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > > 2.Spark Dataframe Pivot > https://databricks.com/blog/2016/02/09/reshaping-data- > with-pivot-in-apache-spark.html > > - Goun > > 2017-05-15 22:15 GMT+09:00 Jone Zhang <joyoungzh...@gmail.com>: > >> For example >> Data1(has 1 billion records) >> user_id1 feature1 >> user_id1 feature2 >> >> Data2(has 1 billion records) >> user_id1 feature3 >> >> Data3(has 1 billion records) >> user_id1 feature4 >> user_id1 feature5 >> ... >> user_id1 feature100 >> >> I want to get the result as follow >> user_id1 feature1 feature2 feature3 feature4 feature5...feature100 >> >> Is there a more efficient way except join? >> >> Thanks! >> > >