All right thanks for inputs is there any way spark can process all combination parallel in one job ?
If is it ok to load the input csv file in dataframe and use flat map to create key pair, then use reduceByKey to sum the double array? I believe that will work same like agg function which you are suggesting. Best Regards, Anil Langote +1-425-633-9747 > On Nov 11, 2016, at 7:10 PM, ayan guha <guha.a...@gmail.com> wrote: > > You can explore grouping sets in SQL and write an aggregate function to add > array wise sum. > > It will boil down to something like > > Select attr1,attr2...,yourAgg(Val) > From t > Group by attr1,attr2... > Grouping sets((attr1,attr2),(aytr1)) > >> On 12 Nov 2016 04:57, "Anil Langote" <anillangote0...@gmail.com> wrote: >> Hi All, >> >> >> >> I have been working on one use case and couldn’t able to think the better >> solution, I have seen you very active on spark user list please throw your >> thoughts on implementation. Below is the requirement. >> >> >> >> I have tried using dataset by splitting the double array column but it fails >> when double size grows. When I create the double array schema data type >> spark doesn’t allow me to sum them because it would be done only on numeric >> types. If I think about storing the file per combination wise to parquet >> there will be too much parquet files. >> >> >> >> Input : The input file will be like below in real data the attributes will >> be 20 & the double array would be 50,000 >> >> >> >> >> >> Attribute_0 >> >> Attribute_1 >> >> Attribute_2 >> >> Attribute_3 >> >> DoubleArray >> >> 5 >> >> 3 >> >> 5 >> >> 3 >> >> 0.2938933463658645 0.0437040427073041 0.23002681025029648 >> 0.18003221216680454 >> >> 3 >> >> 2 >> >> 1 >> >> 3 >> >> 0.5353599620508771 0.026777650111232787 0.31473082754161674 >> 0.2647786522276575 >> >> 5 >> >> 3 >> >> 5 >> >> 2 >> >> 0.8803063581705307 0.8101324740101096 0.48523937757683544 >> 0.5897714618376072 >> >> 3 >> >> 2 >> >> 1 >> >> 3 >> >> 0.33960064683141955 0.46537001358164043 0.543428826489435 >> 0.42653939565053034 >> >> 2 >> >> 2 >> >> 0 >> >> 5 >> >> 0.5108235777360906 0.4368119043922922 0.8651556676944931 >> 0.7451477943975504 >> >> >> >> Now below are the possible combinations in above data set this will be all >> possible combinations >> >> >> >> 1. Attribute_0, Attribute_1 >> >> 2. Attribute_0, Attribute_2 >> >> 3. Attribute_0, Attribute_3 >> >> 4. Attribute_1, Attribute_2 >> >> 5. Attribute_2, Attribute_3 >> >> 6. Attribute_1, Attribute_3 >> >> 7. Attribute_0, Attribute_1, Attribute_2 >> >> 8. Attribute_0, Attribute_1, Attribute_3 >> >> 9. Attribute_0, Attribute_2, Attribute_3 >> >> 10. Attribute_1, Attribute_2, Attribute_3 >> >> 11. Attribute_1, Attribute_2, Attribute_3, Attribute_4 >> >> >> >> Now we have to process all these combinations on input data preferably >> parallel to get good performance. >> >> >> >> Attribute_0, Attribute_1 >> >> >> >> In this iteration the other attributes (Attribute_2, Attribute_3) are not >> required all we need is Attribute_0, Attribute_1 & double array columns. If >> you see the data there are two possible combination in the data one is 5_3 >> and other one is 3_2 we have to pick only those which has at least 2 >> combinations in real data we will get in thousands. >> >> >> >> >> >> Attribute_0 >> >> Attribute_1 >> >> Attribute_2 >> >> Attribute_3 >> >> DoubleArray >> >> 5 >> >> 3 >> >> 5 >> >> 3 >> >> 0.2938933463658645 0.0437040427073041 0.23002681025029648 >> 0.18003221216680454 >> >> 3 >> >> 2 >> >> 1 >> >> 3 >> >> 0.5353599620508771 0.026777650111232787 0.31473082754161674 >> 0.2647786522276575 >> >> 5 >> >> 3 >> >> 5 >> >> 2 >> >> 0.8803063581705307 0.8101324740101096 0.48523937757683544 >> 0.5897714618376072 >> >> 3 >> >> 2 >> >> 1 >> >> 3 >> >> 0.33960064683141955 0.46537001358164043 0.543428826489435 >> 0.42653939565053034 >> >> 2 >> >> 2 >> >> 0 >> >> 5 >> >> 0.5108235777360906 0.4368119043922922 0.8651556676944931 >> 0.7451477943975504 >> >> >> >> when we do the groupBy on above dataset with columns Attribute_0, >> Attribute_1 we will get two records with keys 5_3 & 3_2 and each key will >> have two double arrays. >> >> >> >> 5_3 ==> 0.2938933463658645 0.0437040427073041 0.23002681025029648 >> 0.18003221216680454 & 0.8803063581705307 0.8101324740101096 >> 0.48523937757683544 0.5897714618376072 >> >> >> >> 3_2 ==> 0.5353599620508771 0.026777650111232787 0.31473082754161674 >> 0.2647786522276575 & 0.33960064683141955 0.46537001358164043 >> 0.543428826489435 0.42653939565053034 >> >> >> >> now we have to add these double arrays index wise and produce the one array >> >> >> >> 5_3 ==> [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, >> 0.7698036740044117] >> >> 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, >> 0.6913180478781878] >> >> >> >> After adding we have to compute average, min, max etc on these vector and >> store the results against the keys. >> >> >> >> Same process will be repeated for next combinations. >> >> >> >> >> >> >> >> Thank you >> >> Anil Langote >> >> +1-425-633-9747 >> >>