Hi All,
I have been working on one use case and couldn’t able to think the better solution, I have seen you very active on spark user list please throw your thoughts on implementation. Below is the requirement. I have tried using dataset by splitting the double array column but it fails when double size grows. When I create the double array schema data type spark doesn’t allow me to sum them because it would be done only on numeric types. If I think about storing the file per combination wise to parquet there will be too much parquet files. Input : The input file will be like below in real data the attributes will be 20 & the double array would be 50,000 Attribute_0Attribute_1Attribute_2Attribute_3DoubleArray 53530.2938933463658645 0.0437040427073041 0.23002681025029648 0.18003221216680454 32130.5353599620508771 0.026777650111232787 0.31473082754161674 0.2647786522276575 53520.8803063581705307 0.8101324740101096 0.48523937757683544 0.5897714618376072 32130.33960064683141955 0.46537001358164043 0.543428826489435 0.42653939565053034 22050.5108235777360906 0.4368119043922922 0.8651556676944931 0.7451477943975504 Now below are the possible combinations in above data set this will be all possible combinations 1. Attribute_0, Attribute_12. Attribute_0, Attribute_23. Attribute_0, Attribute_34. Attribute_1, Attribute_25. Attribute_2, Attribute_36. Attribute_1, Attribute_37. Attribute_0, Attribute_1, Attribute_28. Attribute_0, Attribute_1, Attribute_39. Attribute_0, Attribute_2, Attribute_310. Attribute_1, Attribute_2, Attribute_311. Attribute_1, Attribute_2, Attribute_3, Attribute_4 Now we have to process all these combinations on input data preferably parallel to get good performance. Attribute_0, Attribute_1 In this iteration the other attributes (Attribute_2, Attribute_3) are not required all we need is Attribute_0, Attribute_1 & double array columns. If you see the data there are two possible combination in the data one is 5_3 and other one is 3_2 we have to pick only those which has at least 2 combinations in real data we will get in thousands. Attribute_0Attribute_1Attribute_2Attribute_3DoubleArray 53530.2938933463658645 0.0437040427073041 0.23002681025029648 0.18003221216680454 32130.5353599620508771 0.026777650111232787 0.31473082754161674 0.2647786522276575 53520.8803063581705307 0.8101324740101096 0.48523937757683544 0.5897714618376072 32130.33960064683141955 0.46537001358164043 0.543428826489435 0.42653939565053034 22050.5108235777360906 0.4368119043922922 0.8651556676944931 0.7451477943975504 when we do the groupBy on above dataset with columns Attribute_0, Attribute_1 we will get two records with keys 5_3 & 3_2 and each key will have two double arrays. 5_3 ==> 0.2938933463658645 0.0437040427073041 0.23002681025029648 0.18003221216680454 & 0.8803063581705307 0.8101324740101096 0.48523937757683544 0.5897714618376072 3_2 ==> 0.5353599620508771 0.026777650111232787 0.31473082754161674 0.2647786522276575 & 0.33960064683141955 0.46537001358164043 0.543428826489435 0.42653939565053034 now we have to add these double arrays index wise and produce the one array 5_3 ==> [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 0.7698036740044117] 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 0.6913180478781878] After adding we have to compute average, min, max etc on these vector and store the results against the keys. Same process will be repeated for next combinations. Thank you Anil Langote +1-425-633-9747