DataSet is not able to handle 50,000 columns to sum

Anil Langote Fri, 11 Nov 2016 09:57:48 -0800

Hi All,


I have been working on one use case and couldn’t able to think the better 
solution, I have seen you very active on spark user list please throw your 
thoughts on implementation. Below is the requirement.

 

I have tried using dataset by splitting the double array column but it fails 
when double size grows. When I create the double array schema data type spark 
doesn’t allow me to sum them because it would be done only on numeric types. If 
I think about storing the file per combination wise to parquet there will be 
too much parquet files.

 

Input :  The input file will be like below in real data the attributes will be 
20 & the double array would be 50,000

 

 

Attribute_0Attribute_1Attribute_2Attribute_3DoubleArray
53530.2938933463658645  0.0437040427073041  0.23002681025029648  
0.18003221216680454
32130.5353599620508771  0.026777650111232787  0.31473082754161674  
0.2647786522276575
53520.8803063581705307  0.8101324740101096  0.48523937757683544  
0.5897714618376072
32130.33960064683141955  0.46537001358164043  0.543428826489435  
0.42653939565053034
22050.5108235777360906  0.4368119043922922  0.8651556676944931  
0.7451477943975504

 

Now below are the possible combinations in above data set this will be all 
possible combinations 1.      Attribute_0, Attribute_12.      Attribute_0, 
Attribute_23.      Attribute_0, Attribute_34.      Attribute_1, Attribute_25.   
   Attribute_2, Attribute_36.      Attribute_1, Attribute_37.      Attribute_0, 
Attribute_1, Attribute_28.      Attribute_0, Attribute_1, Attribute_39.      
Attribute_0, Attribute_2, Attribute_310.  Attribute_1, Attribute_2, 
Attribute_311.  Attribute_1, Attribute_2, Attribute_3, Attribute_4 Now we have 
to process all these combinations on input data preferably parallel to get good 
performance. Attribute_0, Attribute_1 In this iteration the other attributes 
(Attribute_2, Attribute_3) are not required all we need is Attribute_0, 
Attribute_1 & double array columns. If you see the data there are two possible 
combination in the data one is 5_3 and other one is 3_2 we have to pick only 
those which has at least 2 combinations in real data we will get in thousands.  
 
Attribute_0Attribute_1Attribute_2Attribute_3DoubleArray
53530.2938933463658645  0.0437040427073041  0.23002681025029648  
0.18003221216680454
32130.5353599620508771  0.026777650111232787  0.31473082754161674  
0.2647786522276575
53520.8803063581705307  0.8101324740101096  0.48523937757683544  
0.5897714618376072
32130.33960064683141955  0.46537001358164043  0.543428826489435  
0.42653939565053034
22050.5108235777360906  0.4368119043922922  0.8651556676944931  
0.7451477943975504

 

when we do the groupBy on above dataset with columns Attribute_0, Attribute_1 
we will get two records with keys 5_3 & 3_2 and each key will have two double 
arrays.

 

5_3 ==> 0.2938933463658645  0.0437040427073041  0.23002681025029648  
0.18003221216680454 & 0.8803063581705307  0.8101324740101096  
0.48523937757683544  0.5897714618376072

 

3_2 ==> 0.5353599620508771  0.026777650111232787  0.31473082754161674  
0.2647786522276575 & 0.33960064683141955  0.46537001358164043  
0.543428826489435  0.42653939565053034

 

now we have to add these double arrays index wise and produce the one array

 

5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 
0.7698036740044117]

3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 
0.6913180478781878]

 

After adding we have to compute average, min, max etc on these vector and store 
the results against the keys.

 

Same process will be repeated for next combinations. 

 

 

 

Thank you

Anil Langote

+1-425-633-9747

DataSet is not able to handle 50,000 columns to sum

Reply via email to