Re: DataSet is not able to handle 50,000 columns to sum

Anil Langote Fri, 11 Nov 2016 19:33:58 -0800

All right thanks for inputs is there any way spark can process all combination 
parallel in one job ?


If is it ok to load the input csv file in dataframe and use flat map to create 
key pair, then use reduceByKey to sum the double array? I believe that will 
work same like agg function which you are suggesting.

Best Regards,
Anil Langote
+1-425-633-9747

> On Nov 11, 2016, at 7:10 PM, ayan guha <guha.a...@gmail.com> wrote:
> 
> You can explore grouping sets in SQL and write an aggregate function to add 
> array wise sum.
> 
> It will boil down to something like
> 
> Select attr1,attr2...,yourAgg(Val)
> From t
> Group by attr1,attr2...
> Grouping sets((attr1,attr2),(aytr1))
> 
>> On 12 Nov 2016 04:57, "Anil Langote" <anillangote0...@gmail.com> wrote:
>> Hi All,
>> 
>>  
>> 
>> I have been working on one use case and couldn’t able to think the better 
>> solution, I have seen you very active on spark user list please throw your 
>> thoughts on implementation. Below is the requirement.
>> 
>>  
>> 
>> I have tried using dataset by splitting the double array column but it fails 
>> when double size grows. When I create the double array schema data type 
>> spark doesn’t allow me to sum them because it would be done only on numeric 
>> types. If I think about storing the file per combination wise to parquet 
>> there will be too much parquet files.
>> 
>>  
>> 
>> Input :  The input file will be like below in real data the attributes will 
>> be 20 & the double array would be 50,000
>> 
>>  
>> 
>>  
>> 
>> Attribute_0
>> 
>> Attribute_1
>> 
>> Attribute_2
>> 
>> Attribute_3
>> 
>> DoubleArray
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 3
>> 
>> 0.2938933463658645  0.0437040427073041  0.23002681025029648  
>> 0.18003221216680454
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.5353599620508771  0.026777650111232787  0.31473082754161674  
>> 0.2647786522276575
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 2
>> 
>> 0.8803063581705307  0.8101324740101096  0.48523937757683544  
>> 0.5897714618376072
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.33960064683141955  0.46537001358164043  0.543428826489435  
>> 0.42653939565053034
>> 
>> 2
>> 
>> 2
>> 
>> 0
>> 
>> 5
>> 
>> 0.5108235777360906  0.4368119043922922  0.8651556676944931  
>> 0.7451477943975504
>> 
>>  
>> 
>> Now below are the possible combinations in above data set this will be all 
>> possible combinations
>> 
>>  
>> 
>> 1.      Attribute_0, Attribute_1
>> 
>> 2.      Attribute_0, Attribute_2
>> 
>> 3.      Attribute_0, Attribute_3
>> 
>> 4.      Attribute_1, Attribute_2
>> 
>> 5.      Attribute_2, Attribute_3
>> 
>> 6.      Attribute_1, Attribute_3
>> 
>> 7.      Attribute_0, Attribute_1, Attribute_2
>> 
>> 8.      Attribute_0, Attribute_1, Attribute_3
>> 
>> 9.      Attribute_0, Attribute_2, Attribute_3
>> 
>> 10.  Attribute_1, Attribute_2, Attribute_3
>> 
>> 11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4
>> 
>>  
>> 
>> Now we have to process all these combinations on input data preferably 
>> parallel to get good performance.
>> 
>>  
>> 
>> Attribute_0, Attribute_1
>> 
>>  
>> 
>> In this iteration the other attributes (Attribute_2, Attribute_3) are not 
>> required all we need is Attribute_0, Attribute_1 & double array columns. If 
>> you see the data there are two possible combination in the data one is 5_3 
>> and other one is 3_2 we have to pick only those which has at least 2 
>> combinations in real data we will get in thousands. 
>> 
>>  
>> 
>>  
>> 
>> Attribute_0
>> 
>> Attribute_1
>> 
>> Attribute_2
>> 
>> Attribute_3
>> 
>> DoubleArray
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 3
>> 
>> 0.2938933463658645  0.0437040427073041  0.23002681025029648  
>> 0.18003221216680454
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.5353599620508771  0.026777650111232787  0.31473082754161674  
>> 0.2647786522276575
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 2
>> 
>> 0.8803063581705307  0.8101324740101096  0.48523937757683544  
>> 0.5897714618376072
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.33960064683141955  0.46537001358164043  0.543428826489435  
>> 0.42653939565053034
>> 
>> 2
>> 
>> 2
>> 
>> 0
>> 
>> 5
>> 
>> 0.5108235777360906  0.4368119043922922  0.8651556676944931  
>> 0.7451477943975504
>> 
>>  
>> 
>> when we do the groupBy on above dataset with columns Attribute_0, 
>> Attribute_1 we will get two records with keys 5_3 & 3_2 and each key will 
>> have two double arrays.
>> 
>>  
>> 
>> 5_3 ==> 0.2938933463658645  0.0437040427073041  0.23002681025029648  
>> 0.18003221216680454 & 0.8803063581705307  0.8101324740101096  
>> 0.48523937757683544  0.5897714618376072
>> 
>>  
>> 
>> 3_2 ==> 0.5353599620508771  0.026777650111232787  0.31473082754161674  
>> 0.2647786522276575 & 0.33960064683141955  0.46537001358164043  
>> 0.543428826489435  0.42653939565053034
>> 
>>  
>> 
>> now we have to add these double arrays index wise and produce the one array
>> 
>>  
>> 
>> 5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 
>> 0.7698036740044117]
>> 
>> 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 
>> 0.6913180478781878]
>> 
>>  
>> 
>> After adding we have to compute average, min, max etc on these vector and 
>> store the results against the keys.
>> 
>>  
>> 
>> Same process will be repeated for next combinations. 
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> Thank you
>> 
>> Anil Langote
>> 
>> +1-425-633-9747
>> 
>>

Re: DataSet is not able to handle 50,000 columns to sum

Reply via email to