Sure. Let me explain you my requirement I have an input file which has 
attributes (25) and las column is array of doubles (14500 elements in original 
file)
 
Attribute_0
Attribute_1
Attribute_2
Attribute_3
DoubleArray
5
3
5
3
0.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
3
2
1
3
0.5353599620508771  0.026777650111232787  0.31473082754161674  
0.2647786522276575
5
3
5
2
0.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
3
2
1
3
0.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
2
2
0
5
0.5108235777360906  0.4368119043922922  0.8651556676944931  0.7451477943975504
 
Now I have to compute the addition of the double for any given combination for 
example in above file we will have below possible combinations
 
1.      Attribute_0, Attribute_1
2.      Attribute_0, Attribute_2
3.      Attribute_0, Attribute_3
4.      Attribute_1, Attribute_2
5.      Attribute_2, Attribute_3
6.      Attribute_1, Attribute_3
7.      Attribute_0, Attribute_1, Attribute_2
8.      Attribute_0, Attribute_1, Attribute_3
9.      Attribute_0, Attribute_2, Attribute_3
10.  Attribute_1, Attribute_2, Attribute_3
11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4
 
now if we process the Attribute_0, Attribute_1 combination we want below 
output. In similar way we have to process all the above combinations
 
5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 
0.7698036740044117]
3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 
0.6913180478781878]
 
Solution tried
 
I have created parequet file which will have the schema and last column will be 
array of doubles. The size of the parquet file I have is 276G which has 2.65 M 
records.
 
I have implemented the UDAF which will have 
 
Input schema : array of doubles
Buffer schema : array of doubles 
Return schema : array of doubles
 
I load the data from parquet file and then register the UDAF to use with below 
query, note that SUM is UDAF
 
SELECT COUNT(*) AS MATCHES, SUM(DOUBLEARRAY), Attribute_0, Attribute_1 FROM 
RAW_TABLE GROUP BY Attribute_0, Attribute_1 HAVING COUNT(*)>1
 
This works fine and it takes 1.2 mins for one combination my use case will have 
400 combinations which means 8 hours which is not meeting the SLA we want this 
to be below 1 hours. What is the best way to implement this use case.

Best Regards,
Anil Langote
+1-425-633-9747

> On Jan 8, 2017, at 8:17 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
> 
> To start with caching and having a known partioner will help a bit, then 
> there is also the IndexedRDD project, but in general spark might not be the 
> best tool for the job.  Have you considered having Spark output to something 
> like memcache?
> 
> What's the goal of you are trying to accomplish?
> 
>> On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <anillangote0...@gmail.com> 
>> wrote:
>> Hi All,
>> 
>> I have a requirement where I wanted to build a distributed HashMap which 
>> holds 10M key value pairs and provides very efficient lookups for each key. 
>> I tried loading the file into JavaPairedRDD and tried calling lookup method 
>> its very slow.
>> 
>> How can I achieve very very faster lookup by a given key?
>> 
>> Thank you
>> Anil Langote 

Reply via email to