Sure. Let me explain you my requirement I have an input file which has attributes (25) and las column is array of doubles (14500 elements in original file) Attribute_0 Attribute_1 Attribute_2 Attribute_3 DoubleArray 5 3 5 3 0.2938933463658645 0.0437040427073041 0.23002681025029648 0.18003221216680454 3 2 1 3 0.5353599620508771 0.026777650111232787 0.31473082754161674 0.2647786522276575 5 3 5 2 0.8803063581705307 0.8101324740101096 0.48523937757683544 0.5897714618376072 3 2 1 3 0.33960064683141955 0.46537001358164043 0.543428826489435 0.42653939565053034 2 2 0 5 0.5108235777360906 0.4368119043922922 0.8651556676944931 0.7451477943975504 Now I have to compute the addition of the double for any given combination for example in above file we will have below possible combinations 1. Attribute_0, Attribute_1 2. Attribute_0, Attribute_2 3. Attribute_0, Attribute_3 4. Attribute_1, Attribute_2 5. Attribute_2, Attribute_3 6. Attribute_1, Attribute_3 7. Attribute_0, Attribute_1, Attribute_2 8. Attribute_0, Attribute_1, Attribute_3 9. Attribute_0, Attribute_2, Attribute_3 10. Attribute_1, Attribute_2, Attribute_3 11. Attribute_1, Attribute_2, Attribute_3, Attribute_4 now if we process the Attribute_0, Attribute_1 combination we want below output. In similar way we have to process all the above combinations 5_3 ==> [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 0.7698036740044117] 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 0.6913180478781878] Solution tried I have created parequet file which will have the schema and last column will be array of doubles. The size of the parquet file I have is 276G which has 2.65 M records. I have implemented the UDAF which will have Input schema : array of doubles Buffer schema : array of doubles Return schema : array of doubles I load the data from parquet file and then register the UDAF to use with below query, note that SUM is UDAF SELECT COUNT(*) AS MATCHES, SUM(DOUBLEARRAY), Attribute_0, Attribute_1 FROM RAW_TABLE GROUP BY Attribute_0, Attribute_1 HAVING COUNT(*)>1 This works fine and it takes 1.2 mins for one combination my use case will have 400 combinations which means 8 hours which is not meeting the SLA we want this to be below 1 hours. What is the best way to implement this use case.
Best Regards, Anil Langote +1-425-633-9747 > On Jan 8, 2017, at 8:17 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > > To start with caching and having a known partioner will help a bit, then > there is also the IndexedRDD project, but in general spark might not be the > best tool for the job. Have you considered having Spark output to something > like memcache? > > What's the goal of you are trying to accomplish? > >> On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <anillangote0...@gmail.com> >> wrote: >> Hi All, >> >> I have a requirement where I wanted to build a distributed HashMap which >> holds 10M key value pairs and provides very efficient lookups for each key. >> I tried loading the file into JavaPairedRDD and tried calling lookup method >> its very slow. >> >> How can I achieve very very faster lookup by a given key? >> >> Thank you >> Anil Langote