Hi,
Would appreciate insights and wisdom on a problem we are working on:
1. Context:
- Given a csv file like:
- d1,c1,a1
- d1,c1,a2
- d1,c2,a1
- d1,c1,a1
- d2,c1,a3
- d2,c2,a1
- d3,c1,a1
- d3,c3,a1
- d3,c2,a1
- d3,c3,a2
- d5,c1,a3
- d5,c2,a2
- d5,c3,a2
- Want to find uniques and totals (of the d_ across the c_ and a_
dimensions):
- Tot Unique
- c1 6 4
- c2 4 4
- c3 2 2
- a1 7 3
- a2 4 3
- a3 2 2
- c1-a1 ...
- c1-a2 ...
- c1-a3 ...
- c2-a1 ...
- c2-a2 ...
- ...
- c3-a3
- Obviously there are millions of records and more
attributes/dimensions. So scalability is key
2. We think Spark is a good stack for this problem: Have a few
questions:
3. From a Spark substrate perspective, what are some of the optimum
transformations & things to watch out for ?
4. Is PairRDD the best data representation ? GroupByKey et al is only
available for PairRDD.
5. On a pragmatic level, file.map().map() results in RDD. How do I
transform it to a PairRDD ?
1. .map(fields => (fields(1), fields(0)) - results in Unit
2. .map(fields => fields(1) -> fields(0)) also is not working
3. Both these do not result in a PairRDD
4. Am missing something fundamental.
Cheers & Have a nice weekend
<k/>