Multi-dimensional Uniques over large dataset

Krishna Sankar Fri, 13 Jun 2014 20:53:18 -0700

Hi,
   Would appreciate insights and wisdom on a problem we are working on:


   1. Context:
      - Given a csv file like:
      - d1,c1,a1
      - d1,c1,a2
      - d1,c2,a1
      - d1,c1,a1
      - d2,c1,a3
      - d2,c2,a1
      - d3,c1,a1
      - d3,c3,a1
      - d3,c2,a1
      - d3,c3,a2
      - d5,c1,a3
      - d5,c2,a2
      - d5,c3,a2
      - Want to find uniques and totals (of the d_ across the c_ and a_
      dimensions):
      -         Tot   Unique
         - c1      6      4
         - c2      4      4
         - c3      2      2
         - a1      7      3
         - a2      4      3
         - a3      2      2
         - c1-a1  ...
         - c1-a2 ...
         - c1-a3 ...
         - c2-a1 ...
         - c2-a2 ...
         - ...
         - c3-a3
      - Obviously there are millions of records and more
      attributes/dimensions. So scalability is key
      2. We think Spark is a good stack for this problem: Have a few
   questions:
   3. From a Spark substrate perspective, what are some of the optimum
   transformations & things to watch out for ?
   4. Is PairRDD the best data representation ? GroupByKey et al is only
   available for PairRDD.
   5. On a pragmatic level, file.map().map() results in RDD. How do I
   transform it to a PairRDD ?
      1. .map(fields => (fields(1), fields(0)) - results in Unit
      2. .map(fields => fields(1) -> fields(0)) also is not working
      3. Both these do not result in a PairRDD
      4. Am missing something fundamental.

Cheers & Have a nice weekend
<k/>

Multi-dimensional Uniques over large dataset

Reply via email to