Hi Imk, I think iterator and for-comprehension may help here. I wrote a snippet that implements your first 2 requirements:
def distance(a: (Double, Double), b: (Double, Double)): Double = ??? // Defines some total ordering among locations. def lessThan(a: (Double, Double), b: (Double, Double)): Boolean = ??? sc.textFile("input") .map { line => val Array(_, latitude, longitude, ip, _, _) = line.split(",") ip -> (latitude.toDouble, longitude.toDouble) } .groupByKey() .mapValues { positions => for { a <- positions.iterator b <- positions.iterator if lessThan(a, b) && distance(a, b) < 100 } yield { (a, b) } } The key point is that iterators are lazy evaluated, so that you don’t need to store the whole cartesian product. I didn’t quite get your 3rd requirement, but I think you can implement that following similar approach. Cheng On Thu, Jun 5, 2014 at 1:11 PM, lmk <lakshmi.muralikrish...@gmail.com> wrote: > Hi Oleg/Andrew, > Thanks much for the prompt response. > > We expect thousands of lat/lon pairs for each IP address. And that is my > concern with the Cartesian product approach. > Currently for a small sample of this data (5000 rows) I am grouping by IP > address and then computing the distance between lat/lon coordinates using > array manipulation techniques. > But I understand this approach is not right when the data volume goes up. > My code is as follows: > > val dataset:RDD[String] = sc.textFile("x.csv") > val data = dataset.map(l=>l.split(",")) > val grpData = data.map(r => > (r(3),((r(1).toDouble),r(2).toDouble))).groupByKey() > > Now, I have the data grouped by ipaddress as Array[(String, > Iterable[(Double, Double)])] > ex.. > Array((ip1,ArrayBuffer((lat1,lon1), (lat2,lon2), (lat3,lon3))) > > Now I have to find the distance between (lat1,lon1) and (lat2,lon2) and > then > between (lat1,lon1) and (lat3,lon3) and so on for all combinations. > > This is where I get stuck. Please guide me on this. > > Thanks Again. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7016.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >