When you group by IP address in step 1 to this:
(ip1,(lat1,lon1),(lat2,lon2))
(ip2,(lat3,lon3),(lat4,lat5))
How many lat/lon locations do you expect for each IP address? avg and max
are interesting.
Andrew
On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov <[email protected]>
wrote:
> It is possible if you use a cartesian product to produce all possible
> pairs for each IP address and 2 stages of map-reduce:
> - first by pairs of points to find the total of each pair and
> - second by IP address to find the pair for each IP address with the
> maximum count.
>
> Oleg
>
>
>
> On 4 June 2014 11:49, lmk <[email protected]> wrote:
>
>> Hi,
>> I am a new spark user. Pls let me know how to handle the following
>> scenario:
>>
>> I have a data set with the following fields:
>> 1. DeviceId
>> 2. latitude
>> 3. longitude
>> 4. ip address
>> 5. Datetime
>> 6. Mobile application name
>>
>> With the above data, I would like to perform the following steps:
>> 1. Collect all lat and lon for each ipaddress
>> (ip1,(lat1,lon1),(lat2,lon2))
>> (ip2,(lat3,lon3),(lat4,lat5))
>> 2. For each IP,
>> 1.Find the distance between each lat and lon coordinate pair and
>> all
>> the other pairs under the same IP
>> 2.Select those coordinates whose distances fall under a specific
>> threshold (say 100m)
>> 3.Find the coordinate pair with the maximum occurrences
>>
>> In this case, how can I iterate and compare each coordinate pair with all
>> the other pairs?
>> Can this be done in a distributed manner, as this data set is going to
>> have
>> a few million records?
>> Can we do this in map/reduce commands?
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> --
> Kind regards,
>
> Oleg
>
>