Drop is a method on scala’s collections (array, list, etc) - not on Spark’s
RDDs. You can look at it as long as you use mapPartitions or something like
reduceByKey, but it totally depends on the use-cases you have for analytics.
The others have suggested better solutions using only spark’s APIs.
-adrian
From: Sampo Niskanen
Date: Thursday, October 22, 2015 at 2:12 PM
To: Adrian Tanase
Cc: user
Subject: Re: Analyzing consecutive elements
Hi,
Sorry, I'm not very familiar with those methods and cannot find the 'drop'
method anywhere.
As an example:
val arr = Array((1, "A"), (8, "D"), (7, "C"), (3, "B"), (9, "E"))
val rdd = sc.parallelize(arr)
val sorted = rdd.sortByKey(true)
// ... then what?
Thanks.
Best regards,
Sampo Niskanen
Lead developer / Wellmo
[email protected]<mailto:[email protected]>
+358 40 820 5291
On Thu, Oct 22, 2015 at 10:43 AM, Adrian Tanase
<[email protected]<mailto:[email protected]>> wrote:
I'm not sure if there is a better way to do it directly using Spark APIs but I
would try to use mapPartitions and then within each partition Iterable to:
rdd.zip(rdd.drop(1)) - using the Scala collection APIs
This should give you what you need inside a partition. I'm hoping that you can
partition your data somehow (e.g by user id or session id) that makes you
algorithm parallel. In that case you can use the snippet above in a reduceByKey.
hope this helps
-adrian
Sent from my iPhone
On 22 Oct 2015, at 09:36, Sampo Niskanen
<[email protected]<mailto:[email protected]>> wrote:
Hi,
I have analytics data with timestamps on each element. I'd like to analyze
consecutive elements using Spark, but haven't figured out how to do this.
Essentially what I'd want is a transform from a sorted RDD [A, B, C, D, E] to
an RDD [(A,B), (B,C), (C,D), (D,E)]. (Or some other way to analyze
time-related elements.)
How can this be achieved?
Sampo Niskanen
Lead developer / Wellmo
[email protected]<mailto:[email protected]>
+358 40 820 5291