Hi Lee,

You should be able to create a PairRDD using the Nonce as the key, and the
AnalyticsEvent as the value. I'm very new to Spark, but here is some
uncompilable pseudo code that may or may not help:

events.map(event => (event.getNonce, event)).reduceByKey((a, b) =>
a).map(_._2)

The above code is more Scala-like, since that's the syntax with which I
have more familiarity - it looks like the Spark Java 8 API is similar, but
you won't get implicit conversion to a PairRDD when you use a 2-Tuple as
the mapped value. Instead, will need to use the "mapToPair" function -
there's a good example in the Spark Programming Guide under "Working With
Key-Value Pairs
<https://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs>
".

Hope this helps!

Regards,
Will

On Thu, Jun 4, 2015 at 1:10 PM, lbierman <leebier...@gmail.com> wrote:

> I'm still a bit new to Spark and am struggilng to figure out the best way
> to
> Dedupe my events.
>
> I load my Avro files from HDFS and then I want to dedupe events that have
> the same nonce.
>
> For example my code so far:
>
>  JavaRDD<AnalyticsEvent> events = ((JavaRDD<AvroKey&lt;AnalyticsEvent>>)
> context.newAPIHadoopRDD(
>             context.hadoopConfiguration(),
>             AvroKeyInputFormat.class,
>             AvroKey.class,
>             NullWritable.class
>         ).keys())
>         .map(event -> AnalyticsEvent.newBuilder(event.datum()).build())
>         .filter(key -> { return
> Optional.ofNullable(key.getStepEventKey()).isPresent(); })
>
> Now I want to get back an RDD of AnalyticsEvents that are unique. So I
> basically want to do:
> if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of
> them.
>
> I'm not sure how to do this? If I do reduceByKey it reduces by
> AnalyticsEvent not by the values inside?
>
> Any guidance would be much appreciated how I can walk this list of events
> and only return a filtered version of unique nocnes.
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to