Re: Deduping events using Spark

Richard Marscher Thu, 04 Jun 2015 11:44:31 -0700

I think if you create a bidirectional mapping from AnalyticsEvent to
another type that would wrap it and use the nonce as its equality, you
could then do something like reduceByKey to group by nonce and map back to
AnalyticsEvent after.


On Thu, Jun 4, 2015 at 1:10 PM, lbierman <leebier...@gmail.com> wrote:

> I'm still a bit new to Spark and am struggilng to figure out the best way
> to
> Dedupe my events.
>
> I load my Avro files from HDFS and then I want to dedupe events that have
> the same nonce.
>
> For example my code so far:
>
>  JavaRDD<AnalyticsEvent> events = ((JavaRDD<AvroKey&lt;AnalyticsEvent>>)
> context.newAPIHadoopRDD(
>             context.hadoopConfiguration(),
>             AvroKeyInputFormat.class,
>             AvroKey.class,
>             NullWritable.class
>         ).keys())
>         .map(event -> AnalyticsEvent.newBuilder(event.datum()).build())
>         .filter(key -> { return
> Optional.ofNullable(key.getStepEventKey()).isPresent(); })
>
> Now I want to get back an RDD of AnalyticsEvents that are unique. So I
> basically want to do:
> if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of
> them.
>
> I'm not sure how to do this? If I do reduceByKey it reduces by
> AnalyticsEvent not by the values inside?
>
> Any guidance would be much appreciated how I can walk this list of events
> and only return a filtered version of unique nocnes.
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Deduping events using Spark

Reply via email to