I am not sure if it is a typo-error or not, but how are you using groupByKey to get the summed_values? Assuming you meant reduceByKey(), these workflows seems pretty efficient.
TD On Thu, Aug 7, 2014 at 10:18 AM, Dan H. <dch.ema...@gmail.com> wrote: > I wanted to post for validation to understand if there is more efficient > way > to achieve my goal. I'm currently performing this flow for two distinct > calculations executing in parallel: > 1) Sum key/value pair, by using a simple witnessed count(apply 1 to a > mapToPair() and then groupByKey() > 2) Sum the actual values, in my key/value pair and transform the data so > group properly by groupByKey() > > DataSource: RDDStream_in > > Workflow1: > Generate DStream using flatmap() from input RDDStream_in, which splits data > into: > <StringKey1, StringKey2, Value1_to_be_inspected> > > Next I apply a filter() to pull the values I only want to see as > witnessed...which creates a smaller DStream > <<StringKey1, StringKey2, Value1_inspected> > > I generate a PairDStream from mapToPair() from the previous step, providing > a way to append a summable value yielding: > <StringKey1, StringKey2, Value1_inspected>, to_be_summed_valueof 1> > > Next I apply the groupByKey() to the PairDStream get: > <<StringKey1, StringKey2, Value1>, summed_value by key/values> > > > Workflow 2: > Generate DStream using flatmap() from input RDDStream_in, which splits data > into: > <StringKey1, StringKey2, Value1_to_be_summed> > > Next, I apply mapToPair() from the previous DStream, thus providing a way > to > sum the Value1 and remove the Value1 from the original StringKey, thus > yielding: > <<StringKey1, StringKey2>, Value1_to_be_summed> > > Next I apply the groupByKey() and I get: > <<StringKey1, StringKey2>, Value1_summed by Keys> > > Are there more efficient approaches I should be considering, such as > method.chaining or another technique to increase work flow efficiency? > > Thanks for your feedback in advance. > > DH > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Workflow-Validation-tp11677.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >