I am not sure if it is a typo-error or not, but how are you using
groupByKey to get the summed_values? Assuming you meant reduceByKey(),
these workflows seems pretty efficient.

TD


On Thu, Aug 7, 2014 at 10:18 AM, Dan H. <dch.ema...@gmail.com> wrote:

> I wanted to post for validation to understand if there is more efficient
> way
> to achieve my goal.  I'm currently performing this flow for two distinct
> calculations executing in parallel:
>   1) Sum key/value pair, by using a simple witnessed count(apply 1 to a
> mapToPair() and then groupByKey()
>   2) Sum the actual values, in my key/value pair and transform the data so
> group properly by groupByKey()
>
> DataSource: RDDStream_in
>
> Workflow1:
> Generate DStream using flatmap() from input RDDStream_in, which splits data
> into:
> <StringKey1, StringKey2, Value1_to_be_inspected>
>
> Next I apply a filter() to pull the values I only want to see as
> witnessed...which creates a smaller DStream
> <<StringKey1, StringKey2, Value1_inspected>
>
> I generate a PairDStream from mapToPair() from the previous step, providing
> a way to append a summable value yielding:
> <StringKey1, StringKey2, Value1_inspected>, to_be_summed_valueof 1>
>
> Next I apply the groupByKey() to the PairDStream get:
> <<StringKey1, StringKey2, Value1>, summed_value by key/values>
>
>
> Workflow 2:
> Generate DStream using flatmap() from input RDDStream_in, which splits data
> into:
> <StringKey1, StringKey2, Value1_to_be_summed>
>
> Next, I apply mapToPair() from the previous DStream, thus providing a way
> to
> sum the Value1 and remove the Value1 from the original StringKey, thus
> yielding:
> <<StringKey1, StringKey2>, Value1_to_be_summed>
>
> Next I apply the groupByKey() and I get:
> <<StringKey1, StringKey2>, Value1_summed by Keys>
>
> Are there more efficient approaches I should be considering, such as
> method.chaining or another technique to increase work flow efficiency?
>
> Thanks for your feedback in advance.
>
> DH
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Workflow-Validation-tp11677.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to