Hi,
We have been implementing several Spark Streaming jobs that are basically
processing data and inserting it into Cassandra, sorting it among different
keyspaces.
We've been following the pattern:
dstream.foreachRDD(rdd =>
val records = rdd.map(elem => record(elem))
targets.foreach(target => records.filter{record =>
isTarget(target,record)}.writeToCassandra(target,table))
)
I've been wondering whether there would be a performance difference in
transforming the dstream instead of transforming the RDD within the dstream
with regards to how the transformations get scheduled.
Instead of the RDD-centric computation, I could transform the dstream until
the last step, where I need an rdd to store.
For example, the previous transformation could be written as:
val recordStream = dstream.map(elem => record(elem))
targets.foreach{target => recordStream.filter(record =>
isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}
Would be a difference in execution and/or performance? What would be the
preferred way to do this?
Bonus question: Is there a better (more performant) way to sort the data in
different "buckets" instead of filtering the data collection times the
#buckets?
thanks, Gerard.