I haven't done anything else than performance tuning on Spark Streaming for
the past weeks. rdd.cache makes a huge difference. A must in this case
where you want to iterate over the same RDD several times.
Intuitively, I also thought that all data was in memory already so that
wouldn't make a diff
Ah! That sounds very much like what I need. A very basic question (most
likely), why is "rdd.cache()" critical? Isn't it already true that in Spark
Streaming DStream are cached in memory anyway?
Also any experience with minutes long batch interval?
Thanks for the quick answer!
On Sun, Dec 14, 20
Hi Jean-Pascal,
At Virdata we do a similar thing to 'bucketize' our data to different
keyspaces in Cassandra.
The basic construction would be to filter the DStream (or the underlying
RDD) for each key and then apply the usual storage operations on that new
data set.
Given that, in your case, you
Hey,
I am doing an experiment with Spark Streaming consisting of moving data
from Kafka to S3 locations while partitioning by date. I have already
looked into Linked Camus and Pinterest Secor and while both are workable
solutions, it just feels that Spark Streaming should be able to be on par
with