Re: DStream demultiplexer based on a key

2014-12-14 Thread Gerard Maas
I haven't done anything else than performance tuning on Spark Streaming for the past weeks. rdd.cache makes a huge difference. A must in this case where you want to iterate over the same RDD several times. Intuitively, I also thought that all data was in memory already so that wouldn't make a diff

Re: DStream demultiplexer based on a key

2014-12-14 Thread Jean-Pascal Billaud
Ah! That sounds very much like what I need. A very basic question (most likely), why is "rdd.cache()" critical? Isn't it already true that in Spark Streaming DStream are cached in memory anyway? Also any experience with minutes long batch interval? Thanks for the quick answer! On Sun, Dec 14, 20

Re: DStream demultiplexer based on a key

2014-12-14 Thread Gerard Maas
Hi Jean-Pascal, At Virdata we do a similar thing to 'bucketize' our data to different keyspaces in Cassandra. The basic construction would be to filter the DStream (or the underlying RDD) for each key and then apply the usual storage operations on that new data set. Given that, in your case, you

DStream demultiplexer based on a key

2014-12-14 Thread Jean-Pascal Billaud
Hey, I am doing an experiment with Spark Streaming consisting of moving data from Kafka to S3 locations while partitioning by date. I have already looked into Linked Camus and Pinterest Secor and while both are workable solutions, it just feels that Spark Streaming should be able to be on par with