Hello Spark Devs/Users, Im trying to solve the use case with Spark Streaming 1.6.2 where for every batch ( say 2 mins) data needs to go to the same reducer node after grouping by key. The underlying storage is Cassandra and not HDFS.
This is a map-reduce job, where also trying to use the partitions of the Cassandra table to batch the data for the same partition. The requirement of sticky session/partition across batches is because the operations which we need to do, needs to read data for every key and then merge this with the current batch aggregate values. So, currently when there is no stickyness across batches, we have to read for every key, merge and then write back. and reads are very expensive. So, if we have sticky session, we can avoid read in every batch and have a cache of till last batch aggregates across batches. So, there are few options, can think of: 1. to change the TaskSchedulerImpl, as its using Random to identify the node for mapper/reducer before starting the batch/phase. Not sure if there is a custom scheduler way of achieving it? 2. Can custom RDD can help to find the node for the key-->node. there is a getPreferredLocation() method. But not sure, whether this will be persistent or can vary for some edge cases? Thanks in advance for you help and time ! Regards, Manish