How does MapWithStateRDD distribute the data

2016-08-03 Thread Soumitra Johri
Hi, I am running a steaming job with 4 executors and 16 cores so that each executor has two cores to work with. The input Kafka topic has 4 partitions. With this given configuration I was expecting MapWithStateRDD to be evenly distributed across all executors, how ever I see that it uses only two

inter spark application communication

2016-04-18 Thread Soumitra Johri
Hi, I have two applications : App1 and App2. On a single cluster I have to spawn 5 instances os App1 and 1 instance of App2. What would be the best way to send data from the 5 App1 instances to the single App2 instance ? Right now I am using Kafka to send data from one spark application to the s

UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Soumitra Johri
Hi, I am relatively new to Spark and am using updateStateByKey() operation to maintain state in my Spark Streaming application. The input data is coming through a Kafka topic. 1. I want to understand how are DStreams partitioned? 2. How does the partitioning work with mapWithState() or u