Hi all, I am having problem of understanding how RDD will be partitioned after calling mapToPair function. Could anyone give me more information about parititoning in this function?
I have a simple application doing following job: JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(...) JavaPairDStream<String, Double> stats = messages.mapToPair(JSON_DECODE) .reduceByKey(SUM); saveToDB(stats) I setup 2 workers (each dedicate 20 cores) for this job. My kafka topic has 40 partitions (I want each core handle a partition), and the messages send to queue are partitioned by the same key as mapToPair function. I'm using default Partitioner of both Kafka and Sprark. Ideally, I shouldn't see the data shuffle between cores in mapToPair stage, right? However, in my Spark UI, I see that the "Locality Level" for this stage is "ANY", which means data need to be transfered. Any comments on this? -- Thanks Kien
