Hi all,

I am having problem of understanding how RDD will be partitioned after
calling mapToPair function.
Could anyone give me more information about parititoning in this function?

I have a simple application doing following job:

JavaPairInputDStream<String, String> messages =
KafkaUtils.createDirectStream(...)

JavaPairDStream<String, Double> stats = messages.mapToPair(JSON_DECODE)

.reduceByKey(SUM);

saveToDB(stats)

I setup 2 workers (each dedicate 20 cores) for this job.
My kafka topic has 40 partitions (I want each core handle a partition), and
the messages send to queue are partitioned by the same key as mapToPair
function.
I'm using default Partitioner of both Kafka and Sprark.

Ideally, I shouldn't see the data shuffle between cores in mapToPair stage,
right?
However, in my Spark UI, I see that the "Locality Level" for this stage is
"ANY", which means data need to be transfered.
Any comments on this?

-- 
Thanks
Kien

Reply via email to