Re: RDD partition after calling mapToPair

Cody Koeninger Sun, 22 Nov 2015 09:27:07 -0800

Spark direct stream doesn't have a default partitioner.

If you know that you want to do an operation on keys that are already
partitioned by kafka, just use mapPartitions or foreachPartition to avoid a
shuffle.


On Sat, Nov 21, 2015 at 11:46 AM, trung kien <kient...@gmail.com> wrote:

> Hi all,
>
> I am having problem of understanding how RDD will be partitioned after
> calling mapToPair function.
> Could anyone give me more information about parititoning in this function?
>
> I have a simple application doing following job:
>
> JavaPairInputDStream<String, String> messages =
> KafkaUtils.createDirectStream(...)
>
> JavaPairDStream<String, Double> stats = messages.mapToPair(JSON_DECODE)
>
> .reduceByKey(SUM);
>
> saveToDB(stats)
>
> I setup 2 workers (each dedicate 20 cores) for this job.
> My kafka topic has 40 partitions (I want each core handle a partition),
> and the messages send to queue are partitioned by the same key as mapToPair
> function.
> I'm using default Partitioner of both Kafka and Sprark.
>
> Ideally, I shouldn't see the data shuffle between cores in mapToPair
> stage, right?
> However, in my Spark UI, I see that the "Locality Level" for this stage is
> "ANY", which means data need to be transfered.
> Any comments on this?
>
> --
> Thanks
> Kien
>

Re: RDD partition after calling mapToPair

Reply via email to