Spark direct stream doesn't have a default partitioner. If you know that you want to do an operation on keys that are already partitioned by kafka, just use mapPartitions or foreachPartition to avoid a shuffle.
On Sat, Nov 21, 2015 at 11:46 AM, trung kien <kient...@gmail.com> wrote: > Hi all, > > I am having problem of understanding how RDD will be partitioned after > calling mapToPair function. > Could anyone give me more information about parititoning in this function? > > I have a simple application doing following job: > > JavaPairInputDStream<String, String> messages = > KafkaUtils.createDirectStream(...) > > JavaPairDStream<String, Double> stats = messages.mapToPair(JSON_DECODE) > > .reduceByKey(SUM); > > saveToDB(stats) > > I setup 2 workers (each dedicate 20 cores) for this job. > My kafka topic has 40 partitions (I want each core handle a partition), > and the messages send to queue are partitioned by the same key as mapToPair > function. > I'm using default Partitioner of both Kafka and Sprark. > > Ideally, I shouldn't see the data shuffle between cores in mapToPair > stage, right? > However, in my Spark UI, I see that the "Locality Level" for this stage is > "ANY", which means data need to be transfered. > Any comments on this? > > -- > Thanks > Kien >