Re: RDD partition after calling mapToPair

2015-11-24 Thread trung kien
Thanks Cody for very useful information. It's much more clear to me now. I had a lots of wrong assumptions. On Nov 23, 2015 10:19 PM, "Cody Koeninger" wrote: > Partitioner is an optional field when defining an rdd. KafkaRDD doesn't > define one, so you can't really assume anything about the way

Re: RDD partition after calling mapToPair

2015-11-23 Thread Cody Koeninger
Partitioner is an optional field when defining an rdd. KafkaRDD doesn't define one, so you can't really assume anything about the way it's partitioned, because spark doesn't know anything about the way it's partitioned. If you want to rely on some property of how things were partitioned as they w

Re: RDD partition after calling mapToPair

2015-11-23 Thread Thúy Hằng Lê
Thanks Cody, I still have concerns about this. What's do you mean by saying Spark direct stream doesn't have a default partitioner? Could you please help me to explain more? When i assign 20 cores to 20 Kafka partitions, I am expecting each core will work on a partition. Is it correct? I'm still

Re: RDD partition after calling mapToPair

2015-11-22 Thread Cody Koeninger
Spark direct stream doesn't have a default partitioner. If you know that you want to do an operation on keys that are already partitioned by kafka, just use mapPartitions or foreachPartition to avoid a shuffle. On Sat, Nov 21, 2015 at 11:46 AM, trung kien wrote: > Hi all, > > I am having proble