Re: Kafka Streaming and partitioning

2017-02-26 Thread tonyye
Hi Dave, I had the same question and was wondering if you had found a way to do the join without causing a shuffle? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Streaming-and-partitioning-tp25955p28425.html Sent from the Apache Spark User List maili

Re: Kafka Streaming and partitioning

2016-01-13 Thread David D
Yep that's exactly what we want. Thanks for all the info Cody. Dave. On 13 Jan 2016 18:29, "Cody Koeninger" wrote: > The idea here is that the custom partitioner shouldn't actually get used > for repartitioning the kafka stream (because that would involve a shuffle, > which is what you're trying

Re: Kafka Streaming and partitioning

2016-01-13 Thread Cody Koeninger
The idea here is that the custom partitioner shouldn't actually get used for repartitioning the kafka stream (because that would involve a shuffle, which is what you're trying to avoid). You're just assigning a partitioner because you know how it already is partitioned. On Wed, Jan 13, 2016 at 1

Re: Kafka Streaming and partitioning

2016-01-13 Thread Dave
So for case 1 below - subclass or modify the direct stream and kafkardd. They're private, so you'd need to rebuild just the external kafka project, not all of spark When the data is read from Kafka it will be partitioned correctly with the Custom Partitioner passed in to the new direct stream a

Re: Kafka Streaming and partitioning

2016-01-13 Thread Cody Koeninger
In the case here of a kafkaRDD, the data doesn't reside on the cluster, it's not cached by default. If you're running kafka on the same nodes as spark, then data locality would play a factor, but that should be handled by the existing getPreferredLocations method. On Wed, Jan 13, 2016 at 10:46 AM

Re: Kafka Streaming and partitioning

2016-01-13 Thread Dave
Thanks Cody, appreciate the response. With this pattern the partitioners will now match when the join is executed. However, does the wrapper RDD not need to set the partition meta data on the wrapped RDD in order to allow Spark to know where the data for each partition resides in the cluster.

Re: Kafka Streaming and partitioning

2016-01-13 Thread Cody Koeninger
If two rdds have an identical partitioner, joining should not involve a shuffle. You should be able to override the partitioner without calling partitionBy. Two ways I can think of to do this: - subclass or modify the direct stream and kafkardd. They're private, so you'd need to rebuild just the