Hi Dave,
I had the same question and was wondering if you had found a way to do the
join without causing a shuffle?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Streaming-and-partitioning-tp25955p28425.html
Sent from the Apache Spark User List maili
Yep that's exactly what we want. Thanks for all the info Cody.
Dave.
On 13 Jan 2016 18:29, "Cody Koeninger" wrote:
> The idea here is that the custom partitioner shouldn't actually get used
> for repartitioning the kafka stream (because that would involve a shuffle,
> which is what you're trying
The idea here is that the custom partitioner shouldn't actually get used
for repartitioning the kafka stream (because that would involve a shuffle,
which is what you're trying to avoid). You're just assigning a partitioner
because you know how it already is partitioned.
On Wed, Jan 13, 2016 at 1
So for case 1 below
- subclass or modify the direct stream and kafkardd. They're private,
so you'd need to rebuild just the external kafka project, not all of spark
When the data is read from Kafka it will be partitioned correctly with
the Custom Partitioner passed in to the new direct stream a
In the case here of a kafkaRDD, the data doesn't reside on the cluster,
it's not cached by default. If you're running kafka on the same nodes as
spark, then data locality would play a factor, but that should be handled
by the existing getPreferredLocations method.
On Wed, Jan 13, 2016 at 10:46 AM
Thanks Cody, appreciate the response.
With this pattern the partitioners will now match when the join is
executed.
However, does the wrapper RDD not need to set the partition meta data on
the wrapped RDD in order to allow Spark to know where the data for each
partition resides in the cluster.
If two rdds have an identical partitioner, joining should not involve a
shuffle.
You should be able to override the partitioner without calling partitionBy.
Two ways I can think of to do this:
- subclass or modify the direct stream and kafkardd. They're private, so
you'd need to rebuild just the