Re: Kafka Direct Stream join without data shuffle

Cody Koeninger Wed, 02 Sep 2015 13:47:25 -0700

No, there isn't a partitioner for KafkaRDD (KafkaRDD may not even be a pair
rdd, for instance).


It sounds to me like if it's a self-join, you should be able to do it in a
single mapPartition operation.

On Wed, Sep 2, 2015 at 3:06 PM, Chen Song <chen.song...@gmail.com> wrote:

> I have a stream got from Kafka with direct approach, say, inputStream, I
> need to
>
> 1. Create another DStream derivedStream with map or mapPartitions (with
> some data enrichment with reference table) on inputStream
> 2. Join derivedStream with inputStream
>
> In my use case, I don't need to shuffle data. Each partition in
> derivedStream only needs to be joined with the corresponding partition in
> the original parent inputStream it is generated from.
>
> My question is
>
> 1. Is there a Partitioner defined in KafkaRDD at all?
> 2. How would I preserve the partitioning scheme and avoid data shuffle?
>
> --
> Chen Song
>
>

Re: Kafka Direct Stream join without data shuffle

Reply via email to