Consuming from kafka is inherently limited to using a number of consumer
nodes less than or equal to the number of kafka partitions.  If you think
about it, you're going to be paying some network cost to repartition that
data from a consumer to different processing nodes, regardless of what
Spark consumer library you use.

If you really need finer grained parallelism, and want to do it in a more
efficient manner, you need to move that partitioning to the producer (i.e.
add more partitions to kafka).

On Thu, Oct 29, 2015 at 6:11 AM, Adrian Tanase <atan...@adobe.com> wrote:

> You can call .repartition on the Dstream created by the Kafka direct
> consumer. You take the one-time hit of a shuffle but gain the ability to
> scale out processing beyond your number of partitions.
>
> We’re doing this to scale up from 36 partitions / topic to 140 partitions
> (20 cores * 7 nodes) and it works great.
>
> -adrian
>
> From: varun sharma
> Date: Thursday, October 29, 2015 at 8:27 AM
> To: user
> Subject: Need more tasks in KafkaDirectStream
>
> Right now, there is one to one correspondence between kafka partitions and
> spark partitions.
> I dont have a requirement of one to one semantics.
> I need more tasks to be generated in the job so that it can be
> parallelised and batch can be completed fast. In the previous Receiver
> based approach number of tasks created were independent of kafka
> partitions, I need something like that only.
> Any config available if I dont need one to one semantics?
> Is there any way I can repartition without incurring any additional cost.
>
> Thanks
> *VARUN SHARMA*
>
>

Reply via email to