Re: Kafka Spark Streaming integration : Relationship between DStreams and Tasks

Sheel Pancholi Sun, 12 May 2019 22:36:13 -0700

Hello
Can anyone help me understand this? We work with Receiver based approach
and are trying to move to Direct based approach. There is no problem as
such moving from former to the latter. I am just trying to understand the
inner details bottom up.


Please help.

Regards
Sheel

On Mon 13 May, 2019, 12:28 AM Sheel Pancholi, <sheelst...@gmail.com> wrote:

> Hello Everyone
> I am trying to understand the internals of Spark Streaming (not Structured
> Streaming), specifically the way tasks see the DStream. I am going over the
> source code of Spark in scala, here <https://github.com/apache/spark>. I
> understand the call stack:
>
> ExecutorCoarseGrainedBackend (main) -> Executor (launchtask) -> TaskRunner 
> (Runnable).run() -> task.run(...)
>
> I understand the DStream really is a hashmap of RDDs but I am trying to
> understand the way tasks see the DStream. I know that there are basically 2
> approaches to Kafka Spark integration:
>
>    -
>
>    *Receiver based using High Level Kafka Consumer APIs*
>
>    Here a new (micro-)batch is created at every batch interval (say 5
>    secs) with say 5 partitions (=> 1 sec block interval) by the *Receiver* 
> task
>    and handed downstream to *Regular* tasks.
>
>    *Question:* Considering our example where every microbatch is created
>    every 5 secs; has exactly 5 partitions and all these partitions of all the
>    microbatches are supposed to be DAG-ged downstream the exact same way, is
>    the same *regular* task re-used over and over again for the same
>    partition id of every microbatch (RDD) as a long running task? e.g.
>
>    If *ubatch1* of partitions *(P1,P2,P3,P4,P5)* at time *T0* is assigned
>    to task ids *(T1, T2, T3, T4, T5)*, will *ubatch2* of partitions
>    *(P1',P2',P3',P4',P5')* at time *T5* be also assigned to the same set
>    of tasks *(T1, T2, T3, T4, T5)* or will new tasks *(T6, T7, T8, T9,
>    T10)* be created for *ubatch2*?
>
>    If the latter is the case then, wouldn't it be performance intensive
>    having to send new tasks over the network to executors every 5 seconds when
>    you already know that there are tasks doing the exact same thing and could
>    be re-used as long running tasks?
>    -
>
>    *Direct using Low Level Kafka Consumer APIs*
>
>    Here a Kafka Partition maps to a Spark Partition and therefore a Task.
>    Again, considering 5 Kafka partitions for a topic *t*, we get 5 Spark
>    partitions and their corresponding tasks.
>
>    *Question:* Say, the *ubatch1* at *T0* has partitions
>    *(P1,P2,P3,P4,P5)* assigned to tasks *(T1, T2, T3, T4, T5).* Will
>    *ubatch2* of partitions *(P1',P2',P3',P4',P5')* at time *T5* be also
>    assigned to the same set of tasks *(T1, T2, T3, T4, T5)* or will new
>    tasks *(T6, T7, T8, T9, T10)* be created for *ubatch2*?
>
>
> I have put up this question on SO @
> https://stackoverflow.com/questions/56102094/kafka-spark-streaming-integration-relation-between-tasks-and-dstreams
> .
>
> Regards
> Sheel
>

Re: Kafka Spark Streaming integration : Relationship between DStreams and Tasks

Reply via email to