Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

chandan prakash Fri, 18 May 2018 13:19:43 -0700

Thanks  Lukasz. It was really helpful to understand.

Regards,
Chandan


On Thu, May 17, 2018 at 8:26 PM, Lukasz Cwik <lc...@google.com> wrote:

>
>
> On Wed, May 16, 2018 at 10:46 PM chandan prakash <
> chandanbaran...@gmail.com> wrote:
>
>> Thanks Ismaël.
>> Your answers were quite useful for a novice user.
>> I guess this answer will help many like me.
>>
>> *Regarding your answer to point 2 :*
>>
>>
>> *"Checkpointing is supported, Kafka offset management (if I understand
>> whatyoumean) is managed by the KafkaIO connector + the runner"*
>>
>> Beam provides IO connectors like KafkaIO, etc.
>>
>>    1. Does it mean that Beam code deals with fetching data from source
>>    and writing to sink (after processing is done by chosen runner) ?
>>    2. If ans of 1 is YES, then it means Beam is responsible for data
>>    transfer between source/sink to processing machines . In such case, 
>> underlying
>>    runner like Spark is only responsible for doing processing from one
>>    PCollection -> other PCollection and not for data transfer from/to
>>    source/sink .
>>    3. If ans of 1 is NO, then it means Beam is only giving abstraction
>>    for IO as well but internally data transfer+ processing, both are done by
>>    runner like Spark?
>>
>> Can you please correct me which assumption is right, point 2 or point3.
>> Feel free to add anything else which I might missed in understanding.
>>
>> Yes, Apache Beam code is responsible for reading from sources and writing
> to sinks (typically sinks are a set of PTransforms). Sources are slightly
> more complicated because a runner interacts with a user written source
> through interfaces like BoundedSource and UnboundedSource to support
> progress reporting, splitting, checkpointing, .... This means that all
> runners (if they abide by the source contracts) support all user written
> sources. The Apache Beam sources are "user" written sources that implement
> either the BoundedSource or UnboundedSource interface.
>
>
>> Also,
>> It will be very useful if in the following sample MinimalWordCount
>> program, it can be clearly pointed out which part is executed by Beam and
>> which part is executed by runner :
>>
>> PipelineOptions options = PipelineOptionsFactory.create();
>>
>> Pipeline p = Pipeline.create(options);
>>
>> p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))
>>
>> .apply(FlatMapElements
>>     .into(TypeDescriptors.strings())
>>     .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+"))))
>>
>> .apply(Filter.by((String word) -> !word.isEmpty()))
>>
>> .apply(Count.perElement())
>>
>> .apply(MapElements
>>     .into(TypeDescriptors.strings())
>>     .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + 
>> wordCount.getValue()))
>>
>> .apply(TextIO.write().to("chandan"));
>>
>> p.run().waitUntilFinish();
>>
>>
>> As discussed above, runners are required to fulfill the BoundedSource
> contract to power TextIO. Runners are also responsible for making sure
> elements produced by a PTransform are able to be consumed by the next
> PTransform (e.g. words that are sent to the filter function that are output
> by the filter function are able to be consumed by the count). And finally,
> runners are responsible for implementing GroupByKey (the Count.perElement
> transform is composed of a GroupByKey followed by a combiner). Runners are
> also responsible for the lifecycle of the job, (e.g. distributing your
> application to a cluster of machines and managing execution across that
> cluster).
>
>
>>
>> Thanks & Regards,
>> Chandan
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía <ieme...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Answers to the questions inline:
>>>
>>> > 1. Are there any limitations in terms of implementations,
>>> functionalities
>>> or performance if we want to run streaming on Beam with Spark runner vs
>>> streaming on Spark-Streaming directly ?
>>>
>>> At this moment the Spark runner does not support some parts of the Beam
>>> model in
>>> streaming mode, e.g. side inputs and state/timer API. Comparing this with
>>> pure
>>> spark streaming is not easy given the semantic differences of Beam.
>>>
>>> > 2. Spark features like checkpointing, kafka offset management, how are
>>> they supported in Apache Beam? Do we need to do some extra work for them?
>>>
>>> Checkpointing is supported, Kafka offset management (if I understand what
>>> you
>>> mean) is managed by the KafkaIO connector + the runner, so this should be
>>> ok.
>>>
>>> > 3. with spark 2.x structured streaming , if we want to switch across
>>> different modes like from micro-batching to continuous streaming mode,
>>> how
>>> it can be done while using Beam?
>>>
>>> To do this the Spark runner needs to translate the Beam Pipeline using
>>> the
>>> Structured Streaming API which is not the case today. It uses the RDD
>>> based
>>> API
>>> but we expect to tackle this in the not so far future.  However even if
>>> we
>>> did
>>> Spark continuous mode is quite limited at this moment in time because it
>>> does
>>> not support aggregation functions.
>>>
>>> https://spark.apache.org/docs/2.3.0/structured-streaming-
>>> programming-guide.html#continuous-processing
>>>
>>> Don't hesitate to give a try to Beam and the Spark runner and refer us if
>>> you
>>> have questions or find any issues.
>>>
>>> Regards,
>>> Ismaël
>>>
>>> On Tue, May 15, 2018 at 2:22 PM chandan prakash <
>>> chandanbaran...@gmail.com>
>>> wrote:
>>>
>>> > Also,
>>>
>>> > 3. with spark 2.x structured streaming , if we want to switch across
>>> different modes like from micro-batching to continuous streaming mode,
>>> how
>>> it can be done while using Beam?
>>>
>>> > These are some of the initial questions which I am not able to
>>> understand
>>> currently.
>>>
>>>
>>> > Regards,
>>> > Chandan
>>>
>>> > On Tue, May 15, 2018 at 5:45 PM, chandan prakash <
>>> chandanbaran...@gmail.com> wrote:
>>>
>>> >> Hi Everyone,
>>> >> I have just started exploring and understanding Apache Beam for new
>>> project in my firm.
>>> >> In particular, we have to take decision whether to implement our
>>> product
>>> over spark streaming (as spark batch is already in our eco system) or
>>> should we use Beam over spark runner to have future liberty of changing
>>> underline runner.
>>>
>>> >> Couple of questions, after going through beam docs and examples, I
>>> have
>>> is:
>>>
>>> >> Are there any limitations in terms of implementations, functionalities
>>> or performance if we want to run streaming on Beam with Spark runner vs
>>> streaming on Spark-Streaming directly ?
>>>
>>> >> Spark features like checkpointing, kafka offset management, how are
>>> they
>>> supported in Apache Beam? Do we need to do some extra work for them?
>>>
>>>
>>> >> Any answer or link to like wise discussion will be really appreciable.
>>> >> Thanks in advance.
>>>
>>> >> Regards,
>>> >> --
>>> >> Chandan Prakash
>>>
>>>
>>>
>>>
>>> > --
>>> > Chandan Prakash
>>>
>>
>>
>>
>> --
>> Chandan Prakash
>>
>>
> On Wed, May 16, 2018 at 10:46 PM chandan prakash <
> chandanbaran...@gmail.com> wrote:
>
>> Thanks Ismaël.
>> Your answers were quite useful for a novice user.
>> I guess this answer will help many like me.
>>
>> *Regarding your answer to point 2 :*
>>
>>
>> *"Checkpointing is supported, Kafka offset management (if I understand
>> whatyoumean) is managed by the KafkaIO connector + the runner"*
>>
>> Beam provides IO connectors like KafkaIO, etc.
>>
>>    1. Does it mean that Beam code deals with fetching data from source
>>    and writing to sink (after processing is done by chosen runner) ?
>>    2. If ans of 1 is YES, then it means Beam is responsible for data
>>    transfer between source/sink to processing machines . In such case, 
>> underlying
>>    runner like Spark is only responsible for doing processing from one
>>    PCollection -> other PCollection and not for data transfer from/to
>>    source/sink .
>>    3. If ans of 1 is NO, then it means Beam is only giving abstraction
>>    for IO as well but internally data transfer+ processing, both are done by
>>    runner like Spark?
>>
>> Can you please correct me which assumption is right, point 2 or point3.
>> Feel free to add anything else which I might missed in understanding.
>>
>> Also,
>> It will be very useful if in the following sample MinimalWordCount
>> program, it can be clearly pointed out which part is executed by Beam and
>> which part is executed by runner :
>>
>> PipelineOptions options = PipelineOptionsFactory.create();
>>
>> Pipeline p = Pipeline.create(options);
>>
>> p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))
>>
>> .apply(FlatMapElements
>>     .into(TypeDescriptors.strings())
>>     .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+"))))
>>
>> .apply(Filter.by((String word) -> !word.isEmpty()))
>>
>> .apply(Count.perElement())
>>
>> .apply(MapElements
>>     .into(TypeDescriptors.strings())
>>     .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + 
>> wordCount.getValue()))
>>
>> .apply(TextIO.write().to("chandan"));
>>
>> p.run().waitUntilFinish();
>>
>>
>>
>> Thanks & Regards,
>> Chandan
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía <ieme...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Answers to the questions inline:
>>>
>>> > 1. Are there any limitations in terms of implementations,
>>> functionalities
>>> or performance if we want to run streaming on Beam with Spark runner vs
>>> streaming on Spark-Streaming directly ?
>>>
>>> At this moment the Spark runner does not support some parts of the Beam
>>> model in
>>> streaming mode, e.g. side inputs and state/timer API. Comparing this with
>>> pure
>>> spark streaming is not easy given the semantic differences of Beam.
>>>
>>> > 2. Spark features like checkpointing, kafka offset management, how are
>>> they supported in Apache Beam? Do we need to do some extra work for them?
>>>
>>> Checkpointing is supported, Kafka offset management (if I understand what
>>> you
>>> mean) is managed by the KafkaIO connector + the runner, so this should be
>>> ok.
>>>
>>> > 3. with spark 2.x structured streaming , if we want to switch across
>>> different modes like from micro-batching to continuous streaming mode,
>>> how
>>> it can be done while using Beam?
>>>
>>> To do this the Spark runner needs to translate the Beam Pipeline using
>>> the
>>> Structured Streaming API which is not the case today. It uses the RDD
>>> based
>>> API
>>> but we expect to tackle this in the not so far future.  However even if
>>> we
>>> did
>>> Spark continuous mode is quite limited at this moment in time because it
>>> does
>>> not support aggregation functions.
>>>
>>> https://spark.apache.org/docs/2.3.0/structured-streaming-
>>> programming-guide.html#continuous-processing
>>>
>>> Don't hesitate to give a try to Beam and the Spark runner and refer us if
>>> you
>>> have questions or find any issues.
>>>
>>> Regards,
>>> Ismaël
>>>
>>> On Tue, May 15, 2018 at 2:22 PM chandan prakash <
>>> chandanbaran...@gmail.com>
>>> wrote:
>>>
>>> > Also,
>>>
>>> > 3. with spark 2.x structured streaming , if we want to switch across
>>> different modes like from micro-batching to continuous streaming mode,
>>> how
>>> it can be done while using Beam?
>>>
>>> > These are some of the initial questions which I am not able to
>>> understand
>>> currently.
>>>
>>>
>>> > Regards,
>>> > Chandan
>>>
>>> > On Tue, May 15, 2018 at 5:45 PM, chandan prakash <
>>> chandanbaran...@gmail.com> wrote:
>>>
>>> >> Hi Everyone,
>>> >> I have just started exploring and understanding Apache Beam for new
>>> project in my firm.
>>> >> In particular, we have to take decision whether to implement our
>>> product
>>> over spark streaming (as spark batch is already in our eco system) or
>>> should we use Beam over spark runner to have future liberty of changing
>>> underline runner.
>>>
>>> >> Couple of questions, after going through beam docs and examples, I
>>> have
>>> is:
>>>
>>> >> Are there any limitations in terms of implementations, functionalities
>>> or performance if we want to run streaming on Beam with Spark runner vs
>>> streaming on Spark-Streaming directly ?
>>>
>>> >> Spark features like checkpointing, kafka offset management, how are
>>> they
>>> supported in Apache Beam? Do we need to do some extra work for them?
>>>
>>>
>>> >> Any answer or link to like wise discussion will be really appreciable.
>>> >> Thanks in advance.
>>>
>>> >> Regards,
>>> >> --
>>> >> Chandan Prakash
>>>
>>>
>>>
>>>
>>> > --
>>> > Chandan Prakash
>>>
>>
>>
>>
>> --
>> Chandan Prakash
>>
>>


-- 
Chandan Prakash

Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

Reply via email to