Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

chandan prakash Wed, 16 May 2018 22:47:02 -0700

Thanks Ismaël.
Your answers were quite useful for a novice user.
I guess this answer will help many like me.


*Regarding your answer to point 2 :*


*"Checkpointing is supported, Kafka offset management (if I understand
whatyoumean) is managed by the KafkaIO connector + the runner"*

Beam provides IO connectors like KafkaIO, etc.

   1. Does it mean that Beam code deals with fetching data from source and
   writing to sink (after processing is done by chosen runner) ?
   2. If ans of 1 is YES, then it means Beam is responsible for data
   transfer between source/sink to processing machines . In such case,
underlying
   runner like Spark is only responsible for doing processing from one
   PCollection -> other PCollection and not for data transfer from/to
   source/sink .
   3. If ans of 1 is NO, then it means Beam is only giving abstraction for
   IO as well but internally data transfer+ processing, both are done by
   runner like Spark?

Can you please correct me which assumption is right, point 2 or point3.
Feel free to add anything else which I might missed in understanding.

Also,
It will be very useful if in the following sample MinimalWordCount program,
it can be clearly pointed out which part is executed by Beam and which part
is executed by runner :

PipelineOptions options = PipelineOptionsFactory.create();

Pipeline p = Pipeline.create(options);

p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))

.apply(FlatMapElements
    .into(TypeDescriptors.strings())
    .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+"))))

.apply(Filter.by((String word) -> !word.isEmpty()))

.apply(Count.perElement())

.apply(MapElements
    .into(TypeDescriptors.strings())
    .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " +
wordCount.getValue()))

.apply(TextIO.write().to("chandan"));

p.run().waitUntilFinish();



Thanks & Regards,
Chandan








On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía <ieme...@gmail.com> wrote:

> Hello,
>
> Answers to the questions inline:
>
> > 1. Are there any limitations in terms of implementations, functionalities
> or performance if we want to run streaming on Beam with Spark runner vs
> streaming on Spark-Streaming directly ?
>
> At this moment the Spark runner does not support some parts of the Beam
> model in
> streaming mode, e.g. side inputs and state/timer API. Comparing this with
> pure
> spark streaming is not easy given the semantic differences of Beam.
>
> > 2. Spark features like checkpointing, kafka offset management, how are
> they supported in Apache Beam? Do we need to do some extra work for them?
>
> Checkpointing is supported, Kafka offset management (if I understand what
> you
> mean) is managed by the KafkaIO connector + the runner, so this should be
> ok.
>
> > 3. with spark 2.x structured streaming , if we want to switch across
> different modes like from micro-batching to continuous streaming mode, how
> it can be done while using Beam?
>
> To do this the Spark runner needs to translate the Beam Pipeline using the
> Structured Streaming API which is not the case today. It uses the RDD based
> API
> but we expect to tackle this in the not so far future.  However even if we
> did
> Spark continuous mode is quite limited at this moment in time because it
> does
> not support aggregation functions.
>
> https://spark.apache.org/docs/2.3.0/structured-streaming-
> programming-guide.html#continuous-processing
>
> Don't hesitate to give a try to Beam and the Spark runner and refer us if
> you
> have questions or find any issues.
>
> Regards,
> Ismaël
>
> On Tue, May 15, 2018 at 2:22 PM chandan prakash <chandanbaran...@gmail.com
> >
> wrote:
>
> > Also,
>
> > 3. with spark 2.x structured streaming , if we want to switch across
> different modes like from micro-batching to continuous streaming mode, how
> it can be done while using Beam?
>
> > These are some of the initial questions which I am not able to understand
> currently.
>
>
> > Regards,
> > Chandan
>
> > On Tue, May 15, 2018 at 5:45 PM, chandan prakash <
> chandanbaran...@gmail.com> wrote:
>
> >> Hi Everyone,
> >> I have just started exploring and understanding Apache Beam for new
> project in my firm.
> >> In particular, we have to take decision whether to implement our product
> over spark streaming (as spark batch is already in our eco system) or
> should we use Beam over spark runner to have future liberty of changing
> underline runner.
>
> >> Couple of questions, after going through beam docs and examples, I have
> is:
>
> >> Are there any limitations in terms of implementations, functionalities
> or performance if we want to run streaming on Beam with Spark runner vs
> streaming on Spark-Streaming directly ?
>
> >> Spark features like checkpointing, kafka offset management, how are they
> supported in Apache Beam? Do we need to do some extra work for them?
>
>
> >> Any answer or link to like wise discussion will be really appreciable.
> >> Thanks in advance.
>
> >> Regards,
> >> --
> >> Chandan Prakash
>
>
>
>
> > --
> > Chandan Prakash
>



-- 
Chandan Prakash

Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

Reply via email to