Thanks Lukasz. It was really helpful to understand. Regards, Chandan
On Thu, May 17, 2018 at 8:26 PM, Lukasz Cwik <lc...@google.com> wrote: > > > On Wed, May 16, 2018 at 10:46 PM chandan prakash < > chandanbaran...@gmail.com> wrote: > >> Thanks Ismaël. >> Your answers were quite useful for a novice user. >> I guess this answer will help many like me. >> >> *Regarding your answer to point 2 :* >> >> >> *"Checkpointing is supported, Kafka offset management (if I understand >> whatyoumean) is managed by the KafkaIO connector + the runner"* >> >> Beam provides IO connectors like KafkaIO, etc. >> >> 1. Does it mean that Beam code deals with fetching data from source >> and writing to sink (after processing is done by chosen runner) ? >> 2. If ans of 1 is YES, then it means Beam is responsible for data >> transfer between source/sink to processing machines . In such case, >> underlying >> runner like Spark is only responsible for doing processing from one >> PCollection -> other PCollection and not for data transfer from/to >> source/sink . >> 3. If ans of 1 is NO, then it means Beam is only giving abstraction >> for IO as well but internally data transfer+ processing, both are done by >> runner like Spark? >> >> Can you please correct me which assumption is right, point 2 or point3. >> Feel free to add anything else which I might missed in understanding. >> >> Yes, Apache Beam code is responsible for reading from sources and writing > to sinks (typically sinks are a set of PTransforms). Sources are slightly > more complicated because a runner interacts with a user written source > through interfaces like BoundedSource and UnboundedSource to support > progress reporting, splitting, checkpointing, .... This means that all > runners (if they abide by the source contracts) support all user written > sources. The Apache Beam sources are "user" written sources that implement > either the BoundedSource or UnboundedSource interface. > > >> Also, >> It will be very useful if in the following sample MinimalWordCount >> program, it can be clearly pointed out which part is executed by Beam and >> which part is executed by runner : >> >> PipelineOptions options = PipelineOptionsFactory.create(); >> >> Pipeline p = Pipeline.create(options); >> >> p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*")) >> >> .apply(FlatMapElements >> .into(TypeDescriptors.strings()) >> .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+")))) >> >> .apply(Filter.by((String word) -> !word.isEmpty())) >> >> .apply(Count.perElement()) >> >> .apply(MapElements >> .into(TypeDescriptors.strings()) >> .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + >> wordCount.getValue())) >> >> .apply(TextIO.write().to("chandan")); >> >> p.run().waitUntilFinish(); >> >> >> As discussed above, runners are required to fulfill the BoundedSource > contract to power TextIO. Runners are also responsible for making sure > elements produced by a PTransform are able to be consumed by the next > PTransform (e.g. words that are sent to the filter function that are output > by the filter function are able to be consumed by the count). And finally, > runners are responsible for implementing GroupByKey (the Count.perElement > transform is composed of a GroupByKey followed by a combiner). Runners are > also responsible for the lifecycle of the job, (e.g. distributing your > application to a cluster of machines and managing execution across that > cluster). > > >> >> Thanks & Regards, >> Chandan >> >> >> >> >> >> >> >> >> On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía <ieme...@gmail.com> wrote: >> >>> Hello, >>> >>> Answers to the questions inline: >>> >>> > 1. Are there any limitations in terms of implementations, >>> functionalities >>> or performance if we want to run streaming on Beam with Spark runner vs >>> streaming on Spark-Streaming directly ? >>> >>> At this moment the Spark runner does not support some parts of the Beam >>> model in >>> streaming mode, e.g. side inputs and state/timer API. Comparing this with >>> pure >>> spark streaming is not easy given the semantic differences of Beam. >>> >>> > 2. Spark features like checkpointing, kafka offset management, how are >>> they supported in Apache Beam? Do we need to do some extra work for them? >>> >>> Checkpointing is supported, Kafka offset management (if I understand what >>> you >>> mean) is managed by the KafkaIO connector + the runner, so this should be >>> ok. >>> >>> > 3. with spark 2.x structured streaming , if we want to switch across >>> different modes like from micro-batching to continuous streaming mode, >>> how >>> it can be done while using Beam? >>> >>> To do this the Spark runner needs to translate the Beam Pipeline using >>> the >>> Structured Streaming API which is not the case today. It uses the RDD >>> based >>> API >>> but we expect to tackle this in the not so far future. However even if >>> we >>> did >>> Spark continuous mode is quite limited at this moment in time because it >>> does >>> not support aggregation functions. >>> >>> https://spark.apache.org/docs/2.3.0/structured-streaming- >>> programming-guide.html#continuous-processing >>> >>> Don't hesitate to give a try to Beam and the Spark runner and refer us if >>> you >>> have questions or find any issues. >>> >>> Regards, >>> Ismaël >>> >>> On Tue, May 15, 2018 at 2:22 PM chandan prakash < >>> chandanbaran...@gmail.com> >>> wrote: >>> >>> > Also, >>> >>> > 3. with spark 2.x structured streaming , if we want to switch across >>> different modes like from micro-batching to continuous streaming mode, >>> how >>> it can be done while using Beam? >>> >>> > These are some of the initial questions which I am not able to >>> understand >>> currently. >>> >>> >>> > Regards, >>> > Chandan >>> >>> > On Tue, May 15, 2018 at 5:45 PM, chandan prakash < >>> chandanbaran...@gmail.com> wrote: >>> >>> >> Hi Everyone, >>> >> I have just started exploring and understanding Apache Beam for new >>> project in my firm. >>> >> In particular, we have to take decision whether to implement our >>> product >>> over spark streaming (as spark batch is already in our eco system) or >>> should we use Beam over spark runner to have future liberty of changing >>> underline runner. >>> >>> >> Couple of questions, after going through beam docs and examples, I >>> have >>> is: >>> >>> >> Are there any limitations in terms of implementations, functionalities >>> or performance if we want to run streaming on Beam with Spark runner vs >>> streaming on Spark-Streaming directly ? >>> >>> >> Spark features like checkpointing, kafka offset management, how are >>> they >>> supported in Apache Beam? Do we need to do some extra work for them? >>> >>> >>> >> Any answer or link to like wise discussion will be really appreciable. >>> >> Thanks in advance. >>> >>> >> Regards, >>> >> -- >>> >> Chandan Prakash >>> >>> >>> >>> >>> > -- >>> > Chandan Prakash >>> >> >> >> >> -- >> Chandan Prakash >> >> > On Wed, May 16, 2018 at 10:46 PM chandan prakash < > chandanbaran...@gmail.com> wrote: > >> Thanks Ismaël. >> Your answers were quite useful for a novice user. >> I guess this answer will help many like me. >> >> *Regarding your answer to point 2 :* >> >> >> *"Checkpointing is supported, Kafka offset management (if I understand >> whatyoumean) is managed by the KafkaIO connector + the runner"* >> >> Beam provides IO connectors like KafkaIO, etc. >> >> 1. Does it mean that Beam code deals with fetching data from source >> and writing to sink (after processing is done by chosen runner) ? >> 2. If ans of 1 is YES, then it means Beam is responsible for data >> transfer between source/sink to processing machines . In such case, >> underlying >> runner like Spark is only responsible for doing processing from one >> PCollection -> other PCollection and not for data transfer from/to >> source/sink . >> 3. If ans of 1 is NO, then it means Beam is only giving abstraction >> for IO as well but internally data transfer+ processing, both are done by >> runner like Spark? >> >> Can you please correct me which assumption is right, point 2 or point3. >> Feel free to add anything else which I might missed in understanding. >> >> Also, >> It will be very useful if in the following sample MinimalWordCount >> program, it can be clearly pointed out which part is executed by Beam and >> which part is executed by runner : >> >> PipelineOptions options = PipelineOptionsFactory.create(); >> >> Pipeline p = Pipeline.create(options); >> >> p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*")) >> >> .apply(FlatMapElements >> .into(TypeDescriptors.strings()) >> .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+")))) >> >> .apply(Filter.by((String word) -> !word.isEmpty())) >> >> .apply(Count.perElement()) >> >> .apply(MapElements >> .into(TypeDescriptors.strings()) >> .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + >> wordCount.getValue())) >> >> .apply(TextIO.write().to("chandan")); >> >> p.run().waitUntilFinish(); >> >> >> >> Thanks & Regards, >> Chandan >> >> >> >> >> >> >> >> >> On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía <ieme...@gmail.com> wrote: >> >>> Hello, >>> >>> Answers to the questions inline: >>> >>> > 1. Are there any limitations in terms of implementations, >>> functionalities >>> or performance if we want to run streaming on Beam with Spark runner vs >>> streaming on Spark-Streaming directly ? >>> >>> At this moment the Spark runner does not support some parts of the Beam >>> model in >>> streaming mode, e.g. side inputs and state/timer API. Comparing this with >>> pure >>> spark streaming is not easy given the semantic differences of Beam. >>> >>> > 2. Spark features like checkpointing, kafka offset management, how are >>> they supported in Apache Beam? Do we need to do some extra work for them? >>> >>> Checkpointing is supported, Kafka offset management (if I understand what >>> you >>> mean) is managed by the KafkaIO connector + the runner, so this should be >>> ok. >>> >>> > 3. with spark 2.x structured streaming , if we want to switch across >>> different modes like from micro-batching to continuous streaming mode, >>> how >>> it can be done while using Beam? >>> >>> To do this the Spark runner needs to translate the Beam Pipeline using >>> the >>> Structured Streaming API which is not the case today. It uses the RDD >>> based >>> API >>> but we expect to tackle this in the not so far future. However even if >>> we >>> did >>> Spark continuous mode is quite limited at this moment in time because it >>> does >>> not support aggregation functions. >>> >>> https://spark.apache.org/docs/2.3.0/structured-streaming- >>> programming-guide.html#continuous-processing >>> >>> Don't hesitate to give a try to Beam and the Spark runner and refer us if >>> you >>> have questions or find any issues. >>> >>> Regards, >>> Ismaël >>> >>> On Tue, May 15, 2018 at 2:22 PM chandan prakash < >>> chandanbaran...@gmail.com> >>> wrote: >>> >>> > Also, >>> >>> > 3. with spark 2.x structured streaming , if we want to switch across >>> different modes like from micro-batching to continuous streaming mode, >>> how >>> it can be done while using Beam? >>> >>> > These are some of the initial questions which I am not able to >>> understand >>> currently. >>> >>> >>> > Regards, >>> > Chandan >>> >>> > On Tue, May 15, 2018 at 5:45 PM, chandan prakash < >>> chandanbaran...@gmail.com> wrote: >>> >>> >> Hi Everyone, >>> >> I have just started exploring and understanding Apache Beam for new >>> project in my firm. >>> >> In particular, we have to take decision whether to implement our >>> product >>> over spark streaming (as spark batch is already in our eco system) or >>> should we use Beam over spark runner to have future liberty of changing >>> underline runner. >>> >>> >> Couple of questions, after going through beam docs and examples, I >>> have >>> is: >>> >>> >> Are there any limitations in terms of implementations, functionalities >>> or performance if we want to run streaming on Beam with Spark runner vs >>> streaming on Spark-Streaming directly ? >>> >>> >> Spark features like checkpointing, kafka offset management, how are >>> they >>> supported in Apache Beam? Do we need to do some extra work for them? >>> >>> >>> >> Any answer or link to like wise discussion will be really appreciable. >>> >> Thanks in advance. >>> >>> >> Regards, >>> >> -- >>> >> Chandan Prakash >>> >>> >>> >>> >>> > -- >>> > Chandan Prakash >>> >> >> >> >> -- >> Chandan Prakash >> >> -- Chandan Prakash