Hello, Answers to the questions inline:
> 1. Are there any limitations in terms of implementations, functionalities or performance if we want to run streaming on Beam with Spark runner vs streaming on Spark-Streaming directly ? At this moment the Spark runner does not support some parts of the Beam model in streaming mode, e.g. side inputs and state/timer API. Comparing this with pure spark streaming is not easy given the semantic differences of Beam. > 2. Spark features like checkpointing, kafka offset management, how are they supported in Apache Beam? Do we need to do some extra work for them? Checkpointing is supported, Kafka offset management (if I understand what you mean) is managed by the KafkaIO connector + the runner, so this should be ok. > 3. with spark 2.x structured streaming , if we want to switch across different modes like from micro-batching to continuous streaming mode, how it can be done while using Beam? To do this the Spark runner needs to translate the Beam Pipeline using the Structured Streaming API which is not the case today. It uses the RDD based API but we expect to tackle this in the not so far future. However even if we did Spark continuous mode is quite limited at this moment in time because it does not support aggregation functions. https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#continuous-processing Don't hesitate to give a try to Beam and the Spark runner and refer us if you have questions or find any issues. Regards, Ismaël On Tue, May 15, 2018 at 2:22 PM chandan prakash <chandanbaran...@gmail.com> wrote: > Also, > 3. with spark 2.x structured streaming , if we want to switch across different modes like from micro-batching to continuous streaming mode, how it can be done while using Beam? > These are some of the initial questions which I am not able to understand currently. > Regards, > Chandan > On Tue, May 15, 2018 at 5:45 PM, chandan prakash < chandanbaran...@gmail.com> wrote: >> Hi Everyone, >> I have just started exploring and understanding Apache Beam for new project in my firm. >> In particular, we have to take decision whether to implement our product over spark streaming (as spark batch is already in our eco system) or should we use Beam over spark runner to have future liberty of changing underline runner. >> Couple of questions, after going through beam docs and examples, I have is: >> Are there any limitations in terms of implementations, functionalities or performance if we want to run streaming on Beam with Spark runner vs streaming on Spark-Streaming directly ? >> Spark features like checkpointing, kafka offset management, how are they supported in Apache Beam? Do we need to do some extra work for them? >> Any answer or link to like wise discussion will be really appreciable. >> Thanks in advance. >> Regards, >> -- >> Chandan Prakash > -- > Chandan Prakash