If MQTT can be configured with long enough timeout for ACK and can buffer enough events while waiting for Spark Job restart then I think one could do even without WAL assuming that Spark job shutdowns gracefully. Possibly saving its own custom metadata somewhere, f.e. Zookeeper, if required to restart Spark job.
On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak <oss.mli...@gmail.com> wrote: > Ahh the problem probably is async ingestion to Spark receiver buffers, > hence WAL is required I would say. > > On Tue, Sep 22, 2015 at 10:52 AM, Petr Novak <oss.mli...@gmail.com> wrote: > >> If MQTT can be configured with long enough timeout for ACK and can buffer >> enough events while waiting for Spark Job restart then I think one could do >> even without WAL assuming that Spark job shutdowns gracefully. Possibly >> saving its own custom metadata somewhere, f.e. Zookeeper, if required to >> restart Spark job. >> >> Petr >> >> On Mon, Sep 21, 2015 at 8:49 PM, Adrian Tanase <atan...@adobe.com> wrote: >> >>> I'm wondering, isn't this the canonical use case for WAL + reliable >>> receiver? >>> >>> As far as I know you can tune Mqtt server to wait for ack on messages >>> (qos level 2?). >>> With some support from the client libray you could achieve exactly once >>> semantics on the read side, if you ack message only after writing it to >>> WAL, correct? >>> >>> -adrian >>> >>> Sent from my iPhone >>> >>> On 21 Sep 2015, at 12:35, Petr Novak <oss.mli...@gmail.com> wrote: >>> >>> In short there is no direct support for it in Spark AFAIK. You will >>> either manage it in MQTT or have to add another layer of indirection - >>> either in-memory based (observable streams, in-mem db) or disk based >>> (Kafka, hdfs files, db) which will keep you unprocessed events. >>> >>> Now realizing, there is support for backpressure in v1.5.0 but I don't >>> know if it could be exploited aka I don't know if it is possible to >>> decouple event reading into memory and actual processing code in Spark >>> which could be swapped on the fly. Probably not without some custom built >>> facility for it. >>> >>> Petr >>> >>> On Mon, Sep 21, 2015 at 11:26 AM, Petr Novak <oss.mli...@gmail.com> >>> wrote: >>> >>>> I should read my posts at least once to avoid so many typos. Hopefully >>>> you are brave enough to read through. >>>> >>>> Petr >>>> >>>> On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak <oss.mli...@gmail.com> >>>> wrote: >>>> >>>>> I think you would have to persist events somehow if you don't want to >>>>> miss them. I don't see any other option there. Either in MQTT if it is >>>>> supported there or routing them through Kafka. >>>>> >>>>> There is WriteAheadLog in Spark but you would have decouple stream >>>>> MQTT reading and processing into 2 separate job so that you could upgrade >>>>> the processing one assuming the reading one would be stable (without >>>>> changes) across versions. But it is problematic because there is no easy >>>>> way how to share DStreams between jobs - you would have develop your own >>>>> facility for it. >>>>> >>>>> Alternatively the reading job could could save MQTT event in its the >>>>> most raw form into files - to limit need to change code - and then the >>>>> processing job would work on top of it using Spark streaming based on >>>>> files. I this is inefficient and can get quite complex if you would like >>>>> to >>>>> make it reliable. >>>>> >>>>> Basically either MQTT supports prsistence (which I don't know) or >>>>> there is Kafka for these use case. >>>>> >>>>> Another option would be I think to place observable streams in between >>>>> MQTT and Spark streaming with bakcpressure as far as you could perform >>>>> upgrade till buffers fills up. >>>>> >>>>> I'm sorry that it is not thought out well from my side, it is just a >>>>> brainstorm but it might lead you somewhere. >>>>> >>>>> Regards, >>>>> Petr >>>>> >>>>> On Mon, Sep 21, 2015 at 10:09 AM, Jeetendra Gangele < >>>>> gangele...@gmail.com> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I have an spark streaming application with batch (10 ms) which is >>>>>> reading the MQTT channel and dumping the data from MQTT to HDFS. >>>>>> >>>>>> So suppose if I have to deploy new application jar(with changes in >>>>>> spark streaming application) what is the best way to deploy, currently I >>>>>> am >>>>>> doing as below >>>>>> >>>>>> 1.killing the running streaming app using yarn application -kill ID >>>>>> 2. and then starting the application again >>>>>> >>>>>> Problem with above approach is since we are not persisting the events >>>>>> in MQTT we will miss the events for the period of deploy. >>>>>> >>>>>> how to handle this case? >>>>>> >>>>>> regards >>>>>> jeeetndra >>>>>> >>>>> >>>>> >>>> >>> >> >