Ahh the problem probably is async ingestion to Spark receiver buffers, hence WAL is required I would say.
Petr On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak <oss.mli...@gmail.com> wrote: > If MQTT can be configured with long enough timeout for ACK and can buffer > enough events while waiting for Spark Job restart then I think one could do > even without WAL assuming that Spark job shutdowns gracefully. Possibly > saving its own custom metadata somewhere, f.e. Zookeeper, if required to > restart Spark job. > > On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak <oss.mli...@gmail.com> wrote: > >> Ahh the problem probably is async ingestion to Spark receiver buffers, >> hence WAL is required I would say. >> >> On Tue, Sep 22, 2015 at 10:52 AM, Petr Novak <oss.mli...@gmail.com> >> wrote: >> >>> If MQTT can be configured with long enough timeout for ACK and can >>> buffer enough events while waiting for Spark Job restart then I think one >>> could do even without WAL assuming that Spark job shutdowns gracefully. >>> Possibly saving its own custom metadata somewhere, f.e. Zookeeper, if >>> required to restart Spark job. >>> >>> Petr >>> >>> On Mon, Sep 21, 2015 at 8:49 PM, Adrian Tanase <atan...@adobe.com> >>> wrote: >>> >>>> I'm wondering, isn't this the canonical use case for WAL + reliable >>>> receiver? >>>> >>>> As far as I know you can tune Mqtt server to wait for ack on messages >>>> (qos level 2?). >>>> With some support from the client libray you could achieve exactly once >>>> semantics on the read side, if you ack message only after writing it to >>>> WAL, correct? >>>> >>>> -adrian >>>> >>>> Sent from my iPhone >>>> >>>> On 21 Sep 2015, at 12:35, Petr Novak <oss.mli...@gmail.com> wrote: >>>> >>>> In short there is no direct support for it in Spark AFAIK. You will >>>> either manage it in MQTT or have to add another layer of indirection - >>>> either in-memory based (observable streams, in-mem db) or disk based >>>> (Kafka, hdfs files, db) which will keep you unprocessed events. >>>> >>>> Now realizing, there is support for backpressure in v1.5.0 but I don't >>>> know if it could be exploited aka I don't know if it is possible to >>>> decouple event reading into memory and actual processing code in Spark >>>> which could be swapped on the fly. Probably not without some custom built >>>> facility for it. >>>> >>>> Petr >>>> >>>> On Mon, Sep 21, 2015 at 11:26 AM, Petr Novak <oss.mli...@gmail.com> >>>> wrote: >>>> >>>>> I should read my posts at least once to avoid so many typos. Hopefully >>>>> you are brave enough to read through. >>>>> >>>>> Petr >>>>> >>>>> On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak <oss.mli...@gmail.com> >>>>> wrote: >>>>> >>>>>> I think you would have to persist events somehow if you don't want to >>>>>> miss them. I don't see any other option there. Either in MQTT if it is >>>>>> supported there or routing them through Kafka. >>>>>> >>>>>> There is WriteAheadLog in Spark but you would have decouple stream >>>>>> MQTT reading and processing into 2 separate job so that you could upgrade >>>>>> the processing one assuming the reading one would be stable (without >>>>>> changes) across versions. But it is problematic because there is no easy >>>>>> way how to share DStreams between jobs - you would have develop your own >>>>>> facility for it. >>>>>> >>>>>> Alternatively the reading job could could save MQTT event in its the >>>>>> most raw form into files - to limit need to change code - and then the >>>>>> processing job would work on top of it using Spark streaming based on >>>>>> files. I this is inefficient and can get quite complex if you would like >>>>>> to >>>>>> make it reliable. >>>>>> >>>>>> Basically either MQTT supports prsistence (which I don't know) or >>>>>> there is Kafka for these use case. >>>>>> >>>>>> Another option would be I think to place observable streams in >>>>>> between MQTT and Spark streaming with bakcpressure as far as you could >>>>>> perform upgrade till buffers fills up. >>>>>> >>>>>> I'm sorry that it is not thought out well from my side, it is just a >>>>>> brainstorm but it might lead you somewhere. >>>>>> >>>>>> Regards, >>>>>> Petr >>>>>> >>>>>> On Mon, Sep 21, 2015 at 10:09 AM, Jeetendra Gangele < >>>>>> gangele...@gmail.com> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I have an spark streaming application with batch (10 ms) which is >>>>>>> reading the MQTT channel and dumping the data from MQTT to HDFS. >>>>>>> >>>>>>> So suppose if I have to deploy new application jar(with changes in >>>>>>> spark streaming application) what is the best way to deploy, currently >>>>>>> I am >>>>>>> doing as below >>>>>>> >>>>>>> 1.killing the running streaming app using yarn application -kill ID >>>>>>> 2. and then starting the application again >>>>>>> >>>>>>> Problem with above approach is since we are not persisting the >>>>>>> events in MQTT we will miss the events for the period of deploy. >>>>>>> >>>>>>> how to handle this case? >>>>>>> >>>>>>> regards >>>>>>> jeeetndra >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >