Re: Deploying spark-streaming application on production

Petr Novak Tue, 22 Sep 2015 01:54:38 -0700

Ahh the problem probably is async ingestion to Spark receiver buffers,
hence WAL is required I would say.


Petr

On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak <oss.mli...@gmail.com> wrote:

> If MQTT can be configured with long enough timeout for ACK and can buffer
> enough events while waiting for Spark Job restart then I think one could do
> even without WAL assuming that Spark job shutdowns gracefully. Possibly
> saving its own custom metadata somewhere, f.e. Zookeeper, if required to
> restart Spark job.
>
> On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak <oss.mli...@gmail.com> wrote:
>
>> Ahh the problem probably is async ingestion to Spark receiver buffers,
>> hence WAL is required I would say.
>>
>> On Tue, Sep 22, 2015 at 10:52 AM, Petr Novak <oss.mli...@gmail.com>
>> wrote:
>>
>>> If MQTT can be configured with long enough timeout for ACK and can
>>> buffer enough events while waiting for Spark Job restart then I think one
>>> could do even without WAL assuming that Spark job shutdowns gracefully.
>>> Possibly saving its own custom metadata somewhere, f.e. Zookeeper, if
>>> required to restart Spark job.
>>>
>>> Petr
>>>
>>> On Mon, Sep 21, 2015 at 8:49 PM, Adrian Tanase <atan...@adobe.com>
>>> wrote:
>>>
>>>> I'm wondering, isn't this the canonical use case for WAL + reliable
>>>> receiver?
>>>>
>>>> As far as I know you can tune Mqtt server to wait for ack on messages
>>>> (qos level 2?).
>>>> With some support from the client libray you could achieve exactly once
>>>> semantics on the read side, if you ack message only after writing it to
>>>> WAL, correct?
>>>>
>>>> -adrian
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On 21 Sep 2015, at 12:35, Petr Novak <oss.mli...@gmail.com> wrote:
>>>>
>>>> In short there is no direct support for it in Spark AFAIK. You will
>>>> either manage it in MQTT or have to add another layer of indirection -
>>>> either in-memory based (observable streams, in-mem db) or disk based
>>>> (Kafka, hdfs files, db) which will keep you unprocessed events.
>>>>
>>>> Now realizing, there is support for backpressure in v1.5.0 but I don't
>>>> know if it could be exploited aka I don't know if it is possible to
>>>> decouple event reading into memory and actual processing code in Spark
>>>> which could be swapped on the fly. Probably not without some custom built
>>>> facility for it.
>>>>
>>>> Petr
>>>>
>>>> On Mon, Sep 21, 2015 at 11:26 AM, Petr Novak <oss.mli...@gmail.com>
>>>> wrote:
>>>>
>>>>> I should read my posts at least once to avoid so many typos. Hopefully
>>>>> you are brave enough to read through.
>>>>>
>>>>> Petr
>>>>>
>>>>> On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak <oss.mli...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think you would have to persist events somehow if you don't want to
>>>>>> miss them. I don't see any other option there. Either in MQTT if it is
>>>>>> supported there or routing them through Kafka.
>>>>>>
>>>>>> There is WriteAheadLog in Spark but you would have decouple stream
>>>>>> MQTT reading and processing into 2 separate job so that you could upgrade
>>>>>> the processing one assuming the reading one would be stable (without
>>>>>> changes) across versions. But it is problematic because there is no easy
>>>>>> way how to share DStreams between jobs - you would have develop your own
>>>>>> facility for it.
>>>>>>
>>>>>> Alternatively the reading job could could save MQTT event in its the
>>>>>> most raw form into files - to limit need to change code - and then the
>>>>>> processing job would work on top of it using Spark streaming based on
>>>>>> files. I this is inefficient and can get quite complex if you would like 
>>>>>> to
>>>>>> make it reliable.
>>>>>>
>>>>>> Basically either MQTT supports prsistence (which I don't know) or
>>>>>> there is Kafka for these use case.
>>>>>>
>>>>>> Another option would be I think to place observable streams in
>>>>>> between MQTT and Spark streaming with bakcpressure as far as you could
>>>>>> perform upgrade till buffers fills up.
>>>>>>
>>>>>> I'm sorry that it is not thought out well from my side, it is just a
>>>>>> brainstorm but it might lead you somewhere.
>>>>>>
>>>>>> Regards,
>>>>>> Petr
>>>>>>
>>>>>> On Mon, Sep 21, 2015 at 10:09 AM, Jeetendra Gangele <
>>>>>> gangele...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I have an spark streaming application with batch (10 ms) which is
>>>>>>> reading the MQTT channel and dumping the data from MQTT to HDFS.
>>>>>>>
>>>>>>> So suppose if I have to deploy new application jar(with changes in
>>>>>>> spark streaming application) what is the best way to deploy, currently 
>>>>>>> I am
>>>>>>> doing as below
>>>>>>>
>>>>>>> 1.killing the running streaming app using yarn application -kill ID
>>>>>>> 2. and then starting the application again
>>>>>>>
>>>>>>> Problem with above approach is since we are not persisting the
>>>>>>> events in MQTT we will miss the events for the period of deploy.
>>>>>>>
>>>>>>> how to handle this case?
>>>>>>>
>>>>>>> regards
>>>>>>> jeeetndra
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Deploying spark-streaming application on production

Reply via email to