Re: Flink recovery

Fabian Hueske Tue, 17 May 2016 10:08:27 -0700

Thanks for reporting back Naveen!



2016-05-17 18:55 GMT+02:00 Madhire, Naveen <naveen.madh...@capitalone.com>:

> Hi Robert, With the use of manual save points, I was able to obtain
> exactly-once output with Kafka and HDFS rolling sink.
>
> Thanks to you and Fabian for the help.
>
>
> From: Robert Metzger <rmetz...@apache.org>
> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
> Date: Tuesday, May 17, 2016 at 10:02 AM
>
> To: "user@flink.apache.org" <user@flink.apache.org>
> Subject: Re: Flink recovery
>
> Hi,
>
> Savepoints are exactly for that use case:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
>
> http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/
>
> Regards,
> Robert
>
> On Tue, May 17, 2016 at 4:25 PM, Madhire, Naveen <
> naveen.madh...@capitalone.com> wrote:
>
>> Hey Robert,
>>
>> What is the best way to stop the streaming job in production if I want to
>> upgrade the application without loosing messages and causing duplicates.
>> How can I test this scenario?
>> We are testing few recovery mechanisms like job failure, application
>> upgrade and node failure.
>>
>>
>>
>> Thanks,
>> Naveen
>>
>> From: Robert Metzger <rmetz...@apache.org>
>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>> Date: Tuesday, May 17, 2016 at 6:58 AM
>> To: "user@flink.apache.org" <user@flink.apache.org>
>> Subject: Re: Flink recovery
>>
>> Hi Naveen,
>>
>> I think cancelling a job is not the right approach for testing our
>> exactly-once guarantees. By cancelling a job, you are discarding the state
>> of your job. Restarting from scratch (without using a savepoint) will cause
>> duplicates.
>> What you can do to validate the behavior is randomly killing a task
>> manager running your job. Then, the job should restart on the remaining
>> machines (make sure that enough slots are available even after the failure)
>> and you shouldn't have any duplicates in HDFS.
>>
>> Regards,
>> Robert
>>
>>
>>
>>
>>
>> On Tue, May 17, 2016 at 11:27 AM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> Hi Naveen!
>>>
>>> I assume you are using Hadoop 2.7+? Then you should not see the
>>> ".valid-length" file.
>>>
>>> The fix you mentioned is part of later Flink releases (like 1.0.3)
>>>
>>> Stephan
>>>
>>>
>>> On Mon, May 16, 2016 at 11:46 PM, Madhire, Naveen <
>>> naveen.madh...@capitalone.com> wrote:
>>>
>>>> Thanks Fabian. Actually I don’t see a .valid-length suffix file in the
>>>> output HDFS folder.
>>>> Can you please tell me how would I debug this issue or do you suggest
>>>> anything else to solve this duplicates problem.
>>>>
>>>>
>>>> Thank you.
>>>>
>>>> From: Fabian Hueske <fhue...@gmail.com>
>>>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>>>> Date: Saturday, May 14, 2016 at 4:10 AM
>>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>>> Subject: Re: Flink recovery
>>>>
>>>> The behavior of the RollingFileSink depends on the capabilities of the
>>>> file system.
>>>> If the file system does not support to truncate files such as older
>>>> HDFS versions, an additional file with a .valid-length suffix is written to
>>>> indicate how much of the file is valid.
>>>> All records / data that come after the valid-length are duplicates.
>>>> Please refer to the JavaDocs of the RollingFileSink for details [1].
>>>>
>>>> If the .valid-length file does not solve the problem, you might have
>>>> found a bug and we should have a closer look at the problem.
>>>>
>>>> Best, Fabian
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/connectors/fs/RollingSink.html
>>>>
>>>> 2016-05-14 4:17 GMT+02:00 Madhire, Naveen <
>>>> naveen.madh...@capitalone.com>:
>>>>
>>>>> Thanks Fabian. Yes, I am seeing few records more than once in the
>>>>> output.
>>>>> I am running the job and canceling it from the dashboard, and running
>>>>> again. And using different HDFS file outputs both the times. I was 
>>>>> thinking
>>>>> when I cancel the job, it’s not doing a clean cancel.
>>>>> Is there anything else which I have to use to make it exactly once in
>>>>> the output?
>>>>>
>>>>> I am using a simple read from kafka, transformations and rolling file
>>>>> sink pipeline.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Naveen
>>>>>
>>>>> From: Fabian Hueske <fhue...@gmail.com>
>>>>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>>>>> Date: Friday, May 13, 2016 at 4:26 PM
>>>>>
>>>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>>>> Subject: Re: Flink recovery
>>>>>
>>>>> Hi Naveen,
>>>>>
>>>>> the RollingFileSink supports exactly-once output. So you should be
>>>>> good.
>>>>>
>>>>> Did you see events being emitted multiple times (should not happen
>>>>> with the RollingFileSink) or being processed multiple times within the
>>>>> Flink program (might happen as explained before)?
>>>>>
>>>>> Best, Fabian
>>>>>
>>>>> 2016-05-13 23:19 GMT+02:00 Madhire, Naveen <
>>>>> naveen.madh...@capitalone.com>:
>>>>>
>>>>>> Thank you Fabian.
>>>>>>
>>>>>> I am using HDFS rolling sink. This should support the exactly once
>>>>>> output in case of failures, isn’t it? I am following the below
>>>>>> documentation,
>>>>>>
>>>>>>
>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/fault_tolerance.html#fault-tolerance-guarantees-of-data-sources-and-sinks
>>>>>>
>>>>>> If not what other Sinks can I use to have the exactly once output
>>>>>> since getting exactly once output is critical for our use case.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Naveen
>>>>>>
>>>>>> From: Fabian Hueske <fhue...@gmail.com>
>>>>>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>>>>>> Date: Friday, May 13, 2016 at 4:13 PM
>>>>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>>>>> Subject: Re: Flink recovery
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Flink's exactly-once semantics do not mean that events are processed
>>>>>> exactly-once but that events will contribute exactly-once to the state of
>>>>>> an operator such as a counter.
>>>>>> Roughly, the mechanism works as follows:
>>>>>> - Flink peridically injects checkpoint markers into the data stream.
>>>>>> This happens synchronously across all sources and markers.
>>>>>> - When an operator receives a checkpoint marker from all its sources,
>>>>>> it checkpoints its state and forwards the marker
>>>>>> - When the marker was received by all sinks, the distributed
>>>>>> checkpoint is noted as successful.
>>>>>>
>>>>>> In case of a failure, the state of all operators is reset to the last
>>>>>> successful checkpoint and the sources are reset to the point when the
>>>>>> marker was injected.
>>>>>> Hence, some events are sent a second time to the operators but the
>>>>>> state of the operators was reset as well. So the repeated events 
>>>>>> contribute
>>>>>> exactly once to the state of an operator.
>>>>>>
>>>>>> Note, you need a SinkFunction that supports Flink's checkpointing
>>>>>> mechanism to achieve exactly-once output. Otherwise, it might happen that
>>>>>> results are emitted multiple times.
>>>>>>
>>>>>> Cheers, Fabian
>>>>>>
>>>>>> 2016-05-13 22:58 GMT+02:00 Madhire, Naveen <
>>>>>> naveen.madh...@capitalone.com>:
>>>>>>
>>>>>>> I checked the JIRA and looks like FLINK-2111 should address the
>>>>>>> issue which I am facing. I am canceling the job from dashboard.
>>>>>>>
>>>>>>> I am using kafka source and HDFS rolling sink.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/FLINK-2111
>>>>>>>
>>>>>>> Is this JIRA part of Flink 1.0.0?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Naveen
>>>>>>>
>>>>>>> From: "Madhire, Venkat Naveen Kumar Reddy" <
>>>>>>> naveen.madh...@capitalone.com>
>>>>>>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>>>>>>> Date: Friday, May 13, 2016 at 10:58 AM
>>>>>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>>>>>> Subject: Flink recovery
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are trying to test the recovery mechanism of Flink with Kafka and
>>>>>>> HDFS sink during failures.
>>>>>>>
>>>>>>> I’ve killed the job after processing some messages and restarted the
>>>>>>> same job again. Some of the messages I am seeing are processed more than
>>>>>>> once and not following the exactly once semantics.
>>>>>>>
>>>>>>>
>>>>>>> Also, using the checkpointing mechanism and saving the state
>>>>>>> checkpoints into HDFS.
>>>>>>> Below is the checkpoint code,
>>>>>>>
>>>>>>> envStream.enableCheckpointing(11);
>>>>>>> envStream.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>>>>>>> envStream.getCheckpointConfig().setCheckpointTimeout(60000);
>>>>>>> envStream.getCheckpointConfig().setMaxConcurrentCheckpoints(4);
>>>>>>>
>>>>>>> envStream.setStateBackend(new 
>>>>>>> FsStateBackend("hdfs://ipaddr/mount/cp/checkpoint/"));
>>>>>>>
>>>>>>>
>>>>>>> One thing I’ve noticed is lowering the time to checkpointing is
>>>>>>> actually lowering the number of messages processed more than once and 
>>>>>>> 11ms
>>>>>>> is the lowest I can use.
>>>>>>>
>>>>>>> Is there anything else I should try to have exactly once message
>>>>>>> processing functionality.
>>>>>>>
>>>>>>> I am using Flink 1.0.0 and kafka 0.8
>>>>>>>
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> The information contained in this e-mail is confidential and/or
>>>>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>>>>> solely in performance of work or services for Capital One. The 
>>>>>>> information
>>>>>>> transmitted herewith is intended only for use by the individual or 
>>>>>>> entity
>>>>>>> to which it is addressed. If the reader of this message is not the 
>>>>>>> intended
>>>>>>> recipient, you are hereby notified that any review, retransmission,
>>>>>>> dissemination, distribution, copying or other use of, or taking of any
>>>>>>> action in reliance upon this information is strictly prohibited. If you
>>>>>>> have received this communication in error, please contact the sender and
>>>>>>> delete the material from your computer.
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> The information contained in this e-mail is confidential and/or
>>>>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>>>>> solely in performance of work or services for Capital One. The 
>>>>>>> information
>>>>>>> transmitted herewith is intended only for use by the individual or 
>>>>>>> entity
>>>>>>> to which it is addressed. If the reader of this message is not the 
>>>>>>> intended
>>>>>>> recipient, you are hereby notified that any review, retransmission,
>>>>>>> dissemination, distribution, copying or other use of, or taking of any
>>>>>>> action in reliance upon this information is strictly prohibited. If you
>>>>>>> have received this communication in error, please contact the sender and
>>>>>>> delete the material from your computer.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> The information contained in this e-mail is confidential and/or
>>>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>>>> solely in performance of work or services for Capital One. The 
>>>>>> information
>>>>>> transmitted herewith is intended only for use by the individual or entity
>>>>>> to which it is addressed. If the reader of this message is not the 
>>>>>> intended
>>>>>> recipient, you are hereby notified that any review, retransmission,
>>>>>> dissemination, distribution, copying or other use of, or taking of any
>>>>>> action in reliance upon this information is strictly prohibited. If you
>>>>>> have received this communication in error, please contact the sender and
>>>>>> delete the material from your computer.
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> The information contained in this e-mail is confidential and/or
>>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>>> solely in performance of work or services for Capital One. The information
>>>>> transmitted herewith is intended only for use by the individual or entity
>>>>> to which it is addressed. If the reader of this message is not the 
>>>>> intended
>>>>> recipient, you are hereby notified that any review, retransmission,
>>>>> dissemination, distribution, copying or other use of, or taking of any
>>>>> action in reliance upon this information is strictly prohibited. If you
>>>>> have received this communication in error, please contact the sender and
>>>>> delete the material from your computer.
>>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> The information contained in this e-mail is confidential and/or
>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>> solely in performance of work or services for Capital One. The information
>>>> transmitted herewith is intended only for use by the individual or entity
>>>> to which it is addressed. If the reader of this message is not the intended
>>>> recipient, you are hereby notified that any review, retransmission,
>>>> dissemination, distribution, copying or other use of, or taking of any
>>>> action in reliance upon this information is strictly prohibited. If you
>>>> have received this communication in error, please contact the sender and
>>>> delete the material from your computer.
>>>>
>>>
>>>
>>
>> ------------------------------
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Re: Flink recovery

Reply via email to