Re: Flink recovery

Robert Metzger Tue, 17 May 2016 08:03:06 -0700

Hi,

Savepoints are exactly for that use case:
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/


Regards,
Robert

On Tue, May 17, 2016 at 4:25 PM, Madhire, Naveen <
naveen.madh...@capitalone.com> wrote:

> Hey Robert,
>
> What is the best way to stop the streaming job in production if I want to
> upgrade the application without loosing messages and causing duplicates.
> How can I test this scenario?
> We are testing few recovery mechanisms like job failure, application
> upgrade and node failure.
>
>
>
> Thanks,
> Naveen
>
> From: Robert Metzger <rmetz...@apache.org>
> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
> Date: Tuesday, May 17, 2016 at 6:58 AM
> To: "user@flink.apache.org" <user@flink.apache.org>
> Subject: Re: Flink recovery
>
> Hi Naveen,
>
> I think cancelling a job is not the right approach for testing our
> exactly-once guarantees. By cancelling a job, you are discarding the state
> of your job. Restarting from scratch (without using a savepoint) will cause
> duplicates.
> What you can do to validate the behavior is randomly killing a task
> manager running your job. Then, the job should restart on the remaining
> machines (make sure that enough slots are available even after the failure)
> and you shouldn't have any duplicates in HDFS.
>
> Regards,
> Robert
>
>
>
>
>
> On Tue, May 17, 2016 at 11:27 AM, Stephan Ewen <se...@apache.org> wrote:
>
>> Hi Naveen!
>>
>> I assume you are using Hadoop 2.7+? Then you should not see the
>> ".valid-length" file.
>>
>> The fix you mentioned is part of later Flink releases (like 1.0.3)
>>
>> Stephan
>>
>>
>> On Mon, May 16, 2016 at 11:46 PM, Madhire, Naveen <
>> naveen.madh...@capitalone.com> wrote:
>>
>>> Thanks Fabian. Actually I don’t see a .valid-length suffix file in the
>>> output HDFS folder.
>>> Can you please tell me how would I debug this issue or do you suggest
>>> anything else to solve this duplicates problem.
>>>
>>>
>>> Thank you.
>>>
>>> From: Fabian Hueske <fhue...@gmail.com>
>>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>>> Date: Saturday, May 14, 2016 at 4:10 AM
>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>> Subject: Re: Flink recovery
>>>
>>> The behavior of the RollingFileSink depends on the capabilities of the
>>> file system.
>>> If the file system does not support to truncate files such as older HDFS
>>> versions, an additional file with a .valid-length suffix is written to
>>> indicate how much of the file is valid.
>>> All records / data that come after the valid-length are duplicates.
>>> Please refer to the JavaDocs of the RollingFileSink for details [1].
>>>
>>> If the .valid-length file does not solve the problem, you might have
>>> found a bug and we should have a closer look at the problem.
>>>
>>> Best, Fabian
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/connectors/fs/RollingSink.html
>>>
>>> 2016-05-14 4:17 GMT+02:00 Madhire, Naveen <naveen.madh...@capitalone.com
>>> >:
>>>
>>>> Thanks Fabian. Yes, I am seeing few records more than once in the
>>>> output.
>>>> I am running the job and canceling it from the dashboard, and running
>>>> again. And using different HDFS file outputs both the times. I was thinking
>>>> when I cancel the job, it’s not doing a clean cancel.
>>>> Is there anything else which I have to use to make it exactly once in
>>>> the output?
>>>>
>>>> I am using a simple read from kafka, transformations and rolling file
>>>> sink pipeline.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Naveen
>>>>
>>>> From: Fabian Hueske <fhue...@gmail.com>
>>>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>>>> Date: Friday, May 13, 2016 at 4:26 PM
>>>>
>>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>>> Subject: Re: Flink recovery
>>>>
>>>> Hi Naveen,
>>>>
>>>> the RollingFileSink supports exactly-once output. So you should be good.
>>>>
>>>> Did you see events being emitted multiple times (should not happen with
>>>> the RollingFileSink) or being processed multiple times within the Flink
>>>> program (might happen as explained before)?
>>>>
>>>> Best, Fabian
>>>>
>>>> 2016-05-13 23:19 GMT+02:00 Madhire, Naveen <
>>>> naveen.madh...@capitalone.com>:
>>>>
>>>>> Thank you Fabian.
>>>>>
>>>>> I am using HDFS rolling sink. This should support the exactly once
>>>>> output in case of failures, isn’t it? I am following the below
>>>>> documentation,
>>>>>
>>>>>
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/fault_tolerance.html#fault-tolerance-guarantees-of-data-sources-and-sinks
>>>>>
>>>>> If not what other Sinks can I use to have the exactly once output
>>>>> since getting exactly once output is critical for our use case.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Naveen
>>>>>
>>>>> From: Fabian Hueske <fhue...@gmail.com>
>>>>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>>>>> Date: Friday, May 13, 2016 at 4:13 PM
>>>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>>>> Subject: Re: Flink recovery
>>>>>
>>>>> Hi,
>>>>>
>>>>> Flink's exactly-once semantics do not mean that events are processed
>>>>> exactly-once but that events will contribute exactly-once to the state of
>>>>> an operator such as a counter.
>>>>> Roughly, the mechanism works as follows:
>>>>> - Flink peridically injects checkpoint markers into the data stream.
>>>>> This happens synchronously across all sources and markers.
>>>>> - When an operator receives a checkpoint marker from all its sources,
>>>>> it checkpoints its state and forwards the marker
>>>>> - When the marker was received by all sinks, the distributed
>>>>> checkpoint is noted as successful.
>>>>>
>>>>> In case of a failure, the state of all operators is reset to the last
>>>>> successful checkpoint and the sources are reset to the point when the
>>>>> marker was injected.
>>>>> Hence, some events are sent a second time to the operators but the
>>>>> state of the operators was reset as well. So the repeated events 
>>>>> contribute
>>>>> exactly once to the state of an operator.
>>>>>
>>>>> Note, you need a SinkFunction that supports Flink's checkpointing
>>>>> mechanism to achieve exactly-once output. Otherwise, it might happen that
>>>>> results are emitted multiple times.
>>>>>
>>>>> Cheers, Fabian
>>>>>
>>>>> 2016-05-13 22:58 GMT+02:00 Madhire, Naveen <
>>>>> naveen.madh...@capitalone.com>:
>>>>>
>>>>>> I checked the JIRA and looks like FLINK-2111 should address the issue
>>>>>> which I am facing. I am canceling the job from dashboard.
>>>>>>
>>>>>> I am using kafka source and HDFS rolling sink.
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/FLINK-2111
>>>>>>
>>>>>> Is this JIRA part of Flink 1.0.0?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Naveen
>>>>>>
>>>>>> From: "Madhire, Venkat Naveen Kumar Reddy" <
>>>>>> naveen.madh...@capitalone.com>
>>>>>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>>>>>> Date: Friday, May 13, 2016 at 10:58 AM
>>>>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>>>>> Subject: Flink recovery
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We are trying to test the recovery mechanism of Flink with Kafka and
>>>>>> HDFS sink during failures.
>>>>>>
>>>>>> I’ve killed the job after processing some messages and restarted the
>>>>>> same job again. Some of the messages I am seeing are processed more than
>>>>>> once and not following the exactly once semantics.
>>>>>>
>>>>>>
>>>>>> Also, using the checkpointing mechanism and saving the state
>>>>>> checkpoints into HDFS.
>>>>>> Below is the checkpoint code,
>>>>>>
>>>>>> envStream.enableCheckpointing(11);
>>>>>> envStream.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>>>>>> envStream.getCheckpointConfig().setCheckpointTimeout(60000);
>>>>>> envStream.getCheckpointConfig().setMaxConcurrentCheckpoints(4);
>>>>>>
>>>>>> envStream.setStateBackend(new 
>>>>>> FsStateBackend("hdfs://ipaddr/mount/cp/checkpoint/"));
>>>>>>
>>>>>>
>>>>>> One thing I’ve noticed is lowering the time to checkpointing is
>>>>>> actually lowering the number of messages processed more than once and 
>>>>>> 11ms
>>>>>> is the lowest I can use.
>>>>>>
>>>>>> Is there anything else I should try to have exactly once message
>>>>>> processing functionality.
>>>>>>
>>>>>> I am using Flink 1.0.0 and kafka 0.8
>>>>>>
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> The information contained in this e-mail is confidential and/or
>>>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>>>> solely in performance of work or services for Capital One. The 
>>>>>> information
>>>>>> transmitted herewith is intended only for use by the individual or entity
>>>>>> to which it is addressed. If the reader of this message is not the 
>>>>>> intended
>>>>>> recipient, you are hereby notified that any review, retransmission,
>>>>>> dissemination, distribution, copying or other use of, or taking of any
>>>>>> action in reliance upon this information is strictly prohibited. If you
>>>>>> have received this communication in error, please contact the sender and
>>>>>> delete the material from your computer.
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> The information contained in this e-mail is confidential and/or
>>>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>>>> solely in performance of work or services for Capital One. The 
>>>>>> information
>>>>>> transmitted herewith is intended only for use by the individual or entity
>>>>>> to which it is addressed. If the reader of this message is not the 
>>>>>> intended
>>>>>> recipient, you are hereby notified that any review, retransmission,
>>>>>> dissemination, distribution, copying or other use of, or taking of any
>>>>>> action in reliance upon this information is strictly prohibited. If you
>>>>>> have received this communication in error, please contact the sender and
>>>>>> delete the material from your computer.
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> The information contained in this e-mail is confidential and/or
>>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>>> solely in performance of work or services for Capital One. The information
>>>>> transmitted herewith is intended only for use by the individual or entity
>>>>> to which it is addressed. If the reader of this message is not the 
>>>>> intended
>>>>> recipient, you are hereby notified that any review, retransmission,
>>>>> dissemination, distribution, copying or other use of, or taking of any
>>>>> action in reliance upon this information is strictly prohibited. If you
>>>>> have received this communication in error, please contact the sender and
>>>>> delete the material from your computer.
>>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> The information contained in this e-mail is confidential and/or
>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>> solely in performance of work or services for Capital One. The information
>>>> transmitted herewith is intended only for use by the individual or entity
>>>> to which it is addressed. If the reader of this message is not the intended
>>>> recipient, you are hereby notified that any review, retransmission,
>>>> dissemination, distribution, copying or other use of, or taking of any
>>>> action in reliance upon this information is strictly prohibited. If you
>>>> have received this communication in error, please contact the sender and
>>>> delete the material from your computer.
>>>>
>>>
>>>
>>> ------------------------------
>>>
>>> The information contained in this e-mail is confidential and/or
>>> proprietary to Capital One and/or its affiliates and may only be used
>>> solely in performance of work or services for Capital One. The information
>>> transmitted herewith is intended only for use by the individual or entity
>>> to which it is addressed. If the reader of this message is not the intended
>>> recipient, you are hereby notified that any review, retransmission,
>>> dissemination, distribution, copying or other use of, or taking of any
>>> action in reliance upon this information is strictly prohibited. If you
>>> have received this communication in error, please contact the sender and
>>> delete the material from your computer.
>>>
>>
>>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Re: Flink recovery

Reply via email to