Thanks James for sharing your experience. I find it very interesting :-)

On Tue, Aug 22, 2017 at 9:50 PM, Hao Sun <ha...@zendesk.com> wrote:
> Great suggestions, the etcd operator is very interesting, thanks James.
>
>
> On Tue, Aug 22, 2017, 12:42 James Bucher <jbuc...@expedia.com> wrote:
>>
>> Just wanted to throw in a couple more details here from what I have
>> learned from working with Kubernetes.
>>
>> All processes restart (a lost JobManager restarts eventually). Should be
>> given in Kubernetes:
>>
>> This works very well, we run multiple jobs with a single Jobmanager and
>> Flink/Kubernetes recovers quite well.
>>
>> A way for TaskManagers to discover the restarted JobManager. Should work
>> via Kubernetes as well (restarted containers retain the external hostname):
>>
>> We use StatefulSets which provide a DNS based discovery mechanism.
>> Provided DNS is set up correctly with TTLs this works well. You could also
>> leverage the built-in Kubernetes services if you are only running a single
>> Job Manager. Kubernetes will just route the traffic to the single pod. This
>> works fine with a single Job Manager (I have tested it). However multiple
>> Job Managers won’t work because Kubernetes will route this round-robin to
>> the Job Managers
>>
>> A way to isolate different "leader sessions" against each other. Flink
>> currently uses ZooKeeper to also attach a "leader session ID" to leader
>> election, which is a fencing token to avoid that processes talk to each
>> other despite having different views on who is the leader, or whether the
>> leaser lost and re-gained leadership:
>>
>> This is probably the most difficult thing. You could leverage the built in
>> ETCD cluster. Connecting directly to the Kubernetes ETCD database directly
>> is probably a bad idea however. You should be able to create a counter using
>> the PATCH API that Kubernetes supplies in the API which follows:
>> https://tools.ietf.org/html/rfc6902 you could probably leverage
>> https://tools.ietf.org/html/rfc6902#section-4.6 to allow for atomic updates
>> to counters. Combining this with:
>> https://kubernetes.io/docs/concepts/api-extension/custom-resources/#custom-resources
>> should give a good way to work with ETCD without actually connecting
>> directly to the Kubernetes ETCD directly. This integration would require
>> modifying the Job Manager leader election code.
>>
>> A distributed atomic counter for the checkpoint ID. This is crucial to
>> ensure correctness of checkpoints in the presence of JobManager failures and
>> re-elections or split-brain situations.
>>
>> This is very similar to the above, we should be able to accomplish that
>> through the PATCH API combined with update if condition.
>>
>> If you don’t want to actually rip way into the code for the Job Manager
>> the ETCD Operator would be a good way to bring up an ETCD cluster that is
>> separate from the core Kubernetes ETCD database. Combined with zetcd you
>> could probably have that up and running quickly.
>>
>> Thanks,
>> James Bucher
>>
>> From: Hao Sun <ha...@zendesk.com>
>> Date: Monday, August 21, 2017 at 9:45 AM
>> To: Stephan Ewen <se...@apache.org>, Shannon Carey <sca...@expedia.com>
>> Cc: "user@flink.apache.org" <user@flink.apache.org>
>> Subject: Re: Flink HA with Kubernetes, without Zookeeper
>>
>> Thanks Shannon for the https://github.com/coreos/zetcd tips, I will check
>> that out and share my results if we proceed on that path.
>> Thanks Stephan for the details, this is very useful, I was about to ask
>> what exactly is stored into zookeeper, haha.
>>
>> On Mon, Aug 21, 2017 at 9:31 AM Stephan Ewen <se...@apache.org> wrote:
>>>
>>> Hi!
>>>
>>> That is a very interesting proposition. In cases where you have a single
>>> master only, you may bet away with quite good guarantees without ZK. In
>>> fact, Flink does not store significant data in ZK at all, it only uses locks
>>> and counters.
>>>
>>> You can have a setup without ZK, provided you have the following:
>>>
>>>   - All processes restart (a lost JobManager restarts eventually). Should
>>> be given in Kubernetes.
>>>
>>>   - A way for TaskManagers to discover the restarted JobManager. Should
>>> work via Kubernetes as well (restarted containers retain the external
>>> hostname)
>>>
>>>   - A way to isolate different "leader sessions" against each other.
>>> Flink currently uses ZooKeeper to also attach a "leader session ID" to
>>> leader election, which is a fencing token to avoid that processes talk to
>>> each other despite having different views on who is the leader, or whether
>>> the leaser lost and re-gained leadership.
>>>
>>>   - An atomic marker for what is the latest completed checkpoint.
>>>
>>>   - A distributed atomic counter for the checkpoint ID. This is crucial
>>> to ensure correctness of checkpoints in the presence of JobManager failures
>>> and re-elections or split-brain situations.
>>>
>>> I would assume that etcd can provide all of those services. The best way
>>> to integrate it would probably be to add an implementation of Flink's
>>> "HighAvailabilityServices" based on etcd.
>>>
>>> Have a look at this class:
>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServices.java
>>>
>>> If you want to contribute an extension of Flink using etcd, that would be
>>> awesome.
>>> This should have a FLIP though, and a plan on how to set up rigorous unit
>>> testing of that implementation (because its correctness is very crucial to
>>> Flink's HA resilience).
>>>
>>> Best,
>>> Stephan
>>>
>>>
>>> On Mon, Aug 21, 2017 at 4:15 PM, Shannon Carey <sca...@expedia.com>
>>> wrote:
>>>>
>>>> Zookeeper should still be necessary even in that case, because it is
>>>> where the JobManager stores information which needs to be recovered after
>>>> the JobManager fails.
>>>>
>>>> We're eyeing https://github.com/coreos/zetcd as a way to run Zookeeper
>>>> on top of Kubernetes' etcd cluster so that we don't have to rely on a
>>>> separate Zookeeper cluster. However, we haven't tried it yet.
>>>>
>>>> -Shannon
>>>>
>>>> From: Hao Sun <ha...@zendesk.com>
>>>> Date: Sunday, August 20, 2017 at 9:04 PM
>>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>>> Subject: Flink HA with Kubernetes, without Zookeeper
>>>>
>>>> Hi, I am new to Flink and trying to bring up a Flink cluster on top of
>>>> Kubernetes.
>>>>
>>>> For HA setup, with kubernetes, I think I just need one job manager and
>>>> do not need Zookeeper? I will store all states to S3 buckets. So in case of
>>>> failure, kubernetes can just bring up a new job manager without losing
>>>> anything?
>>>>
>>>> I want to confirm my assumptions above make sense. Thanks
>>>
>>>
>

Reply via email to