Great suggestions, the etcd operator is very interesting, thanks James.

On Tue, Aug 22, 2017, 12:42 James Bucher <jbuc...@expedia.com> wrote:

> Just wanted to throw in a couple more details here from what I have
> learned from working with Kubernetes.
>
> *All processes restart (a lost JobManager restarts eventually). Should be
> given in Kubernetes*:
>
>    - This works very well, we run multiple jobs with a single Jobmanager
>    and Flink/Kubernetes recovers quite well.
>
> *A way for TaskManagers to discover the restarted JobManager. Should work
> via Kubernetes as well (restarted containers retain the external hostname)*
> :
>
>    - We use StatefulSets which provide a DNS based discovery mechanism.
>    Provided DNS is set up correctly with TTLs this works well. You could also
>    leverage the built-in Kubernetes services if you are only running a single
>    Job Manager. Kubernetes will just route the traffic to the single pod. This
>    works fine with a single Job Manager (I have tested it). However multiple
>    Job Managers won’t work because Kubernetes will route this round-robin to
>    the Job Managers
>
> *A way to isolate different "leader sessions" against each other. Flink
> currently uses ZooKeeper to also attach a "leader session ID" to leader
> election, which is a fencing token to avoid that processes talk to each
> other despite having different views on who is the leader, or whether the
> leaser lost and re-gained leadership:*
>
>    - This is probably the most difficult thing. You could leverage the
>    built in ETCD cluster. Connecting directly to the Kubernetes ETCD database
>    directly is probably a bad idea however. You should be able to create a
>    counter using the PATCH API that Kubernetes supplies in the API which
>    follows: https://tools.ietf.org/html/rfc6902
>    <https://tools.ietf.org/html/rfc6902> you could probably
>    leverage https://tools.ietf.org/html/rfc6902#section-4.6
>    <https://tools.ietf.org/html/rfc6902#section-4.6> to allow for atomic
>    updates to counters. Combining this with:
>    
> https://kubernetes.io/docs/concepts/api-extension/custom-resources/#custom-resources
>    
> <https://kubernetes.io/docs/concepts/api-extension/custom-resources/#custom-resources>
>  should give a good
>    way to work with ETCD without actually connecting directly to the
>    Kubernetes ETCD directly. This integration would require modifying the Job
>    Manager leader election code.
>
> *A distributed atomic counter for the checkpoint ID. This is crucial to
> ensure correctness of checkpoints in the presence of JobManager failures
> and re-elections or split-brain situations*.
>
>    - This is very similar to the above, we should be able to accomplish
>    that through the PATCH API combined with update if condition.
>
> If you don’t want to actually rip way into the code for the Job Manager
> the ETCD Operator <https://github.com/coreos/etcd-operator> would
> be a good way to bring up an ETCD cluster that is separate from the core
> Kubernetes ETCD database. Combined with zetcd you could probably have that
> up and running quickly.
>
> Thanks,
> James Bucher
>
> From: Hao Sun <ha...@zendesk.com>
> Date: Monday, August 21, 2017 at 9:45 AM
> To: Stephan Ewen <se...@apache.org>, Shannon Carey <sca...@expedia.com>
> Cc: "user@flink.apache.org" <user@flink.apache.org>
> Subject: Re: Flink HA with Kubernetes, without Zookeeper
>
> Thanks Shannon for the https://github.com/coreos/zetcd
> <https://github.com/coreos/zetcd> tips, I will check
> that out and share my results if we proceed on that path.
> Thanks Stephan for the details, this is very useful, I was about to ask
> what exactly is stored into zookeeper, haha.
>
> On Mon, Aug 21, 2017 at 9:31 AM Stephan Ewen <se...@apache.org> wrote:
>
>> Hi!
>>
>> That is a very interesting proposition. In cases where you have a single
>> master only, you may bet away with quite good guarantees without ZK. In
>> fact, Flink does not store significant data in ZK at all, it only uses
>> locks and counters.
>>
>> You can have a setup without ZK, provided you have the following:
>>
>>   - All processes restart (a lost JobManager restarts eventually). Should
>> be given in Kubernetes.
>>
>>   - A way for TaskManagers to discover the restarted JobManager. Should
>> work via Kubernetes as well (restarted containers retain the external
>> hostname)
>>
>>   - A way to isolate different "leader sessions" against each other.
>> Flink currently uses ZooKeeper to also attach a "leader session ID" to
>> leader election, which is a fencing token to avoid that processes talk to
>> each other despite having different views on who is the leader, or whether
>> the leaser lost and re-gained leadership.
>>
>>   - An atomic marker for what is the latest completed checkpoint.
>>
>>   - A distributed atomic counter for the checkpoint ID. This is crucial
>> to ensure correctness of checkpoints in the presence of JobManager failures
>> and re-elections or split-brain situations.
>>
>> I would assume that etcd can provide all of those services. The best way
>> to integrate it would probably be to add an implementation of Flink's
>> "HighAvailabilityServices" based on etcd.
>>
>> Have a look at this class:
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServices.java
>> <https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServices.java>
>>
>> If you want to contribute an extension of Flink using etcd, that would be
>> awesome.
>> This should have a FLIP though, and a plan on how to set up rigorous unit
>> testing of that implementation (because its correctness is very crucial to
>> Flink's HA resilience).
>>
>> Best,
>> Stephan
>>
>>
>> On Mon, Aug 21, 2017 at 4:15 PM, Shannon Carey <sca...@expedia.com>
>> wrote:
>>
>>> Zookeeper should still be necessary even in that case, because it is
>>> where the JobManager stores information which needs to be recovered after
>>> the JobManager fails.
>>>
>>> We're eyeing https://github.com/coreos/zetcd
>>> <https://github.com/coreos/zetcd> as a way to run
>>> Zookeeper on top of Kubernetes' etcd cluster so that we don't have to rely
>>> on a separate Zookeeper cluster. However, we haven't tried it yet.
>>>
>>> -Shannon
>>>
>>> From: Hao Sun <ha...@zendesk.com>
>>> Date: Sunday, August 20, 2017 at 9:04 PM
>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>> Subject: Flink HA with Kubernetes, without Zookeeper
>>>
>>> Hi, I am new to Flink and trying to bring up a Flink cluster on top of
>>> Kubernetes.
>>>
>>> For HA setup, with kubernetes, I think I just need one job manager and
>>> do not need Zookeeper? I will store all states to S3 buckets. So in case of
>>> failure, kubernetes can just bring up a new job manager without losing
>>> anything?
>>>
>>> I want to confirm my assumptions above make sense. Thanks
>>>
>>
>>

Reply via email to