Re: Flink AutoScaling EMR

2020-11-16 Thread Rex Fenley
Thanks for all the input! On Sun, Nov 15, 2020 at 6:59 PM Xintong Song wrote: > Is there a way to make the new yarn job only on the new hardware? > > I think you can simply decommission the nodes from Yarn, so that new > containers will not be allocated from those nodes. You might also need a >

Re: Flink AutoScaling EMR

2020-11-15 Thread Xintong Song
> > Is there a way to make the new yarn job only on the new hardware? I think you can simply decommission the nodes from Yarn, so that new containers will not be allocated from those nodes. You might also need a large decommission timeout, upon which all the remaining running contains on the decom

Re: Flink AutoScaling EMR

2020-11-12 Thread Robert Metzger
Hi, it seems that YARN has a feature for targeting specific hardware: https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html In any case, you'll need enough spare resources for some time to be able to run your job twice for this kind of "zero downtime handover"

Re: Flink AutoScaling EMR

2020-11-12 Thread Rex Fenley
Awesome, thanks! Is there a way to make the new yarn job only on the new hardware? Or would the two jobs have to run on intersecting hardware and then would be switched on/off, which means we'll need a buffer of resources for our orchestration? Also, good point on recovery. I'll spend some time lo

Re: Flink AutoScaling EMR

2020-11-11 Thread Robert Metzger
Hey Rex, the second approach (spinning up a standby job and then doing a handover) sounds more promising to implement, without rewriting half of the Flink codebase ;) What you need is a tool that orchestrates creating a savepoint, starting a second job from the savepoint and then communicating wit

Re: Flink AutoScaling EMR

2020-11-11 Thread Rex Fenley
Another thought, would it be possible to * Spin up new core or task nodes. * Run a new copy of the same job on these new nodes from a savepoint. * Have the new job *not* write to the sink until the other job is torn down? This would allow us to be eventually consistent and maintain writes going th

Flink AutoScaling EMR

2020-11-11 Thread Rex Fenley
Hello, I'm trying to find a solution for auto scaling our Flink EMR cluster with 0 downtime using RocksDB as state storage and S3 backing store. My current thoughts are like so: * Scaling an Operator dynamically would require all keyed state to be available to the set of subtasks for that operato