Thank you very much for the new release that makes auto-scaling possible. I'm currently running multiple flink jobs and I've hand tuned the parallelism of each of the operators to achieve the best throughput. I would much rather use the auto-scaling capabilities of flink than have to hand tune my jobs but it seems there are a few gaps:
1. Setting max parallelism seems to be the only user controlled knob at the moment. As flink tries to achieve operator chaining by launching same number of sub-tasks for each operator, I'm afraid the current auto-scaling will be very inefficient. At a minimum, we need to support user provided ratios that will be used to distribute sub-tasks among operators. E.g. O1:O2 = 4:1 will mean that 4 sub-tasks of O1 should be started for each sub-task of O2. 2. Allow for external system to set parallelism of operators. Perhaps job manager's rest api can be extended to support such scaling requests 3. The doc says that local recovery doesn't work. This makes sense when a restart is due to a scaling action but I couldn't quite understand why that needs to be the case when a task manager is recovering from a crash 4. Is there any metric that allows us to distinguish between restart due to scaling as opposed to restart due to some other reason? Based on the section on limitations, there isn't but it would be good to add this as people will eventually want to monitor and alert based on restarts due to failures alone. 5. Suppose the number of containers are fixed and the job is running. Will flink internally rebalance by adding sub-tasks of one operator and removing sub-tasks of another? This could be driven by back-pressure for instance. The doc doesn't mention this so I'm assuming that current scaling is designed to maximize operator chaining. However, it does make sense to incorporate back-pressure to rebalance. Should this be how future versions of auto-scaling will work then we'll need to have some toggles to avoid restart loops. 6. How is the implementation different from taking a savepoint and manually rescaling? Are there any operator specific gotchas that we should watch out for? For instance, we use AsyncIO operator and wanted to know how inflight requests to a database would be handled when it's parallelism changes. Once again, thank you for your continued support! -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/