Hi CVP, On how people use Flink, you can check this blogpost to see how Alibaba does it: http://data-artisans.com/blink-flink-alibaba-search/ <http://data-artisans.com/blink-flink-alibaba-search/>
In addition, you can also find some more information on the matter on the talks from the last Flink Forwards conference: http://berlin.flink-forward.org/program/sessions/ <http://berlin.flink-forward.org/program/sessions/> For example Netflix also shares some information here: http://berlin.flink-forward.org/kb_sessions/beaming-flink-to-the-cloud-netflix/ <http://berlin.flink-forward.org/kb_sessions/beaming-flink-to-the-cloud-netflix/> Now for how things work under the hood, I will provide links to the Flink documentation. I hope that this will also help you figure out what fits your needs best: For deployment and operations, the main resource is the Flink documentation, https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cluster_setup.html <https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cluster_setup.html> and for what is about to come on that front, you can check out the FLIP-6 page: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077 <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077> To dynamically scale your Flink job you have to take a savepoint and restart your job with different parallelism. You can find some details here https://www.slideshare.net/tillrohrmann/dynamic-scaling-how-apache-flink-adapts-to-changing-workloads <https://www.slideshare.net/tillrohrmann/dynamic-scaling-how-apache-flink-adapts-to-changing-workloads> , but unfortunately, this talk is a little bit outdated. We will update our documentation on dynamic scaling soon. For the Resource allocation and Job Scheduling, you can check the links I included for deployment and operations, and also this: https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/job_scheduling.html <https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/job_scheduling.html> For metrics and monitoring you can check here: https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/metrics.html <https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/metrics.html> and the related pages in the Debugging and monitoring section of the Flink documentation. I hope this can help as a first step, Kostas > > Right now our plan is to do a paper based study evaluating these options. > > I'm sure lot of you guys in production/support would have encountered > issues around these. Can someone point out to blogs/research papers/material > focussing on the approach taken and the considerations for evaluation? > > Any help here is highly appreciated ! > > Best Regards > CVP > > On Feb 22, 2017, at 12:30 PM, Chakravarthy varaga <chakravarth...@gmail.com> > wrote: > > Hi Team, > > We are analysing different deployment options for managing Flink Jobs on > AWS EC2 instances. > > Basically, the options (Resource Manangers) in front of us are using: > -> Standalone cluster > -> On YARN > -> Deploy using Mesos/Marthon > -> Deploy using Kubernetes/Docker > > The Resource Managers options are a bit confusing as we are unable to > decide on which one to go with. What we are looking at as inputs to our > analysis is: > -> Dynamic Scaling of resources > -> Resource Allocation > -> Jobs Scheduling > -> No-Downtime upgrades > -> Monitoring & Metrics. > > Right now our plan is to do a paper based study evaluating these options. > > I'm sure lot of you guys in production/support would have encountered > issues around these. Can someone point out to blogs/research papers/material > focussing on the approach taken and the considerations for evaluation? > > Any help here is highly appreciated ! > > Best Regards > CVP >