Hi Joey, We are currently running around 2000+ small Flink clusters on top of k8s, currently at around ~ 100 nodes. Do you see yourself scaling to 10k nodes, given that each node can run a significant amount of Flink jobs inside of it?
On Thu, Mar 4, 2021 at 10:51 AM Piotr Nowojski <pnowoj...@apache.org> wrote: > Maybe a stupid question Joey, but if the problem is in the resource > managers, haven't you tried running standalone Flink clusters without any > resource manager? Probably you would still hit the JobManager problems that > Xintong mentioned, but those problems we can help addressing. > > Piotrek > > czw., 4 mar 2021 o 09:30 Xintong Song <tonysong...@gmail.com> napisał(a): > >> Hi Joey, >> >> Quick question: by *nodes*, do you mean Flink task manager processes, or >> physical/virtual machines (like ecs, yarn NM)? >> >> In our production, we run flink workloads on several Yarn/Kubernetes >> clusters, where each cluster typically has 2k~5k machines. Most Flink >> workloads are deployed in single-job mode, giving us thousands (sometimes >> more than 10k) of flink instances concurrently running on each cluster. In >> this way, the scale of each flink instance is usually not extremely large >> (less than 1000 TMs), and we rely on the power of Yarn/Kubernetes to deal >> with the large number of instances. >> >> There're also cases that a single flink job is extremely large. We had a >> batch workload from last year's double-11 event, with 8k max per-stage >> parallelism and up to 30k task managers running at the same time. At that >> scale, we run into problems with the single-point JobManager process, such >> as tremendous memory consumption, buzy rpc main thread, etc. To make that >> case work, we did many optimizations on our internal flink version, which >> we are trying to contribute to the community. See FLINK-21110 [1] for the >> details. >> >> Thank you~ >> >> Xintong Song >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-21110 >> >> On Thu, Mar 4, 2021 at 3:39 PM Piotr Nowojski <pnowoj...@apache.org> >> wrote: >> >>> Hi Joey, >>> >>> Sorry for not responding to your question sooner. As you can imagine >>> there are not many users running Flink at such scale. As far as I know, >>> Alibaba is running the largest/one of the largest clusters, I'm asking for >>> someone who is familiar with those deployments to take a look at this >>> conversation. I hope someone will respond here soon :) >>> >>> Best, >>> Piotrek >>> >>> pon., 1 mar 2021 o 14:43 Joey Tran <joey.t...@schrodinger.com> >>> napisał(a): >>> >>>> Hi, I was looking at Apache Beam/Flink for some of our data processing >>>> needs, but when reading about the resource managers >>>> (YARN/mesos/Kubernetes), it seems like they all top out at around 10k >>>> nodes. What are recommended solutions for scaling higher than this? >>>> >>>> Thanks in advance, >>>> Joey >>>> >>> -- Best Regards, Yuval Itzchakov.