Thanks Peter, even with ECS we have autoscaling enabled but the issue is during autoscaling we need to stop the job and start with new parallelism which creates a downtime.
Thanks On Fri, Nov 8, 2019 at 1:01 PM Peter Groesbeck <peter.groesb...@gmail.com> wrote: > We use EMR instead of ECS but if that’s an option for your team, you can > configure auto scaling rules in your cloud formation so that your task/job > load dynamically controls cluster sizing. > > Sent from my iPhone > > > On Nov 8, 2019, at 1:40 AM, Navneeth Krishnan <reachnavnee...@gmail.com> > wrote: > > > > Hello All, > > > > I have a streaming job running in production which is processing over 2 > > billion events per day and it does some heavy processing on each event. > We > > have been facing some challenges in managing flink in production like > > scaling in and out, restarting the job with savepoint etc. Flink > provides a > > lot of features which seemed as an obvious choice at that time but now > with > > all the operational overhead we are thinking should we still use flink > for > > our stream processing requirements or choose kafka streams. > > > > We currently deploy flink on ECR. Bringing up a new cluster for another > > stream job is too expensive but on the flip side running it on the same > > cluster becomes difficult since there are no ways to say this job has to > be > > run on a dedicated server versus this can run on a shared instance. Also > > savepoint point, cancel and submit a new job results in some downtime. > The > > most critical part being there is no shared state among all tasks sort > of a > > global state. We sort of achieve this today using an external redis cache > > but that incurs cost as well. > > > > If we are moving to kafka streams, it makes our deployment life much > > easier, each new stream job will be a microservice that can scale > > independently. With global state it's much easier to share state without > > using external cache. But the disadvantage is we have to rely on the > > partitions for parallelism. Although this might initially sound easier, > > when we need to scale much higher this will become a bottleneck. > > > > Do you guys have any suggestions on this? We need to decide which way to > > move forward and any suggestions would be of much greater help. > > > > Thanks >