I started to answer these questions and then realized I was making an assumption about your environment. Do you have a reliable persistent file system such as HDFS or S3 at your disposal or do you truly mean to run on a single node?
If the you are truly thinking to run on a single node only there's no way to make this guaranteed to be reliable. You would be open to machine and disk failures, etc. I think the minimal reasonable production setup must use at least 3 physical nodes with the following services running: 1) HDFS or some other reliable filesystem (for persistent state storage) 2) Zookeeper for the Flink HA JobManager setup The rest is configuration.. With regard to scaling up after your initial deployment: right now in the latest Flink release (1.0.3) you cannot stop and restart a job with a different parallelism without losing your computed state. What this means is that if you know you will likely scale up and you don't want to lose that state you can provision many, many slots on the TaskManagers you do run, essentially over-provisioning them, and run your job now with the max parallelism you expect to need to scale to. This will all be much simpler to do in future Flink versions (though not in 1.1) but for now this would be a decent approach. In Flink versions after 1.1 Flink will be able to scale parallelism up and down while preserving all of the previously computed state. -Jamie On Fri, Jul 1, 2016 at 6:41 AM, Ryan Crumley <crum...@gmail.com> wrote: > Hi, > > I am evaluating flink for use in stateful streaming application. Some > information about the intended use: > > - Will run in a mesos cluster and deployed via marathon in a docker > container > - Initial throughput ~ 100 messages per second (from kafka) > - Will need to scale to 10x that soon after launch > - State will be much larger than memory available > > In order to quickly get this out the door I am considering postponing the > YARN / HA setup of a cluster with the idea that the current application can > easily fit within a single jvm and handle the throughput. Hopefully by the > time I need more scale flink support for mesos will be available and I can > use that to distribute the job to the cluster with minimal code rewrite. > > Questions: > 1. Is this a viable approach? Any pitfalls to be aware of? > > 2. What is the correct term for this deployment mode? Single node > standalone? Local? > > 3. Will the RocksDB state backend work in a single jvm mode? > > 4. When the single jvm process becomes unhealthy and is restarted by > marathon will flink recover appropriately or is failure recovery a function > of HA? > > 5. How would I migrate the RocksDB state once I move to HA mode? Is there > a straight forward path? > > Thanks for your time, > > Ryan > -- Jamie Grier data Artisans, Director of Applications Engineering @jamiegrier <https://twitter.com/jamiegrier> ja...@data-artisans.com