Hello, did not find where yo open such discussion about features, but after working with aurora here is few ideas: 1. not pre-defined instance count. Currently you need to specify exact number of instances on which you want to run your app. It would be great if i could specify only constraints on which servers I want to run. Example can be some clustered software (long running service) , where I want add/remove nodes just by spinning servers without need to update job each time. 2. relative resource allocation Now you need to specify exact amount of memory/cpu... you want to use. I'd like to specify relative amount like: available - 0.5 for cpu or available - 1024M for memory ... & if remaining amount of resource > min_required_cpu/memory, then schedule job.
These both is usable when i run services on dedicated servers. If 1. is easy to implement externally, then 2. is not possible easy way. 3. when executing .aurora file - inject some extra variables like actual_cpu, actual_ram, array of all attributes of current node for allocated job. (in context of 2.) Now i've defined variables inside .aurora cpu = x, ram = y. And I bypass these numbers to Process top compute configuration files based on it. If we have 2. implemented - then I could avoid hard coding them and use real values for allocated job. 4. Running, healthy things. about RUNNING state ... I've job with n-tasks ... download, download, configure, wait .. configure, wait ... main program Now it seems RUNNING state is when first download hits. Even if application is still downloading or performing configuration tasks .. it is already considered as RUNNING. Sometimes it is hard to say , it will take 10 sec or 5 minutes. Maybe if we would be able to say: that app is RUNNING when it reaches task "main program" ? Health check: if we do /health - we are expecting Ok. Maybe we could add states like: * pending_start - would indicate, that application is starting up and running, but still can't be considered as HEALTHY. And I could specify max_pending_time if I want. Scenario: we are performing update or restart procedure and sometimes normal application start can take 10 seconds .. sometimes 10 minutes. I want to be able to move to next node as soon as possible. Now we have only initial_interval_secs to control when to start real health check. To allow servers start up to 10 minutes - i guess i need to set initial_interval_secs = 600. * pending_quit - after /quitquitquit I'd like to have some more time than hard-coded 5 seconds to perform nice shutdown and maybe de-registration, post-shutdown steps. We can also add max_pending_quit_time, just in case. * pending_abort - same like pending_quit :) Aurora would query every n-seconds up to max_... specified time. About real example: I was playing with Cassandra, Elasticsearch, kafka, etc. and some in-house clustered software. One way is to write scheduler for mesos or use from mesossphere (like Cassandra-mesos). But they have some limitations with promises in future to improve, make better. My approach was using aurora with announcer and own configuration server. When process start up - I can see it in zookeeper (via announcer) and when process request configuration files -> configuration server receives cpu/mem/disk etc. information from install process like: downloadConfig = Process( name = 'downloadConfig', cmdline = 'wget -q -O cassandra/conf/cassandra.yaml http://confserver:port/cassandra/`/bin/hostname`/' + str(cpu) + '/' + str(ram) + '/' + str(disk) + '/' + role + '/' + env + '/' + job_name + '/cassandra.yaml&version=' + conf_version ) Which takes template cassandra.yaml, some variables computed depending on resources, seeds is taken from zookeeper /aurora/..../job_name 1,2,3 allows actually work better with dedicated servers. Add/remove nodes to clusters, help configuration server to compute config files. Make things more dynamic. I just spin up servers - tag properly (attributes) and job is done. 4. would allow do properly upgrades/restarts. If we take C* - startup can take non predictable time and shutdown procedure with all draining can take some time as well. Probably it is not a best way now to run long-running perm clusters like C* ,es*, but when you have a lot of small clusters and need to spin up/down dynamic then it start making sense. p.s. Maybe something of I wrote is already possible and just couldn't find in docs. Thanks, Haralds