Hello, did not find where yo open such discussion about features, but after
working with aurora here is few ideas:
1. not pre-defined instance count.
Currently you need to specify exact number of instances on which you want
to run your app. It would be great if i could specify only constraints on
which servers I want to run.
Example can be some clustered software (long running service) , where I
want add/remove nodes just by spinning servers without need to update job
each time.
2. relative resource allocation
Now you need to specify exact amount of memory/cpu... you want to use. I'd
like to specify relative amount like: available - 0.5  for cpu or available
- 1024M for memory ... & if remaining amount of resource >
min_required_cpu/memory,  then schedule job.

These both is usable when i run services on dedicated servers. If 1. is
easy to implement externally, then 2. is not possible easy way.

3. when executing .aurora file - inject some extra variables like
actual_cpu, actual_ram, array of all attributes  of current node for
allocated job. (in context of 2.)
Now i've defined variables inside .aurora cpu = x, ram = y. And I bypass
these numbers to Process top compute configuration files based on it. If we
have 2. implemented - then I could avoid hard coding them and use real
values for allocated job.

4. Running, healthy things.
about RUNNING state ...
I've job with n-tasks ... download, download, configure, wait .. configure,
wait ... main program
Now it seems RUNNING state is when first download hits. Even if application
is still downloading or performing configuration tasks .. it is already
considered as RUNNING. Sometimes it is hard to say , it will take 10 sec or
5 minutes. Maybe if we would be able to say: that app is RUNNING when it
reaches task "main program" ?
Health check: if we do /health - we are expecting Ok.
Maybe we could add states like:
* pending_start - would indicate, that application is starting up and
running, but still can't be considered as HEALTHY. And I could specify
max_pending_time if I want.
Scenario: we are performing update or restart procedure and sometimes
normal application start can take 10 seconds .. sometimes 10 minutes. I
want to be able to move to next node as soon as possible. Now we have only
initial_interval_secs to control when to start real health check. To allow
servers start up to 10 minutes - i guess i need to set
initial_interval_secs = 600.
* pending_quit - after /quitquitquit I'd like to have some more time than
hard-coded 5 seconds to perform nice shutdown and maybe de-registration,
post-shutdown steps. We can also add max_pending_quit_time, just in case.
* pending_abort - same like pending_quit :)
Aurora would query every n-seconds up to max_... specified time.


About real example:
I was playing with Cassandra, Elasticsearch, kafka, etc. and some in-house
clustered software. One way is to write scheduler for mesos or use from
mesossphere (like Cassandra-mesos).
But they have some limitations with promises in future to improve, make
better.

My approach was using aurora with announcer and own configuration server.
When process start up - I can see it in zookeeper (via announcer) and when
process request configuration files -> configuration server receives
cpu/mem/disk etc. information from install process like:
downloadConfig = Process(
    name = 'downloadConfig',
    cmdline = 'wget -q -O cassandra/conf/cassandra.yaml
http://confserver:port/cassandra/`/bin/hostname`/' + str(cpu) + '/' +
str(ram) + '/' + str(disk) + '/' + role + '/' + env + '/' + job_name +
'/cassandra.yaml&version=' + conf_version
)
Which takes template cassandra.yaml, some variables computed depending on
resources, seeds is taken from zookeeper /aurora/..../job_name

1,2,3 allows actually work better with dedicated servers. Add/remove nodes
to clusters, help configuration server to compute config files. Make things
more dynamic. I just spin up servers - tag properly (attributes) and job is
done.

4. would allow do properly upgrades/restarts. If we take C* - startup can
take non predictable time and shutdown procedure with all draining can take
some time as well.

Probably it is not a best way now to run long-running perm clusters like C*
,es*, but when you have a lot of small clusters and need to spin up/down
dynamic then it start making sense.

p.s. Maybe something of I wrote is already possible and just couldn't find
in docs.

Thanks,
Haralds

Reply via email to