Hi all, Rolling updates of services is a crucial feature in Aurora. As such, we want to take great care when changing its behavior. Today, Aurora operates by delegating this functionality to the client (or any API client, for that matter). While this has provided a nice abstraction, it turns out there are some shortcomings with this approach:
1. Visibility: since the scheduler does not know about updates, it cannot display useful information about an in-progress update 2. Visibility: for two users to diagnose a failed update, they must be at the same terminal, or copy/paste terminal output 3. Usability: the scheduler has no means to show information about how an application's packages or configuration changed over time 4. Usability: update orchestration in the client means a lost connection to the scheduler halts an update Some of the above issues can be addressed by moving update orchestration to a service external to the scheduler. At first glance, this approach is attractive, as there is a firm separation of concerns. However, there are a few pitfalls with this approach: 1. Usability: setup and maintenance of an aurora cluster becomes even more complicated (additional service + storage system) 2. Usability: the user interface becomes more complicated to stitch together, as end-users really should only have to visit one website to view job information. 3. Complexity: implementing a new production-ready service from scratch will take a non-trivial amount of time With these issues in mind, I propose that the scheduler take over the responsibility of application update orchestration. This will allow us to solve the current design shortcomings, without the pitfalls of the separate service approach. I'm interested in thoughts others have on this. Does the reasoning seem sound? Are there things i'm missing? -=Bill