OK, have just gotten off a chat with the folks at summit.

I am glad that I've managed to get my concerns about this approach out there. For people reading my notes here, I've gotten the answer to my question about how database access code is written for a system that is moving from some particular schema structure A to a new one B.

Essentially, supposing we've released "L", and we are now in development for "M". Over the course of M, we are adding new objects, e.g. tables, columns; let's call these M1, M2, M3. These objects are meant to replace older objects in L, say L1, L2, L3.

As M is being developed, at all times the model and database access layer must consider both of L1, L2, L3, and M1, M2, M3, at the same time. Meaning, the notion of schema migrations as something you commit and at which point you can cleanly just change your model is gone. The model needs to be able to load data from the L1/L2/L3 objects and ensure that it gets copied to M1, M2, M3, either as the APIs are accessed under normal use, or via a "background process" that will move data over from L to M. It is only when the database data is fully moved to M, but also when *all database-connected applications* are moved up as well, that the "contract" phase can be run. The "contract" phase will then drop every object in the database that is not in the currently running model, including any additional objects that were added by the operator.

Now, the approach of having a model that can bridge the gap between two schemas, to delay the full migration of changes across, is in fact very common in the real world of database applications. This is a common technique that is often necessary.

What is dramatically different in Nova's case is that what is normally just a particularly tedious tool one can choose to use in specific situations, that of the model that must bridge two different schema designs and slowly migrate data, now becomes an absolute hard requirement in all cases. It is no longer considered to be tenable for developers to decide on a case-by-case basis which kinds of migrations are trivial and can safely be run during an "expand" type of phase, vs. those that are data- and lock- intensive if done in bulk and therefore should be carefully rolled out over time at low scale.

Let me be clear that one of the big things I want to work on is cleaning up the model and database access code I see in Nova and many other Openstack applications. Right now it's complicated, slow, and is riddled with evidence that people didn't always have a firm grasp of the APIs when they wrote it. But what we are talking about is creating a hard link between the complexity of the model/DB access code and the ability to make necessary changes and improvements to the schema. It means that every schema change now inflicts verbosity and complexity directly into the model and database access logic, not in a self-contained, write-once-and-forget-it database migration script elsewhere; data migrations and business model access code are to be literally merged together most likely into the same function in many cases. This is definitely going to make my job of cleaning up, simplifying, and vastly improving the performance of this logic that much more difficult. This is the squeeze point within the whole approach and it is also the one which the Nova team could offer the least specifics on. While simple things like column transitions shouldn't be too terrible, more significant changes like table moves or restructurings will be very difficult; and as always, while this might be fine for Nova, it definitely is not appropriate for less mature Openstack projects just starting out with new schema designs that will have a bigger need for periodic refactorings.

The rationale for this hard-edged decision is that all-at-once data migrations are slow and place an enormous load on the database, and therefore must be banned in all cases, no matter how trivial. An anecdotal reference to some obviously serious outage that occurred during a Nova migration was cited as evidence.

I'm generally not in favor of this approach to a problem. The driving philosophy of SQLAlchemy and related tools are one of developer empowerment, not of shuttling away database details behind one-size-fits-all abstractions that keep developers as far from pointy and sharp edges as possible; because the edges aren't as sharp as you remember and a good developer is more deft with tools than you think. This philosophy is one that I developed over many years working at companies and watching how various forms of technical anxiety led to all kinds of obtuse, awkward, and sometimes outright byzantine ways of operating, all because something a long time ago failed to work as expected, and it was therefore banned forever - it was usually my job to extricate teams from these ways of thinking and re-acquaint them with more flexible and fluent approaches, while at the same time assuaging their anxiety that we can in fact use our brains to solve problems correctly as they come up rather than relying on iron bound constraints that are extremely difficult to modify from a technical perspective.

My proposal to Nova is not that they shouldn't go with this approach, but only that they proceed with a version of the idea that has an escape hatch, and at the same time that we make clear to other projects that this approach is a very specific road to travel and it should not be assumed to be appropriate for everyone. If Nova goes full on with online schema migrations, it means there will no longer be any fixed schema migration files, and no way that even the most trivial data migration can be implemented without going through the new system of building out a model and database access layer that talks to both logical schemas and has to migrate its own data over time. If OTOH they implement the exact same workflow, such that the migrations are still generated into files that represent discrete and fixed states of a schema, they will be able to maintain that approach to a varying degree, as they are ultimately exercising the new workflow on top of a traditional system which can still allow for tuning and version control of schema changes as well as inline data migrations where appropriate. As a bonus, the system works in a fixed way and won't delete the objects planted by the operator; it also allows for a traditional dependency model that will ensure that certain moves always happen before others, such as ensuring a "contract" against the previous version is completed before the next version's "expand" proceeds, thus allowing the database to remain in a clean and defined state. If I understood correctly, the current plan is that "contract" is an optional thing that perhaps some operators might never do at all; they'd just have a database which has old tables and columns from many versions ago still lying around.

The one objection raised to my alternative proposal is based on the notion that a certain kind of database "move" might apply in one way to a particular target database and in a different way to another. In the general case, this notion doesn't hold a lot of validity, because the system is emitting Alembic directives in any case which themselves are database agnostic; I only propose that we render the directives into a fixed file first. The specific concept that was raised however regards the notion of a schema operation that in one case wants to be done in the "expand" phase and in another wants to be done in the "migrate" phase. Asking for an example, the issue of certain indexes that behave differently on different MySQL versions; an index addition that would be a performance blocker during the "migrate" phase on an older MySQL version should be blocked from the "expand" phase but might be safer to run within "expand" for later versions of MySQL.

But again, this is not a very difficult issue to overcome. The current online schema migration code already has within it a ruleset that can accommodate such a rule. We simply move that rule to be within the migration directive itself. So that in the expand phase, instead of "op.create_index(indexname)", we have some kind of qualifier such as "op.create_index(indexname, create_rule=requires_high_rowcount)", or similar. Again, this is not manually coded in a migration, it is rendered out by the autogenerate facilities which would be utilized by the online schema engine that has already been built. The original online schema blueprint referred to the advantage of working with "declarative" structures vs. "imperative" structures, and I certainly agree; that's why Alembic's directives are themselves declarative and why the new rules as embedded will be very high level and declarative themselves. I doubt very many of these new directives will be needed and they will be simple to implement in any case.

Alembic also supports alternative migration flows for tables such as "copy and move", which can be evaluated as options in some cases. In any case, a system where we can manually establish particular migrations to work in certain ways is more flexible than one where we have to ensure that a ruleset knows ahead of time how to detect and adapt to certain tables and conditions on the fly with no pre-defined direction. Without any place to establish migration behaviors declaratively other than the model itself means that I can imagine that we ultimately would have to start adding "hints" to our models, like "use_migration_style_X_on_mysql" to mapped classes, so that the online schema system has clues as to what we want to happen in certain cases. That version of the system would also be tasked with making guesses in some cases; after all, a declaration in the model itself doesn't actually know what its being migrated *from*, as online schema changes start with a schema that is essentially in an undefined state. It would be better if migration directives still had a dedicated place of their own to be explicitly laid out, version-controlled, built against a specific and known previous state, tunable and configurable as needed, without mixing them up within the object model's declarations.

Having the expand/contract workflow available in such a way that is compatible with traditional migration files means that this becomes a feature that Alembic can continue to add support for, and could even become a feature within Alembic itself. Right now, the "autogenerate" feature has a pretty straight job of gathering a list of changes and spitting out a single migration file. It would be a great idea to open up this API such that different kinds of workflows can be plugged in, such that rulesets can interpret the autogenerate change stream into different revision stream structures. We'd get all the capabilities of the expand/contract workflow without being rigidly welded to it, and as an Alembic feature it would mean the production of new kinds of workflows would be available through an upstream declarative system that would have the benefit of real-world use by other teams outside of Openstack.

The impact on other projects, not just other Openstack projects but also Alembic itself, is really why I'm motivated to comment on this system. It's not that it's so important to me if Nova has a certain process or not (though I do want to clean up their database access code). It's more that Nova is always looked upon as the driver for how all other Openstack applications do things; what they do is what we will all be doing soon enough. Just look at the subject of this email thread; it's not about Nova at all, it's about the Heat project, which is eagerly looking to copy Nova's approach and wondering if they should just do that instead of migrating to Alembic traditionally. This is why I really want to get my reservations out there. While I do want all projects to be on similar database approaches that ultimately derive from the oslo.* namespace, I'd hope that this one can be opened up a bit before it is taken on by everyone else. The Nova team seemed to hear me on this and they'd like to encourage other projects to wait on moving to this approach until it can be proven. But they also agreed that yeah, everyone likes to copy Nova a lot. The proof they say, will be if this approach fails completely and they decide it isn't working. I don't think that will actually happen. Patterns like these just as often drag development down in a more slow and subtle way and over time contribute towards that calcified "don't change it!" culture that takes months or years to develop. For reference, google "Frog in the Water". It's a known thing. :)

Allowing certain patterns while always providing for a flexible "escape hatch" to work at different levels simultaneously, combined with a strong emphasis on explicitness, has always been the driving philosophy of SQLAlchemy. That's why there's a Core and an ORM which are separate but highly interactive together. It's an approach that works and I'd like to continue to encourage Openstack projects to subscribe to this philosophy.



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to