Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

Ilya Maximets via discuss Tue, 03 Jan 2023 06:07:40 -0800

On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> Hello,
> 
> When performing an online schema conversion for a clustered DB the
> `ovsdb-client` connects to the current leader of the cluster and
> requests it to convert the DB to a new schema.
> 
> The main thread of the leader ovsdb-server will then parse the new
> schema and copy the entire database into a new in-memory copy using
> the new schema. For a moderately sized database, let's say 650MB
> on-disk, this process can take north of 24 seconds on a modern
> adequately performant system.
> 
> While this is happening the ovsdb-server process will not process any
> raft election events or inactivity probes, so by the time the
> conversion is done and the now past leader wants to write the
> converted database to the cluster, its connection to the cluster is
> dead.
> 
> The past leader will keep repeating this process indefinitely, until
> the client requesting the conversion disconnects. No message is passed
> to the client.
> 
> Meanwhile the other nodes in the cluster have moved on with a new leader.
> 
> A workaround for this scenario would be to increase the election timer
> to a value great enough so that the conversion can succeed within an
> election window.
> 
> I don't view this as a permanent solution though, as it would be
> unfair to leave the end user with guessing the correct election timer
> in order for their upgrades to succeed.
> 
> Maybe we need to hand off conversion to a thread and make the main
> loop only process raft requests until it is done, similar to the
> recent addition of preparing snapshot JSON in a separate thread [0].
> 
> Any other thoughts or ideas?
> 
> 0: 
> https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
>


Hi, Frode.  Thanks for starting this conversation.

First of all I'd still respectfully disagree that 650 MB is a
moderately sized database. :)  ovsdb-server on its own doesn't limit
users on how much data they can put in, but that doesn't mean there
is no limit at which it will be difficult for it or even impossible
to handle the database.  From my experience 650 MB is far beyond the
threshold for a smooth work.

Allowing database to grow to such size might be considered a user
error, or a CMS error.  In any case, setups should be tested at the
desired [simulated at least] scale including upgrades before
deploying in production environment to not run into such issues
unexpectedly.

Another way out from the situation, beside bumping the election
timer, might be to pin ovn-controllers, destroy the database (maybe
keep port bindings, etc.) and let northd to re-create it after
conversion.  Not sure if that will actually work though, as I
didn't try.


For the threads, I'll re-iterate my thought that throwing more
cores on the problem is absolutely last thing we should do.  Only
if there is no other choice.  Simply because many parts of
ovsdb-server was never optimized for performance and there are
likely many things we can do to improve without blindly using more
resources and increasing the code complexity by adding threads.


Speaking of not optimal code, the conversion process seems very
inefficient.  Let's deconstruct it.  (I'll skip the standalone
case, focusing on the clustered mode.)

There are few main steps:

1. ovsdb_convert() - Creates a copy of a database converting
   each column along the way and checks the constraints.

2. ovsdb_to_txn_json() - Converts the new database into a
   transaction JSON object.

3. ovsdb_txn_propose_schema_change() - Writes the new schema
   and the transaction JSON to the storage (RAFT).

4. ovsdb_destroy() - Copy of a database is destroyed.

   -----

5. read_db()/parse_txn() - Reads the new schema and the
   transaction JSON from the storage, replaces the current
   database with an empty one and replays the transaction
   that creates a new converted database.

There is always a storage run between steps 4 and 5, so we generally
care only that steps 1-4 and the step 5 are below the election timer
threshold separately.


Now looking closer to the step 1, which is the most time consuming
step.  It has two stages - data conversion and the transaction
check.  Data conversion part makes sure that we're creating all
the rows in a new database with all the new columns and without
removed columns.  It also makes sure that all the datum objects
are converted from the old column type to the new column type by
calling ovsdb_datum_convert() for every one of them.

Datum conversion is a very heavy operation, because it involves
converting it to JSON and back.  However, in vast majority of cases
column types do not change at all, and even if they do, it only
happens for a few columns, not for all of them.

This part can be optimized by not converting columns if the type
didn't change, and just creating a shallow copy of the datum with
an ovsdb_datum_clone() instead.

In my preliminary testing this saves about 70% of the time spent
in ovsdb_convert() and reduces the overall conversion time by
about 30%.  Creation of a shallow copy also speeds up the database
destruction at step 4 and saves some memory.

I'll post patches I have for this a bit later after cleaning
them up.

The next part of the ovsdb_convert() is ovsdb_txn_replay_commit()
that we can't really get rid of, because we have to perform all
the transaction checks including referential integrity checks that
take up most of the time.  We might look at further optimizations
for the weak reference re-assessment process in the future though
as it seems to be the heaviest part.



Now to the steps 2 and 3.
Creation of a transaction JSON and writing it to the storage seems
to be completely redundant.  I understand that it was probably done
this way because conversion to JSON and back was cheaper than
ovsdb_convert().  However, with the change for a step 1, it will not
be the case anymore.

Database conversion is a deterministic process.  All we need to
perform it is the database content, which is already in the
storage, and the new schema.  So, we should be able to drop the
step 2 and write only the new schema to the storage on step 3.
On step 5 we'll need to read the schema and call ovsdb_convert()
again, but that should not be heavier than parsing transaction from
JSON and replaying it, with the above optimizations.  RAFT will
ensure that we're converting the same data as in the first
ovsdb_convert().

Writing only the new schema to the storage may mean a backward
incompatible database file format change, but that should not
be very hard to handle taking into account that it only happens
during conversion and the next compaction will get rid of
incompatibility.

I'll sketch up some patches for this as well.   Didn't try yet,
so don't have any performance data.


All in all the conversion process will be condensed to just two
calls to optimized ovsdb_convert() and two database_destroy().
One before writing to storage and one after reading from it.
This should be enough to cover most of the use cases, I hope.

Thoughts?

Best regards, Ilya Maximets.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

Reply via email to