Hi Kristian, Kristian Nielsen <kniel...@knielsen-hq.org> writes: > > Marko Mäkelä <marko.mak...@mariadb.com> writes: > > > Could you please take a look at this InnoDB regression in MariaDB 10.2 > > and later: > > https://jira.mariadb.org/browse/MDEV-15740 > > > > Because of this bug, we can no longer write tests that would ensure > > that certain changes are present in the redo log, before killing and > > But what is the bug here? This is a Galera-only issue, right?
I was under the impression that it affects normal InnoDB too. But indeed, it looks like you could be right that it is only affecting Galera. I am sorry for blaming you; I think that the test innodb.innodb_force_recovery proves that the redo log will be flushed correctly in 10.2 as well as 10.4. I also tested with a variant that relies on the commit of DML (INSERT) instead of DDL (DROP TABLE) for flushing the redo log. I have now reassigned MDEV-15740 as a Galera bug. > It is not a bug that commit does not flush the log to disk when > trx->active_commit_ordered, it's an important performance optimisation. The > transaction is already durable at this point (it was synced in prepare, and > will be recovered by the transaction coordinator during crash recovery, eg. > from the binlog). > > It's the same in MySQL, just using a different mechanism > (thd_requested_durability(trx->mysql_thd) == HA_IGNORE_DURABILITY)). Thanks, this makes sense now. > Is the problem just about getting the test cases to work? That's easy to do, > for example with a DBUG_EXECUTE_IF() to add a log flush to disk at some > appropriate place. I prefer to have tests that run on non-debug binaries, and innodb_flush_log_at_trx_commit=1 should trigger a redo log write (which will also persist any incomplete transactions from concurrent sessions). For those cases where we need debug instrumentation, I prefer DEBUG_SYNC. For example, if I have to kill the server at a specific point, I prefer to do it externally, like this: connection other; SET DEBUG_SYNC = 'some_point SIGNAL hung WAIT_FOR ever'; … connection default; SET DEBUG_SYNC = 'now WAIT_FOR hung'; --source include/kill_mysqld.inc disconnect other; --source include/start_mysqld.inc In this way, the test will not cause any memory leak errors in Valgrind or ASAN. > Or is the problem that Galera somehow disables crash recovery from the > binlog, and requires an fsync during every commit to be crash safe, not just > for test-purposes? In that case, isn't the problem the same in MySQL? Then > it sounds like the problem is that Galera in MySQL does something to avoid > HA_IGNORE_DURABILITY being set, and this "something" was not merged > correctly to MariaDB to work within the commit_ordered() framework? > > Galera needs to either respect the commit_ordered() semantics (see eg. > comments in sql/handler.h). Or it needs to take over the transaction > coordinator role and control commit_ordered() itself. The latter is the > correct approach (but is not done currently as far as I am aware). > > In general, Galera has never been properly integrated with the transaction > coordinator / commit_ordered() framework in MariaDB, and thus we keep > getting issues like this popping up. But without knowing how Galera intends > this to work, it's hard to say what the real problem is... I have not participated in Galera development, so I cannot say either. > > - || thd_requested_durability(trx->mysql_thd) > > - == HA_IGNORE_DURABILITY) { > > + || (srv_flush_log_at_trx_commit == 1 && > > trx->active_commit_ordered)) { > > > > This looks particularly strange because changing > > innodb_flush_log_at_trx_commit from 1 to 2 "fixes" the bug due to the > > condition srv_flush_log_at_trx_commit == 1 not being true anymore. But 1 > > is supposed to be more durable and slow compared to 2. Now, with the > > current code 2 is more durable! > > It could be that this could instead use srv_flush_log_at_trx_commit>=1 to > also avoid an unnecessary log flush (without fsync) to disk. I didn't check > now, as I think this is not the question at hand here? Yes, it could be changed to >=1 as part of the Galera fix. > > Related to this, I discussed an idea with Andrei Elkin: Actually there > > should be no need to flush the InnoDB redo log at transaction commit if > > logical logging is enabled. On recovery, we could retrieve the latest > > committed logical log position from InnoDB, and then replay the logical > > log from that point of time. We would only have to wait for InnoDB redo > > log write before discarding the head of the logical log. For this, we > > could implement some API call like "flush logs up to the LSN corresponding > > to this binlog position". > > It's an interesting idea, I had the same one 8 years ago ;-) > > http://worklog.askmonty.org/worklog/Server-RawIdeaBin/?tid=164 > > (I think there may be a jira corresponding to this MWL#164, but not sure how > to find it). > > It would be really nice to see this implemented. There are a few > complications though, eg. multi-engine transactions. And it still requires > that storage engines (/Galera) respect the commit_ordered() semantics. Yes, I believe that multi-engine transactions would still need a XA 2PC mechanism. In that case, it seems to me that the binlog would have to continue to do extra fsync() at all of XA PREPARE, XA COMMIT, XA ROLLBACK. > If you can describe in more detail what the problem is, I can look a bit > deeper? I think that I must carefully debug the issue that I observed in 10.4 to see what exactly is wrong there. I was making a false assumption, but luckily as a result of this, we got MDEV-15740 properly recategorized as a Galera bug. Thanks for the help! Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp