On 05/17/2016 08:55 PM, Clint Byrum wrote: > I missed your reply originally, so sorry for the 2 week lag... > > Excerpts from Mike Bayer's message of 2016-04-30 15:14:05 -0500: >> >> On 04/30/2016 10:50 AM, Clint Byrum wrote: >>> Excerpts from Roman Podoliaka's message of 2016-04-29 12:04:49 -0700: >>>> >>> >>> I'm curious why you think setting wsrep_sync_wait=1 wouldn't help. >>> >>> The exact example appears in the Galera documentation: >>> >>> http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-sync-wait >>> >>> The moment you say 'SET SESSION wsrep_sync_wait=1', the behavior should >>> prevent the list problem you see, and it should not matter that it is >>> a separate session, as that is the entire point of the variable: >> >> >> we prefer to keep it off and just point applications at a single node >> using master/passive/passive in HAProxy, so that we don't have the >> unnecessary performance hit of waiting for all transactions to >> propagate; we just stick on one node at a time. We've fixed a lot of >> issues in our config in ensuring that HAProxy definitely keeps all >> clients on exactly one Galera node at a time. >> > > Indeed, haproxy does a good job at shifting over rapidly. But it's not > atomic, so you will likely have a few seconds where commits landed on > the new demoted backup. > >>> >>> "When you enable this parameter, the node triggers causality checks in >>> response to certain types of queries. During the check, the node blocks >>> new queries while the database server catches up with all updates made >>> in the cluster to the point where the check was begun. Once it reaches >>> this point, the node executes the original query." >>> >>> In the active/passive case where you never use the passive node as a >>> read slave, one could actually set wsrep_sync_wait=1 globally. This will >>> cause a ton of lag while new queries happen on the new active and old >>> transactions are still being applied, but that's exactly what you want, >>> so that when you fail over, nothing proceeds until all writes from the >>> original active node are applied and available on the new active node. >>> It would help if your failover technology actually _breaks_ connections >>> to a presumed dead node, so writes stop happening on the old one. >> >> If HAProxy is failing over from the master, which is no longer >> reachable, to another passive node, which is reachable, that means that >> master is partitioned and will leave the Galera primary component. It >> also means all current database connections are going to be bounced off, >> which will cause errors for those clients either in the middle of an >> operation, or if a pooled connection is reused before it is known that >> the connection has been reset. So failover is usually not an error-free >> situation in any case from a database client perspective and retry >> schemes are always going to be needed. >> > > There are some really big assumptions above, so I want to enumerate > them: > > 1. You assume that a partition between haproxy and a node is a partition > between that node and the other galera nodes. > 2. You assume that I never want to failover on purpose, smoothly. > > In the case of (1), there are absolutely times where the load balancer > thinks a node is dead, and it is quite happily chugging along doing its > job. Transactions will be already committed in this scenario that have > not propagated, and there may be more than one load balancer, and only > one of them thinks that node is dead. > > For the limited partition problem, having wsrep_sync_wait turned on > would result in consistency, and the lag would only be minimal as the > transactions propagate onto the new primary server. > > For the multiple haproxy problem, lag would be _horrible_ on all nodes > that are getting reads as long as there's another one getting writes, > so a solution for making sure only one is specified would need to be > developed using a leader election strategy. If haproxy is able to query > wsrep status, that might be ideal, as galera will in fact elect leaders > for you (assuming all of your wsrep nodes are also mysql nodes, which > is not the case if you're using 2 nodes + garbd for example). > > This is, however, a bit of a strawman, as most people don't need > active/active haproxy nodes, so the simplest solution is to go > active/passive on your haproxy nodes with something like UCARP handling > the failover there. As long as they all use the same primary/backup > ordering, then a new UCARP target should just result in using the same > node, and a very tiny window for inconsistency and connection errors. > > The second assumption is handled by leader election as well. If there's > always one leader node that load balancers send traffic to, then one > should be able to force promotion of a different node as the leader, > and all new transactions and queries go to the new leader. The window > for that would be pretty small, and so wsrep_sync_wait time should > be able to be very low, if not 0. I'm not super familiar with the way > haproxy gracefully reloads configuration, but if you can just change > the preferred server and poke it with a signal that sends new stuff to > the new master, then you only have a window the size of however long > the last transaction takes to worry about inconsistency. > >> Additionally, the purpose of the enginefacade [1] is to allow Openstack >> applications to fix their often incorrectly written database access >> logic such that in many (most?) cases, a single logical operation is no >> longer unnecessarily split among multiple transactions when possible. >> I know that this is not always feasible in the case where multiple web >> requests are coordinating, however. >> > > Yeah, that's really the problem. You can't be in control of all of the > ways the data is expected to be consistent. IMO, we should do a better > job in our API contracts to specify whether data is consistent or > not. A lot of this angst over whether we even need to deal with races > with Galera would go away if we could just make clear guarantees about > reads after writes. > >> That leaves only the very infrequent scenario of, the master has >> finished sending a write set off, the passives haven't finished >> committing that write set, the master goes down and HAProxy fails over >> to one of the passives, and the application that just happens to also be >> connecting fresh onto that new passive node in order to perform the next >> operation that relies upon the previously committed data so it does not >> see a database error, and instead runs straight onto the node where the >> committed data it's expecting hasn't arrived yet. I can't make the >> judgment for all applications if this scenario can't be handled like any >> other transient error that occurs during a failover situation, however >> if there is such a case, then IMO the wsrep_sync_wait (formerly known as >> wsrep_causal_reads) may be used on a per-transaction basis for that very >> critical, not-retryable-even-during-failover operation. Allowing this >> variable to be set for the scope of a transaction and reset afterwards, >> and only when talking to Galera, is something we've planned to work into >> the enginefacade as well as an declarative transaction attribute that >> would be a pass-through on other systems. >> > > It's not infrequent if you're failing over so you can update the current > master without interrupting service. Thinking through the common case > so that it doesn't erupt in database errors (a small percentage is ok, > a large percentage is not) or inconsistencies in the data seems like a > prudent thing to do.
Please note, that one of main purposes of the subject work *is* to demonstrate that there is no more blockers to start using A/A Galera without backup/standby nodes, which would allow all of the complex things above to be skipped. I believe the additional test cases added in the Appendix B cover as well the most generic case for Openstack service running a DB transaction via sqlalchemy ORM (or oslo.db's enginefasade?), which rollbacks transactions on deadlocks, uses the default TI=repeatable read, does not specify the with_lockmode() / with_for_update(), and it seems does not require wsrep_sync_wait/wsrep_causal_reads enabled. Please correct me, if I'm missing something important. I suggest to quick revise DB related code in Openstack projects (perhaps looking for things like *.query or *.filter_by?) and recommend all ops to switch to A/A writes and reads, if Galera cluster is used. I'm not sure though how to address/cover with jepsen cases the issues like Roman P. had described above, and to compose any test showing a profit for using wsrep_sync_wait=1 or another values >0. Any help is appreciated. > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Best regards, Bogdan Dobrelya, Irc #bogdando __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev