On Fri, Mar 17, 2017 at 08:33:07AM +0100, Kristian Nielsen wrote: > Reinder Cuperus <rein...@snt.utwente.nl> writes: > > > The problem is, as soon as I stop that connection, that master2 and > > master3 have different gtid-positions for domain0, and stop/start on > > replication master3->backup results in the error: > > "Got fatal error 1236 from master when reading data from binary log: > > 'Error: connecting slave requested to start from GTID 0-1-3898746614, > > which is not in the master's binlog' > > Yes. backup sees that it is ahead of master3 in domain 0, so it aborts to > avoid risk of diverging replication. > > > I have tried moving master1/2 to domain_id:1, and removing the > > domain_id:0 from the gtid_slave_pos on backup, but starting the > > replication master2->backup results in the error: > > Got fatal error 1236 from master when reading data from binary log: > > 'Could not find GTID state requested by slave in any binlog files. > > Probably the slave state is too old and required binlog files have been > > purged.' > > Yes. Because backup now sees that it is far behind in domain 0 (as it sees > the world), and aborts to not silently lose transactions. > > > I tried finding a way to purge domain:0 from master3/master4, but the > > only way sofar I have found is doing a "RESET MASTER" on master3, which > > would break replication between master3 and master4. > > Yes, I guess this is what you need. You have made a copy and removed half of > the data, and now you need to similarly remove half of the binlog. Even if > there are no actual transactions left from a domain in non-purged binlogs, > the binlogs still remember the history of all domains, in order to not > silently lose transactions for a slave that gets far behind. > > It would be useful in general to be able to purge a domain from a binlog. > But currently the only way I can think of is RESET MASTER. > > You can see how this binlog history looks by checking @@gtid_binlog_state, > and in the GTID_LIST events at the head of each binlog file.
Thank you for this insightful information. > > I have tried to find a way to insert an empty transaction, with the last > > gtid on domain_id:0 on the master3, to bring master2/master3 in sync > > again on that domain, but I could not find a way to do that on MariaDB. > > The server will not binlog an empty transaction, but a dummy transaction > should work, eg. create and drop a dummy table, for example: > > CREATE TABLE dummy_table (a INT PRIMARY KEY); > SET gtid_domain_id= 0; > SET gtid_server_id= 1; > SET gtid_seq_no= 3898746614; > DROP TABLE dummy_table; > > Maybe this way you can make the binlogs look like they are in sync to the > replication, not sure. It might be tricky, but then you do seem to have a > good grasp of the various issues involved. I have created a test-setup with Vagrant, and will use it to test this method. Sofar this seems to be the sollution with the last impact. If at a later point MariaDB gets the option of purging a domain, I'll use that to remove this unused domain from all the machines. > > Are there other ways to fix this issue, so I can have reliable > > replication master3->backup without having to keep the dummy replication > > backup->master3 indefinitely? > > I guess you would need to stop traffic to master3/master4 while getting them > in sync with one another and the do RESET MASTER on both and SET GLOBAL > gtid_slave_pos="" to start replication from scratch. You would then also > need to have server 'backup' up-to-date with master3 before RESET MASTER, > and remove domain id 20 from the gtid_slave_pos on backup after the RESET > MASTER. > > So that is quite intrusive. Stopping traffic to master3/4 might indeed be quite disruptive. I could imagine the following sequence to be working in that case, assuming master3 is primary in keepalived: - master3> FLUSH TABLES WITH READ LOCK; STOP SLAVE; - master4> wait till slave is quiet --> STOP SLAVE; - master3> RESET MASTER; SET GLOBAL gtid_slave_pos=""; - master4> RESET MASTER; SET GLOBAL gtid_slave_pos=""; - master3> UNLOCK TABLES; - master4> START SLAVE; - master3> START SLAVE; Properly scripted, and run at a moment when there are no long transactions and no replication-lag, the impact would not be noticeable to the end-users. But failure would mean broken replication, and a need to rebuild master4 causing a temporary loss of redundancy. Thank you for the information. Regards, Reinder Cuperus _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp