Hello everyone, Sorry for the delay in responding but I had to be out of town for the last 4 days and didn't have access to my mail.
Here is a recap for those who are just joining: I have multiple branch locations that need to have their transactions rolled up into a common statistics database. Each branch will host a master database replicating their transactions to a central server (multiple slaves, one per branch). The common statistics database needs to be kept as up-to-date as possible. The changes to the common statistics database will be replicated in a hub-and-spoke pattern (master at the central server/slaves at each branch). What Garth and I have been discussing are ideas for synchronizing an external "merge/process" agent with a slaves replication activity. By coordinating with each slave, I can extract the new records from the central office's slaves and add the processed data to the statistics database in as close to real-time as possible. > On Tue, 2004-04-13 at 20:06, [EMAIL PROTECTED] wrote: > On Tue, 2004-04-13 at 13:13, [EMAIL PROTECTED] wrote: > [snip] > > >> I could frequently poll the slave servers using SHOW SLAVE STATUS or > > SHOW > > >> PROCESS LIST and watch for their status to change. > > >> > > >> I could start each slave with "--log-slave-updates" and watch the > > slaves' > > >> binlogs for changes. > > >> > > >> I could watch for the creation/deletion of the slave's relay logs. > > > > > This seems to indicate that you are afraid of selecting rows on the > > > slave that are in the middle of being updated from the master. A single > > > update statement is still atomic, so you don't need to poll log files to > > > determine if an update statement has finished. > > > > > On the other hand, if there is some set of multiple updates and inserts > > > that constitute a collection of data that you want to merge only when > > >this collection is complete, you're better off finding a way to signal > > > this through the database. You could have the master lock the tables in > > > question until its finished and then the program quering the slave knows > > > that when it gets a read lock, its will see the full set of data. You > > > could also have a status column or a status table that has a flag > > > letting the program on the slave side know when the data is ready. > > > > > If this is off the mark, maybe some example statements would help... > > > > [more snipping] > > > > Yes, I AM concerned about getting a partial update to the warehouse. I know > > that transactions aren't logged until after they commit. If I use > > transactional boundaries to post multitable reports (for instance: an > > invoice takes two tables, one for the base information and one for the line > > items) into the branch masters then they will arrive intact and I won't > > corrupt the central slaves. (That's NOT the issue I am worried about!) > > > > I am worried that if I start processing those new records from the slave > > database to the Warehouse before all of the records have been processed > > from the Relay Logs (lets say I started trying to merge records when I see > > the relay log being created), I could miss some data (like the last few > > items on the invoice). That's why I am so worried about not merging until > > the END of the cycle. I need to be sure that everything has been committed > > to my central slave BEFORE I can merge the latest updates with the > > warehouse database. > > > > I can lock a slave database so that it won't replicate in the middle of my > > merging so I know that if I can catch a slave when it goes back to "sleep" > > ("Waiting for master to send event"), I would have a complete set of data. > > I could use "--log-slave-updates" to copy the updates to the slave's binlog > > and check that to see if I need to merge records( if slave status is > > "waiting" and the binlog is not "empty" then merge). Each merge could flush > > the binlog. However, there was a post from someone using binlogs for > > similar purpose that said that for 4.1.x+ the binlogs vary in size after > > flushing so I don't know how reliable that would be as a check. How could I > > tell when a binlog is empty? > > Using the binlog in this way will lead to a race condition. What if > another update comes in during the few milliseconds between your 'ready' > check (an empty binlog and a 'waiting' status) and when you select those > rows for processing? Plus if MySQL does any buffering of its output to > the binlog, you could be basing your check on seconds old data, further > aggravating the problem. > > The only solution I can think of that won't cause a race condition is to > lock your tables, but your program running on the slave database would > have to be able to connect to the master. Your slave program would lock > the necessary tables on the master side, wait to receive the lock, wait > until the slave had caught up, do your merge, then release the lock. > Likewise, the code updating tables on the master would need to lock the > tables while they write. You make the process a little friendlier by > creating a separate control table that the slave and master alternately > locked, rather than locking all the table you'll use. That way > processes on the master that just want to read the data don't have to > wait for a lock. > Yes, my "merge" process could talk directly to my branch master databases. If I wanted, I could even write the merge process to get the records straight from the branch masters (and avoid all of this replication) but that adds several live WAN connections and I am faced with a new problem: how would I detect changes to the master databases at each of my branches without filling up my WAN pipes with monitoring queries? I think I am better off letting the internal replication mechanics get the database changes to me (in the middle) and processing them there. I spent some time after your last response going over the source code that covers replication. If I read it correctly, the replication process is subject to the same locking as any other client to the database. So, I believe that if I lock the new records with FOR UPDATE, I should be able to avoid the race. A less elegant solution would be to lock the entire database. The least elegant control would be to use STOP SLAVE/START SLAVE every time I need to process new records. However we have now returned to my original problem(s): a) How can I tell when a slave has started receiving records? b) How can I tell when a slave has just finished processing the last record from its relay logs? c) How could I tell that both of those things ( a) and b) ) happened since I last checked (assuming a really quick update and/or a monitoring timer that is too slow to catch the slave in the middle of the act). Even if I am checking every .01 seconds, it could be possible to miss something. d) Without keeping a 3rd copy of the data, how could I tell what kind of slave update just happened? I believe that I have to use the slave's binlog to know for certain as the slave's relay log should be erased right after processing. Any other ideas? Ideally I need to lockout changes to a slave as soon as possible after b) or c) and apply the knowledge of d) to make the appropriate adjustments to my statistics tables. Is anyone else doing this kind of thing? My fall-back scenario is to lock-down the clients every 10 or 15 seconds and process any changes then. However I still have to deal with issue d). What would be an optimal combination of techniques I could use to coordinate my slave databases and my statistics processing application? Thank you all very much for your patience in this matter!! Shawn -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]