Hello. Kristian Nielsen <kniel...@knielsen-hq.org> writes:
> andrei.el...@pp.inet.fi writes: > >> First the parent errros out goes to `finish_event_group()' but it's >> possible it does not have yet the child in its `subsequent_commits_list' > >> So I understood so far that the retrying worker needs to check >> `stop_on_error_sub_id' that effectively reflects the error status. > > Agree, the basic fix looks correct, of checking stop_on_error_sub_id. > > And yes, there are probably cases where after stop_on_error_sub_id is set, > earlier transactions are aborted early without setting up the > subsequent-transaction-chain, so that wait_for_prior_commit cannot be relied > upon, and stop_on_error_sub_id must be checked. > > So again, nice catch, this would be nasty to try and debug from errors seen > only in a user's setup. Thanks! > > I am only wondering about two points, if there is something that needs to be > adjusted in the patch around the error exit after stop_on_error_sub_id, or > perhaps something else not yet understood going on: > > > > Number one: > >>>> + if (!rgi->worker_error) >>>> + rgi->worker_error= 1; >>>> + } >>> >>> I would have expected an abort (goto err) here (or well just below, after >>> unlocking the mutex). Is this not needed, and if so, why not? >> >> I left it to upcoming (in 2 lines) THD::wait_for_prior_commit(). It >> takes care to check out `worker_error'. >> >> int wait_for_prior_commit(THD *thd) >> { >> /* >> Quick inline check, to avoid function call and locking in the common >> case >> where no wakeup is registered, or a registered wait was already >> signalled. >> */ >> if (waitee) >> return wait_for_prior_commit2(thd); >> else >> { >> if (wakeup_error) >> my_error(ER_PRIOR_COMMIT_FAILED, MYF(0)); >> return wakeup_error; >> } >> } >> >> No waitee is guaranteed as this worker skipped >> `register_wait_for_prior_event_group_commit()'. > > Sorry, I don't get it. wait_for_prior_commit() checks > thd->wait_for_commit_ptr->wakeup_error, not rgi->worker_error? > wait_for_prior_commit() doesn't have access to rgi. Thanks for this check. Indeed, I did not fully recover from earlier confusion between the two (too much closely named (in some metrics) identifier they are!). thd->wait_for_commit_ptr->wakeup_error := 1 is meant even though the test passes (must be thanks to rgi->worker_error later checks). > >> list. And that's how the parent's `wakeup_error' never reaches the >> child. This is certainly the case when a child retries after a temp >> failure. In the test condition the child's `worker_error' is zero, but >> if a non-zero case `worker_error' gets cleared anyway through >> `unregister_wait_for_prior_commit()'. > > Isn't there some confusion here? Between "wakeup_error", which sits inside > struct wait_for_commit, and indicates that a prior event group failed. And > "worker_error" in rpl_group_info, which indicates that _this_ event group > failed, and is used to remember to do proper error cleanup inside the worker > thread? To fix. > > Number two: > >> You must mean the retrying worker stops doing it optimistically. I saw that. >> But I saw through logs that at times a worker >> could re-try about the number of worker pool size. >> Perhaps this unrestrained behaviour was caused by lack of a check of the >> current patch. > > Yes, that is what I mean. And yes, the underlying bug with missing check on > stop_on_error_sub_id might have caused this as a secondary effect. > > My point was just that users should not need to set a high retry count just > from using parallel replication (remember, several thousand worker threads > have been used with success on production data). So if this is necessary in > a test case, it might indicate a problem with the code... True. I am making changes to submit a newer version very soon. Thanks for your help! Andrei > > - Kristian. _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp