On Fri, Feb 3, 2023 at 11:12 AM Peter Smith <smithpb2...@gmail.com> wrote: > > On Fri, Feb 3, 2023 at 4:21 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > On Fri, Feb 3, 2023 at 6:41 AM Peter Smith <smithpb2...@gmail.com> wrote: > > > > > > On Thu, Feb 2, 2023 at 7:21 PM Takamichi Osumi (Fujitsu) > > > <osumi.takami...@fujitsu.com> wrote: > > > > > > > ... > > > > > > > > > > > > > Besides, I am not sure it's a stable test to check the log. Is it > > > > > possible that there's > > > > > no such log on a slow machine? I modified the code to sleep 1s at the > > > > > beginning > > > > > of apply_dispatch(), then the new added test failed because the > > > > > server log > > > > > cannot match. > > > > To get the log by itself is necessary to ensure > > > > that the delay is conducted by the apply worker, because we emit the > > > > diffms > > > > only if it's bigger than 0 in maybe_apply_delay(). If we omit the step, > > > > we are not sure the delay is caused by other reasons or the > > > > time-delayed feature. > > > > > > > > As you mentioned, it's possible that no log is emitted on slow machine. > > > > Then, > > > > the idea to make the test safer for such machines should be to make the > > > > delayed time longer. > > > > But we shortened the delay time to 1 second to mitigate the long test > > > > execution time of this TAP test. > > > > So, I'm not sure if it's a good idea to make it longer again. > > > > > > I think there are a couple of things that can be done about this problem: > > > > > > 1. If you need the code/test to remain as-is then at least the test > > > message could include some comforting text like "(this can fail on > > > slow machines when the delay time is already exceeded)" so then a test > > > failure will not cause undue alarm. > > > > > > 2. Try moving the DEBUG2 elog (in function maybe_apply_delay) so that > > > it will *always* log the remaining wait time even if that wait time > > > becomes negative. Then I think the test cases can be made > > > deterministic instead of relying on good luck. This seems like the > > > better option. > > > > > > > I don't understand why we have to do any of this instead of using 3s > > as min_apply_delay similar to what we are doing in > > src/test/recovery/t/005_replay_delay. Also, I think we should use > > exactly the same way to verify the test even though we want to keep > > the log level as DEBUG2 to check logs in case of any failures. > > > > IIUC the reasons are due to conflicting requirements. e.g. > - A longer delay like 3s might work better for testing this feature, but OTOH > - A longer delay will also cause the whole BF execution to take longer >
Sure, but we already have the same test for a similar feature and it seems to be a proven reliable way to test the feature. We do seem to have seen buildfarm failures for tests related to recovery_min_apply_delay and the current way is quite stable, so I would prefer to go with that. -- With Regards, Amit Kapila.