On Tue, Nov 5, 2019 at 4:48 PM Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
> From the stack trace shared by Prabhat, I understand that the checkpointer > process panicked due to one of the following two reasons: > > 1) The fsync() failed in the first attempt itself and the reason for the > failure was not due to file being dropped or truncated i.e. fsync failed > with the error other than ENOENT. Refer to ProcessSyncRequests() for > details esp. the code inside for (failures = 0; !entry->canceled; > failures++) loop. > > 2) The first attempt to fsync() failed with ENOENT error because just > before the fsync function was called, the file being synced either got > dropped or truncated. When this happened, the checkpointer process called > AbsorbSyncRequests() to update the entry for deleted file in the hash table > but it seems like AbsorbSyncRequests() failed to do so and that's why the > "entry->canceled" couldn't be set to true. Due to this, fsync() was > performed on the same file twice and that failed too. As checkpointer > process doesn't expect the fsync on the same file to fail twice, it > panicked. Again, please check ProcessSyncRequests() for details esp. the > code inside for (failures = 0; !entry->canceled; failures++) loop. > > Now, the point of discussion here is, which one of the above two reasons > could the cause for panic? According to me, point #2 doesn't look like the > possible reason for panic. The reason being just before a file is unlinked, > backend first sends a SYNC_FORGET_REQUEST to the checkpointer process which > marks the entry for this file in the hash table as cancelled and then > removes the file. So, with this understanding it is hard to believe that > once the first fsync() for a file has failed with error ENOENT, a call to > AbsorbSyncRequests() made immediately after that wouldn't update the entry > for this file in the hash table because the backend only removes the file > once it has successfully sent the SYNC_FORGET_REQUEST for that file to the > checkpointer process. See mdunlinkfork()->register_forget_request() for > details on this. > > So, I think the first point that I mentioned above could be the probable > reason for the checkpointer process getting panicked. But, having said all > that, it would be good to have some evidence for it which can be confirmed > by inspecting the server logfile. > > Prabhat, is it possible for you to re-run the test-case with > log_min_messages set to DEBUG1 and save the logfile for the test-case that > crashes. This would be helpful in knowing if the fsync was performed just > once or twice i.e. whether point #1 is the reason for the panic or point > #2. > I have ran the same testcases with and without patch multiple times with debug option (log_min_messages = DEBUG1), but this time I am not able to reproduce the crash. > > Thanks, > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com > > On Thu, Oct 31, 2019 at 10:26 AM Prabhat Sahu < > prabhat.s...@enterprisedb.com> wrote: > >> >> >> On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmh...@gmail.com> >> wrote: >> >>> On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu < >>> prabhat.s...@enterprisedb.com> wrote: >>> >>>> While testing the Toast patch(PG+v7 patch) I found below server crash. >>>> System configuration: >>>> VCPUs: 4, RAM: 8GB, Storage: 320GB >>>> >>>> This issue is not frequently reproducible, we need to repeat the same >>>> testcase multiple times. >>>> >>> >>> I wonder if this is an independent bug, because the backtrace doesn't >>> look like it's related to the stuff this is changing. Your report doesn't >>> specify whether you can also reproduce the problem without the patch, which >>> is something that you should always check before reporting a bug in a >>> particular patch. >>> >> >> Hi Robert, >> >> My sincere apologize that I have not mentioned the issue in more detail. >> I have ran the same case against both PG HEAD and HEAD+Patch multiple >> times(7, 10, 20nos), and >> as I found this issue was not failing in HEAD and same case is >> reproducible in HEAD+Patch (again I was not sure about the backtrace >> whether its related to patch or not). >> >> >> >>> -- >>> Robert Haas >>> EnterpriseDB: http://www.enterprisedb.com >>> The Enterprise PostgreSQL Company >>> >> >> >> -- >> >> With Regards, >> >> Prabhat Kumar Sahu >> Skype ID: prabhat.sahu1984 >> EnterpriseDB Software India Pvt. Ltd. >> >> The Postgres Database Company >> > -- With Regards, Prabhat Kumar Sahu Skype ID: prabhat.sahu1984 EnterpriseDB Software India Pvt. Ltd. The Postgres Database Company