> On 22 May 2018 at 20:59, Andres Freund <and...@anarazel.de> wrote: > On 2018-05-22 20:54:46 +0200, Dmitry Dolgov wrote: >> > On 22 May 2018 at 18:47, Andres Freund <and...@anarazel.de> wrote: >> > On 2018-05-22 08:57:18 -0700, Andres Freund wrote: >> >> Hi, >> >> >> >> >> >> On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote: >> >> > Thanks for the patch. Out of curiosity I tried to play with it a bit. >> >> >> >> Thanks. >> >> >> >> >> >> > `pgbench -i -s 100` actually hang on my machine, because the >> >> > copy process ended up with waiting after `pg_uds_send_with_fd` >> >> > had >> >> >> >> Hm, that had worked at some point... >> >> >> >> >> >> > errno == EWOULDBLOCK || errno == EAGAIN >> >> > >> >> > as well as the checkpointer process. >> >> >> >> What do you mean with that latest sentence? >> >> To investigate what's happening I attached with gdb to two processes, COPY >> process from pgbench and checkpointer (since I assumed it may be involved). >> Both were waiting in WaitLatchOrSocket right after SendFsyncRequest. > > Huh? Checkpointer was in SendFsyncRequest()? Coudl you share the > backtrace?
Well, that's what I've got from gdb: #0 0x00007fae03fae9f3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 #1 0x000000000077a979 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffe37529ec0, cur_timeout=-1, set=0x23cddf8) at latch.c:1048 #2 WaitEventSetWait (set=set@entry=0x23cddf8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffe37529ec0, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=0) at latch.c:1000 #3 0x000000000077ad08 in WaitLatchOrSocket (latch=latch@entry=0x0, wakeEvents=wakeEvents@entry=4, sock=8, timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=0) at latch.c:385 #4 0x00000000007152cb in SendFsyncRequest (request=request@entry=0x7ffe37529f40, fd=fd@entry=-1) at checkpointer.c:1345 #5 0x0000000000716223 in AbsorbAllFsyncRequests () at checkpointer.c:1207 #6 0x000000000079a5f0 in mdsync () at md.c:1339 #7 0x000000000079c672 in smgrsync () at smgr.c:766 #8 0x000000000076dd53 in CheckPointBuffers (flags=flags@entry=64) at bufmgr.c:2581 #9 0x000000000051c681 in CheckPointGuts (checkPointRedo=722254352, flags=flags@entry=64) at xlog.c:9079 #10 0x0000000000523c4a in CreateCheckPoint (flags=flags@entry=64) at xlog.c:8863 #11 0x0000000000715f41 in CheckpointerMain () at checkpointer.c:494 #12 0x00000000005329f4 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffe3752a220) at bootstrap.c:451 #13 0x0000000000720c28 in StartChildProcess (type=type@entry=CheckpointerProcess) at postmaster.c:5340 #14 0x0000000000721c23 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2875 #15 <signal handler called> #16 0x00007fae03fa45b3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84 #17 0x0000000000722968 in ServerLoop () at postmaster.c:1679 #18 0x0000000000723cde in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x23a00e0) at postmaster.c:1388 #19 0x000000000068979f in main (argc=3, argv=0x23a00e0) at main.c:228 >> >> > Looks like with the default >> >> > configuration and `max_wal_size=1GB` it writes more than reads to a >> >> > socket, and a buffer eventually becomes full. >> >> >> >> That's intended to then wake up the checkpointer immediately, so it can >> >> absorb the requests. So something isn't right yet. >> > >> > Doesn't hang here, but it's way too slow. >> >> Yep, in my case it was also getting slower, but eventually hang. >> >> > Reason for that is that I've wrongly resolved a merge conflict. Attached >> > is a >> > fixup patch - does that address the issue for you? >> >> Hm...is it a correct patch? I see the same committed in >> 8c3debbbf61892dabd8b6f3f8d55e600a7901f2b, so I can't really apply it. > > Yea, sorry for that. Too many files in my patch directory... Right one > attached. Yes, this patch solves the problem, thanks.