You might want to try pg replay: http://laurenz.github.io/pgreplay/
On Thu, Jul 30, 2015 at 7:23 AM, Spiros Ioannou <siv...@inaccess.com> wrote: > I'm very sorry but we don't have a synthetic load generator for our > testing setup, only production and that is on SLA. I would be happy to test > the next release though. > > > > > > > > > *Spiros Ioannou IT Manager, inAccesswww.inaccess.com > <http://www.inaccess.com>M: +30 6973-903808T: +30 210-6802-358* > > On 29 July 2015 at 13:42, Heikki Linnakangas <hlinn...@iki.fi> wrote: > >> On 07/28/2015 11:36 PM, Heikki Linnakangas wrote: >> >>> A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It >>> seems to be important to have a very large number of connections: >>> >>> pgbench -n -c400 -j4 -T600 -P5 >>> >>> That got stuck after a few minutes. I'm using commit_delay=100. >>> >>> Now that I have something to work with, I'll investigate this more >>> tomorrow. >>> >> >> Ok, it seems that this is caused by the same issue that I found with my >> synthetic test case, after all. It is possible to get a lockup because of >> it. >> >> For the archives, here's a hopefully easier-to-understand explanation of >> how the lockup happens. It involves three backends. A and C are insertion >> WAL records, while B is flushing the WAL with commit_delay. The byte >> positions 2000, 2100, 2200, and 2300 are offsets within a WAL page. 2000 >> points to the beginning of the page, while the others are later positions >> on the same page. WaitToFinish() is an abbreviation for >> WaitXLogInsertionsToFinish(). "Update pos X" means a call to >> WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to >> ReserveXLogInsertLocation, which returned StartPos A and EndPos B. >> >> Backend A Backend B Backend C >> --------- --------- --------- >> Acquire InsertLock 2 >> Reserve 2100-2200 >> Calls WaitToFinish() >> reservedUpto is 2200 >> sees that Lock 1 is >> free >> Acquire InsertLock 1 >> Reserve 2200-2300 >> GetXLogBuffer(2200) >> page not in cache >> Update pos 2000 >> AdvanceXLInsertBuffer() >> run until about to >> acquire WALWriteLock >> GetXLogBuffer(2100) >> page not in cache >> Update pos 2000 >> AdvanceXLInsertBuffer() >> Acquire WALWriteLock >> write out old page >> initialize new page >> Release WALWriteLock >> finishes insertion >> release InsertLock 2 >> WaitToFinish() continues >> sees that lock 2 is >> free. Returns 2200. >> >> Acquire WALWriteLock >> Call WaitToFinish(2200) >> blocks on Lock 1, >> whose initializedUpto >> is 2000. >> >> At this point, there is a deadlock between B and C. B is waiting for C to >> release the lock or update its insertingAt value past 2200, while C is >> waiting for WALInsertLock, held by B. >> >> To fix that, let's fix GetXLogBuffer() to always advertise the exact >> position, not the beginning of the page (except when inserting the first >> record on the page, just after the page header, see comments). >> >> This fixes the problem for me. I've been running pgbench for about 30 >> minutes without lockups now, while without the patch it locked up within a >> couple of minutes. Spiros, can you easily test this patch in your >> environment? Would be nice to get a confirmation that this fixes the >> problem for you too. >> >> - Heikki >> >> > -- To understand recursion, one must first understand recursion.