That would need a replica of the data probably which is not possible
(tablespace is 4TB).








*Spiros Ioannou IT Manager, inAccesswww.inaccess.com
<http://www.inaccess.com>M: +30 6973-903808T: +30 210-6802-358*

On 30 July 2015 at 21:47, Scott Marlowe <scott.marl...@gmail.com> wrote:

> You might want to try pg replay: http://laurenz.github.io/pgreplay/
>
> On Thu, Jul 30, 2015 at 7:23 AM, Spiros Ioannou <siv...@inaccess.com>
> wrote:
>
>> I'm very sorry but we don't have a synthetic load generator for our
>> testing setup, only production and that is on SLA. I would be happy to test
>> the next release though.
>>
>>
>>
>>
>>
>>
>>
>>
>> *Spiros Ioannou IT Manager, inAccesswww.inaccess.com
>> <http://www.inaccess.com>M: +30 6973-903808T: +30 210-6802-358*
>>
>> On 29 July 2015 at 13:42, Heikki Linnakangas <hlinn...@iki.fi> wrote:
>>
>>> On 07/28/2015 11:36 PM, Heikki Linnakangas wrote:
>>>
>>>> A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
>>>> seems to be important to have a very large number of connections:
>>>>
>>>> pgbench -n -c400 -j4 -T600 -P5
>>>>
>>>> That got stuck after a few minutes. I'm using commit_delay=100.
>>>>
>>>> Now that I have something to work with, I'll investigate this more
>>>> tomorrow.
>>>>
>>>
>>> Ok, it seems that this is caused by the same issue that I found with my
>>> synthetic test case, after all. It is possible to get a lockup because of
>>> it.
>>>
>>> For the archives, here's a hopefully easier-to-understand explanation of
>>> how the lockup happens. It involves three backends. A and C are insertion
>>> WAL records, while B is flushing the WAL with commit_delay. The byte
>>> positions 2000, 2100, 2200, and 2300 are offsets within a WAL page. 2000
>>> points to the beginning of the page, while the others are later positions
>>> on the same page. WaitToFinish() is an abbreviation for
>>> WaitXLogInsertionsToFinish(). "Update pos X" means a call to
>>> WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to
>>> ReserveXLogInsertLocation, which returned StartPos A and EndPos B.
>>>
>>> Backend A               Backend B               Backend C
>>> ---------               ---------               ---------
>>> Acquire InsertLock 2
>>> Reserve 2100-2200
>>>                         Calls WaitToFinish()
>>>                           reservedUpto is 2200
>>>                           sees that Lock 1 is
>>>                           free
>>>                                                 Acquire InsertLock 1
>>>                                                 Reserve 2200-2300
>>>                                                 GetXLogBuffer(2200)
>>>                                                  page not in cache
>>>                                                  Update pos 2000
>>>                                                  AdvanceXLInsertBuffer()
>>>                                                   run until about to
>>>                                                   acquire WALWriteLock
>>> GetXLogBuffer(2100)
>>>  page not in cache
>>>  Update pos 2000
>>>  AdvanceXLInsertBuffer()
>>>   Acquire WALWriteLock
>>>   write out old page
>>>   initialize new page
>>>   Release WALWriteLock
>>> finishes insertion
>>> release InsertLock 2
>>>                         WaitToFinish() continues
>>>                           sees that lock 2 is
>>>                           free. Returns 2200.
>>>
>>>                         Acquire WALWriteLock
>>>                         Call WaitToFinish(2200)
>>>                           blocks on Lock 1,
>>>                           whose initializedUpto
>>>                           is 2000.
>>>
>>> At this point, there is a deadlock between B and C. B is waiting for C
>>> to release the lock or update its insertingAt value past 2200, while C is
>>> waiting for WALInsertLock, held by B.
>>>
>>> To fix that, let's fix GetXLogBuffer() to always advertise the exact
>>> position, not the beginning of the page (except when inserting the first
>>> record on the page, just after the page header, see comments).
>>>
>>> This fixes the problem for me. I've been running pgbench for about 30
>>> minutes without lockups now, while without the patch it locked up within a
>>> couple of minutes. Spiros, can you easily test this patch in your
>>> environment? Would be nice to get a confirmation that this fixes the
>>> problem for you too.
>>>
>>> - Heikki
>>>
>>>
>>
>
>
> --
> To understand recursion, one must first understand recursion.
>

Reply via email to