Multi-second pauses blocking even trivial activity

2018-09-07 Thread Patrick Molgaard
Hi folks,

I've been seeing some curious behaviour on a postgres server I administer.

Intermittently (one or two times a week), all queries on that host are
simultaneously blocked for extended periods (10s of seconds).

The blocked queries are trivial & not related to locking - I'm seeing
slowlogs of the form:

`LOG: duration: 22627.299 ms statement: SET client_encoding='''utf-8''';`

where this is the first statement on a fresh connection.

It happens even for connections from the same host - so it doesn't appear
to be e.g. network slowness, if that is even counted in query duration.

Does anyone have any hints for where to look for a cause?

Thanks,

Patrick

-

Set up information:

Postgres version:  PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc
(Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit

Full Table and Index Schema: not applicable, I think

EXPLAIN ANALYSE: n/a

Hardware: AWS i3.2xlarge

Maintenance Setup: autovacuum yes, but the times it runs don't correlate to
the incidences of slow queries

WAL settings: shipped to S3 with wal-e, stored on same disk for interim
period


Re: Multi-second pauses blocking even trivial activity

2018-09-07 Thread Patrick Molgaard
Hi Jeff,

Thanks for your reply. Are locks relevant in this case, though?

To be clear, the slow statements are the first thing happening on the
connection and don't look like they should be acquiring any kind of lock -
eg. 'select version();' also seems to be paused when it occurs.

Or are there some system level locks that a trivial query, touching no
relations, might be contending for?

Best
Patrick

On Fri, 7 Sep 2018, 15:32 Jeff Janes,  wrote:

> On Fri, Sep 7, 2018 at 8:00 AM Patrick Molgaard 
> wrote:
>
>> Hi folks,
>>
>> I've been seeing some curious behaviour on a postgres server I administer.
>>
>> Intermittently (one or two times a week), all queries on that host are
>> simultaneously blocked for extended periods (10s of seconds).
>>
>> The blocked queries are trivial & not related to locking - I'm seeing
>> slowlogs of the form:
>>
>> `LOG: duration: 22627.299 ms statement: SET client_encoding='''utf-8''';`
>>
>>
> Do you have log_lock_waits set to on?  If not, you might want to turn it
> on.
>
> Cheers,
>
> Jeff
>


Re: Multi-second pauses blocking even trivial activity

2018-09-07 Thread Patrick Molgaard
Oh, to be clear - I'll be implementing your suggestion regardless, it seems
valuable whether or not it gets me closer to the root cause this time :)

I was just trying to dig into why it may be relevant -- I want to really
get a good grip on the mechanism behind this phenomenon.

Cheers
Patrick
On Fri, 7 Sep 2018, 20:20 Jeff Janes,  wrote:

> On Fri, Sep 7, 2018 at 2:03 PM Patrick Molgaard 
> wrote:
>
>>
>> Hi Jeff,
>>
>> Thanks for your reply. Are locks relevant in this case, though?
>>
>
> I don't know, but why theorize when we can know for sure?  It at least
> invokes VirtualXactLockTableInsert.  I don't see how that could block on a
> heavyweight lock, though. But again, why theorize when logging it is simple?
>
> Is it always the first statement in a connection which is blocking, or
> will established connections also block at the same time the new ones start
> to block?
>
> Cheers,
>
> Jeff
>
>>


Re: Multi-second pauses blocking even trivial activity

2018-09-07 Thread Patrick Molgaard
This sounds extremely plausible -- thanks for the tip, Andreas.

Best,
Patrick

On Fri, 7 Sep 2018, 19:20 Andreas Kretschmer, 
wrote:

>
> >
> >Intermittently (one or two times a week), all queries on that host are
> >simultaneously blocked for extended periods (10s of seconds).
> >
> >The blocked queries are trivial & not related to locking - I'm seeing
> >slowlogs of the form:
> >
>
>
> please check if THP are enabled.
>
>
> Regards, Andreas
>
>
>
> --
> 2ndQuadrant - The PostgreSQL Support Company
>
>


Re: How Do You Associate a Query With its Invoking Procedure?

2018-09-15 Thread Patrick Molgaard
You might find application-level tracing a more practical answer - e.g.
check out Datadog APM for a (commercial) plug and play approach or Jaeger
for a self-hostable option.

Patrick
On Fri, Sep 14, 2018 at 4:38 PM Fred Habash  wrote:

> Any ideas, please?
>
> On Thu, Sep 13, 2018, 3:49 PM Fd Habash  wrote:
>
>> In API function may invoke 10 queries. Ideally, I would like to know what
>> queries are invoked by it and how long each took.
>>
>>
>>
>> I’m using pg_stat_statement. I can see the API function statement, but
>> how do I deterministically identify all queries invoked by it?
>>
>>
>>
>>
>>
>> 
>> Thank you
>>
>>
>>
>


Re: Multi-second pauses blocking even trivial activity

2018-09-21 Thread Patrick Molgaard
Andreas -- just following up to say that this was indeed the root cause.
Thanks again.

Patrick

On Sat, 8 Sep 2018, 01:34 Patrick Molgaard,  wrote:

> This sounds extremely plausible -- thanks for the tip, Andreas.
>
> Best,
> Patrick
>
>
> On Fri, 7 Sep 2018, 19:20 Andreas Kretschmer, 
> wrote:
>
>>
>> >
>> >Intermittently (one or two times a week), all queries on that host are
>> >simultaneously blocked for extended periods (10s of seconds).
>> >
>> >The blocked queries are trivial & not related to locking - I'm seeing
>> >slowlogs of the form:
>> >
>>
>>
>> please check if THP are enabled.
>>
>>
>> Regards, Andreas
>>
>>
>>
>> --
>> 2ndQuadrant - The PostgreSQL Support Company
>>
>>


Understanding logical replication lag

2021-01-21 Thread Patrick Molgaard
Hi all,

*My top-level query is*: I'm using logical replication under pg 9.6 to do a
kind of change data capture and I'm seeing occasional extended periods of
significant lag. I'm not sure what conceptual model I'm missing in order to
understand why this happens.

*The details:*

I'm running Postgres 9.6.19 from the postgres debian apt repos & the
wal2json extension.

I have a custom client application which essentially executes
pg_logical_slot_get_changes() for some manually-created logical replication
slot on a loop.

I'm monitoring replication lag defined as pg_current_xlog_location() -
confirmed_flush_lsn for that slot.

What I'm observing is - very occasionally - an extended period (hours long)
wherein:

* The normal database write load continues or slightly increases
* calls to pg_logical_slot_get_changes() return no rows and
confirmed_flush_lsn doesn't move
* the duration of a call to pg_logical_slot_get_changes() rises linearly
over time

I understand from the docs and research that this is usually caused by a
long-running write transaction, but I notice I'm still confused.

* I'm not 100% sure - I'm still confirming - but I'm fairly confident that
I don't have any egregiously long write transactions (at least on that
scale of hours). Are there any other common scenarios that can result in a
similar 'blockage'? e.g some categories of long read-only transactions, or
advisory locks, or other kinds of database activity like a vacuum?

* Conversely, from experimenting, it seems as if not all long-running write
transactions cause pg_logical_slot_get_changes() to be unable to advance.
In fact, I'm not able so far to produce a minimal set of simple queries
which show that behaviour.

Given the following sequence of queries I see changes emitted:

-- session 1
begin;
insert into foo(bar,baz) values (1, 1);

-- session 2
begin;
insert into foo(bar,baz) values (2,2);
commit;

-- session 3
 select data from
pg_logical_slot_get_changes('example-slot', NULL, NULL, 'format-version',
'2');

Session 3 is able to return the row from session 2 despite session 1's
ongoing transaction starting first and not yet committing. Can you help me
understand (or better yet point me to a resource which explains) the
underlying logic defining how logical decoding does in fact get blocked by
in-flight transactions?

Thanks,
Patrick