čt 14. 7. 2022 v 21:26 odesílatel Aleš Zelený <zeleny.a...@gmail.com> napsal:
> > st 13. 7. 2022 v 2:20 odesílatel Michael Paquier <mich...@paquier.xyz> > napsal: > >> On Mon, Jul 11, 2022 at 10:50:23AM +0200, Aleš Zelený wrote: >> > So far, it has happened three times (during a single week) from the >> 14.3 -> >> > 14.4 upgrade, before 14.4 we haven't suffered from such an issue. >> > >> > Questions: >> > 1) Can we safely downgrade from 14.4 to 14.3 by shutting down the >> instance >> > and reinstalling 14.3 PG packages (to prove, that the issue disappear)? >> > 2) What is the best way to diagnose what is the root cause? >> >> Hmm. 14.4 has nothing in its release notes that would point to a >> change in the vacuum or autovacuum's code paths: >> https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4 >> >> There is nothing specific after a look at the changes as of, and I am >> not grabbing anything that would imply a change in memory context >> handling either: >> `git log --stat REL_14_3..REL_14_4` >> `git diff REL_14_3..REL_14_4 -- *.c` >> >> Saying that, you should be able to downgrade safely as there are no >> changes in WAL format or such that would break things. Saying that, >> the corruption issue caused by CONCURRENTLY is something you'd still >> have to face. >> >> > Thanks, good to know that, we can use it for a test case, since we already > hit the CONCURRENTLY bug on 14.3. > > > 2022-07-02 14:48:07 CEST [3930]: [3-1] user=,db=,host=,app= ERROR: out >> of >> > memory >> > 2022-07-02 14:48:07 CEST [3930]: [4-1] user=,db=,host=,app= DETAIL: >> Failed >> > on request of size 152094068 in memory context "TopTransactionContext". >> > 2022-07-02 14:48:07 CEST [3930]: [5-1] user=,db=,host=,app= CONTEXT: >> > automatic vacuum of table "prematch.replication.tab_queue_tmp" >> >> This is the interesting part. Do you happen to use logical >> replication in a custom C++ plugin? >> > > We are using logical replication to other instances (pg_output) and > decoderbufs > https://github.com/debezium/postgres-decoderbufs for other applications. > > >> > 2022-07-02 14:48:47 CEST [4476]: [43-1] user=,db=,host=,app= LOG: could >> > not fork worker process: Cannot allocate memory >> > terminate called after throwing an instance of 'std::bad_alloc' >> > what(): std::bad_alloc >> > >> > DETAIL: parameters: $1 = '1', $2 = '1748010445', $3 = '0', $4 = '1000' >> > terminate >> > called after throwing an instance of 'std::bad_alloc' terminate called >> > after throwing an instance of 'std::bad_alloc' what(): what(): >> > std::bad_allocstd::bad_alloc 2022-07-08 14:54:23 CEST [4476]: [49-1] >> > user=,db=,host=,app= LOG: background worker "parallel worker" (PID >> 25251) >> > was terminated by signal 6: Aborted >> > 2022-07-08 14:54:23 CEST [4476]: [51-1] user=,db=,host=,app= LOG: >> > terminating any other active server processes >> >> Looks like something is going wrong in the memory handling of one of >> your C++ extensions here. If you can isolate an issue using a query >> without any custom code, that would be a Postgres problem, but I think >> that you are missing a trick in it. >> > > Here are extensions installed in the database served by the cluster: > prematch=# \dx > List of installed extensions > Name | Version | Schema | > Description > > --------------------+---------+------------+------------------------------------------------------------------------ > amcheck | 1.3 | public | functions for verifying > relation integrity > dblink | 1.2 | public | connect to other PostgreSQL > databases from within a database > file_fdw | 1.0 | public | foreign-data wrapper for flat > file access > hstore | 1.8 | public | data type for storing sets of > (key, value) pairs > hypopg | 1.3.1 | public | Hypothetical indexes for > PostgreSQL > pageinspect | 1.9 | public | inspect the contents of > database pages at a low level > pg_buffercache | 1.3 | public | examine the shared buffer > cache > pg_stat_kcache | 2.2.0 | public | Kernel statistics gathering > pg_stat_statements | 1.9 | public | track planning and execution > statistics of all SQL statements executed > pgcrypto | 1.3 | public | cryptographic functions > pgstattuple | 1.5 | public | show tuple-level statistics > plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language > plpgsql_check | 2.1 | public | extended check for plpgsql > functions > postgres_fdw | 1.1 | public | foreign-data wrapper for > remote PostgreSQL servers > tablefunc | 1.0 | public | functions that manipulate > whole tables, including crosstab > (15 rows) > > When we started experiencing these issues, based on Prometheus > node-exporter statistics, I've begun collecting /proc/<pid>/status for all > PG instance processes in an infinite loop. > From this, I can extract some data. It looks like RssAnon memory is what I > can see growing from several hundred kB to 1-2GB over some period (depends > on workload, but usually 24 hours). > The sessions where RssAnon memory growth are PHP client applications, > having multiple connections it might consume 10GB of memory. The same > application (I've asked for deployed changes, but no one was released close > to the upgrade time). > Comparing Prometheus server data, RssAnon seems to directly follow > node_memory_Active_anon_bytes node exporter metrics. If the "suspected" > sessions disconnects, node_memory_Active_anon_bytes drops down. > I've checked node_memory_Active_anon_bytes metric before the upgrade and > (same application workload) it was pretty stable on PG 13.8, later 14.3 but > immediately after the upgrade, node_memory_Active_anon_bytes start growth, > and once the Commit limit was reached, the fork()/Out of memory issues > appears. > > So for the application user running the PHP, I've set > log_min_duration_statemenet = 0, but no unexpected statements or arguments > (some functions have int array argument) were found in the logfile. > > Checking the RssAnon from proc/pid/status I've found some points where > RssAnon memory usage grew very steep for a minute, but no "suspicious" > queries/arguments were found in the instance logfile. > > Any hint, on how to get the root cause would be appreciated since so far > I've failed to isolate the issue reproducible testcase. > At least I hope that looking for the RssAnon process memory is an > appropriate metric, if not, let me know and I'll try to update the > monitoring to get the root cause. > > I can imagine a workaround with client application regular reconnect..., > but u to 14.3 it works, so I'd like to fix the issue either on our > application side or at PG side if it is a PG problem. > Theoretically, there can be memory leak in libpq. There were some changes. Regards Pavel > Kind regards Ales Zeleny > > -- >> Michael >> >