Re: PoC Refactor AM analyse API

2020-12-08 Thread Denis Smirnov
ze improvement for 
> append-optimized row and column AM with variable size compressed blocks.
> Currently we do analyze in two steps.
> 
> 1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
> 2. Collect tuples into reservoir with algorithm Z from Vitter.
> 
> So this doesn’t work for AMs using variable sized physical blocks for 
> example. They need weight random sampling (WRS) algorithms like A-Chao or 
> logical blocks to follow S-Knuth (and have a problem with 
> RelationGetNumberOfBlocks() estimating a physical number of blocks). Another 
> problem with columns - they are not passed to analyze begin scan and can’t 
> benefit from column storage at ANALYZE TABLE (COL).
> 
> The suggestion is to replace table_scan_analyze_next_block() and 
> table_scan_analyze_next_tuple() with a single function: 
> table_acquire_sample_rows(). The AM implementation of 
> table_acquire_sample_rows() can use the BlockSampler functions if it wants 
> to, but if the AM is not block-oriented, it could do something else. This 
> suggestion also passes VacAttrStats to table_acquire_sample_rows() for 
> column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and 
> PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be block-oriented.
> 
> 
> 
> 
> 
> Best regards,
> Denis Smirnov | Developer
> s...@arenadata.io 
> Arenadata | Godovikova 9-17, Moscow 129085 Russia
> 



Re: PoC Refactor AM analyse API

2020-12-08 Thread Denis Smirnov
Andrey, thanks for your feedback!

I agree that AMs with fix sized blocks can have much alike code in 
acquire_sample_rows() (though it is not a rule). But there are several points 
about current master sampling.

* It is not perfect - AM developers may want to improve it with other sampling 
algorithms.
* It is designed with a big influence of heap AM - for example, 
RelationGetNumberOfBlocks() returns uint32 while other AMs can have a bigger 
amount of blocks.
* heapam_acquire_sample_rows() is a small function - I don't think it is not a 
big trouble to write something alike for any AM developer.
* Some AMs may have a single level sampling (only algorithm Z from Vitter for 
example) - why not?

As a result we get a single and clear method to acquire rows for statistics. If 
we don’t modify but rather extend current API ( for example in a manner it is 
done for FDW) the code becomes more complicated and difficult to understand.

> 8 дек. 2020 г., в 18:42, Andrey Borodin  написал(а):
> 
> Hi Denis!
> 
>> 7 дек. 2020 г., в 18:23, Смирнов Денис  написал(а):
>> 
>> I suggest a refactoring of analyze AM API as it is too much heap specific at 
>> the moment. The problem was inspired by Greenplum’s analyze improvement for 
>> append-optimized row and column AM with variable size compressed blocks.
>> Currently we do analyze in two steps.
>> 
>> 1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
>> 2. Collect tuples into reservoir with algorithm Z from Vitter.
>> 
>> So this doesn’t work for AMs using variable sized physical blocks for 
>> example. They need weight random sampling (WRS) algorithms like A-Chao or 
>> logical blocks to follow S-Knuth (and have a problem with 
>> RelationGetNumberOfBlocks() estimating a physical number of blocks). Another 
>> problem with columns - they are not passed to analyze begin scan and can’t 
>> benefit from column storage at ANALYZE TABLE (COL).
>> 
>> The suggestion is to replace table_scan_analyze_next_block() and 
>> table_scan_analyze_next_tuple() with a single function: 
>> table_acquire_sample_rows(). The AM implementation of 
>> table_acquire_sample_rows() can use the BlockSampler functions if it wants 
>> to, but if the AM is not block-oriented, it could do something else. This 
>> suggestion also passes VacAttrStats to table_acquire_sample_rows() for 
>> column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and 
>> PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be 
>> block-oriented.
> 
> Just few random notes about the idea.
> Heap pages are not of a fixed size, when measured in tuple count. And comment 
> in the codes describes it.
> * Although every row has an equal chance of ending up in the final
> * sample, this sampling method is not perfect: not every possible
> * sample has an equal chance of being selected.  For large relations
> * the number of different blocks represented by the sample tends to be
> * too small.  We can live with that for now.  Improvements are welcome.
> 
> Current implementation provide framework with shared code. Though this 
> framework is only suitable for block-of-tuples AMs. And have statistical 
> downsides when count of tuples varies too much.
> Maybe can we just provide a richer API? To have both: avoid copying code and 
> provide flexibility.
> 
> Best regards, Andrey Borodin.
> 





Re: libpq compression

2020-12-17 Thread Denis Smirnov
Hello all,

I’ve finally read the whole thread (it was huge). It is extremely sad that this 
patch hang without progress for such a long time. It seems that the main 
problem in discussion is that everyone has its own view what problems should be 
solve with this patch. Here are some of positions (not all of them):

1. Add a compression for networks with a bad bandwidth (and make a patch as 
simple and maintainable as possible) - author’s position.
2. Don’t change current network protocol and related code much.
3. Refactor compression API (and network compression as well)
4. Solve cloud provider’s problems: on demand buy network bandwidth with CPU 
utilisation and vice versa.

All of these requirements have a different nature and sometimes conflict with 
each other. Without clearly formed requirements this patch would never be 
released.

Anyway, I have rebased it to the current master branch, applied pgindent, 
tested on MacOS and fixed a MacOS specific problem with strcpy in 
build_compressors_list(): it has an undefined behaviour when source and 
destination strings overlap.
-   *client_compressors = src = dst = strdup(value);
+   *client_compressors = src = strdup(value);
+   dst = strdup(value);

According to my very simple tests with randomly generated data, zstd gives 
about 3x compression (zlib has a little worse compression ratio and a little 
bigger CPU utilisation). It seems to be a normal ratio for any streaming data - 
Greenplum also uses zstd/zlib to compress append optimised tables and 
compression ratio is usually about 3-5x. Also according to my Greenplum 
experience, the most commonly used zstd ratio is 1, while for zlib it is 
usually in a range of 1-5. CPU and execution time were not affected much 
according to uncompressed data (but my tests were very simple and they should 
not be treated as reliable).
From e33b9c99a6fdac54df40aa0c14eb43b6b1f3709d Mon Sep 17 00:00:00 2001
From: Denis Smirnov 
Date: Wed, 16 Dec 2020 22:02:49 +1000
Subject: [PATCH] Rebase patch-27 to actual master and fix strcpy

1. Patch-27 had conflicts with actual master. They were fixed.
2. On MacOS psql failed with EXC_BAD_INSTRUCTION on "compression"
   option. The problem was in strcpy OS-specific implementation:
   "The source and destination strings should not overlap, as the
   behavior is undefined."
3. pgindent was applied to the code.
---
 configure | 102 +++-
 configure.ac  |  21 +
 .../postgres_fdw/expected/postgres_fdw.out|   2 +-
 doc/src/sgml/config.sgml  |  18 +
 doc/src/sgml/libpq.sgml   |  16 +
 doc/src/sgml/protocol.sgml|  97 +++
 src/Makefile.global.in|   1 +
 src/backend/Makefile  |   8 +
 src/backend/catalog/system_views.sql  |   9 +
 src/backend/libpq/pqcomm.c| 245 ++--
 src/backend/postmaster/pgstat.c   |  30 +
 src/backend/postmaster/postmaster.c   |  13 +-
 src/backend/utils/adt/pgstatfuncs.c   |  50 +-
 src/backend/utils/misc/guc.c  |  10 +
 src/common/Makefile   |   3 +-
 src/common/zpq_stream.c   | 578 ++
 src/include/catalog/pg_proc.dat   |  18 +-
 src/include/common/zpq_stream.h   |  36 ++
 src/include/libpq/libpq-be.h  |   3 +
 src/include/libpq/libpq.h |   1 +
 src/include/libpq/pqcomm.h|   1 +
 src/include/pg_config.h.in|   3 +
 src/include/pgstat.h  |   7 +
 src/interfaces/libpq/Makefile |  14 +
 src/interfaces/libpq/fe-connect.c |  83 ++-
 src/interfaces/libpq/fe-exec.c|   4 +-
 src/interfaces/libpq/fe-misc.c|  54 +-
 src/interfaces/libpq/fe-protocol3.c   | 162 -
 src/interfaces/libpq/libpq-int.h  |  20 +
 src/test/regress/expected/rules.out   |  14 +-
 src/tools/msvc/Mkvcbuild.pm   |   2 +-
 31 files changed, 1532 insertions(+), 93 deletions(-)
 create mode 100644 src/common/zpq_stream.c
 create mode 100644 src/include/common/zpq_stream.h

diff --git a/configure b/configure
index 11a4284e5b..a64125f3ec 100755
--- a/configure
+++ b/configure
@@ -699,6 +699,7 @@ LD
 LDFLAGS_SL
 LDFLAGS_EX
 with_zlib
+with_zstd
 with_system_tzdata
 with_libxslt
 XML2_LIBS
@@ -866,6 +867,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 '
@@ -8573,40 +8575,120 @@ fi
 
 
 #
-# Zlib
+# Assignments
 #
 
+CPPFLAGS="$CPPFLAGS $INCLUDES"
+LDFLAGS="$LDFLAGS $LIBDIRS"
 
 
-# Check whether --with-zlib was given.
-if test "${with_zlib+set}" = set; then :
-  withval=$with_zlib;
+
+
+#
+# ZStd
+#
+
+
+
+# Check whether --with-zstd was given.
+if test

Re: PoC Refactor AM analyse API

2020-12-30 Thread Denis Smirnov


> But why do you pass int natts and VacAttrStats **stats to 
> acquire_sample_rows()? Is it of any use? It seems to break abstraction too.

Yes, it is really a kluge that should be discussed. The main problem is that we 
don’t pass projection information to analyze scan (analyze begin scan relise 
only on relation information during initialization). And as a result we can’t 
benefit from column AMs during «analyze t(col)» and consume data only from 
target columns. This parameters were added to solve this problem.

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia





Re: PoC Refactor AM analyse API

2021-02-18 Thread Denis Smirnov
Thanks for your review, Heikki.

I have made the changes you have requested.

1. All modifications interconnected with column projection were reverted (they 
should be added in https://commitfest.postgresql.org/31/2922 if the current 
patch would be merged one day).
2. I have returned PROGRESS_ANALYZE_* states.
3. qsort() was moved into heapam_acquire_sample_rows(). Also a comment was 
added, that the acquire_sample_rows() AM function must return the tuples in a 
physical table order.
4. table_beginscan_analyze() was removed as a redundant function.
5. acquire_sample_rows() comment about reservoir was changed.

From a5104b15683524a375aab1beb9615f690de31882 Mon Sep 17 00:00:00 2001
From: Denis Smirnov 
Date: Sat, 5 Dec 2020 23:25:29 +1000
Subject: [PATCH v2] Refactor AM analyze (acqiure rows method)

Analyze AM functions were implemented for fix sized block AM (heap
influence). Other AMs with variable size blocks are not be able to
use current two stage block sampling (Knuth' algorithm S and Vitter's
algorithm Z). This refactoring gives more freedom to AM developers to
implement analyze for non-fix sized AMs.

Discussion: 
https://www.postgresql.org/message-id/flat/C7CFE16B-F192-4124-BEBB-7864285E0FF7%40arenadata.io
---
 src/backend/access/heap/heapam_handler.c | 199 ++-
 src/backend/access/table/tableamapi.c|   3 +-
 src/backend/commands/analyze.c   | 189 +
 src/include/access/tableam.h |  99 +++
 4 files changed, 227 insertions(+), 263 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c 
b/src/backend/access/heap/heapam_handler.c
index 4a70e20a14..6bf0c887be 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,9 +33,11 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "commands/progress.h"
+#include "commands/vacuum.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
@@ -44,6 +46,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 
Relation OldHeap, Relation NewHeap,
@@ -53,6 +56,8 @@ static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer 
buffer,
   HeapTuple 
tuple,
   OffsetNumber 
tupoffset);
 
+static int compare_rows(const void *a, const void *b);
+
 static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 
 static const TableAmRoutine heapam_methods;
@@ -1154,6 +1159,173 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, 
TransactionId OldestXmin,
return false;
 }
 
+/*
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+heapam_acquire_sample_rows(Relation rel, int elevel,
+  BufferAccessStrategy 
bstrategy,
+  HeapTuple *rows, int 
targrows,
+  double *totalrows, double 
*totaldeadrows)
+{
+   int numrows = 0;/* # rows now in reservoir */
+   double  samplerows = 0; /* total # rows collected */
+   double  liverows = 0;   /* # live rows seen */
+   double  deadrows = 0;   /* # dead rows seen */
+   double  rowstosk

Re: PoC Refactor AM analyse API

2021-02-18 Thread Denis Smirnov
Hello,  Zhihong.

Thanks for your comments.

1. I am afraid that an equivalence of "floor(val + 0.5)" to "cell(val)" is 
incorrect: "floor(2.1 + 0.5)" returns 2 while  "cell(2.1)" returns 3. We can’t 
use such replacement, as you have suggested.

2. >> For compare_rows(), it seems the computation of oa and ob can be delayed 
to when ba == bb (after the first two if statements).
I have checked some examples of ASM code generated by different compilers with 
-O2/O3 flags on https://gcc.godbolt.org and didn’t see any big benefit in 
result CPU instructions. You can check yourself an attached example below.



simplified_compare_rows.c
Description: Binary data


Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia



Re: Use virtual tuple slot for Unique node

2023-10-29 Thread Denis Smirnov
I have taken a look at this discussion, at the code and I am confused how we 
choose tuple table slot (TTS) type in PG. May be you can clarify this topic or 
me. 

1. Brief intro. There are four types of TTS. Plan tree «leaves»:
- buffer heap (produced by index and table scans, has system columns and keeps 
shared buffer pins)
- heap (produced by FDW: has system columns, but doesn’t keep any pins)
- minimal (produced by values and materializations nodes like sort, agg, etc.)
Plan «branches»:
- virtual (keeps datum references to the columns of the tuples in the child 
nodes)

Virtual TTS is cheeper to copy among the plan (as we copy datum references), 
but more expensive to materialize (we have to construct a tuple from pieces).

Leaves are cheeper to materialize (as we make a memcmp under hood), but very 
expensive to copy (we copy the value, not the datum reference).

2. If we take a look at the materialize nodes in the plan, they produce 
different result TTS.
- Outer child TTS type: gater, gather merge, lock rows, limit;
- Minimal: material, sort, incremental sort, memoize, unique, hash, setup (can 
be heap as well);
- Virtual: group, agg, window agg.

From my point of view, the materialization node should preserve the incoming 
TTS type. For the sort node (that materializes incoming tuples as minimal) it 
is ok to output minimal result as well. Looks that unique should use the outer 
child’d TTS (instead of hardcoded minimal). But can anyone explain me why do 
group, agg and window agg return the virtual instead of the same TTS type as 
outer child has? Do we expect that the parent node exists and requires exactly 
virtual tuples (but what if the parent node is sort and benefits from minimal 
TTS)? So, it looks like we need to take a look not only at the unique, but also 
inspect all the materialization nodes.






Re: Use virtual tuple slot for Unique node

2023-08-30 Thread Denis Smirnov
It looks like my patch was not analyzed by the hackers mailing list due to 
incorrect mime type, so I duplicate it here.
commit 2852a3f2fab8e723f208d81c1ad1eb6a6a377b09
Author: Denis Smirnov 
Date:   Thu Aug 31 08:51:14 2023 +0700

Change tuple table slot for Unique node to "virtual"

The Unique node does a very simple thing in the plan - it processes
the incoming sorted tuple stream and adds the unique tuples to the
resulting tuple table slot. Previously the Unique node palloc'ed
the results with the "miminal" tuples. It is redundant and for now
we simply collect references to the child node tuples with the
"virtual" tuples.

diff --git a/src/backend/executor/nodeUnique.c 
b/src/backend/executor/nodeUnique.c
index 45035d74fa..c859add6e0 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -141,7 +141,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 * Initialize result slot and type. Unique nodes do no projections, so
 * initialize projection info for this node appropriately.
 */
-   ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsMinimalTuple);
+   ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsVirtual);
uniquestate->ps.ps_ProjInfo = NULL;
 
/*


> 31 авг. 2023 г., в 10:06, Denis Smirnov  написал(а):
> 
> 



Re: Use virtual tuple slot for Unique node

2023-08-31 Thread Denis Smirnov
I have made a small research and found out that though the patch itself is correct (i.e. we can benefit from replacing TTSOpsMinimalTuple with TTSOpsVirtual for the Unique node), my explanation WHY was wrong.1. We always materialize the new unique tuple in the slot, never mind what type of tuple table slots do we use.2. But the virtual tuple materialization (tts_virtual_copyslot) have performance benefits over the minimal tuple one (tts_minimal_copyslot):    - tts_minimal_copyslot always allocates zeroed memory with palloc0 (expensive according to the flame graph);    - tts_minimal_copyslot() doesn’t allocate additional memory if the tuples are constructed from the passed by value column (but for the variable-size columns we still need memory allocation);    - if tts_minimal_copyslot() need allocation it doesn’t need to zero the memory;So as a result we seriously benefit from virtual TTS for the tuples constructed from the fixed-sized columns when get a Unique node in the plan.commit 148642d81f046b7d72b3a40182c165e42a8ab6d7
Author: Denis Smirnov 
Date:   Thu Aug 31 08:51:14 2023 +0700

Change tuple table slot for Unique node to "virtual"

The Unique node uses minimal TTS implementation to copy the unique
tuples from the sorted stream into the resulting tuple slot. But if
we replace the minimal TTS with the virtual TTS copy method, the
performance improves.

1. Minimal TTS always allocates zeroed memory for the materialized
   tuple.
2. Virtual TTS doesn't allocate additional memory for the tuples
   with the columns passed by value. For the columns with external
   memory we don't need to zero the bytes but can simply take the
   memory chunk from the free list "as is".

diff --git a/src/backend/executor/nodeUnique.c 
b/src/backend/executor/nodeUnique.c
index 45035d74fa..c859add6e0 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -141,7 +141,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 * Initialize result slot and type. Unique nodes do no projections, so
 * initialize projection info for this node appropriately.
 */
-   ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsMinimalTuple);
+   ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsVirtual);
uniquestate->ps.ps_ProjInfo = NULL;
 
/*
31 авг. 2023 г., в 10:28, Denis Smirnov  написал(а):It looks like my patch was not analyzed by the hackers mailing list due to incorrect mime type, so I duplicate it here.31 авг. 2023 г., в 10:06, Denis Smirnov  написал(а):

Re: Async-unsafe functions in signal handlers

2021-08-26 Thread Denis Smirnov
As far as I understand, the main problem with backtrace_symbols() is the 
internal malloc() call. Backend can lock forever if malloc() was interrupted by 
a signal and then was evaluated again in a signal handler.

At the moment Greenplum uses "addr2line -s -e» (on Linux) and "atos -o" (on 
macOS) for each stack address instead of backtrace_symbols(). Both of these 
utils don’t use malloc() underhood, although there is no guarantee that this 
implementation never changes in the future. It seems to be a safer approach, 
but looks like a dirty hack.

> 26 авг. 2021 г., в 08:52, Andrey Borodin  написал(а):
> 
> 
> 
>> 25 авг. 2021 г., в 19:22, Denis Smirnov  написал(а):
>> 
>> I am going to refactor Greenplum backtraces for error messages and want to 
>> make it more compatible with PostgreSQL code. Backtraces in PostgreSQL were 
>> introduced by 71a8a4f6e36547bb060dbcc961ea9b57420f7190 commit (original 
>> discussion 
>> https://www.postgresql.org/message-id/CAMsr+YGL+yfWE=jvbubnpwtrrzney7hj07+zt4byjdvp4sz...@mail.gmail.com
>>  ) and rely on backtrace() and backtrace_symbols() functions. They are used 
>> inside errfinish() that is wrapped by ereport() macros. ereport() is invoked 
>> inside bgworker_die() and FloatExceptionHandler() signal handlers. I am 
>> confused with this fact - both backtrace functions are async-unsafe: 
>> backtrace_symbols() - always, backtrace() - only for the first call due to 
>> dlopen. I wonder why does PostgreSQL use async-unsafe functions in signal 
>> handlers?
> 
> In my view GUC backtrace_functions is expected to be used for debug purposes. 
> Not for enabling on production server for bgworker_die() or 
> FloatExceptionHandler().
> Are there any way to call backtrace_symbols() without touching 
> backtrace_functions?
> 
> Best regards, Andrey Borodin.
> 

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia





Re: Async-unsafe functions in signal handlers

2021-08-27 Thread Denis Smirnov


> 26 авг. 2021 г., в 23:39, Tom Lane  написал(а):
> 
> (BTW, I think it's pretty silly to imagine that adding backtrace()
> calls inside ereport is making things any more dangerous.  ereport
> has pretty much always carried a likelihood of calling malloc(),
> for example.)

I have taken a look through the signal handlers and found out that many of them 
use malloc() via ereport() and elog(). Here is the list:

SIGUSR1
- procsignal_sigusr1_handler(): autoprewarm, autovacuum, bgworker, bgwriter, 
checkpointer, pgarch, startup, walwriter, walreciever, walsender
- sigusr1_handler(): postmaster

SIGFPE:
- FloatExceptionHandler(): autovacuum, bgworker, postgres, plperl

SIGHUP:
- SIGHUP_handler(): postmaster

SIGCHLD:
- reaper(): postmaster

SIGQUIT:
- quickdie(): postgres

SIGTERM:
- bgworker_die(): bgworker

SIGALRM:
- handle_sig_alarm(): autovacuum, bgworker, postmaster, startup, walsender, 
postgres

I suspect there are lots of potential ways to lock on malloc() inside any of 
this handlers. An interesting question is why there are still no evidence of 
such locks?

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia





Re: Async-unsafe functions in signal handlers

2021-08-27 Thread Denis Smirnov


> 28 авг. 2021 г., в 07:05, Andres Freund  написал(а):
> 
> However, we have a
> bandaid that deals with possible hangs, by SIGKILLing when processes don't
> shut down (at that point things have already gone quite south, so that's not
> an issue).

Thanks for the explanation. I can see that child process SIGKILL machinery was 
introduced by 82233ce7ea42d6ba519aaec63008aff49da6c7af commit to fix a malloc() 
deadlock in quickdie() signal handler. As a result, all child processes that 
die too long are killed in the ServerLoop() with SIGKILL. But bgworker_die() is 
a problem as we initialize bgworkers right before ServerLoop(). So we can face 
malloc() deadlock on postmaster startup (before ServerLoop() started). Maybe we 
should simply use write() and exit() instead of ereport() for bgworker_die()?

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia





Re: Async-unsafe functions in signal handlers

2021-08-30 Thread Denis Smirnov
Honestly, I don’t know what to do with bgworker_die(). At the moment it 
produces ereport(FATAL) with async-unsafe proc_exit_prepare() and exit() 
underhood. I can see three solutions:

1. Leave the code as is. Then SIGTERM can produce deadlocks in bgworker's 
signal handler. The locked process can terminated with an immediate shutdown 
<https://github.com/postgres/postgres/commit/82233ce7ea42d6ba519aaec63008aff49da6c7af>
 of the cluster. May be it is ok as we don’t expect to send SIGTERM to bgworker 
too often.

2. Use async-safe _exit() in a signal handler instead of proc_exit_prepare() 
and exit(). In this case we’ll have to go through cluster recovery as the 
bgworker doesn't properly clean its shared memory. This solution is even worth 
than immediate shutdown as we recover for every SIGTERM have been sent to 
bgworker.

3. Set a signal flag inside the handler (something like miscadmin.h 
XXX_INTERRUPTS() macros). So it becomes an extension developer's responsibility 
to properly handle this flag in the bgworker’s code. This approach breaks 
backward compatibility.

May be I've missed a good solution, do you see any?

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia



Async-unsafe functions in signal handlers

2021-08-25 Thread Denis Smirnov
Hello all,

I am going to refactor Greenplum backtraces for error messages and want to make 
it more compatible with PostgreSQL code. Backtraces in PostgreSQL were 
introduced by 71a8a4f6e36547bb060dbcc961ea9b57420f7190 commit (original 
discussion 
https://www.postgresql.org/message-id/CAMsr+YGL+yfWE=jvbubnpwtrrzney7hj07+zt4byjdvp4sz...@mail.gmail.com
 ) and rely on backtrace() and backtrace_symbols() functions. They are used 
inside errfinish() that is wrapped by ereport() macros. ereport() is invoked 
inside bgworker_die() and FloatExceptionHandler() signal handlers. I am 
confused with this fact - both backtrace functions are async-unsafe: 
backtrace_symbols() - always, backtrace() - only for the first call due to 
dlopen. I wonder why does PostgreSQL use async-unsafe functions in signal 
handlers?

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia