Re: OpenSSL 3.0.0 compatibility
On Mon, Jun 01, 2020 at 10:44:17AM +0200, Peter Eisentraut wrote: > Then we can stick a OPENSSL_API_COMPAT=908 into at least PG10, 11, and 12, > and probably also into PG9.5 and 9.6 without harm. FWIW, I am fine that for PG >= 10, and I don't think that it is worth bothering with PG <= 9.6. -- Michael signature.asc Description: PGP signature
Re: valgrind error
On Sun, May 10, 2020 at 09:29:05AM -0400, Andrew Dunstan wrote: > On 4/18/20 9:15 AM, Andrew Dunstan wrote: > > I was just trying to revive lousyjack, my valgrind buildfarm animal > > which has been offline for 12 days, after having upgraded the machine > > (fedora 31, gcc 9.3.1, valgrind 3.15) and noticed lots of errors like this: > > { > > > > Memcheck:Value8 > > fun:pg_comp_crc32c_sb8 > > fun:XLogRecordAssemble > > fun:XLogInsert > > fun:LogCurrentRunningXacts > > fun:LogStandbySnapshot > > fun:CreateCheckPoint > > fun:CheckpointerMain > > fun:AuxiliaryProcessMain > > fun:StartChildProcess > > fun:reaper > > obj:/usr/lib64/libpthread-2.30.so > > fun:select > > fun:ServerLoop > > fun:PostmasterMain > > fun:main > > } > After many hours of testing I have a culprit for this. The error appears > with valgrind 3.15.0 with everything else held constant. 3.14.0 does > not produce the problem. I suspect 3.15.0 is just better at tracking the uninitialized data. A more-remote possibility is valgrind-3.14.0 emulating sse42. That would make pg_crc32c_sse42_available() return true, avoiding the pg_comp_crc32c_sb8(). > andrew@freddo:bf (master)*$ lscpu ... > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext > fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl > nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy > svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs > skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save > > > I did not manage to reproduce this anywhere else, tried on various > physical, Virtualbox and Docker instances. I can reproduce this on a 2017-vintage CPU with ./configure ... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel" under valgrind-3.15.0 (as packaged by RHEL 7.8). valgrind.supp has a suppression for CRC calculations, but it didn't get the memo when commit 4f700bc renamed the function. The attached patch fixes the suppression. Author: Noah Misch Commit: Noah Misch Refresh function name in CRC-associated Valgrind suppressions. Back-patch to 9.5, where commit 4f700bcd20c087f60346cb8aefd0e269be8e2157 first appeared. Reviewed by FIXME. Reported by Andrew Dunstan. Discussion: https://postgr.es/m/4dfabec2-a3ad-0546-2d62-f816c97ed...@2ndquadrant.com diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp index ec47a22..acdb620 100644 --- a/src/tools/valgrind.supp +++ b/src/tools/valgrind.supp @@ -45,7 +45,7 @@ padding_XLogRecData_CRC Memcheck:Value8 - fun:pg_comp_crc32c + fun:pg_comp_crc32c* fun:XLogRecordAssemble } @@ -89,7 +89,7 @@ { padding_twophase_CRC Memcheck:Value8 - fun:pg_comp_crc32c + fun:pg_comp_crc32c* fun:EndPrepare }
Re: BufFileRead() error signalling
On Fri, Jun 05, 2020 at 06:03:59PM +1200, Thomas Munro wrote: > I didn't change BufFileWrite() to be void, to be friendly to existing > callers outside the tree (if there are any), though I removed all the > code that checks the return code. We can make it void later. Missing one entry in AppendStringToManifest(). It sounds right to not change the signature of the routine on back-branches to any ABI breakages. It think that it could change on HEAD. Anyway, why are we sure that it is fine to not complain even if BufFileWrite() does a partial write? fwrite() is mentioned at the top of the thread, but why is that OK? > For the future: it feels a bit like we're missing a one line way to > say "read this many bytes and error out if you run out". - ereport(ERROR, - (errcode_for_file_access(), -errmsg("could not write block %ld of temporary file: - %m", - blknum))); - } + elog(ERROR, "could not seek block %ld temporary file", blknum); You mean "in temporary file" in the new message, no? + ereport(ERROR, + (errcode_for_file_access(), +errmsg("could not write to \"%s\" : %m", + FilePathName(thisfile; Nit: "could not write [to] file \"%s\": %m" is a more common error string. -- Michael signature.asc Description: PGP signature
Make more use of RELKIND_HAS_STORAGE()
This is a patch to make use of RELKIND_HAS_STORAGE() where appropriate, instead of listing out the relkinds individually. No behavior change is intended. This was originally part of the patch from [0], but it seems worth moving forward independently. [0]: https://www.postgresql.org/message-id/flat/dc35a398-37d0-75ce-07ea-1dd71d98f8ec%402ndquadrant.com -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From e5135e2324ff38dc03998af00268f96b6725cd23 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 5 Jun 2020 08:57:28 +0200 Subject: [PATCH] Make more use of RELKIND_HAS_STORAGE() Make use of RELKIND_HAS_STORAGE() where appropriate, instead of listing out the relkinds individually. No behavior change intended. --- src/backend/catalog/heap.c | 7 +--- src/backend/postmaster/pgstat.c | 6 +-- src/backend/utils/adt/dbsize.c | 73 + 3 files changed, 31 insertions(+), 55 deletions(-) diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index e393c93a45..9c45544815 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -1943,13 +1943,8 @@ heap_drop_with_catalog(Oid relid) /* * Schedule unlinking of the relation's physical files at commit. */ - if (rel->rd_rel->relkind != RELKIND_VIEW && - rel->rd_rel->relkind != RELKIND_COMPOSITE_TYPE && - rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE && - rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) - { + if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind)) RelationDropStorage(rel); - } /* * Close relcache entry, but *keep* AccessExclusiveLock on the relation diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index d7f99d9944..166d8e3d15 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -1807,11 +1807,7 @@ pgstat_initstats(Relation rel) charrelkind = rel->rd_rel->relkind; /* We only count stats for things that have storage */ - if (!(relkind == RELKIND_RELATION || - relkind == RELKIND_MATVIEW || - relkind == RELKIND_INDEX || - relkind == RELKIND_TOASTVALUE || - relkind == RELKIND_SEQUENCE)) + if (!RELKIND_HAS_STORAGE(relkind)) { rel->pgstat_info = NULL; return; diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c index 840664429e..2320c06a9b 100644 --- a/src/backend/utils/adt/dbsize.c +++ b/src/backend/utils/adt/dbsize.c @@ -874,25 +874,18 @@ pg_relation_filenode(PG_FUNCTION_ARGS) PG_RETURN_NULL(); relform = (Form_pg_class) GETSTRUCT(tuple); - switch (relform->relkind) + if (RELKIND_HAS_STORAGE(relform->relkind)) { - case RELKIND_RELATION: - case RELKIND_MATVIEW: - case RELKIND_INDEX: - case RELKIND_SEQUENCE: - case RELKIND_TOASTVALUE: - /* okay, these have storage */ - if (relform->relfilenode) - result = relform->relfilenode; - else/* Consult the relation mapper */ - result = RelationMapOidToFilenode(relid, - relform->relisshared); - break; - - default: - /* no storage, return NULL */ - result = InvalidOid; - break; + if (relform->relfilenode) + result = relform->relfilenode; + else/* Consult the relation mapper */ + result = RelationMapOidToFilenode(relid, + relform->relisshared); + } + else + { + /* no storage, return NULL */ + result = InvalidOid; } ReleaseSysCache(tuple); @@ -951,38 +944,30 @@ pg_relation_filepath(PG_FUNCTION_ARGS) PG_RETURN_NULL(); relform = (Form_pg_class) GETSTRUCT(tuple); - switch (relform->relkind) + if (RELKIND_HAS_STORAGE(relform->relkind)) + { + /* This logic should match RelationInitPhysicalAddr */ + if (relform->reltablespace) + rnode.spcNode = relform->reltablespace; + else + rnode.spcNode = MyDatabaseTableSpace; + if (rnode.spcNode == GLOBALTABLESPACE_OID) + rnode.dbNode = InvalidOid; + else + rnode.dbNode
Re: OpenSSL 3.0.0 compatibility
On 2020-05-30 11:29, Peter Eisentraut wrote: I suggest to update the test data in PG13+, since we require OpenSSL 1.0.1 there. For the older branches, I would look into changing the test driver setup so that it loads a custom openssl.cnf that loads the legacy providers. I have pushed your 0002 patch (the one that updates the test data) to master. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: proposal - function string_to_table
Hi čt 4. 6. 2020 v 11:49 odesílatel movead...@highgo.ca napsal: > +{ oid => '2228', descr => 'split delimited text', > + proname => 'string_to_table', prorows => '1000', proretset => 't', > + prorettype => 'text', proargtypes => 'text text', > + prosrc => 'text_to_table' }, > +{ oid => '2282', descr => 'split delimited text with null string', > + proname => 'string_to_table', prorows => '1000', proretset => 't', > + prorettype => 'text', proargtypes => 'text text text', > + prosrc => 'text_to_table_null' }, > > I go through the patch, and everything looks good to me. But I do not know > why it needs a 'text_to_table_null()', it's ok to put a 'text_to_table' > there, I think. > It is a convention in Postgres - every SQL unique signature has its own unique internal C function. I am sending a refreshed patch. Regards Pavel > > -- > Regards, > Highgo Software (Canada/China/Pakistan) > URL : www.highgo.ca > EMAIL: mailto:movead(dot)li(at)highgo(dot)ca > diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7c06afd3ea..974499ff2e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -3437,6 +3437,27 @@ repeat('Pg', 4) PgPgPgPg + + + + string_to_table + +string_to_table ( string text, delimiter text , nullstr text ) +set of text + + +splits string into table using supplied delimiter and +optional null string. + + +string_to_table('xx~^~yy~^~zz', '~^~', 'yy') + +xx +yy, +zz + + + @@ -16717,21 +16738,6 @@ strict $.track.segments[*].location - - -string starts with string -boolean - - -Tests whether the second operand is an initial substring of the first -operand. - - -jsonb_path_query('["John Smith", "Mary Stone", "Bob Johnson"]', '$[*] ? (@ starts with "John")') -"John Smith" - - - exists ( path_expression ) diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 2eaabd6231..650813d8b8 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -26,6 +26,7 @@ #include "lib/hyperloglog.h" #include "libpq/pqformat.h" #include "miscadmin.h" +#include "nodes/execnodes.h" #include "parser/scansup.h" #include "port/pg_bswap.h" #include "regex/regex.h" @@ -35,6 +36,7 @@ #include "utils/memutils.h" #include "utils/pg_locale.h" #include "utils/sortsupport.h" +#include "utils/tuplestore.h" #include "utils/varlena.h" @@ -92,6 +94,16 @@ typedef struct pg_locale_t locale; } VarStringSortSupport; +/* + * Holds target metadata used for split string to array or to table. + */ +typedef struct +{ + ArrayBuildState *astate; + Tuplestorestate *tupstore; + TupleDesc tupdesc; +} SplitStringTargetData; + /* * This should be large enough that most strings will fit, but small enough * that we feel comfortable putting it on the stack @@ -139,7 +151,7 @@ static bytea *bytea_substring(Datum str, bool length_not_specified); static bytea *bytea_overlay(bytea *t1, bytea *t2, int sp, int sl); static void appendStringInfoText(StringInfo str, const text *t); -static Datum text_to_array_internal(PG_FUNCTION_ARGS); +static bool text_to_array_internal(FunctionCallInfo fcinfo, SplitStringTargetData *tstate); static text *array_to_text_internal(FunctionCallInfo fcinfo, ArrayType *v, const char *fldsep, const char *null_string); static StringInfo makeStringAggState(FunctionCallInfo fcinfo); @@ -4679,7 +4691,19 @@ text_isequal(text *txt1, text *txt2, Oid collid) Datum text_to_array(PG_FUNCTION_ARGS) { - return text_to_array_internal(fcinfo); + SplitStringTargetData tstate; + + /* reset tstate */ + memset(&tstate, 0, sizeof(tstate)); + + if (!text_to_array_internal(fcinfo, &tstate)) + PG_RETURN_NULL(); + + if (!tstate.astate) + PG_RETURN_ARRAYTYPE_P(construct_empty_array(TEXTOID)); + + PG_RETURN_ARRAYTYPE_P(makeArrayResult(tstate.astate, + CurrentMemoryContext)); } /* @@ -4693,16 +4717,98 @@ text_to_array(PG_FUNCTION_ARGS) Datum text_to_array_null(PG_FUNCTION_ARGS) { - return text_to_array_internal(fcinfo); + return text_to_array(fcinfo); +} + +/* + * text_to_table + * Parse input string and returns substrings as a table. + */ +Datum +text_to_table(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsi = (ReturnSetInfo *) fcinfo->resultinfo; + SplitStringTargetData tstate; + MemoryContext old_cxt; + + /* check to see if caller supports us returning a tuplestore */ + if (rsi == NULL || !IsA(rsi, ReturnSetInfo)) + ereport(ERROR, +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + + if (!(rsi->allowedModes & SFRM_Materialize)) + ereport(ERROR, +(errcode(ERRCODE_FEATURE_NOT_S
minor doc fix - garbage in example of result of unnest
Hi diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7c06afd3ea..3b810c0eb4 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -17821,7 +17821,6 @@ SELECT NULLIF(value, '(none)') ... 1 2 -(2 rows in result) Regards Pavel
Re: [Proposal] Page Compression for OLTP
Hi hackers, > # Page storage(Plan C) > > Further, since the size of the compress address file is fixed, the above > address file and data file can also be combined into one file > > 0 1 2 1230710 1 2 > +===+===+===+ +===+=+=+ > | head | 1|2 | ... | | data1 | data2 | ... > +===+===+===+ +===+=+=+ > head | address| data | I made a prototype according to the above storage method. Any suggestions are welcome. # Page compress file storage related definitions /* * layout of Page Compress file: * * - PageCompressHeader * - PageCompressAddr[] * - chuncks of PageCompressData * */ typedef struct PageCompressHeader { pg_atomic_uint32 nblocks; /* number of total blocks in this segment */ pg_atomic_uint32 allocated_chunks; /* number of total allocated chunks in data area */ uint16chunk_size; /* size of each chunk, must be 1/2 1/4 or 1/8 of BLCKSZ */ uint8algorithm; /* compress algorithm, 1=pglz, 2=lz4 */ } PageCompressHeader; typedef struct PageCompressAddr { uint8nchunks; /* number of chunks for this block */ uint8allocated_chunks; /* number of allocated chunks for this block */ /* variable-length fields, 1 based chunk no array for this block, size of the array must be 2, 4 or 8 */ pc_chunk_number_t chunknos[FLEXIBLE_ARRAY_MEMBER]; } PageCompressAddr; typedef struct PageCompressData { char page_header[SizeOfPageHeaderData]; /* page header */ uint32 size;/* size of compressed data */ char data[FLEXIBLE_ARRAY_MEMBER]; /* compressed page, except for the page header */ } PageCompressData; # Usage Set whether to use compression through storage parameters of tables and indexes - compress_type Set whether to compress and the compression algorithm used, supported values: none, pglz, zstd - compress_chunk_size Chunk is the smallest unit of storage space allocated for compressed pages. The size of the chunk can only be 1/2, 1/4 or 1/8 of BLCKSZ - compress_prealloc_chunks The number of chunks pre-allocated for each page. The maximum value allowed is: BLCKSZ/compress_chunk_size -1. If the number of chunks required for a compressed page is less than `compress_prealloc_chunks`, It allocates `compress_prealloc_chunks` chunks to avoid future storage fragmentation when the page needs more storage space. # Sample ## requirement - zstd ## build ./configure --with-zstd make make install ## create compressed table and index create table tb1(id int,c1 text); create table tb1_zstd(id int,c1 text) with(compress_type=zstd,compress_chunk_size=1024); create table tb1_zstd_4(id int,c1 text) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4); create index tb1_idx_id on tb1(id); create index tb1_idx_id_zstd on tb1(id) with(compress_type=zstd,compress_chunk_size=1024); create index tb1_idx_id_zstd_4 on tb1(id) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4); create index tb1_idx_c1 on tb1(c1); create index tb1_idx_c1_zstd on tb1(c1) with(compress_type=zstd,compress_chunk_size=1024); create index tb1_idx_c1_zstd_4 on tb1(c1) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4); insert into tb1 select generate_series(1,100),md5(random()::text); insert into tb1_zstd select generate_series(1,100),md5(random()::text); insert into tb1_zstd_4 select generate_series(1,100),md5(random()::text); ## show size of table and index postgres=# \d+ List of relations Schema |Name| Type | Owner | Persistence | Size | Description ++---+--+-+---+- public | tb1| table | postgres | permanent | 65 MB | public | tb1_zstd | table | postgres | permanent | 37 MB | public | tb1_zstd_4 | table | postgres | permanent | 37 MB | (3 rows) postgres=# \di+ List of relations Schema | Name| Type | Owner | Table | Persistence | Size | Description +---+---+--+---+-+---+- public | tb1_idx_c1| index | postgres | tb1 | permanent | 73 MB | public | tb1_idx_c1_zstd | index | postgres | tb1 | permanent | 36 MB | public | tb1_idx_c1_zstd_4 | index | postgres | tb1 | permanent | 41 MB | public | tb1_idx_id| index | postgres | tb1 | permanent | 21 MB | public | tb1_idx_id_zstd | index | postgres | tb1 | permanent | 13 MB | public | tb1_idx_id_zstd_4 | index | postgres | tb1 | permanent | 15 MB | (6 rows) # pgbench performance testing(TPC-B) Compre
Re: [PATCH] pg_dump: Add example and link for --encoding option
I've modified my previous patch because it linked the wrong document so I fixed it. and I add a highlight to the encoding list for readability. What do you think about this change? Thanks, Dong Wook 2020년 6월 4일 (목) 오후 10:20, 이동욱 님이 작성: > To let users know what kind of character set > can be used add examples and a link to --encoding option. > > Thanks, > Dong wook > v2-0001-pg_dump-Add-example-and-link-for-encoding-option.patch Description: Binary data
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera wrote: > On 2020-Jun-04, Andres Freund wrote: > > > postgres[52656][1]=# SELECT 1; > > ┌──┐ > > │ ?column? │ > > ├──┤ > > │1 │ > > └──┘ > > (1 row) > > > > > > I am very much not in love with the way that was implemented, but it's > > there, and it's used as far as I know (cf tablesync.c). > > Ouch ... so they made IDENT in the replication grammar be a trigger to > enter the regular grammar. Crazy. No way to put those worms back in > the tin now, I guess. > Is that documented ? > > It is still my opinion that we should prohibit a logical replication > connection from being used to do physical replication. Horiguchi-san, > Sawada-san and Masao-san are all of the same opinion. Dave Cramer (of > the JDBC team) is not opposed to the change -- he says they're just > using it because they didn't realize they should be doing differently. I think my exact words were "I don't see this is a valid reason to keep doing something. If it is broken then fix it. Clients can deal with the change." in response to: Well, I don't really think that we can just break a behavior that > exists since 9.4 as you could break applications relying on the > existing behavior, and that's also the point of Vladimir upthread. > Which is different than not being opposed to the change. I don't see this as broken, and it's quite possible that some of our users are using it. It certainly needs to be documented Dave
Re: Removal of currtid()/currtid2() and some table AM cleanup
On 2020/06/05 15:22, Michael Paquier wrote: On Wed, Jun 03, 2020 at 10:10:21PM +0900, Inoue, Hiroshi wrote: On 2020/06/03 11:14, Michael Paquier wrote: I have been looking at the ODBC driver and the need for currtid() as well as currtid2(), and as mentioned already in [1], matching with my lookup of things, these are actually not needed by the driver as long as we connect to a server newer than 8.2 able to support RETURNING. Though currtid2() is necessary even for servers which support RETURNING, I don't object to remove it. In which cases is it getting used then? Keyset-driven cursors always detect changes made by other applications (and themselves). currtid() is necessary to detect the changes. CTIDs are changed by updates unfortunately. regards, Hiroshi Inoue From what I can see there is zero coverage for that part in the tests. And based on a rough read of the code, this would get called with LATEST_TUPLE_LOAD being set, where there is some kind of bulk deletion involved. Couldn't that be a problem? -- Michael
Re: significant slowdown of HashAggregate between 9.6 and 10
On Thu, Jun 04, 2020 at 06:57:58PM -0700, Andres Freund wrote: Hi, On 2020-06-04 18:22:03 -0700, Jeff Davis wrote: On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote: > +/* minimum number of initial hash table buckets */ > +#define HASHAGG_MIN_BUCKETS 256 > > > I don't really see much explanation for that part in the commit, > perhaps > Jeff can chime in? I did this in response to a review comment (point #5): https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development Tomas suggested a min of 1024, and I thought I was being more conservative choosing 256. Still too high, I guess? I can lower it. What do you think is a reasonable minimum? I don't really see why there needs to be a minimum bigger than 1? I think you're right. I think I was worried about having to resize the hash table in case of an under-estimate, and it seemed fine to waste a tiny bit more memory to prevent that. But this example shows we may need to scan the hash table sequentially, which means it's not just about memory consumption. So in hindsight we either don't need the limit at all, or maybe it could be much lower (IIRC it reduces probability of collision, but maybe dynahash does that anyway internally). I wonder if hashjoin has the same issue, but probably not - I don't think we'll ever scan that internal hash table sequentially. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Internal key management system
Hello Bruce, Hmmm. This seels to suggest that interacting with something outside should be an option. Our goal is not to implement every possible security idea someone has, because we will never finish, and the final result would be too complex to be unable. Sure. I'm trying to propose something both simple and extensible, so that other people could plug their own KMS if they are not fully satisfied with the way the internal pg KMS works, which IMHO should be the case if someone is motivated and paranoid enough to setup a KMS in the first place. You will need to explain exactly why having a separate process has value over coding/user complexity, and you will need to get agreement from a sufficient number of people to move that idea forward. ISTM that the value is simple: The whole KMS idea turns around a "KEK", which is a secret key which allows to unlock/retrieve/recompute many data keys, aka DEKs. Loosing the KEK basically means loosing all data keys, past, present and possibly future, depending on how the KEK/DEK mechanism operates internally. So the thing you should not want is to lose your KEK. Keeping it inside pg process means that any pg process compromision would result in the KEK to be compromised as well, while the whole point of doing this KMS business was to provide security by isolating realms of data encryption. If you provide an interface instead, which I'm advocating, then where the KEK is does not concern pg, which has just to ask for DEKs. A compromise of pg would compromise local DEKs, but not the KEK "master key". The KEK may be somewhere on the same host, in another process, or possibly on another host, on some attached specialized quantum hardware inacessible to human beings. Postgres should not decide where the user should put its KEK, because it would depend on the threat model being addressed that you do not know. From an implementation point of view, what I suggest is reasonably simple and allows people to interface with the KMS of their choice, including one based on the patch, which would be a demos about what can be done, but other systems would be accessible just as well. The other software engineering aspect is that a KMS is a complex/sensitive thing, so reinventing our own and forcing it on users seems like a bad idea. From what I understood from the code, the KEK is loaded into postgres process. That is what I'm disagreeing with, only needed DEK should be there. One option would be to send the data needing to be encrypted to an external command, and get the decrypted data back. In that way, the KEK is never on the Postgres server. However, the API for doing such an interface seems very complex and could lead to failures. I was more thinking of an interface to retrieve DEKs, but to still keep encryption/decryption inside postgres, to limit traffic, but what you suggest could be a valid option, so maybe should be allowed. I disagree with the implementation complexity, though. Basically the protocol only function is sending "GET " and retrieving a response which is either the DEK or an error, which looks like a manageable complexity. Attached a simplistic server-side implementation of that for illustration. If you want externalized DEK as well, it would be sending "ENC/DEC " and the response is an error, or the translated data. Looks manageable as well. Allowing both approaches looks ok. Obviously it requires some more thinking and design, but my point is that postgres should not hold a KEK, ever, nor presume how DEK are to be managed by a DMS, and that is not very difficult to achieve by putting it outside of pg and defining how interactions take place. Providing a reference/example implementation would be nice as well, and Masahiko-san code can be rewrapped quite easily. -- Fabien.#! /usr/bin/env python3 # # This (powerful) code is public domain. import sys from hashlib import sha256 as H # obviously the KEK should be retrieved from somewhere else KEK = b"super secret default KEK" while True: try: # parse GET 0xdeadbeef req = sys.stdin.readline()) assert req[0:6] == "GET 0x" and req[-1] == "\n" hdata = bytes.fromhex(req[6:-1]) print(f"DEK 0x{H(hdata + KEK).hexdigest()}") except Exception as e: print(f"ERR {e}")
Re: minor doc fix - garbage in example of result of unnest
Pavel Stehule writes: > diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml > index 7c06afd3ea..3b810c0eb4 100644 > --- a/doc/src/sgml/func.sgml > +++ b/doc/src/sgml/func.sgml > @@ -17821,7 +17821,6 @@ SELECT NULLIF(value, '(none)') ... > 1 > 2 > > -(2 rows in result) > > That's not "garbage", I put it there intentionally. regards, tom lane
Re: WIP: WAL prefetch (another approach)
Hi, I've spent some time testing this, mostly from the performance point of view. I've done a very simple thing, in order to have reproducible test: 1) I've initialized pgbench with scale 8000 (so ~120GB on a machine with only 64GB of RAM) 2) created a physical backup, enabled WAL archiving 3) did 1h pgbench run with 32 clients 4) disabled full-page writes and did another 1h pgbench run Once I had this, I did a recovery using the physical backup and WAL archive, measuring how long it took to apply each WAL segment. First without any prefetching (current master), then twice with prefetching. First with default values (m_io_c=10, distance=256kB) and then with higher values (100 + 2MB). I did this on two storage systems I have in the system - NVME SSD and SATA RAID (3 x 7.2k drives). So, a fast one and slow one. 1) NVME On the NVME, this generates ~26k WAL segments (~400GB), and each of the pgbench runs generates ~120M transactions (~33k tps). Of course, wast majority of the WAL segments ~16k comes from the first run, because there's a lot of FPI due to the random nature of the workload. I have not expected a significant improvement from the prefetching, as the NVME is pretty good in handling random I/O. The total duration looks like this: no prefetch prefetch prefetch2 10618103859403 So the default is a tiny bit faster, and the more aggressive config makes it about 10% faster. Not bad, considering the expectations. Attached is a chart comparing the three runs. There are three clearly visible parts - first the 1h run with f_p_w=on, with two checkpoints. That's first ~16k segments. Then there's a bit of a gap before the second pgbench run was started - I think it's mostly autovacuum etc. And then at segment ~23k the second pgbench (f_p_w=off) starts. I think this shows the prefetching starts to help as the number of FPIs decreases. It's subtle, but it's there. 2) SATA On SATA it's just ~550 segments (~8.5GB), and the pgbench runs generate only about 1M transactions. Again, vast majority of the segments comes from the first run, due to FPI. In this case, I don't have complete results, but after processing 542 segments (out of the ~550) it looks like this: no prefetchprefetchprefetch2 66446635 8282 So the no prefetch and "default" prefetch are roughly on par, but the "aggressive" prefetch is way slower. I'll get back to this shortly, but I'd like to point out this is entirely due to the "no FPI" pgbench, because after the first ~525 initial segments it looks like this: no prefetchprefetchprefetch2 58 65 57 So it goes very fast by the initial segments with plenty of FPIs, and then we get to the "no FPI" segments and the prefetch either does not help or makes it slower. Looking at how long it takes to apply the last few segments, it looks like this: no prefetchprefetchprefetch2 280 298 478 which is not particularly great, I guess. There however seems to be something wrong, because with the prefetching I see this in the log: prefetch: 2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG: recovery no longer prefetching: unexpected pageaddr 108/E800 in log segment 0001010800FF, offset 0 prefetch2: 2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG: recovery no longer prefetching: unexpected pageaddr 108/E800 in log segment 000101090001, offset 0 Which seems pretty suspicious, but I have no idea what's wrong. I admit the archive/restore commands are a bit hacky, but I've only seen this with prefetching on the SATA storage, while all other cases seem to be just fine. I haven't seen in on NVME (which processes much more WAL). And the SATA baseline (no prefetching) also worked fine. Moreover, the pageaddr value is the same in both cases, but the WAL segments are different (but just one segment apart). Seems strange. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: minor doc fix - garbage in example of result of unnest
On Fri, Jun 05, 2020 at 09:56:54AM -0400, Tom Lane wrote: Pavel Stehule writes: diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7c06afd3ea..3b810c0eb4 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -17821,7 +17821,6 @@ SELECT NULLIF(value, '(none)') ... 1 2 -(2 rows in result) That's not "garbage", I put it there intentionally. Why is it outside the though? Also, the next unnest example does not include the number of rows - why the difference? -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: minor doc fix - garbage in example of result of unnest
On Fri, Jun 5, 2020 at 8:38 AM Tomas Vondra wrote: > On Fri, Jun 05, 2020 at 09:56:54AM -0400, Tom Lane wrote: > >Pavel Stehule writes: > >> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml > >> index 7c06afd3ea..3b810c0eb4 100644 > >> --- a/doc/src/sgml/func.sgml > >> +++ b/doc/src/sgml/func.sgml > >> @@ -17821,7 +17821,6 @@ SELECT NULLIF(value, '(none)') ... > >> 1 > >> 2 > >> > >> -(2 rows in result) > >> > >> > > > >That's not "garbage", I put it there intentionally. > > > > Why is it outside the though? Also, the next unnest > example does not include the number of rows - why the difference? > More generally I think the final unnest listing is probably the best presentation for a set-returning function, and the one under discussion should be consistent with it. The function reference documentation doesn't seem like a place to introduce strictly psql output variations like /pset footer on|off. A set-returning function should be displayed in the example output as a structured table like the last unnest example. I don't think it's necessary for "single column" sets to be an exception. Since the examples convert null to (none) there isn't even whitespace concerns in the output where you want to display a bottom boundary for clarity. Related, It seems arbitrary that the case-related full block examples do not have row counts but the aggregate full block examples do. I don't see where having the row counts in the output adds clarity and it just more text for the reader to mentally filter out. David J.
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020-Jun-05, Dave Cramer wrote: > On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera > wrote: > > Ouch ... so they made IDENT in the replication grammar be a trigger to > > enter the regular grammar. Crazy. No way to put those worms back in > > the tin now, I guess. > > Is that documented ? I don't think it is. > > It is still my opinion that we should prohibit a logical replication > > connection from being used to do physical replication. Horiguchi-san, > > Sawada-san and Masao-san are all of the same opinion. Dave Cramer (of > > the JDBC team) is not opposed to the change -- he says they're just > > using it because they didn't realize they should be doing differently. > > I think my exact words were > > "I don't see this is a valid reason to keep doing something. If it is > broken then fix it. > Clients can deal with the change." > > in response to: > > > Well, I don't really think that we can just break a behavior that > > exists since 9.4 as you could break applications relying on the > > existing behavior, and that's also the point of Vladimir upthread. > > Which is different than not being opposed to the change. I don't see this > as broken, and it's quite possible that some of our users are using > it. Apologies for misinterpreting. > It certainly needs to be documented I'd rather not. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Make more use of RELKIND_HAS_STORAGE()
Peter Eisentraut writes: > This is a patch to make use of RELKIND_HAS_STORAGE() where appropriate, > instead of listing out the relkinds individually. No behavior change is > intended. > This was originally part of the patch from [0], but it seems worth > moving forward independently. Passes eyeball examination. I did not try to search for other places where RELKIND_HAS_STORAGE should be used. regards, tom lane
Improve planner cost estimations for alternative subplans
Hello, Currently, in case of alternative subplans that do hashed vs per-row lookups, the per-row estimate is used when planning the rest of the query. It's also coded in not quite an explicit way. In [1] we found a situation where it leads to a suboptimal plan, as it bloats the overall cost into large figures, a decision related to an outer part of the plan look negligible to the planner, and as a result it doesn't elaborate on choosing the optimal one. The patch is to fix it. Our linear model for costs cannot quite accommodate the piecewise linear matter of alternative subplans, so it is based on ugly heuristics and still cannot be very precise, but I think it's better than the current one. Thoughts? Best, Alex [1] https://www.postgresql.org/message-id/flat/ff42b25b-ff03-27f8-ed11-b8255d658cd5%40imap.cc diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index b976afb69d..5e2c5ec822 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -92,6 +92,7 @@ #include "optimizer/planmain.h" #include "optimizer/restrictinfo.h" #include "parser/parsetree.h" +#include "utils/float.h" #include "utils/lsyscache.h" #include "utils/selfuncs.h" #include "utils/spccache.h" @@ -4283,15 +4284,57 @@ cost_qual_eval_walker(Node *node, cost_qual_eval_context *context) else if (IsA(node, AlternativeSubPlan)) { /* - * Arbitrarily use the first alternative plan for costing. (We should - * certainly only include one alternative, and we don't yet have - * enough information to know which one the executor is most likely to - * use.) + * Estimate the cost as some mean of the alternatives. + * It's not uncommon for the alternative costs to be of different + * orders of magnitude, so arithmetic mean would be too biased towards + * the slower one. That's why for per-tuple coefficient we use geometric + * mean. + * + * It's tempting to use geometric mean for startup costs too, but + * one of them can be zero, which will result in substantial + * underestimation. So instead we estimate the cost of returning *first* + * tuple as geometric mean of single-tuple costs for the alternatives. */ - AlternativeSubPlan *asplan = (AlternativeSubPlan *) node; + List *subplans = ((AlternativeSubPlan *) node)->subplans; + Cost per_tuple_mean = 1; + Cost first_tuple_mean = 1; + Cost startup_min = get_float8_infinity(); + Cost startup_max = 0; + Cost startup_mean; + ListCell *lc; + + foreach(lc, subplans) + { + cost_qual_eval_context subplanContext; + + subplanContext.root = context->root; + subplanContext.total.startup = 0; + subplanContext.total.per_tuple = 0; - return cost_qual_eval_walker((Node *) linitial(asplan->subplans), - context); + cost_qual_eval_walker((Node *) lfirst(lc), &subplanContext); + + Assert(subplanContext.total.startup >= 0); + Assert(subplanContext.total.per_tuple >= 0); + + per_tuple_mean *= subplanContext.total.per_tuple; + first_tuple_mean *= subplanContext.total.startup + subplanContext.total.per_tuple; + startup_min = float8_min(startup_min, subplanContext.total.startup); + startup_max = float8_max(startup_min, subplanContext.total.startup); + } + per_tuple_mean = + pow(per_tuple_mean, 1.0 / list_length(subplans)); + first_tuple_mean = + pow(first_tuple_mean, 1.0 / list_length(subplans)); + startup_mean = first_tuple_mean - per_tuple_mean; + /* make sure this calculation makes a sane value */ + startup_mean = float8_max(startup_mean, startup_min); + startup_mean = float8_min(startup_mean, startup_max); + + context->total.startup += startup_mean; + context->total.per_tuple += per_tuple_mean; + + /* We've already recursed into children, skip the generic recursion */ + return false; } else if (IsA(node, PlaceHolderVar)) { diff --git a/src/test/regress/expected/updatable_views.out b/src/test/regress/expected/updatable_views.out index 5de53f2782..98787e75e0 100644 --- a/src/test/regress/expected/updatable_views.out +++ b/src/test/regress/expected/updatable_views.out @@ -2301,17 +2301,16 @@ SELECT * FROM v1 WHERE a=8; EXPLAIN (VERBOSE, COSTS OFF) UPDATE v1 SET a=100 WHERE snoop(a) AND leakproof(a) AND a < 7 AND a != 6; -QUERY PLAN + QUERY PLAN +- Update on public.t1 Update on public.t1 Update on public.t11 t1_1 Update on public.t12 t1_2 Update on public.t111 t1_3 - ->
Re: valgrind error
Noah Misch writes: > I can reproduce this on a 2017-vintage CPU with ./configure > ... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel" > under valgrind-3.15.0 (as packaged by RHEL 7.8). valgrind.supp has a > suppression for CRC calculations, but it didn't get the memo when commit > 4f700bc renamed the function. The attached patch fixes the suppression. I can also reproduce this, on RHEL 8.2 which likewise has valgrind-3.15.0, using the same configuration to force use of that CRC function. I concur with your diagnosis that this is just a missed update of the pre-existing suppression rule. However, rather than - fun:pg_comp_crc32c + fun:pg_comp_crc32c* as you have it, I'd prefer to use - fun:pg_comp_crc32c + fun:pg_comp_crc32c_sb8 which precisely matches what 4f700bc did. The other way seems like it's giving a free pass to problems that could lurk in unrelated CRC implementations. regards, tom lane
Re: minor doc fix - garbage in example of result of unnest
"David G. Johnston" writes: >> On Fri, Jun 05, 2020 at 09:56:54AM -0400, Tom Lane wrote: >>> That's not "garbage", I put it there intentionally. > I don't see where having the row counts in the output adds clarity and it > just more text for the reader to mentally filter out. Seems like I'm getting outvoted here. If there aren't other votes, I'll change it. regards, tom lane
Re: significant slowdown of HashAggregate between 9.6 and 10
Hi, On 2020-06-05 15:25:26 +0200, Tomas Vondra wrote: > I think you're right. I think I was worried about having to resize the > hash table in case of an under-estimate, and it seemed fine to waste a > tiny bit more memory to prevent that. It's pretty cheap to resize a hashtable with a handful of entries, so I'm not worried about that. It's also how it has worked for a *long* time, so I think unless we have some good reason to change that, I wouldn't. > But this example shows we may need to scan the hash table > sequentially, which means it's not just about memory consumption. We *always* scan the hashtable sequentially, no? Otherwise there's no way to get at the aggregated data. > So in hindsight we either don't need the limit at all, or maybe it > could be much lower (IIRC it reduces probability of collision, but > maybe dynahash does that anyway internally). This is simplehash using code. Which resizes on a load factor of 0.9. > I wonder if hashjoin has the same issue, but probably not - I don't > think we'll ever scan that internal hash table sequentially. I think we do for some outer joins (c.f. ExecPrepHashTableForUnmatched()), but it's probably not relevant performance-wise. Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
On Fri, Jun 05, 2020 at 05:20:52PM +0200, Tomas Vondra wrote: ... which is not particularly great, I guess. There however seems to be something wrong, because with the prefetching I see this in the log: prefetch: 2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG: recovery no longer prefetching: unexpected pageaddr 108/E800 in log segment 0001010800FF, offset 0 prefetch2: 2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG: recovery no longer prefetching: unexpected pageaddr 108/E800 in log segment 000101090001, offset 0 Which seems pretty suspicious, but I have no idea what's wrong. I admit the archive/restore commands are a bit hacky, but I've only seen this with prefetching on the SATA storage, while all other cases seem to be just fine. I haven't seen in on NVME (which processes much more WAL). And the SATA baseline (no prefetching) also worked fine. Moreover, the pageaddr value is the same in both cases, but the WAL segments are different (but just one segment apart). Seems strange. I suspected it might be due to a somewhat hackish restore_command that prefetches some of the WAL segments, so I tried again with a much simpler restore_command - essentially just: restore_command = 'cp /archive/%f %p.tmp && mv %p.tmp %p' which I think should be fine for testing purposes. And I got this: LOG: recovery no longer prefetching: unexpected pageaddr 108/5700 in log segment 0001010800FF, offset 0 LOG: restored log file "0001010800FF" from archive which is the same segment as in the earlier examples, but with a different pageaddr value. Of course, there's no such pageaddr in the WAL segment (and recovery of that segment succeeds). So I think there's something broken ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP: WAL prefetch (another approach)
On Fri, Jun 05, 2020 at 10:04:14PM +0200, Tomas Vondra wrote: On Fri, Jun 05, 2020 at 05:20:52PM +0200, Tomas Vondra wrote: ... which is not particularly great, I guess. There however seems to be something wrong, because with the prefetching I see this in the log: prefetch: 2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG: recovery no longer prefetching: unexpected pageaddr 108/E800 in log segment 0001010800FF, offset 0 prefetch2: 2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG: recovery no longer prefetching: unexpected pageaddr 108/E800 in log segment 000101090001, offset 0 Which seems pretty suspicious, but I have no idea what's wrong. I admit the archive/restore commands are a bit hacky, but I've only seen this with prefetching on the SATA storage, while all other cases seem to be just fine. I haven't seen in on NVME (which processes much more WAL). And the SATA baseline (no prefetching) also worked fine. Moreover, the pageaddr value is the same in both cases, but the WAL segments are different (but just one segment apart). Seems strange. I suspected it might be due to a somewhat hackish restore_command that prefetches some of the WAL segments, so I tried again with a much simpler restore_command - essentially just: restore_command = 'cp /archive/%f %p.tmp && mv %p.tmp %p' which I think should be fine for testing purposes. And I got this: LOG: recovery no longer prefetching: unexpected pageaddr 108/5700 in log segment 0001010800FF, offset 0 LOG: restored log file "0001010800FF" from archive which is the same segment as in the earlier examples, but with a different pageaddr value. Of course, there's no such pageaddr in the WAL segment (and recovery of that segment succeeds). So I think there's something broken ... BTW in all three cases it happens right after the first restart point in the WAL stream: LOG: restored log file "0001010800FD" from archive LOG: restartpoint starting: time LOG: restored log file "0001010800FE" from archive LOG: restartpoint complete: wrote 236092 buffers (22.5%); 0 WAL ... LOG: recovery restart point at 108/FC28 DETAIL: Last completed transaction was at log time 2020-06-04 15:27:00.95139+02. LOG: recovery no longer prefetching: unexpected pageaddr 108/5700 in log segment 0001010800FF, offset 0 LOG: restored log file "0001010800FF" from archive It looks exactly like this in case of all 3 failures ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: repeat() function, CHECK_FOR_INTERRUPTS(), and unlikely()
On 6/4/20 5:20 PM, Alvaro Herrera wrote: > On 2020-May-28, Joe Conway wrote: > >> I backpatched and pushed the changes to the repeat() function. Any other >> opinions regarding backpatch of the unlikely() addition to >> CHECK_FOR_INTERRUPTS()? > > We don't use unlikely() in 9.6 at all, so I would stop that backpatching > at 10 anyhow. (We did backpatch unlikely()'s definition afterwards.) Correct you are -- thanks for the heads up! Pushed to REL_10_STABLE and later. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: WIP: WAL prefetch (another approach)
On Sat, Jun 6, 2020 at 8:41 AM Tomas Vondra wrote: > BTW in all three cases it happens right after the first restart point in > the WAL stream: > > LOG: restored log file "0001010800FD" from archive > LOG: restartpoint starting: time > LOG: restored log file "0001010800FE" from archive > LOG: restartpoint complete: wrote 236092 buffers (22.5%); 0 WAL ... > LOG: recovery restart point at 108/FC28 > DETAIL: Last completed transaction was at log time 2020-06-04 > 15:27:00.95139+02. > LOG: recovery no longer prefetching: unexpected pageaddr > 108/5700 in log segment 0001010800FF, offset 0 > LOG: restored log file "0001010800FF" from archive > > It looks exactly like this in case of all 3 failures ... Huh. Thanks! I'll try to reproduce this here.
Re: Trouble with hashagg spill I/O pattern and costing
Hello Is this patch the only thing missing before this open item can be considered closed? Thanks -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: v13: Performance regression related to FORTIFY_SOURCE
Hi, On 2020-04-19 15:07:22 -0700, Jeff Davis wrote: > I brought up an issue where GCC in combination with FORTIFY_SOURCE[2] > causes a perf regression for logical tapes after introducing > LogicalTapeSetExtend()[3]. Unfortunately, FORTIFY_SOURCE is used by > default on ubuntu. I have not observed the problem with clang. > > There is no reason why the change should trigger the regression, but it > does. The slowdown is due to GCC switching to an inlined version of > memcpy() for LogicalTapeWrite() at logtape.c:768. The change[3] seems > to have little if anything to do with that. FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a profile shows me: │ nthistime = TapeBlockPayloadSize - lt->pos; │ if (nthistime > size) 3.01 │1 b0: cmp %rdx,%r12 1.09 │ cmovbe %r12,%rdx │ memcpy(): │ │ __fortify_function void * │ __NTH (memcpy (void *__restrict __dest, const void *__restrict __src, │ size_t __len)) │ { │ return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest)); 2.44 │ mov %r13,%rsi │ LogicalTapeWrite(): │ nthistime = size; │ Assert(nthistime > 0); │ │ memcpy(lt->buffer + lt->pos, ptr, nthistime); 2.49 │ add 0x28(%rbx),%rdi 0.28 │ mov %rdx,%r15 │ memcpy(): 4.65 │ → callqmemcpy@plt │ LogicalTapeWrite(): I.e. normal memcpy is getting called. That's with -D_FORTIFY_SOURCE=2 With which compiler / libc versions did you encounter this? Greetings, Andres Freund
Re: Trouble with hashagg spill I/O pattern and costing
On Fri, Jun 05, 2020 at 05:19:43PM -0400, Alvaro Herrera wrote: Hello Is this patch the only thing missing before this open item can be considered closed? I've already pushed this as 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0, sorry for not mentioning it in this thread explicitly. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP: WAL prefetch (another approach)
Hi, I wonder if we can collect some stats to measure how effective the prefetching actually is. Ultimately we want something like cache hit ratio, but we're only preloading into page cache, so we can't easily measure that. Perhaps we could measure I/O timings in redo, though? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trouble with hashagg spill I/O pattern and costing
On 2020-Jun-06, Tomas Vondra wrote: > On Fri, Jun 05, 2020 at 05:19:43PM -0400, Alvaro Herrera wrote: > > > Is this patch the only thing missing before this open item can be > > considered closed? > > I've already pushed this as 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0, > sorry for not mentioning it in this thread explicitly. That's great to know, thanks. The other bit necessary to answer my question is whether do we need to do anything else in this area -- if no, then we can mark the open item as closed: https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#Open_Issues Thanks -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trouble with hashagg spill I/O pattern and costing
On Fri, Jun 05, 2020 at 06:51:34PM -0400, Alvaro Herrera wrote: On 2020-Jun-06, Tomas Vondra wrote: On Fri, Jun 05, 2020 at 05:19:43PM -0400, Alvaro Herrera wrote: > Is this patch the only thing missing before this open item can be > considered closed? I've already pushed this as 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0, sorry for not mentioning it in this thread explicitly. That's great to know, thanks. The other bit necessary to answer my question is whether do we need to do anything else in this area -- if no, then we can mark the open item as closed: https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#Open_Issues Hmmm, good question. There was some discussion about maybe tweaking the costing model to make it a bit more pessimistic (assuming more random I/O or something like that), but I'm not sure it's still needed. Increasing random_page_cost for the temp tablespace did the trick for me. So I'd say we can mark it as closed ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Atomic operations within spinlocks
Hi, On 2020-06-04 19:33:02 -0700, Andres Freund wrote: > But it looks like that code is currently buggy (and looks like it always > has been), because we don't look at the nested argument when > initializing the semaphore. So we currently allocate too many > semaphores, without benefiting from them :(. I wrote a patch for this, and when I got around to to testing it, I found that our tests currently don't pass when using both --disable-spinlocks and --disable-atomics. Turns out to not be related to the issue above, but the global barrier support added in 13. That *reads* two 64 bit atomics in a signal handler. Which is normally fine, but not at all cool when atomics (or just 64 bit atomics) are backed by spinlocks. Because we can "self interrupt" while already holding the spinlock. It looks to me that that's a danger whenever 64bit atomics are backed by spinlocks, not just when both --disable-spinlocks and --disable-atomics are used. But I suspect that it's really hard to hit the tiny window of danger when those options aren't used. While we have buildfarm animals testing each of those separately, we don't have one that tests both together... I'm not really sure what to do about that issue. The easisest thing would probably be to change the barrier generation to 32bit (which doesn't have to use locks for reads in any situation). I tested doing that, and it fixes the hangs for me. Randomly noticed while looking at the code: uint64 flagbit = UINT64CONST(1) << (uint64) type; that shouldn't be 64bit, right? Greetings, Andres Freund
Re: WIP: WAL prefetch (another approach)
Greetings, * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote: > I wonder if we can collect some stats to measure how effective the > prefetching actually is. Ultimately we want something like cache hit > ratio, but we're only preloading into page cache, so we can't easily > measure that. Perhaps we could measure I/O timings in redo, though? That would certainly be interesting, particularly as this optimization seems likely to be useful on some platforms (eg, zfs, where the filesystem block size is larger than ours..) and less on others (traditional systems which have a smaller block size). Thanks, Stephen signature.asc Description: PGP signature
Re: Atomic operations within spinlocks
Andres Freund writes: > I wrote a patch for this, and when I got around to to testing it, I > found that our tests currently don't pass when using both > --disable-spinlocks and --disable-atomics. Turns out to not be related > to the issue above, but the global barrier support added in 13. > That *reads* two 64 bit atomics in a signal handler. Which is normally > fine, but not at all cool when atomics (or just 64 bit atomics) are > backed by spinlocks. Because we can "self interrupt" while already > holding the spinlock. This is the sort of weird platform-specific problem that I'd prefer to avoid by minimizing our expectations of what spinlocks can be used for. > I'm not really sure what to do about that issue. The easisest thing > would probably be to change the barrier generation to 32bit (which > doesn't have to use locks for reads in any situation). Yeah, I think we need a hard rule that you can't use a spinlock in an interrupt handler --- which means no atomics that don't have non-spinlock implementations on every platform. At some point I think we'll have to give up --disable-spinlocks; it's really of pretty marginal use (how often does anyone port PG to a new CPU type?) and the number of weird interactions it adds in this area seems like more than it's worth. But of course requiring 64-bit atomics is still a step too far. > Randomly noticed while looking at the code: > uint64 flagbit = UINT64CONST(1) << (uint64) type; I'm surprised we didn't get any compiler warnings about that. regards, tom lane
Re: v13: Performance regression related to FORTIFY_SOURCE
On Thu, 2020-06-04 at 21:41 -0400, Tom Lane wrote: > I think what'd make more sense is to file this as a gcc bug ("why > doesn't > it remove the useless object size check?") Filed: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556 Regards, Jeff Davis
Re: v13: Performance regression related to FORTIFY_SOURCE
On Fri, 2020-06-05 at 14:49 -0700, Andres Freund wrote: > FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a > profile shows me: ... > 4.65 │ → callqmemcpy@plt >│ LogicalTapeWrite(): > > I.e. normal memcpy is getting called. > > That's with -D_FORTIFY_SOURCE=2 That's good news, although people will be using ubuntu 18.04 for a while. Just to confirm, would you mind trying the example programs in the GCC bug report? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556 > With which compiler / libc versions did you encounter this? gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 gcc-9 (Ubuntu 9.2.1-19ubuntu1~18.04.york0) 9.2.1 20191109 libc-dev-bin/bionic,now 2.27-3ubuntu1 amd64 [installed] Regards, Jeff Davis
Re: v13: Performance regression related to FORTIFY_SOURCE
On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote: > If it is something worth worrying about, let's discuss what's a good > fix for it. While making a minimal test case for the GCC bug report, I found another surprisingly-small workaround. Patch attached. Regards, Jeff Davis diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index 666a7c0e81c..c09e1e308ae 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -99,7 +99,7 @@ typedef struct TapeBlockTrailer * bytes on last block (if < 0) */ } TapeBlockTrailer; -#define TapeBlockPayloadSize (BLCKSZ - sizeof(TapeBlockTrailer)) +#define TapeBlockPayloadSize (BLCKSZ - 16) #define TapeBlockGetTrailer(buf) \ ((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize))
Re: v13: Performance regression related to FORTIFY_SOURCE
Jeff Davis writes: > On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote: >> If it is something worth worrying about, let's discuss what's a good >> fix for it. > While making a minimal test case for the GCC bug report, I found > another surprisingly-small workaround. Patch attached. Ugh :-( ... but perhaps you could get the same result like this: -#define TapeBlockPayloadSize (BLCKSZ - sizeof(TapeBlockTrailer)) +#define TapeBlockPayloadSize (BLCKSZ - (int) sizeof(TapeBlockTrailer)) Or possibly casting the whole thing to int or unsigned int would be better. Point being that I bet it's int vs long that is making the difference. regards, tom lane
Re: v13: Performance regression related to FORTIFY_SOURCE
On Fri, 2020-06-05 at 21:50 -0400, Tom Lane wrote: > Or possibly casting the whole thing to int or unsigned int would be > better. Point being that I bet it's int vs long that is making the > difference. That did it, and it's much more tolerable as a workaround. Thank you. I haven't tested end-to-end that it solves the problem, but I'm pretty sure it will. Regards, Jeff Davis
Re: Atomic operations within spinlocks
Hi, On 2020-06-05 21:01:56 -0400, Tom Lane wrote: > > I'm not really sure what to do about that issue. The easisest thing > > would probably be to change the barrier generation to 32bit (which > > doesn't have to use locks for reads in any situation). > > Yeah, I think we need a hard rule that you can't use a spinlock in > an interrupt handler --- which means no atomics that don't have > non-spinlock implementations on every platform. Yea, that might be the easiest thing to do. The only other thing I can think of would be to mask all signals for the duration of the atomic-using-spinlock operation. That'd make the fallback noticably more expensive, but otoh, do we care enough? I think a SIGNAL_HANDLER_BEGIN(); SIGNAL_HANDLER_END(); to back an Assert(!InSignalHandler()); could be quite useful. Could also save errno etc. > At some point I think we'll have to give up --disable-spinlocks; it's > really of pretty marginal use (how often does anyone port PG to a new > CPU type?) and the number of weird interactions it adds in this area > seems like more than it's worth. Indeed. And any new architecture one would port PG to would have good enough compiler intrinsics to make that trivial. I still think it'd make sense to have a fallback implementation using compiler intrinsics... And I think we should just require 32bit atomics at the same time. Would probably kill gaur though. I did just find a longstanding bug in the spinlock emulation code: void s_init_lock_sema(volatile slock_t *lock, bool nested) { static int counter = 0; *lock = ((++counter) % NUM_SPINLOCK_SEMAPHORES) + 1; } void s_unlock_sema(volatile slock_t *lock) { int lockndx = *lock; if (lockndx <= 0 || lockndx > NUM_SPINLOCK_SEMAPHORES) elog(ERROR, "invalid spinlock number: %d", lockndx); PGSemaphoreUnlock(SpinlockSemaArray[lockndx - 1]); } I don't think it's ok that counter is a signed integer... While it maybe used to be unlikely that we ever have that many spinlocks, I don't think it's that hard anymore, because we dynamically allocate them for a lot of parallel query stuff. A small regression test that initializes enough spinlocks indeed errors out with 2020-06-05 18:08:29.110 PDT [734946][3/2:0] ERROR: invalid spinlock number: -126 2020-06-05 18:08:29.110 PDT [734946][3/2:0] STATEMENT: SELECT test_atomic_ops(); > > Randomly noticed while looking at the code: > > uint64 flagbit = UINT64CONST(1) << (uint64) type; > > I'm surprised we didn't get any compiler warnings about that. Unfortunately I don't think one can currently compile postgres with warnings for "implicit casts" enabled :(. > But of course requiring 64-bit atomics is still a step too far. If we had a 32bit compare-exchange it ought to be possible to write a signal-safe emulation of 64bit atomics. I think. Something *roughly* like: typedef struct pg_atomic_uint64 { /* * Meaning of state bits: * 0-1: current valid * 2-4: current proposed * 5: in signal handler * 6-31: pid of proposer */ pg_atomic_uint32 state; /* * One current value, two different proposed values. */ uint64 value[3]; } pg_atomic_uint64; The update protocol would be something roughly like: bool pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 *expected, uint64 newval) { while (true) { uint32 old_state = pg_atomic_read_u32(&ptr->state); uint32 updater_pid = PID_FROM_STATE(old_state); uint32 new_state; uint64 old_value; int proposing; /* * Value changed, so fail. This is obviously racy, but we'll * notice concurrent updates later. */ if (ptr->value[VALID_FIELD(old_state)] != *expected) { return false; } if (updater_pid == INVALID_PID) { new_state = old_state; /* signal that current process is updating */ new_state |= MyProcPid >> PID_STATE_SHIFT; if (InSignalHandler) new_state |= PROPOSER_IN_SIGNAL_HANDLER_BIT; /* set which index is being proposed */ new_state = (new_state & ~PROPOSER_BITS) | NEXT_PROPOSED_FIELD(old_state, &proposing); /* * If we successfully can update state to contain our new * value, we have a right to do so, and can only be * interrupted by ourselves, in a signal handler. */ if (!pg_atomic_compare_exchange(&ptr->state, &old_state, new_state)) { /* somebody else updated, restart */ continue; } old_state = new_state; /* * It's ok to compare the values now. If we are interrupted * by a signal handler, we'll notice when updating * state. There's no danger upda
Re: v13: Performance regression related to FORTIFY_SOURCE
Hi, On 2020-06-05 18:39:28 -0700, Jeff Davis wrote: > On Fri, 2020-06-05 at 14:49 -0700, Andres Freund wrote: > > FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a > > profile shows me: > > ... > > > 4.65 │ → callqmemcpy@plt > >│ LogicalTapeWrite(): > > > > I.e. normal memcpy is getting called. > > > > That's with -D_FORTIFY_SOURCE=2 > > That's good news, although people will be using ubuntu 18.04 for a > while. > > Just to confirm, would you mind trying the example programs in the GCC > bug report? > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556 I get "call memcpy@PLT" for both files. With various debian versions of gcc (7,8,9,10). But, very curiously, I do see the difference when compiling with gcc-snapshot (which is a debian package wrapping a recent snapshot from upstream gcc). Greetings, Andres Freund
Re: Atomic operations within spinlocks
Andres Freund writes: > On 2020-06-05 21:01:56 -0400, Tom Lane wrote: >> At some point I think we'll have to give up --disable-spinlocks; it's >> really of pretty marginal use (how often does anyone port PG to a new >> CPU type?) and the number of weird interactions it adds in this area >> seems like more than it's worth. > Indeed. And any new architecture one would port PG to would have good > enough compiler intrinsics to make that trivial. I still think it'd make > sense to have a fallback implementation using compiler intrinsics... > And I think we should just require 32bit atomics at the same time. Would > probably kill gaur though. Not only gaur. A quick buildfarm survey finds these active members reporting not having 32-bit atomics: anole | 2020-06-05 11:20:17 | pgac_cv_gcc_atomic_int32_cas=no chipmunk | 2020-05-29 22:27:56 | pgac_cv_gcc_atomic_int32_cas=no curculio | 2020-06-05 22:30:06 | pgac_cv_gcc_atomic_int32_cas=no frogfish | 2020-05-31 13:00:25 | pgac_cv_gcc_atomic_int32_cas=no gaur | 2020-05-19 13:33:25 | pgac_cv_gcc_atomic_int32_cas=no gharial | 2020-06-05 12:41:14 | pgac_cv_gcc_atomic_int32_cas=no hornet| 2020-06-05 09:11:26 | pgac_cv_gcc_atomic_int32_cas=no hoverfly | 2020-06-05 22:06:14 | pgac_cv_gcc_atomic_int32_cas=no locust| 2020-06-05 10:14:29 | pgac_cv_gcc_atomic_int32_cas=no mandrill | 2020-06-05 09:20:03 | pgac_cv_gcc_atomic_int32_cas=no prairiedog| 2020-06-05 09:55:49 | pgac_cv_gcc_atomic_int32_cas=no It looks to me like this is mostly about compiler support not the hardware; that doesn't make it not a problem, though. (I also remain skeptical about the quality of the compiler intrinsics on non-mainstream hardware.) regards, tom lane
Re: valgrind error
On Fri, Jun 05, 2020 at 12:17:54PM -0400, Tom Lane wrote: > Noah Misch writes: > > I can reproduce this on a 2017-vintage CPU with ./configure > > ... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel" > > under valgrind-3.15.0 (as packaged by RHEL 7.8). valgrind.supp has a > > suppression for CRC calculations, but it didn't get the memo when commit > > 4f700bc renamed the function. The attached patch fixes the suppression. > > I can also reproduce this, on RHEL 8.2 which likewise has valgrind-3.15.0, > using the same configuration to force use of that CRC function. I concur > with your diagnosis that this is just a missed update of the pre-existing > suppression rule. However, rather than > > - fun:pg_comp_crc32c > + fun:pg_comp_crc32c* > > as you have it, I'd prefer to use > > - fun:pg_comp_crc32c > + fun:pg_comp_crc32c_sb8 > > which precisely matches what 4f700bc did. The other way seems like > it's giving a free pass to problems that could lurk in unrelated CRC > implementations. The undefined data is in the CRC input, namely the padding bytes in xl_* structs. Apparently, valgrind-3.15.0 doesn't complain about undefined input to _mm_crc32_u* functions. We should not be surprised if Valgrind gains the features necessary to complain about the other implementations. Most COMP_CRC32C callers don't have a suppression, so Valgrind still studies each CRC implementation via those other callers.
Re: valgrind error
Noah Misch writes: > On Fri, Jun 05, 2020 at 12:17:54PM -0400, Tom Lane wrote: >> as you have it, I'd prefer to use >> - fun:pg_comp_crc32c >> + fun:pg_comp_crc32c_sb8 >> which precisely matches what 4f700bc did. The other way seems like >> it's giving a free pass to problems that could lurk in unrelated CRC >> implementations. > The undefined data is in the CRC input, namely the padding bytes in xl_* > structs. Oh, I see. Objection withdrawn. > Apparently, valgrind-3.15.0 doesn't complain about undefined input > to _mm_crc32_u* functions. We should not be surprised if Valgrind gains the > features necessary to complain about the other implementations. Perhaps it already has ... I wonder if anyone's tried this on ARMv8 lately. regards, tom lane
Re: Atomic operations within spinlocks
Hi, On 2020-06-05 22:52:47 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2020-06-05 21:01:56 -0400, Tom Lane wrote: > >> At some point I think we'll have to give up --disable-spinlocks; it's > >> really of pretty marginal use (how often does anyone port PG to a new > >> CPU type?) and the number of weird interactions it adds in this area > >> seems like more than it's worth. > > > Indeed. And any new architecture one would port PG to would have good > > enough compiler intrinsics to make that trivial. I still think it'd make > > sense to have a fallback implementation using compiler intrinsics... > > > And I think we should just require 32bit atomics at the same time. Would > > probably kill gaur though. > > Not only gaur. A quick buildfarm survey finds these active members > reporting not having 32-bit atomics: Hm, I don't think that's the right test. We have bespoke code to support most of these, I think: > anole | 2020-06-05 11:20:17 | pgac_cv_gcc_atomic_int32_cas=no Has support via acc specific intrinsics. > chipmunk | 2020-05-29 22:27:56 | pgac_cv_gcc_atomic_int32_cas=no Doesn't have support for __atomic, but does have support for 32bit __sync. > gharial | 2020-06-05 12:41:14 | pgac_cv_gcc_atomic_int32_cas=no __sync support for both 32 and 64 bit. > curculio | 2020-06-05 22:30:06 | pgac_cv_gcc_atomic_int32_cas=no > frogfish | 2020-05-31 13:00:25 | pgac_cv_gcc_atomic_int32_cas=no __sync support for both 32 and 64 bit. > mandrill | 2020-06-05 09:20:03 | pgac_cv_gcc_atomic_int32_cas=no __sync support for 32, as well as as inline asm for 32bit atomics (although we might be able to add 64 bit). > hornet| 2020-06-05 09:11:26 | pgac_cv_gcc_atomic_int32_cas=no > hoverfly | 2020-06-05 22:06:14 | pgac_cv_gcc_atomic_int32_cas=no __sync support for both 32 and 64 bit, and we have open coded ppc asm. > locust| 2020-06-05 10:14:29 | pgac_cv_gcc_atomic_int32_cas=no > prairiedog| 2020-06-05 09:55:49 | pgac_cv_gcc_atomic_int32_cas=no Wee, these don't have __sync? But I think it should be able to use the asm ppc implementation for 32 bit atomics. > gaur | 2020-05-19 13:33:25 | pgac_cv_gcc_atomic_int32_cas=no As far as I understand pa-risc doesn't have any atomic instructions except for TAS. So I think gaur is really the only one that'd drop. > It looks to me like this is mostly about compiler support not the > hardware; that doesn't make it not a problem, though. (I also > remain skeptical about the quality of the compiler intrinsics > on non-mainstream hardware.) I think that's fair enough for really old platforms, but at least for gcc / clang I don't think it's a huge concern for newer ones. Even if not mainstream. For gcc/clang the intrinsics basically back the C11/C++11 "language level" atomics support. And those are extremely widely used these days. Greetings, Andres Freund
hashagg slowdown due to spill changes
Hi, While preparing my pgcon talk I noticed that our hash-agg performance degraded noticeably. Looks to me like it's due to the spilling-hashagg changes. Sample benchmark: config: -c huge_pages=on -c shared_buffers=32GB -c jit=0 -c max_parallel_workers_per_gather=0 (largely just to reduce variance) data prep: CREATE TABLE fewgroups_many_rows AS SELECT (random() * 4)::int cat, (random()*1)::int val FROM generate_series(1, 1); VACUUM (FREEZE, ANALYZE) fewgroups_many_rows; test prep: CREATE EXTENSION IF NOT EXISTS pg_prewarm;SELECT pg_prewarm('fewgroups_many_rows', 'buffer'); test: SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1; Using best-of-three timing: 12 12221.031 ms master 13855.129 ms While not the end of the world, that's a definitely noticable and reproducible slowdown (~12%). I don't think this is actually an inherent cost, but a question of how the code ended up being organized. Here's a perf diff of profiles for both versions: # Baseline Delta Abs Shared Object Symbol # . . # +6.70% postgres [.] LookupTupleHashEntryHash +6.37% postgres [.] prepare_hash_slot +4.74% postgres [.] TupleHashTableHash_internal.isra.0 20.36% -2.89% postgres [.] ExecInterpExpr 6.31% -2.73% postgres [.] lookup_hash_entries +2.36% postgres [.] lookup_hash_entry +2.14% postgres [.] ExecJustAssignScanVar 2.28% +1.97% postgres [.] ExecScan 2.54% +1.93% postgres [.] MemoryContextReset 3.84% -1.86% postgres [.] SeqNext 10.19% -1.50% postgres [.] tts_buffer_heap_getsomeattrs +1.42% postgres [.] hash_bytes_uint32 +1.39% postgres [.] TupleHashTableHash +1.10% postgres [.] tts_virtual_clear 3.36% -0.74% postgres [.] ExecAgg +0.45% postgres [.] CheckForSerializableConflictOutNeeded 0.25% +0.44% postgres [.] hashint4 5.80% -0.35% postgres [.] tts_minimal_getsomeattrs 1.91% -0.33% postgres [.] heap_getnextslot 4.86% -0.32% postgres [.] heapgettup_pagemode 1.46% -0.32% postgres [.] tts_minimal_clear While some of this is likely is just noise, it's pretty clear that we spend a substantial amount of additional time below lookup_hash_entries(). And looking at the code, I'm not too surprised: Before there was basically one call from nodeAgg.c to execGrouping.c for each tuple and hash table. Now it's a lot more complicated: 1) nodeAgg.c: prepare_hash_slot() 2) execGrouping.c: TupleHashTableHash() 3) nodeAgg.c: lookup_hash_entry() 4) execGrouping.c: LookupTupleHashEntryHash() For each of these data needs to be peeled out of one or more of AggState / AggStatePerHashData / TupleHashTable. There's no way the compiler can know that nothing inside those changes, therefore it has to reload the contents repeatedly. By my look at the profiles, that's where most of the time is going. There's also the issue that the signalling whether to insert / not to insert got unnecessarily complicated. There's several checks: 1) lookup_hash_entry() (p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;) 2) LookupTupleHashEntry_internal() (if (isnew)) 3) lookup_hash_entry() (if (entry == NULL) and if (isnew)) 4) lookup_hash_entries() if (!in_hash_table) Not performance related: I am a bit confused why the new per-hash stuff in lookup_hash_entries() isn't in lookup_hash_entry()? I assume that's because of agg_refill_hash_table()? Why isn't the flow more like this: 1) prepare_hash_slot() 2) if (aggstate->hash_spill_mode) goto 3; else goto 4 3) entry = LookupTupleHashEntry(&hash); if (!entry) hashagg_spill_tuple(); 4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry) That way there's again exactly one call to execGrouping.c, there's no need for nodeAgg to separately compute the hash, there's far fewer branches... Doing things this way might perhaps make agg_refill_hash_table() a tiny bit more complicated, but it'll also avoid the slowdown for the vast majority of cases where we're not spilling. Greetings, Andres Freund
Re: libpq copy error handling busted
Hi, On 2020-06-03 18:41:28 -0400, Tom Lane wrote: > Andres Freund writes: > > When libpq is used to COPY data to the server, it doesn't properly > > handle errors. > > This is partially an old problem, and partially got recently > > worse. Before the below commit we detected terminated connections, but > > we didn't handle copy failing. > > Yeah. After poking at that for a little bit, there are at least three > problems: > > * pqSendSome() is responsible not only for pushing out data, but for > calling pqReadData in any situation where it can't get rid of the data > promptly. 1f39a1c06 overlooked that requirement, and the upshot is > that we don't necessarily notice that the connection is broken (it's > pqReadData's job to detect that). Putting a pqReadData call into > the early-exit path helps, but doesn't fix things completely. Is that fully necessary? Couldn't we handle at least the case I had by looking at write_failed in additional places? It might still be the right thing to continue to call pqReadData() from pqSendSome(), don't get me wrong. > * The more longstanding problem is that the PQputCopyData code path > doesn't have any mechanism for consuming an 'E' (error) message > once pqReadData has collected it. AFAICS that's ancient. Yea, I looked back quite a bit, and it looked that way for a long time. I thought for a moment that it might be related to the copy-both introduction, but it wasn't. > I think that the idea was to let the client dump all its copy data and > then report the error message when PQendCopy is called, but as you > say, that's none too friendly when there's gigabytes of data involved. > I'm not sure we can do anything about this without effectively > changing the client API for copy errors, though. Hm. Why would it *really* be an API change? Until recently connection failures etc were returned from PQputCopyData(), and it is documented that way: /* * PQputCopyData - send some data to the backend during COPY IN or COPY BOTH * * Returns 1 if successful, 0 if data could not be sent (only possible * in nonblock mode), or -1 if an error occurs. */ int PQputCopyData(PGconn *conn, const char *buffer, int nbytes) So consuming 'E' when in copy mode doesn't seem like a crazy change to me. Probably a bit too big to backpatch though. But given that this hasn't been complained about much in however many years... Greetings, Andres Freund
Vacuuming the operating system documentation
Hi, We're carrying a bunch of obsolete and in one case insecure advice on kernel settings. Here's an attempt to clean some of that up. Linux: * Drop reference to ancient systems that didn't have a sysctl command. * Drop references to Linux before 2.6. * I was tempted to remove the reference to oom_adj, which was apparently deprecated from 2.6.36, but that's probably recent enough to keep (RHEL6 may outlive humanity). macOS: * Drop reference to 10.2 and 10.3 systems. That's 15-16 years ago. Even the ancient PPC systems in the build farm run 10.4+. FreeBSD: * Drop insecure and outdated jail instructions. I moved the pre-FreeBSD 11 behaviour into a brief note in parentheses, because FreeBSD 11 is the oldest release of that OS that is still in support. In that parenthetical note, I dropped the reference to port numbers and UIDs in shmem keys since we now use pgdata inode numbers instead. * Drop SysV semaphore instruction. We switched to POSIX on this platform in PostgreSQL 10, and we don't bother to give the redundant instructions about semaphores for Linux so we might as well drop this noise for FreeBSD too. * Clarify that kern.ipc.shm_use_phys only has a useful effect if shared_memory_type=sysv, which is not the default. * Drop some stuff about pre-4.0 systems. That was 20 years ago. NetBSD: * Drop reference to pre-5.0 systems. That was 11 years ago. Maybe someone wants to argue with me on this one? OpenBSD: * Drop instruction on recompiling the kernel on pre-3.3 systems. That was 17 years ago. Solaris/illumos: * Drop instructions on Solaris 6-9 systems. 10 came out 15 years ago, 9 was fully desupported 6 years ago. The last person to mention Solaris 9 on the mailing list was ... me. That machine had cobwebs even then. * Drop reference to OpenSolaris, which was cancelled ten years ago; the surviving project goes by illumos, so use that name. AIX: * Drop reference to 5.1, since there is no way older systems than that are going to be running new PostgreSQL releases. 5.1 itself was desupported by IBM 14 years ago. HP-UX: * Drop advice for v10. 11.x came out 23 years ago. It's a bit inconsistent that we bother to explain the SysV shmem sysctls on some systems but not others, just because once upon a time it was necessary to tweak them on some systems and not others due to defaults. You shouldn't need that anywhere now IIUC, unless you run a lot of clusters or use shared_memory_type=sysv. I'm not proposing to add it where it's missing, as I don't have the information and I doubt it's really useful anyway; you can find that stuff elsewhere if you really need it. From be61fe032a2b8ac3b6130c1f7a38c918d9423ec8 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 6 Jun 2020 15:20:14 +1200 Subject: [PATCH] doc: Clean up references to obsolete OS versions. Modernize the documentation to remove insecure and/or obsolete instructions about old operating systems. --- doc/src/sgml/runtime.sgml | 152 +- 1 file changed, 19 insertions(+), 133 deletions(-) diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index a8bb85e6f5..c0d1860bf2 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -872,7 +872,7 @@ psql: could not connect to server: No such file or directory -At least as of version 5.1, it should not be necessary to do +It should not be necessary to do any special configuration for such parameters as SHMMAX, as it appears this is configured to allow all memory to be used as shared memory. That is the @@ -907,41 +907,24 @@ psql: could not connect to server: No such file or directory /etc/sysctl.conf. - -These semaphore-related settings are read-only as far as -sysctl is concerned, but can be set in -/boot/loader.conf: - -kern.ipc.semmni=256 -kern.ipc.semmns=512 - -After modifying that file, a reboot is required for the new -settings to take effect. - - - -You might also want to configure your kernel to lock System V shared + +If you have set shared_memory_type to +sysv (not the default, see ), +you might also want to configure your kernel to lock System V shared memory into RAM and prevent it from being paged out to swap. This can be accomplished using the sysctl setting kern.ipc.shm_use_phys. -If running in FreeBSD jails by enabling sysctl's -security.jail.sysvipc_allowed, postmasters -running in different jails should be run by different operating system -users. This improves security because it prevents non-root users -from interfering with shared memory or semaphores in different jails, -and it allows the PostgreSQL IPC cleanup code to function properly. -(In FreeBSD 6.0 and later the IPC cleanup code does not properly