Re: Server crashed with dense_rank on partition table.
On 13 June 2018 at 17:55, Michael Paquier wrote: > On Wed, Jun 13, 2018 at 11:08:38AM +0530, Rajkumar Raghuwanshi wrote: >> postgres=# SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab >> GROUP BY b ORDER BY 1; >> server closed the connection unexpectedly >> This probably means the server terminated abnormally >> before or while processing the request. >> The connection to the server was lost. Attempting reset: Failed. > > Indeed, thanks for the test case. This used to work in v10 but this is > failing with v11 so I am adding an open item. The plans of the pre-10 > query and the query on HEAD are rather similar, and the memory context > at execution time looks messed up. Looks like some memory is being stomped on somewhere. 4b9094eb6 (Adapt to LLVM 7+ Orc API changes.) appears to be the first bad commit. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Server crashed with dense_rank on partition table.
Hi. On 2018/06/13 14:55, Michael Paquier wrote: > On Wed, Jun 13, 2018 at 11:08:38AM +0530, Rajkumar Raghuwanshi wrote: >> postgres=# SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab >> GROUP BY b ORDER BY 1; >> server closed the connection unexpectedly >> This probably means the server terminated abnormally >> before or while processing the request. >> The connection to the server was lost. Attempting reset: Failed. > > Indeed, thanks for the test case. This used to work in v10 but this is > failing with v11 so I am adding an open item. The plans of the pre-10 > query and the query on HEAD are rather similar, and the memory context > at execution time looks messed up. Fwiw, I see that the crash can also occur even when using a non-partitioned table in the query, as shown in the following example which reuses Rajkumar's test data and query: create table foo (a int, b int, c text); postgres=# insert into foo select i%20, i%30, to_char(i%12, 'FM') from generate_series(0, 36) i; select dense_rank(b) within group (order by a) from foo group by b order by 1; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Following query in the regression test suite can also be made to crash by adding a group by clause: select dense_rank(3) within group (order by x) from (values (1),(1),(2),(2),(3),(3),(4)) v(x) group by (x); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Looking at the core dump of this, it seems the following commit may be relevant: commit bf6c614a2f2c58312b3be34a47e7fb7362e07bcb Author: Andres Freund Date: Thu Feb 15 21:55:31 2018 -0800 Do execGrouping.c via expression eval machinery, take two. Thanks, Amit
Re: Duplicate Item Pointers in Gin index
On Wed, Jun 13, 2018 at 3:32 PM, Peter Geoghegan wrote: > On Tue, Jun 12, 2018 at 11:01 PM, Masahiko Sawada > wrote: >> FWIW, I've looked at this again. I think that the situation Siva >> reported in the first mail can happen before we get commit 3b2787e. >> That is, gin indexes had had a data corruption bug. I've reproduced >> the situation with PostgreSQL 10.1 and observed that a gin index can >> corrupt. > > So, you've recreated the problem with Postgres from before 3b2787e, > but not after 3b2787e? Are you suggesting that 3b2787e might have > fixed it, or that it only hid the problem, or something else? I meant 3b2787e fixed it. I checked that at least the situation doesn't happen after 3b2787e. > How did you recreate the problem? Do you have a test case you can share? I recreated it by executing each steps step by step using gdb. So I can share the test case but it might not help. create extension pageinspect; create table g (c int[]); insert into g select ARRAY[1] from generate_series(1,1000); create index g_idx on g using gin (c); alter table g set (autovacuum_enabled = off); insert into g select ARRAY[1] from generate_series(1, 408); -- 408 items fit in exactly one page of pending list insert into g select ARRAY[1] from generate_series(1, 100); -- insert into 2nd page of pending list select n_pending_pages, n_pending_tuples from gin_metapage_info(get_raw_page('g_idx', 0)); insert into g select ARRAY[999]; -- insert into 2nd pending list page delete from g where c = ARRAY[999]; -- At this point, gin entry of 'ARRAY[999]' exists on 2nd page of pending list and deleted. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Index maintenance function for BRIN doesn't check RecoveryInProgress()
Hi, Three functions: brin_summarize_new_values, brin_summarize_range and brin_desummarize_range can be called during recovery as follows. =# select brin_summarize_new_values('a_idx'); ERROR: cannot acquire lock mode ShareUpdateExclusiveLock on database objects while recovery is in progress HINT: Only RowExclusiveLock or less can be acquired on database objects during recovery. I think we should complaint "recovery is in progress" error in this case rather than erroring due to lock modes. Attached patch fixes them. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center brin_maintenance_func.patch Description: Binary data
Re: Slow planning time for simple query
On 09.06.2018 22:49, Tom Lane wrote: Maksim Milyutin writes: On hot standby I faced with the similar problem. ... is planned 4.940 ms on master and *254.741* ms on standby. (I wonder though why, if you executed the same query on the master, its setting of the index-entry-is-dead bits didn't propagate to the standby.) I have verified the number dead item pointers (through pageinspect extension) in the first leaf page of index participating in query ('main.message_instance_pkey') on master and slave nodes and have noticed a big difference. SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705); On master: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 1 | 58 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 65 On standby: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 59 | 0 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 1 The vacuum routine improves the situation. Сan there be something that I have incorrectly configured WAL logging or replication? I wonder if we should extend the "SnapshotNonVacuumable" logic introduced in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed non vacuumable. This would essentially get rid of long standby planning times in this sort of scenario by instead accepting worse (possibly much worse) planner range estimates. I'm unsure if that's a good tradeoff or not. I applied the patch introduced in this commit to test standby (not master; I don't know if this is correct) and haven't noticed any differences. -- Regards, Maksim Milyutin
Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()
On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada wrote: > Hi, > > Three functions: brin_summarize_new_values, brin_summarize_range and > brin_desummarize_range can be called during recovery as follows. > > =# select brin_summarize_new_values('a_idx'); > ERROR: cannot acquire lock mode ShareUpdateExclusiveLock on database > objects while recovery is in progress > HINT: Only RowExclusiveLock or less can be acquired on database > objects during recovery. > > I think we should complaint "recovery is in progress" error in this > case rather than erroring due to lock modes. +1 -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
On 10-06-2018 10:38, Fabien COELHO wrote: Hello Marina, Hello! v9-0003-Pgbench-errors-use-the-ereport-macro-to-report-de.patch - a patch for the ereport() macro (this is used to report client failures that do not cause an aborts and this depends on the level of debugging). ISTM that abort() is called under FATAL. If you mean abortion of the client, this is not an abortion of the main program. - implementation: if possible, use the local ErrorData structure during the errstart()/errmsg()/errfinish() calls. Otherwise use a static variable protected by a mutex if necessary. To do all of this export the function appendPQExpBufferVA from libpq. This patch applies cleanly on top of the other ones (there are minimal interactions), compiles cleanly, global & pgbench "make check" are ok. :-) IMO this patch is more controversial than the other ones. It is not really related to the aim of the patch series, which could do without, couldn't it? I'd suggest that it should be an independent submission, unrelated to the pgbench error management patch. I suppose that this is related; because of my patch there may be a lot of such code (see v7 in [1]): - fprintf(stderr, - "malformed variable \"%s\" value: \"%s\"\n", - var->name, var->svalue); + if (debug_level >= DEBUG_FAILS) + { + fprintf(stderr, + "malformed variable \"%s\" value: \"%s\"\n", + var->name, var->svalue); + } - if (debug) + if (debug_level >= DEBUG_ALL) fprintf(stderr, "client %d sending %s\n", st->id, sql); That's why it was suggested to make the error function which hides all these things (see [2]): There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with corresponding fprintf(stderr..) I think it's time to do it like in the main code, wrap with some function like log(level, msg). Moreover, it changes pgbench current behavior, which might be admissible, but should be discussed clearly. The semantics of the existing code is changed, the FATAL levels calls abort() and replace existing exit(1) calls. Maybe you want an ERROR level as well. Oh, thanks, I agree with you. And I do not want to change the program exit code without good reasons, but I'm sorry I may not know all pros and cons in this matter.. Or did you also mean other changes? The code adapts/duplicates existing server-side "ereport" stuff and brings it to the frontend, where the logging needs are somehow quite different. I'd prefer to avoid duplication and/or have some code sharing. I was recommended to use the same interface in [3]: On elog/errstart: we already have a convention for what ereport() calls look like; I suggest to use that instead of inventing your own. If it really needs to be duplicated, I'd suggest to put all this stuff in separated files. If we want to do that, I think that it would belong to fe_utils, and where it could/should be used by all front-end programs. I'll try to do it.. I do not understand why names are changed, eg ELEVEL_FATAL instead of FATAL. ISTM that part of the point of the move would be to be homogeneous, which suggests that the same names should be reused. Ok! For logging purposes, ISTM that the "elog" macro interface is nicer, closer to the existing "fprintf(stderr", as it would not introduce the additional parentheses hack for "rest". I was also recommended to use ereport() instead of elog() in [3]: With that, is there a need for elog()? In the backend we have it because $HISTORY but there's no need for that here -- I propose to lose elog() and use only ereport everywhere. I see no actual value in creating on the fly a dynamic buffer through plenty macros and functions as the end result is just to print the message out to stderr in the end. errfinishImpl: fprintf(stderr, "%s", error->message.data); This looks like overkill. From reading the code, this does not look like an improvement: fprintf(stderr, "invalid socket: %s", PQerrorMessage(st->con)); vs ereport(ELEVEL_LOG, (errmsg("invalid socket: %s", PQerrorMessage(st->con; The whole complexity of the server-side interface only make sense because TRY/CATCH stuff and complex logging requirements (eg several outputs) in the backend. The patch adds quite some code and complexity without clear added value that I can see. My 0.02€: maybe you just want to turn fprintf(stderr, format, ...); // then possibly exit or abort depending... into elog(level, format, ...); which maybe would exit or abort depending on level, and possibly not actually report under some levels and/or some conditions. For that, it could enough to just provide an nice "elog" function. I agree
Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()
On Wed, Jun 13, 2018 at 12:48 PM Kuntal Ghosh wrote: > On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada > wrote: > > Hi, > > > > Three functions: brin_summarize_new_values, brin_summarize_range and > > brin_desummarize_range can be called during recovery as follows. > > > > =# select brin_summarize_new_values('a_idx'); > > ERROR: cannot acquire lock mode ShareUpdateExclusiveLock on database > > objects while recovery is in progress > > HINT: Only RowExclusiveLock or less can be acquired on database > > objects during recovery. > > > > I think we should complaint "recovery is in progress" error in this > > case rather than erroring due to lock modes. > +1 +1, but current behavior doesn't seem to be bug, but rather not precise enough error reporting. So, I think we shouldn't consider backpatching this. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Partitioning with temp tables is broken
Hi, It seems temp tables with partitioned tables is not working correctly. 9140cf8269b0c has not considered that in build_simple_rel() an AppendRelInfo could be missing due to it having been skipped in expand_partitioned_rtentry() Assert(cnt_parts == nparts); fails in build_simple_rel, and without an assert enabled build it just crashes later when trying to dereference the in prune_append_rel_partitions() or apply_scanjoin_target_to_paths(), depending if partition pruning is used and does not prune the relation. There's also something pretty weird around the removal of the temp relation from the partition bound. I've had cases where the session that attached the temp table is long gone, but \d+ shows the table is still there and I can't attach another partition due to an overlap, and can't drop the temp table due to the session not existing anymore. I've not got a test case for that one yet, but the test case for the crash is: -- Session 1: create table listp (a int) partition by list(a); create table listp1 partition of listp for values in(1); create temp table listp2 partition of listp for values in (2); -- Session 2: select * from listp; We might want to create the RelOptInfo and somehow set it as a dummy rel, or consider setting it to NULL and skipping the NULLs. I'd rather see the latter for the future, but for PG11 we might prefer the former. I've not looked into how this would be done or if it can be done sanely. I'll add this to the open items list. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 06/11/2018 11:22 AM, Masahiko Sawada wrote: On Fri, May 25, 2018 at 8:41 PM, Moon, Insung wrote: Hello Hackers, This propose a way to develop "Table-level" Transparent Data Encryption (TDE) and Key Management Service (KMS) support in PostgreSQL. ... As per discussion at PGCon unconference, I think that firstly we need to discuss what threats we want to defend database data against. If user wants to defend against a threat that is malicious user who logged in OS or database steals an important data on datbase this design TDE would not help. Because such user can steal the data by getting a memory dump or by SQL. That is of course differs depending on system requirements or security compliance but what threats do you want to defend database data against? and why? I do agree with this - a description of the threat model needs to be part of the design discussion, otherwise it's not possible to compare it to alternative solutions (e.g. full-disk encryption using LUKS or using existing privilege controls and/or RLS). TDE was proposed/discussed repeatedly in the past, and every time it died exactly because it was not very clear which issue it was attempting to solve. Let me share some of the issues mentioned as possibly addressed by TDE (I'm not entirely sure TDE actually solves them, I'm just saying those were mentioned in previous discussions): 1) enterprise requirement - Companies want in-database encryption, for various reasons (because "enterprise solution" or something). 2) like FDE, but OS/filesystem independent - Same config on any OS and filesystem, which may make maintenance easier. 3) does not require special OS/filesystem setup - Does not require help from system adminitrators, setup of LUKS devices or whatever. 4) all filesystem access (basebackups/rsync) is encrypted anyway 5) solves key management (the main challenge with pgcrypto) 6) allows encrypting only some of the data (tables, columns) to minimize performance impact IMHO it makes sense to have TDE even if it provides the same "security" as disk-level encryption, assuming it's more convenient to setup/use from the database. Also, if I understand correctly, at unconference session there also were two suggestions about the design other than the suggestion by Alexander: implementing TDE at column level using POLICY, and implementing TDE at table-space level. The former was suggested by Joe but I'm not sure the detail of that suggestion. I'd love to hear the deal of that suggestion. The latter was suggested by Tsunakawa-san. Have you considered that? You mentioned that encryption of temporary data for query processing and large objects are still under the consideration. But other than them you should consider the temporary data generated by other subsystems such as reorderbuffer and transition table as well. The severity of those limitations is likely related to the threat model. I don't think encrypting temporary data would be a big problem, assuming you know which key to use. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
WAL prefetch
There was very interesting presentation at pgconf about pg_prefaulter: http://www.pgcon.org/2018/schedule/events/1204.en.html But it is implemented in GO and using pg_waldump. I tried to do the same but using built-on Postgres WAL traverse functions. I have implemented it as extension for simplicity of integration. In principle it can be started as BG worker. First of all I tried to estimate effect of preloading data. I have implemented prefetch utility with is also attached to this mail. It performs random reads of blocks of some large file and spawns some number of prefetch threads: Just normal read without prefetch: ./prefetch -n 0 SOME_BIG_FILE One prefetch thread which uses pread: ./prefetch SOME_BIG_FILE One prefetch thread which uses posix_fadvise: ./prefetch -f SOME_BIG_FILE 4 prefetch thread which uses posix_fadvise: ./prefetch -f -n 4 SOME_BIG_FILE Based on this experiments (on my desktop), I made the following conclusions: 1. Prefetch at HDD doesn't give any positive effect. 2. Using posix_fadvise allows to speed-up random read speed at SSD up to 2 times. 3. posix_fadvise(WILLNEED) is more efficient than performing normal reads. 4. Calling posix_fadvise in more than one thread has no sense. I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME RAID 10 storage device and 256Gb of RAM connected using InfiniBand. The speed of synchronous replication between two nodes is increased from 56k TPS to 60k TPS (on pgbench with scale 1000). Usage: 1. At master: create extension wal_prefetch 2. At replica: Call pg_wal_prefetch() function: it will not return until you interrupt it. pg_wal_prefetch function will infinitely traverse WAL and prefetch block references in WAL records using posix_fadvise(WILLNEED) system call. It is possible to explicitly specify start LSN for pg_wal_prefetch() function. Otherwise, WAL redo position will be used as start LSN. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company #include #include #include #include #include #include #include #define BLOCK_SIZE 8192 #define INIT_SEED 1999 #define MAX_THREADS 8 size_t file_size; char* file_name; size_t n_prefetchers = 1; int use_fadvise; int n_iterations = 10*1024*1024; static off_t random_offset(uint64_t* seed) { off_t rnd = (3141592621u * *seed + 2718281829u) % 17u; *seed = rnd; return rnd % file_size * BLOCK_SIZE; } void reader(void) { uint64_t seed = INIT_SEED; char page[BLOCK_SIZE]; time_t start = time(NULL); int i; int fd = open(file_name, O_RDONLY); assert(fd >= 0); for (i = 0; i < n_iterations; i++) { off_t offs = random_offset(&seed); ssize_t rc = pread(fd, page, sizeof page, offs); time_t now; assert(rc == BLOCK_SIZE); now = time(NULL); if (i % 1024 == 0 && now != start) { printf("%d: %.2f Mb/sec \r", i/1024, (double)(i+1)*BLOCK_SIZE/1024/1024/(now - start)); fflush(stdout); } } } void* prefetcher(void* arg) { size_t id = (size_t)arg; uint64_t seed = INIT_SEED; char page[BLOCK_SIZE]; int fd = open(file_name, O_RDONLY); int i; assert(fd >= 0); for (i = 0;;i++) { off_t offs = random_offset(&seed); if (i % n_prefetchers == id) { if (use_fadvise) { int rc = posix_fadvise(fd, offs, BLOCK_SIZE, POSIX_FADV_WILLNEED); assert(rc == 0); } else { ssize_t rc = pread(fd, page, sizeof page, offs); assert(rc == BLOCK_SIZE); } } } return 0; } int main(int argc, char* argv[]) { pthread_t prefetchers[MAX_THREADS]; int i; int fd; for (i = 1; i < argc; i++) { if (argv[i][0] == '-') { switch (argv[i][1]) { case 'f': use_fadvise = 1; continue; case 'n': n_prefetchers = atoi(argv[++i]); continue; case 'i': n_iterations = atoi(argv[++i]); continue; default: help: fprintf(stderr, "prefetch [-f] [-n THREADS] [-i ITERATIONS] file\n"); return 1; } } else { file_name = argv[i]; } } if (file_name == NULL) { goto help; } fd = open(file_name, O_RDONLY); assert(fd >= 0); file_size = lseek(fd, 0, SEEK_END)/BLOCK_SIZE; assert(file_size != 0); for (i = 0; i < n_prefetchers; i++) { pthread_create(&prefetchers[i], NULL, prefetcher, (void*)(size_t)i); } reader(); puts("\nDone"); return 0; } wal_prefetch.tgz Description: application/compressed-tar
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Hi, On 05/25/2018 01:41 PM, Moon, Insung wrote: Hello Hackers, ... BTW, I want to support CBC mode encryption[3]. However, I'm not sure how to use the IV in CBC mode for this proposal. I'd like to hear opinions by security engineer. I'm not a cryptographer either, but this is exactly where you need a prior discussion about the threat models - there are a couple of chaining modes, each with different weaknesses. FWIW it may also matter if data_checksums are enabled, because that may prevent malleability attacks affecting of the modes. Assuming active attacker (with the ability to modify the data files) is part of the threat model, of course. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 06/11/2018 05:22 AM, Masahiko Sawada wrote: > As per discussion at PGCon unconference, I think that firstly we need > to discuss what threats we want to defend database data against. Exactly. While certainly there is demand for encryption for the sake of "checking a box", different designs will defend against different threats, and we should be clear on which ones we are trying to protect against for any particular design. > Also, if I understand correctly, at unconference session there also > were two suggestions about the design other than the suggestion by > Alexander: implementing TDE at column level using POLICY, and > implementing TDE at table-space level. The former was suggested by Joe > but I'm not sure the detail of that suggestion. I'd love to hear the > deal of that suggestion. The idea has not been extensively fleshed out yet, but the thought was that we create column level POLICY, which would transparently apply some kind of transform on input and/or output. The transforms would presumably be expressions, which in turn could use functions (extension or builtin) to do their work. That would allow encryption/decryption, DLP (data loss prevention) schemes (masking, redacting), etc. to be applied based on the policies. This, in and of itself, would not address key management. There is probably a separate need for some kind of built in key management -- perhaps a flexible way to integrate with external systems such as Vault for example, or maybe something self contained, or perhaps both. Or maybe key management is really tied into the separately discussed effort to create SQL VARIABLEs somehow. In any case certainly a lot of room for discussion. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Re: Duplicate Item Pointers in Gin index
On Wed, Jun 13, 2018 at 11:40 AM Masahiko Sawada wrote: > > On Wed, Jun 13, 2018 at 3:32 PM, Peter Geoghegan wrote: > > On Tue, Jun 12, 2018 at 11:01 PM, Masahiko Sawada > > wrote: > >> FWIW, I've looked at this again. I think that the situation Siva > >> reported in the first mail can happen before we get commit 3b2787e. > >> That is, gin indexes had had a data corruption bug. I've reproduced > >> the situation with PostgreSQL 10.1 and observed that a gin index can > >> corrupt. > > > > So, you've recreated the problem with Postgres from before 3b2787e, > > but not after 3b2787e? Are you suggesting that 3b2787e might have > > fixed it, or that it only hid the problem, or something else? > > I meant 3b2787e fixed it. I checked that at least the situation > doesn't happen after 3b2787e. I also think that 3b2787e should fix such problems. After 3b2787e, vacuum is forced to cleanup all pending list entries, which were inserted before vacuum start. So, vacuum should have everything to be vaccumed merged into posting lists/trees. > > How did you recreate the problem? Do you have a test case you can share? > > I recreated it by executing each steps step by step using gdb. So I > can share the test case but it might not help. > > create extension pageinspect; > create table g (c int[]); > insert into g select ARRAY[1] from generate_series(1,1000); > create index g_idx on g using gin (c); > alter table g set (autovacuum_enabled = off); > insert into g select ARRAY[1] from generate_series(1, 408); -- 408 > items fit in exactly one page of pending list > insert into g select ARRAY[1] from generate_series(1, 100); -- insert > into 2nd page of pending list > select n_pending_pages, n_pending_tuples from > gin_metapage_info(get_raw_page('g_idx', 0)); > insert into g select ARRAY[999]; -- insert into 2nd pending list page > delete from g where c = ARRAY[999]; > -- At this point, gin entry of 'ARRAY[999]' exists on 2nd page of > pending list and deleted. Is this test case completed? It looks like there should be a continuation with concurrent vacuum and insertions managed by gdb... -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: why partition pruning doesn't work?
On Wed, Jun 13, 2018 at 12:15 AM, Tom Lane wrote: >> As for whether to change it at this point in the release cycle, I >> guess that, to me, depends on how invasive the fix is. > > It seems not to be that bad: we just need a shutdown call for the > PartitionPruneState, and then we can remember the open relation there. > The attached is based on David's patch from yesterday. Seems reasonable. Really, I think we should look for a way to hang onto the relation at the point where it's originally opened and locked instead of reopening it here. But that's probably more invasive than we can really justify right at the moment, and I think this is a step in a good direction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Partitioning with temp tables is broken
On Wed, Jun 13, 2018 at 5:36 PM, David Rowley wrote: > Hi, > > It seems temp tables with partitioned tables is not working correctly. > 9140cf8269b0c has not considered that in build_simple_rel() an > AppendRelInfo could be missing due to it having been skipped in > expand_partitioned_rtentry() > > Assert(cnt_parts == nparts); fails in build_simple_rel, and without an > assert enabled build it just crashes later when trying to dereference > the in prune_append_rel_partitions() or > apply_scanjoin_target_to_paths(), depending if partition pruning is > used and does not prune the relation. > > There's also something pretty weird around the removal of the temp > relation from the partition bound. I've had cases where the session > that attached the temp table is long gone, but \d+ shows the table is > still there and I can't attach another partition due to an overlap, > and can't drop the temp table due to the session not existing anymore. > I've not got a test case for that one yet, but the test case for the > crash is: > > -- Session 1: > create table listp (a int) partition by list(a); > create table listp1 partition of listp for values in(1); > create temp table listp2 partition of listp for values in (2); > > -- Session 2: > select * from listp; Culprit here is following code in expand_partitioned_rtentry() /* As in expand_inherited_rtentry, skip non-local temp tables */ if (RELATION_IS_OTHER_TEMP(childrel)) { heap_close(childrel, lockmode); continue; } > > We might want to create the RelOptInfo and somehow set it as a dummy > rel, or consider setting it to NULL and skipping the NULLs. I'd rather > see the latter for the future, but for PG11 we might prefer the > former. I've not looked into how this would be done or if it can be > done sanely. A temporary table is a base relation. In order to create a RelOptInfo of a base relation we need to have an entry available for it in query->rtables, simple_rel_array. We need corresponding RTE and AppendRelInfo so that we can reserve an entry there. Somehow this AppendRelInfo or the RTE should convey that it's a temporary relation from the other session and mark the RelOptInfo empty when it gets created in build_simple_rel() or when it gets in part_rels array. We will require accessing metadata about the temporary table while building RelOptInfo. That may be fine. I haven't checked. I haven't thought about setting NULL in part_rels array. That may be safer than the first option since the places where we scan part_rels are fewer and straight forward. But having a temporary table with partition has other effects also e.g. \d+ t1 Table "public.t1" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +-+---+--+-+-+--+- a | integer | | | | plain | | b | integer | | | | plain | | Partition key: RANGE (a) Indexes: "t1_a" btree (a) Partitions: t1p1 FOR VALUES FROM (0) TO (100), t1p2 FOR VALUES FROM (100) TO (200) both t1p1 and t1p2 are regular tables. create a temporary table as default partition from some other session and insert data create temporary table t1p3 partition of t1 default; insert into t1 values (350, 1); now try creating a partition from a session other than the one where temporary table was created create table t1p4 partition of t1 for values from (300) to (400); ERROR: cannot access temporary tables of other sessions The reason that happens because when creating a regular partition we scan the default partition to check whether the regular partition has any data that belongs to the partition being created. Similar error will result if we try to insert a row to the partitioned table which belongs to temporary partition from other session. I think it's debatable whether that's a bug or not. But it's not as bad as a crash. So, a solution to fix crash your reproduction is to just remove the code in expand_partitioned_rtentry() which skips the temporary tables from the other session 1712 /* As in expand_inherited_rtentry, skip non-local temp tables */ 1713 if (RELATION_IS_OTHER_TEMP(childrel)) 1714 { 1715 heap_close(childrel, lockmode); 1716 continue; 1717 } If we do that we will see above error when the partition corresponding to the temporary table from the other session is scanned i.e. if such partition is not pruned. But a larger question is what use such temporary partitions are? Should we just prohibit adding temporary partitions to a permanant partitioned table? We should allow adding temporary partitions to a temporary partitioned table if only they both belong to the same session. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Spilling hashed SetOps and aggregates to disk
On Mon, Jun 11, 2018 at 1:50 PM, Jeff Davis wrote: > I think average group size is the wrong way to look at it, because > averages are only useful for a normal distribution. The two most > interesting cases are: > > 1. Fairly uniform group size (e.g. normal dist) > 2. Skew values, power law distribution, etc., where some groups are > very large and most are tiny by comparison. I am calling the very large > groups "skewed groups". I wasn't using the term "average" in a mathematical sense. I suppose I really meant "typical". I agree that thinking about cases where the group size is nonuniform is a good idea, but I don't think I agree that all uniform distributions are created equal. Uniform distributions with 1 row per group are very different than uniform distributions with 1000 rows per group. > If we get the skewed groups into the hash table, and avoid OOM, I think > that is a success. My patch did that, except it didn't account for two > cases: > > (a) clustered input, where skewed groups may not appear until the > hash table is already full; and > (b) variable-sized transition values that grow with the group size I think that many of the algorithms under consideration could be implemented without worrying about variable-sized transition values, and then the approach could be extended later. However, whether or not a given algorithm can handle clustered input seems like a fairly basic property of the algorithm. I don't think we should set the bar too high because no algorithm is going to be perfect in every case; at the same time, clustered input is pretty common in real-world scenarios, and an algorithm that handles such cases better has a significant leg up over one that can't, all other things being equal. I'm not sure I remember precisely what your proposal was any more. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: hot_standby_feedback vs excludeVacuum and snapshots
On 11 June 2018 at 17:56, Andres Freund wrote: > I don't think this is a good idea. We shouldn't continue down the path > of having running xacts not actually running xacts, but rather go back > to including everything. The case presented in the thread didn't > actually do what it claimed originally, and there's a fair amount of > potential for the excluded xids to cause problems down the line. > > Especially not when the fixes should be backpatched. I think the > earlier patch should be reverted, and then the AEL lock release problem > should be fixed separately. Since Greg has not reappeared to speak either way, I agree we should revert, though I will add comments to document this. I will do this today. Looks like we would need a multi-node isolation tester to formally test the AEL lock release, so I won't add tests for that. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: why partition pruning doesn't work?
Robert Haas writes: > Seems reasonable. Really, I think we should look for a way to hang > onto the relation at the point where it's originally opened and locked > instead of reopening it here. But that's probably more invasive than > we can really justify right at the moment, and I think this is a step > in a good direction. The existing coding there makes me itch a bit, because there's only a rather fragile line of reasoning justifying the assumption that there is a pre-existing lock at all. So I'd be in favor of what you suggest just to get rid of the "open(NoLock)" hazard. But I agree that it'd be rather invasive and right now is probably not the time for it. regards, tom lane
Re: PostgreSQL vs SQL Standard
On Sun, Jun 10, 2018 at 11:32 AM, Tom Lane wrote: > Andrew Gierth writes: >> I beat at the grammar a bit to see what it would take to fix it at least >> to the extent of allowing a_expr ColId in a select list after removing >> postfix ops. It looked like it was doable by making these keywords more >> reserved (all of which are already reserved words per spec): >> DOUBLE, DAY, FILTER, HOUR, MINUTE, MONTH, OVER, PRECISION, SECOND, >> VARYING, WITHIN, WITHOUT, YEAR > > Yeah, a side effect of allowing "a_expr ColId" is that we can expect, > going forward, that a lot of new keywords are going to have to become > fully reserved that otherwise wouldn't have to be. This is particularly > a problem because of the SQL committee's COBOL-hangover tendency to > invent new syntax that involves sequences of keywords; we usually > don't have a good way to deal with that short of making the first > keyword(s) reserved. > > It's arguable that going down that path will, in the long run, lead to > breaking more applications (via their table/column names suddenly becoming > reserved in some new version) than we rescue from having to quote their > SELECT aliases. At the very least we need to recognize that this is far > from cost-free. It depends on the source of those applications. Applications originally written for PostgreSQL are going to break if we reserve those keywords. Applications originally written for other databases that already reserve those words won't, and migrating to PostgreSQL will become easier for those people. I think this is to some extent a question about whether it's more important to have backward compatibility with our own previous releases or compatibility with other SQL database systems. > (wanders away wondering exactly what parsing technology the SQL committee > thinks implementations use...) I wonder about this, too, because I've noticed that one other system in particular seems to do parsing in a quite different manner than we do (but I have no idea specifically how they do it). bison is a good tool in many ways and converting to something else would be a huge amount of work, but there are also a number of things about it that are not so great: - The fact that conflicts can only be resolved with precedence declarations sucks. You can only set the priority of rule A relative to rule B unless they've both got a precedence assigned, and there's frequently no non-invasive way to accomplish that. It would be nice to be able to mark the postfix operator production as "weak" so that it's only used when no other interpretation is possible. Of course this kind of thing could be overdone leading to a confusing grammar, but it's not a bad idea to be able to do things like this to handle corner cases, at least IMHO. - It really only handles LR(1), but for certain cases more tokens of lookahead would be really useful. NULLS_LA and NOT_LA are nasty hacks to work around the lack of adequate lookahead, and there are other cases where we could do nicer things if we had it. - There's no way to modify the parser at runtime, which means our extension facility can't invent new SQL syntax, not even e.g. CREATE THINGY and DROP THINGY. (Our lack of support for thingys is really holding us back.) - The release cadence is extremely slow. They just released 3.0.5, but 3.0.4 came out in 2015 and 3.0.3 in 2013. It's hard to imagine bison going away completely, but it's also pretty hard to imagine it ever having any new features that would help us. And if we find a bug, waiting for the next release to fix it is going to be painful. To all appearances, it's only a little ahead of abandonware. I'm not sure that there's a better tool than bison out there and I'm not volunteering to write one in my spare time, but if something turns up, we might not want to dismiss it out of hand. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Spilling hashed SetOps and aggregates to disk
On Tue, Jun 5, 2018 at 5:05 AM Tomas Vondra wrote: > > On 06/05/2018 07:46 AM, Jeff Davis wrote: > > On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote: > >> I expect the eviction strategy to be the primary design challenge of > >> this patch. The other bits will be mostly determined by this one > >> piece. > > > > Not sure I agree that this is the primary challenge. > > > > The cases that benefit from eviction are probably a minority. I see two > > categories that would benefit: > > > >* Natural clustering in the heap. This sounds fairly common, but a > > lot of the cases that come to mind are too low-cardinality to be > > compelling; e.g. timestamps grouped by hour/day/month. If someone has > > run into a high-cardinality natural grouping case, let me know. > >* ARRAY_AGG (or similar): individual state values can be large enough > > that we need to evict to avoid exceeding work_mem even if not adding > > any new groups. > > > > In either case, it seems like a fairly simple eviction strategy would > > work. For instance, we could just evict the entire hash table if > > work_mem is exceeded or if the hit rate on the hash table falls below a > > certain threshold. If there was really something important that should > > have stayed in the hash table, it will go back in soon anyway. > > > > So why should eviction be a major driver for the entire design? I agree > > it should be an area of improvement for the future, so let me know if > > you see a major problem, but I haven't been as focused on eviction. > > > > My concern is more about what happens when the input tuple ordering is > inherently incompatible with the eviction strategy, greatly increasing > the amount of data written to disk during evictions. > > Say for example that we can fit 1000 groups into work_mem, and that > there are 2000 groups in the input data set. If the input is correlated > with the groups, everything is peachy because we'll evict the first > batch, and then group the remaining 1000 groups (or something like > that). But if the input data is random (which can easily happen, e.g. > for IP addresses, UUIDs and such) we'll hit the limit repeatedly and > will evict much sooner. > > I know you suggested simply dumping the whole hash table and starting > from scratch while we talked about this at pgcon, but ISTM it has > exactly this issue. > > But I don't know if there actually is a better option - maybe we simply > have to accept this problem. After all, running slowly is still better > than OOM (which may or may not happen now). > > I wonder if we can somehow detect this at run-time and maybe fall-back > to groupagg. E.g. we could compare number of groups vs. number of input > tuples when we first hit the limit. It's a rough heuristics, but maybe > sufficient. I've been applying a strategy like that to do massive streaming aggregation quite successfully. The code I have is in python and in a private repo. There have been talks of both opensourcing it, and of porting it into postgres as a kind of aggregate node, because it sounds like a good idea. It seems very related to this thread. In essence, the technique I've been using always uses a fixed-size hash table to do partial grouping. The table is never grown, when collisions do happen, the existing entry in the hash table is flushed to disk and the aggregate state in that bucket reset for the incoming key. To avoid excessive spilling due to frequent collisions, we use a kind of "lazy cuckoo" hash table. Lazy in the sense that it does no relocations, it just checks 2 hash values, and if it has to evict, it evicts the "oldest" value (with some cheap definition of "old"). The technique works very well to reduce temporary data size with a fixed amount of working memory. The resulting spill file can then be processed again to finalize the computation. What I was pondering PG could do, is feed the spilled tuples to a sort node, using the partial hash aggregation as a data-reducing node only. scan -> partial hash agg -> sort -> final group agg The group agg would have to know to consume and combine aggregate states instead of producing them, but in essence it seems a relatively efficient process. An adaptive hash agg node would start as a hash agg, and turn into a "partial hash agg + sort + final group agg" when OOM is detected. The benefit over ordinary sort+group agg is that the sort is happening on a potentially much smaller data set. When the buffer is large enough to capture enough key repetitions, the output of the partial hash agg can be orders of magnitude smaller than the scan. My proposal was to use this for all group aggs, not only when the planner chooses a hash agg, because it can speed up the sort and reduce temp storage considerably. I guess the trick lies in estimating that cardinality reduction to avoid applying this when there's no hope of getting a payoff. The overhead of such a lazy hash table isn't much, really. But yes, its cache locality is terri
Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()
On 2018-Jun-13, Alexander Korotkov wrote: > On Wed, Jun 13, 2018 at 12:48 PM Kuntal Ghosh > wrote: > > On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada > > wrote: > > > Hi, > > > > > > Three functions: brin_summarize_new_values, brin_summarize_range and > > > brin_desummarize_range can be called during recovery as follows. > > > > > > =# select brin_summarize_new_values('a_idx'); > > > ERROR: cannot acquire lock mode ShareUpdateExclusiveLock on database > > > objects while recovery is in progress > > > HINT: Only RowExclusiveLock or less can be acquired on database > > > objects during recovery. Good catch! > > > I think we should complaint "recovery is in progress" error in this > > > case rather than erroring due to lock modes. > > +1 > > +1, > but current behavior doesn't seem to be bug, but rather not precise > enough error reporting. So, I think we shouldn't consider > backpatching this. I guess you could go either way ... we're just changing one unhelpful error with a better one: there is no change in behavior. I would backpatch this, myself, and avoid the code divergence. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Partitioning with temp tables is broken
Ashutosh Bapat writes: > [ lots o' problems with $subject ] > But a larger question is what use such temporary partitions are? > Should we just prohibit adding temporary partitions to a permanant > partitioned table? We should allow adding temporary partitions to a > temporary partitioned table if only they both belong to the same > session. Even if you want to argue that there's a use case for these situations, it seems far too late in the release cycle to be trying to fix all these issues. I think we need to forbid the problematic cases for now, and leave relaxing the prohibition to be treated as a future new feature. regards, tom lane
Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()
On 13 June 2018 at 15:51, Alvaro Herrera wrote: > On 2018-Jun-13, Alexander Korotkov wrote: > >> On Wed, Jun 13, 2018 at 12:48 PM Kuntal Ghosh >> wrote: >> > On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada >> > wrote: >> > > Hi, >> > > >> > > Three functions: brin_summarize_new_values, brin_summarize_range and >> > > brin_desummarize_range can be called during recovery as follows. >> > > >> > > =# select brin_summarize_new_values('a_idx'); >> > > ERROR: cannot acquire lock mode ShareUpdateExclusiveLock on database >> > > objects while recovery is in progress >> > > HINT: Only RowExclusiveLock or less can be acquired on database >> > > objects during recovery. > > Good catch! > >> > > I think we should complaint "recovery is in progress" error in this >> > > case rather than erroring due to lock modes. >> > +1 >> >> +1, >> but current behavior doesn't seem to be bug, but rather not precise >> enough error reporting. So, I think we shouldn't consider >> backpatching this. > > I guess you could go either way ... we're just changing one unhelpful > error with a better one: there is no change in behavior. I would > backpatch this, myself, and avoid the code divergence. WAL control functions all say the same thing, so we can do that here also. I'd prefer it if the message was more generic, so remove the summarization/desummarization wording from the message. i.e. "BRIN control functions cannot be executed during recovery" -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Duplicate Item Pointers in Gin index
On Wed, Jun 13, 2018 at 10:22 PM, Alexander Korotkov wrote: > On Wed, Jun 13, 2018 at 11:40 AM Masahiko Sawada > wrote: >> >> On Wed, Jun 13, 2018 at 3:32 PM, Peter Geoghegan wrote: >> > On Tue, Jun 12, 2018 at 11:01 PM, Masahiko Sawada >> > wrote: >> >> FWIW, I've looked at this again. I think that the situation Siva >> >> reported in the first mail can happen before we get commit 3b2787e. >> >> That is, gin indexes had had a data corruption bug. I've reproduced >> >> the situation with PostgreSQL 10.1 and observed that a gin index can >> >> corrupt. >> > >> > So, you've recreated the problem with Postgres from before 3b2787e, >> > but not after 3b2787e? Are you suggesting that 3b2787e might have >> > fixed it, or that it only hid the problem, or something else? >> >> I meant 3b2787e fixed it. I checked that at least the situation >> doesn't happen after 3b2787e. > > I also think that 3b2787e should fix such problems. After 3b2787e, > vacuum is forced to cleanup all pending list entries, which were > inserted before vacuum start. So, vacuum should have everything to be > vaccumed merged into posting lists/trees. > >> > How did you recreate the problem? Do you have a test case you can share? >> >> I recreated it by executing each steps step by step using gdb. So I >> can share the test case but it might not help. >> >> create extension pageinspect; >> create table g (c int[]); >> insert into g select ARRAY[1] from generate_series(1,1000); >> create index g_idx on g using gin (c); >> alter table g set (autovacuum_enabled = off); >> insert into g select ARRAY[1] from generate_series(1, 408); -- 408 >> items fit in exactly one page of pending list >> insert into g select ARRAY[1] from generate_series(1, 100); -- insert >> into 2nd page of pending list >> select n_pending_pages, n_pending_tuples from >> gin_metapage_info(get_raw_page('g_idx', 0)); >> insert into g select ARRAY[999]; -- insert into 2nd pending list page >> delete from g where c = ARRAY[999]; >> -- At this point, gin entry of 'ARRAY[999]' exists on 2nd page of >> pending list and deleted. > > Is this test case completed? It looks like there should be a > continuation with concurrent vacuum and insertions managed by gdb... > This is not completed test case. This is only for step 1 and we need concurrent vacuum and insertions as you said. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: why partition pruning doesn't work?
David Rowley writes: > On 13 June 2018 at 16:15, Tom Lane wrote: >> It seems not to be that bad: we just need a shutdown call for the >> PartitionPruneState, and then we can remember the open relation there. >> The attached is based on David's patch from yesterday. > I've looked over this and it seems better than mine. Especially so > that you've done things so that the FmgrInfo is copied into a memory > context that's not about to get reset. Pushed, thanks for looking it over. > One small thing is that I'd move the: > context.partrel = NULL; > to under: > /* These are not valid when being called from the planner */ Judgment call I guess. I tend to initialize struct fields in field order unless there's a good reason to do differently, just so it's easier to confirm that none were overlooked. But I can see the point of your suggestion too, so done that way. There's still one thing I'm a bit confused about here. I noticed that we weren't actually using the partopfamily and partopcintype fields in PartitionPruneContext, so I removed them. But that still leaves both partsupfunc and partcollation as pointers into the relcache that were subject to this hazard. My testing agrees with lousyjack's results that both of those were, in fact, being improperly accessed. The OID comparison effect I mentioned upthread explains why the buildfarm's cache-clobbering members failed to notice any problem with garbage partsupfunc data ... but why did we not see any complaints about invalid collation OIDs? Tis strange. regards, tom lane
Re: POC: GROUP BY optimization
I see v9 is now calling estimate_num_groups(), so it already benefits from extended statistics. Nice! I think the algorithm is more or less the greedy option I proposed in one of the earlier messages - I don't know if it's sufficient or not, but the only other idea I have is essentially an exhaustive search through all possible permutations. All possible permutation could be slow with extremely large group by clause, so we will need some some stop limit - like join_collapse_limit. I don't like this idea. So that's a nice improvement, although I think we should also consider non-uniform distributions (using the per-column MCV lists). Could you clarify how to do that? 2) Planing cost is the same for all queries. So, cost_sort() doesn't take into account even number of columns. Yeah. As I wrote before, I think we'll need to come up with a model for comparison_cost depending on the column order, which should make costs for those paths different. Yeah. But I still think it should be separated patch which will improve cost estimation and plan choosing in other optimizer part too. OK. I think we should also have a debug GUC for the first patch (i.e. one that would allow reordering group_pathkeys to match the input path). Added to 0002-opt_group_by_index_and_order-v10.patch . debug_group_by_reorder_by_pathkeys. Regarding the two GUCs introduced in v9, I'm not sure I 100% understand what they are doing. Let me try explaining what I think they do: 1) debug_cheapest_group_by - Reorder GROUP BY clauses to using ndistinct stats for columns, placing columns with more distinct values first (so that we don't need to compare the following columns as often). yes 2) debug_group_by_match_order_by - Try to reorder the GROUP BY clauses to match the ORDER BY, so that the group aggregate produces results in the desired output (and we don't need an explicit Sort). yes FWIW I doubt about the debug_group_by_match_order_by optimization. The trouble is that it's only based on comparing the pathkeys lists, and entirely ignores that the two result sets may operate on very different numbers of rows. Consider for example "SELECT * FROM t GROUP BY a,b,c,d ORDER BY c,d" where table "t" has 1B rows, but there are only ~10k rows in the result (which is fairly common in fact tables in star-schema DBs). IIUC the optimization will ensure "c,d" is placed first in the GROUP BY, and then we reorder "a,b" using ndistinct. But this may be much less efficient than simply reordering (a,b,c,d) per ndistinct and then adding explicit Sort on top of that (because on 10k rows that's going to be cheap). As you pointed in next email, this optimization is already implemented in preprocess_groupclause(). And correct resolving of this issue could be done only after implementing of correct cost_sort() - which will pay attention to comparison cost, number of distinct value in columns and how often it's needed to compare second and next columns. The other "issue" I have with get_cheapest_group_keys_order() is how it interacts with group_keys_reorder_by_pathkeys(). I mean, we first call n_preordered = group_keys_reorder_by_pathkeys(path->pathkeys, ...); so the reordering tries to maintain ordering of the input path. Then if (n_preordered == 0) we do this: group_keys_reorder_by_pathkeys(root->sort_pathkeys, ...); Doesn't that mean that if we happen to have input path with partially matching ordering (so still requiring explicit Sort before grouping) we may end up ignoring the ORDER BY entirely? Instead of using Sort that would match the ORDER BY? I might have misunderstood this, though. Hm. I believe ordering of input of GROUP clause is always more expensive - just because output of GROUP BY clause should have less number of rows than its input, what means more cheap ordering. So here it uses ORDER BY only if we don't have a group pathkey(s) matched by path pathkeys. I'm not sure why the first group_keys_reorder_by_pathkeys() call does not simply return 0 when the input path ordering is not sufficient for the grouping? Is there some reasoning behind that (e.g. that sorting partially sorted is faster)? Hm, what do you mean - sufficient? group_keys_reorder_by_pathkeys() always tries to order GROUP BY columns by preexisting pathkeys. But of course, incremental sort will add more benefits here. I'd like to hope that incremental sort patch is close to commit. Overall I think this patch introduces four different optimizations that are indeed very interesting: 1) consider additional sorted paths for grouping Agree 2) reorder GROUP BY clauses per ndistinct to minimize comparisons Agree (with a help of estimate_num_groups() and patch tries to maximize a number of groups on each step) 3) when adding Sort for grouping, maintain partial input ordering Agree - and incremental sort patch will helpful here. 4) when adding Sort for grouping, try producing the right output order (if the ORDER BY was spe
Re: Partitioning with temp tables is broken
On Wed, Jun 13, 2018, 8:34 PM Tom Lane wrote: > Ashutosh Bapat writes: > > [ lots o' problems with $subject ] > > > But a larger question is what use such temporary partitions are? > > Should we just prohibit adding temporary partitions to a permanant > > partitioned table? We should allow adding temporary partitions to a > > temporary partitioned table if only they both belong to the same > > session. > > Even if you want to argue that there's a use case for these situations, > it seems far too late in the release cycle to be trying to fix all these > issues. I think we need to forbid the problematic cases for now, and > leave relaxing the prohibition to be treated as a future new feature. > +1, forbid the problematic case. Regards, Amul Sent from a mobile device. Please excuse brevity and tpyos. >
Re: POC: GROUP BY optimization
I.e. we already do reorder the group clauses to match ORDER BY, to only require a single sort. This happens in preprocess_groupclause(), which also explains the reasoning behind that. Huh. I missed that. That means group_keys_reorder_by_pathkeys() call inside get_cheapest_group_keys_order() duplicates work of preprocess_groupclause(). And so it's possible to replace it to simple counting number of the same pathkeys in beginning of lists. Or remove most part of preprocess_groupclause() - but it seems to me we should use first option, preprocess_groupclause() is simpler, it doesn't operate with pathkeys only with SortGroupClause which is simpler. BTW, incremental sort path provides pathkeys_common(), exactly what we need. I wonder if some of the new code reordering group pathkeys could/should be moved here (not sure, maybe it's too early for those decisions). In any case, it might be appropriate to update some of the comments before preprocess_groupclause() which claim we don't do certain things added by the proposed patches. preprocess_groupclause() is too early step to use something like group_keys_reorder_by_pathkeys() because pathkeys is not known here yet. This probably also somewhat refutes my claim that the order of grouping keys is currently fully determined by users (and so they may pick the most efficient order), while the reorder-by-ndistinct patch would make that impossible. Apparently when there's ORDER BY, we already mess with the order of group clauses - there are ways to get around it (subquery with OFFSET 0) but it's much less clear. I like a moment when objections go away :) -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()
On Thu, Jun 14, 2018 at 12:04 AM, Simon Riggs wrote: > On 13 June 2018 at 15:51, Alvaro Herrera wrote: >> On 2018-Jun-13, Alexander Korotkov wrote: >> >>> On Wed, Jun 13, 2018 at 12:48 PM Kuntal Ghosh >>> wrote: >>> > On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada >>> > wrote: >>> > > Hi, >>> > > >>> > > Three functions: brin_summarize_new_values, brin_summarize_range and >>> > > brin_desummarize_range can be called during recovery as follows. >>> > > >>> > > =# select brin_summarize_new_values('a_idx'); >>> > > ERROR: cannot acquire lock mode ShareUpdateExclusiveLock on database >>> > > objects while recovery is in progress >>> > > HINT: Only RowExclusiveLock or less can be acquired on database >>> > > objects during recovery. >> >> Good catch! >> >>> > > I think we should complaint "recovery is in progress" error in this >>> > > case rather than erroring due to lock modes. >>> > +1 >>> >>> +1, >>> but current behavior doesn't seem to be bug, but rather not precise >>> enough error reporting. So, I think we shouldn't consider >>> backpatching this. >> >> I guess you could go either way ... we're just changing one unhelpful >> error with a better one: there is no change in behavior. I would >> backpatch this, myself, and avoid the code divergence. > > WAL control functions all say the same thing, so we can do that here also. +1 > I'd prefer it if the message was more generic, so remove the > summarization/desummarization wording from the message. i.e. > "BRIN control functions cannot be executed during recovery" > Agreed. Attached an updated patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 60e650d..8453bfe 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -871,6 +871,12 @@ brin_summarize_range(PG_FUNCTION_ARGS) Relation heapRel; double numSummarized = 0; + if (RecoveryInProgress()) + ereport(ERROR, +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("BRIN control functions cannot be executed during recovery."))); + if (heapBlk64 > BRIN_ALL_BLOCKRANGES || heapBlk64 < 0) { char *blk = psprintf(INT64_FORMAT, heapBlk64); @@ -942,6 +948,12 @@ brin_desummarize_range(PG_FUNCTION_ARGS) Relation indexRel; bool done; + if (RecoveryInProgress()) + ereport(ERROR, +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("BRIN control functions cannot be executed during recovery."))); + if (heapBlk64 > MaxBlockNumber || heapBlk64 < 0) { char *blk = psprintf(INT64_FORMAT, heapBlk64);
Re: Portability concerns over pq_sendbyte?
Hi, On 2018-05-24 18:13:23 +0100, Andrew Gierth wrote: > In PG11, pq_sendbyte got changed from taking an int parameter to taking > an int8. > > While that seems to work in general, it does mean that there are now > several places in the code that do the equivalent of: > > unsigned char x = 128; > pq_sendbyte(&buf, x); > > which I believe is not well-defined since pq_sendbyte takes an int8, and > conversions of unrepresentable values to _signed_ integer types are > (iirc) implementation-dependent. It's not implementation defined in postgres' dialect of C - we rely on accurate signed->unsigned conversions in a number of places. But I doin't think we should increase that reliance, so I think you're right we should do something about this. > There are also some cases where pq_sendint16 is being called for an > unsigned value or a value that might exceed 32767. Hm, which case were you thinking of here? The calls usually are exactly the types that the wire protocol expects, no? > Would it be better for these to take unsigned values, or have unsigned > variants? I wonder if we should just take 'int' out of the name. Say, pg_send{8,16,32,64}(unsigned ...). Greetings, Andres Freund
Re: Portability concerns over pq_sendbyte?
On 2018-06-13 14:10:37 +0900, Michael Paquier wrote: > On Mon, Jun 11, 2018 at 02:25:44PM +0900, Michael Paquier wrote: > > On Wed, Jun 06, 2018 at 12:27:58PM -0400, Alvaro Herrera wrote: > >> Do you have an answer to this question? Does anybody else? > >> > >> (My guts tell me it'd be better to change these routines to take > >> unsigned values, without creating extra variants. But guts frequently > >> misspeak.) > > > > My guts are telling me as well to not have more variants. Agreed. > > On top of that it seems to me that we'd want to rename any new > > routines to include "uint" in their name instead of "int", and for > > compatibility with past code pq_sendint should not be touched. I'm very doubtful about this one, unless you mean that just the signature shouldn't be touched. Otherwise we'll just increase code duplication unnecessarily? > And also pq_sendint64 needs to be kept around for compatibility. :(. Wonder if it's better to just break people's code. Greetings, Andres Freund
Re: [HACKERS] Optional message to user when terminating/cancelling backend
> On 9 Apr 2018, at 23:55, Andres Freund wrote: > On 2017-06-20 13:01:35 -0700, Andres Freund wrote: >> For extensions it'd also be useful if it'd be possible to overwrite the >> error code. E.g. for citus there's a distributed deadlock detector, >> running out of process because there's no way to interrupt lock waits >> locally, and we've to do some ugly hacking to generate proper error >> messages and code from another session. > > What happened to this request? Seems we're out of the crunch mode and > could round the feature out a littlebit more… Revisiting old patches, I took a stab at this request. Since I don’t really have a use case for altering the sqlerrcode other than the on that Citus.. cited, I modelled the API around that. The slot now holds a sqlerrcode as well as a message, with functions to just set the message keeping the default sqlerrcode for when that is all one wants to do. There is no function for just altering the sqlerrcode without a message as that seems not useful to me. The combination of sqlerrcode and message was dubbed SignalFeedback for lack of a better term. With this I also addressed something that annoyed me; I had called all the functions Cancel* which technically isn’t true since we also support termination. There are no new user facing changes in patch compared to the previous version. This patchset still has the refactoring that Alvaro brought up upthread. Parking this again the commitfest as it was returned with feedback from the last one it was in (all review comments addressed, see upthread). cheers ./daniel 0001-Refactor-backend-signalling-code-v11.patch Description: Binary data 0002-Support-optional-message-in-backend-cancel-terminate-v11.patch Description: Binary data
Re: [HACKERS] Optional message to user when terminating/cancelling backend
Hi, Onder, I CCed you because it seems worthwhile to ensure that the relevant Citus code could use this instead of the gross hack you and I committed... On 2018-06-13 20:54:03 +0200, Daniel Gustafsson wrote: > > On 9 Apr 2018, at 23:55, Andres Freund wrote: > > On 2017-06-20 13:01:35 -0700, Andres Freund wrote: > >> For extensions it'd also be useful if it'd be possible to overwrite the > >> error code. E.g. for citus there's a distributed deadlock detector, > >> running out of process because there's no way to interrupt lock waits > >> locally, and we've to do some ugly hacking to generate proper error > >> messages and code from another session. > > > > What happened to this request? Seems we're out of the crunch mode and > > could round the feature out a littlebit more… > > Revisiting old patches, I took a stab at this request. > > Since I don’t really have a use case for altering the sqlerrcode other than > the > on that Citus.. cited, I modelled the API around that. Cool. Onder? > diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml > index b851fe023a..ad9763698f 100644 > --- a/doc/src/sgml/func.sgml > +++ b/doc/src/sgml/func.sgml > @@ -18540,7 +18540,7 @@ SELECT set_config('log_statement_stats', 'off', > false); > > > > -pg_cancel_backend(pid > int) > +pg_cancel_backend(pid > int [, message > text]) > > boolean > Cancel a backend's current query. This is also allowed if the > @@ -18565,7 +18565,7 @@ SELECT set_config('log_statement_stats', 'off', > false); > > > > -pg_terminate_backend(pid > int) > +pg_terminate_backend(pid > int [, message > text]) > > boolean > Terminate a backend. This is also allowed if the calling role > @@ -18596,6 +18596,14 @@ SELECT set_config('log_statement_stats', 'off', > false); > The role of an active backend can be found from the > usename column of the > pg_stat_activity view. > +If the optional message parameter is set, the text > +will be appended to the error message returned to the signalled backend. > +message is limited to 128 bytes, any longer text > +will be truncated. An example where we cancel our own backend: > + > +postgres=# SELECT pg_cancel_backend(pg_backend_pid(), 'Cancellation message > text'); > +ERROR: canceling statement due to user request: Cancellation message text > + > I'm not sure I really like the appending bit. There's a security argument to be made about doing so, but from a user POV that mostly seems restrictive. I wonder if that aspect would be better handled by adding an error context that contains information about which process did the cancellation (or DETAIL?)? > +/* > + * Structure for registering a feedback payload to be sent to a cancelled, or > + * terminated backend. Each backend is registered per pid in the array which > is > + * indexed by Backend ID. Reading and writing the message is protected by a > + * per-slot spinlock. > + */ > +typedef struct > +{ > + pid_t pid; This is the pid of the receiving process? If so, why do we need this in here? That's just duplicated data, no? > + slock_t mutex; > + charmessage[MAX_CANCEL_MSG]; > + int len; > + int sqlerrcode; I'd vote for including the pid of the process that did the cancelling in here. > +Datum > +pg_cancel_backend(PG_FUNCTION_ARGS) > +{ > + PG_RETURN_BOOL(pg_cancel_backend_internal(PG_GETARG_INT32(0), NULL)); > +} > + > +Datum > +pg_cancel_backend_msg(PG_FUNCTION_ARGS) > +{ > + pid_t pid; > + char *msg = NULL; > + > + if (PG_ARGISNULL(0)) > + PG_RETURN_NULL(); > + > + pid = PG_GETARG_INT32(0); > + > + if (PG_NARGS() == 2 && !PG_ARGISNULL(1)) > + msg = text_to_cstring(PG_GETARG_TEXT_PP(1)); > + > + PG_RETURN_BOOL(pg_cancel_backend_internal(pid, msg)); > +} Why isn't this just one function? Given that you already have a PG_NARGS check, I don't quite see the point? > /* > * Signal to terminate a backend process. This is allowed if you are a > member > * of the role whose process is being terminated. > * > * Note that only superusers can signal superuser-owned processes. > */ > -Datum > -pg_terminate_backend(PG_FUNCTION_ARGS) > +static bool > +pg_terminate_backend_internal(pid_t pid, char *msg) > { > - int r = pg_signal_backend(PG_GETARG_INT32(0), > SIGTERM); > + int r = pg_signal_backend(pid, SIGTERM, msg); > > if (r == SIGNAL_BACKEND_NOSUPERUSER) > ereport(ERROR, > @@ -146,7 +203,30 @@ pg_terminate_backend(PG_FUNCTION_ARGS) > (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), >(errmsg("must be a member of the role whose > process is being terminated or member of pg_signal_backend"; > > - PG_RETURN_BOOL(r == SIGN
Re: Spilling hashed SetOps and aggregates to disk
On Wed, 2018-06-13 at 10:08 -0400, Robert Haas wrote: > I wasn't using the term "average" in a mathematical sense. I suppose > I really meant "typical". I agree that thinking about cases where > the > group size is nonuniform is a good idea, but I don't think I agree > that all uniform distributions are created equal. Uniform > distributions with 1 row per group are very different than uniform > distributions with 1000 rows per group. The only mechanism we have for dealing efficiently with skewed groups is hashing; no alternative has been proposed. I think what you are saying is that the case of medium-large groups with clustered input are kind of like skewed groups because they have enough locality to benefit from grouping before spilling. I can see that. So how do we handle both of these cases (skewed groups and clustered groups) well? I tend toward a simple approach, which is to just put whatever we can in the hash table. Once it's full, if the hit rate on the hash table (the whole table) drops below a certain threshold, we just dump all the transition states out to disk and empty the hash table. That would handle both skewed groups and clustering fairly well. Clustering would cause the hit rate to rapidly go to zero when we are past the current batch of groups, causing fairly quick switching to new groups which should handle Tomas's case[1]. And it would also handle ordinary skewed groups fairly well -- it may write them out sometimes (depending on how smart our eviction strategy is), but the cost of that is not necessarily bad because it will go back into the hash table quickly. It also offers an incremental implementation strategy: something resembling my patch could be first (which doesn't dump transition states at all), followed by something that can dump transition states, followed by some tweaking to make it smarter. That still leaves the question about what to do with the small groups: partition them (like my patch does) or feed them into a sort and group them? Regards, Jeff Davis [1] https://www.postgresql.org/message-id/46734151-3245-54ac-76fc-17742 fb0e6d9%402ndquadrant.com
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Hello Marina, I suppose that this is related; because of my patch there may be a lot of such code (see v7 in [1]): - fprintf(stderr, - "malformed variable \"%s\" value: \"%s\"\n", - var->name, var->svalue); + if (debug_level >= DEBUG_FAILS) + { + fprintf(stderr, + "malformed variable \"%s\" value: \"%s\"\n", + var->name, var->svalue); + } - if (debug) + if (debug_level >= DEBUG_ALL) fprintf(stderr, "client %d sending %s\n", st->id, sql); I'm not sure that debug messages needs to be kept after debug, if it is about debugging pgbench itself. That is debatable. That's why it was suggested to make the error function which hides all these things (see [2]): There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with corresponding fprintf(stderr..) I think it's time to do it like in the main code, wrap with some function like log(level, msg). Yep. I did not wrote that, but I agree with an "elog" suggestion to switch if (...) { fprintf(...); exit/abort/continue/... } to a simpler: elog(level, ...) Moreover, it changes pgbench current behavior, which might be admissible, but should be discussed clearly. The semantics of the existing code is changed, the FATAL levels calls abort() and replace existing exit(1) calls. Maybe you want an ERROR level as well. Oh, thanks, I agree with you. And I do not want to change the program exit code without good reasons, but I'm sorry I may not know all pros and cons in this matter.. Or did you also mean other changes? AFAICR I meant switching exit to abort in some cases. The code adapts/duplicates existing server-side "ereport" stuff and brings it to the frontend, where the logging needs are somehow quite different. I'd prefer to avoid duplication and/or have some code sharing. I was recommended to use the same interface in [3]: On elog/errstart: we already have a convention for what ereport() calls look like; I suggest to use that instead of inventing your own. The "elog" interface already exists, it is not an invention. "ereport" is a hack which is somehow necessary in some cases. I prefer a simple function call if possible for the purpose, and ISTM that this is the case. If it really needs to be duplicated, I'd suggest to put all this stuff in separated files. If we want to do that, I think that it would belong to fe_utils, and where it could/should be used by all front-end programs. I'll try to do it.. Dunno. If you only need one "elog" function which prints a message to stderr and decides whether to abort/exit/whatevrer, maybe it can just be kept in pgbench. If there are are several complicated functions and macros, better with a file. So I'd say it depends. For logging purposes, ISTM that the "elog" macro interface is nicer, closer to the existing "fprintf(stderr", as it would not introduce the additional parentheses hack for "rest". I was also recommended to use ereport() instead of elog() in [3]: Probably. Are you hoping that advises from different reviewers should be consistent? That seems optimistic:-) With that, is there a need for elog()? In the backend we have it because $HISTORY but there's no need for that here -- I propose to lose elog() and use only ereport everywhere. See commit 8a07ebb3c172 which turns some ereport into elog... My 0.02€: maybe you just want to turn fprintf(stderr, format, ...); // then possibly exit or abort depending... into elog(level, format, ...); which maybe would exit or abort depending on level, and possibly not actually report under some levels and/or some conditions. For that, it could enough to just provide an nice "elog" function. I agree that elog() can be coded in this way. To use ereport() I need a structure to store the error level as a condition to exit. Yep. That is a lot of complication which are justified server side where logging requirements are special, but in this case I see it as overkill. So my current view is that if you only need an "elog" function, it is simpler to add it to "pgbench.c". -- Fabien.
Re: Spilling hashed SetOps and aggregates to disk
On Wed, 2018-06-13 at 11:50 -0300, Claudio Freire wrote: > In essence, the technique I've been using always uses a fixed-size > hash table to do partial grouping. The table is never grown, when > collisions do happen, the existing entry in the hash table is flushed > to disk and the aggregate state in that bucket reset for the incoming > key. To avoid excessive spilling due to frequent collisions, we use a > kind of "lazy cuckoo" hash table. Lazy in the sense that it does no > relocations, it just checks 2 hash values, and if it has to evict, it > evicts the "oldest" value (with some cheap definition of "old"). ... > An adaptive hash agg node would start as a hash agg, and turn into a > "partial hash agg + sort + final group agg" when OOM is detected. > > The benefit over ordinary sort+group agg is that the sort is > happening > on a potentially much smaller data set. When the buffer is large > enough to capture enough key repetitions, the output of the partial > hash agg can be orders of magnitude smaller than the scan. > > My proposal was to use this for all group aggs, not only when the > planner chooses a hash agg, because it can speed up the sort and > reduce temp storage considerably. I guess the trick lies in > estimating > that cardinality reduction to avoid applying this when there's no > hope > of getting a payoff. The overhead of such a lazy hash table isn't > much, really. But yes, its cache locality is terrible, so it needs to > be considered. I think this is a promising approach because it means the planner has one less decion to make. And the planner doesn't have great information to make its decision on, anyway (ndistinct estimates are hard enough on plain tables, and all bets are off a few levels up in the plan). Unfortunately, it means that either all data types must support hashing and sorting[1], or we need to have a fallback path that can get by with hashing alone (which would not be tested nearly as well as more typical paths). I still like this idea, though. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/9584.1528739975%40sss.pgh.pa. us
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
On 2018-Jun-13, Fabien COELHO wrote: > > > > With that, is there a need for elog()? In the backend we have > > > > it because $HISTORY but there's no need for that here -- I > > > > propose to lose elog() and use only ereport everywhere. > > See commit 8a07ebb3c172 which turns some ereport into elog... For context: in the backend, elog() is only used for internal messages (i.e. "can't-happen" conditions), and ereport() is used for user-facing messages. There are many things ereport() has that elog() doesn't, such as additional message fields (HINT, DETAIL, etc) that I think could have some use in pgbench as well. If you use elog() then you can't have that. Another difference is that in the backend, elog() messages are never translated, while ereport() message are translated. Since pgbench is translatable I think it would be best to keep those things in sync, to avoid confusion. (Although of course you could do it differently in pgbench than backend.) One thing that just came to mind is that pgbench uses some src/fe_utils stuff. I hope having ereport() doesn't cause a conflict with that ... BTW I think abort() is not the right thing, as it'll cause core dumps if enabled. Why not just exit(1)? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Portability concerns over pq_sendbyte?
> "Andres" == Andres Freund writes: >> unsigned char x = 128; >> pq_sendbyte(&buf, x); >> >> which I believe is not well-defined since pq_sendbyte takes an int8, >> and conversions of unrepresentable values to _signed_ integer types >> are (iirc) implementation-dependent. Andres> It's not implementation defined in postgres' dialect of C - we Andres> rely on accurate signed->unsigned conversions in a number of Andres> places. Converting signed integer to unsigned is ok as I understand it - what's happening here is the reverse, converting an unrepresentable unsigned value to a signed type. >> There are also some cases where pq_sendint16 is being called for an >> unsigned value or a value that might exceed 32767. Andres> Hm, which case were you thinking of here? The calls usually are Andres> exactly the types that the wire protocol expects, no? There are cases where it's not actually clear what the wire protocol expects - I'm thinking in particular of the number of entries in a list of parameter types/formats. -- Andrew (irc:RhodiumToad)
Re: Portability concerns over pq_sendbyte?
Hi, On 2018-06-13 22:02:13 +0100, Andrew Gierth wrote: > > "Andres" == Andres Freund writes: > > >> unsigned char x = 128; > >> pq_sendbyte(&buf, x); > >> > >> which I believe is not well-defined since pq_sendbyte takes an int8, > >> and conversions of unrepresentable values to _signed_ integer types > >> are (iirc) implementation-dependent. > > Andres> It's not implementation defined in postgres' dialect of C - we > Andres> rely on accurate signed->unsigned conversions in a number of > Andres> places. > > Converting signed integer to unsigned is ok as I understand it - what's > happening here is the reverse, converting an unrepresentable unsigned > value to a signed type. Err, yes, I was thinking about that conversion. Sorry for the confusion. > >> There are also some cases where pq_sendint16 is being called for an > >> unsigned value or a value that might exceed 32767. > > Andres> Hm, which case were you thinking of here? The calls usually are > Andres> exactly the types that the wire protocol expects, no? > > There are cases where it's not actually clear what the wire protocol > expects - I'm thinking in particular of the number of entries in a list > of parameter types/formats. But that didn't change around the pq_send* changes? So I'm not sure I understand how this is related? I mean I'm all for documenting the wire protocol more extensively, but we can't just change the width? Greetings, Andres Freund
Re: Locking B-tree leafs immediately in exclusive mode
On Mon, Jun 11, 2018 at 9:30 AM, Alexander Korotkov wrote: > On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs wrote: >> It's a good idea. How does it perform with many duplicate entries? I agree that this is a good idea. It independently occurred to me to do this. The L&Y algorithm doesn't take a position on this at all, supposing that *no* locks are needed at all (that's why I recommend people skip L&Y and just read the Lanin & Shasha paper -- L&Y is unnecessarily confusing). This idea seems relatively low risk. > Our B-tree is currently maintaining duplicates unordered. So, during > insertion > we can traverse rightlinks in order to find page, which would fit new > index tuple. > However, in this case we're traversing pages in exclusive lock mode, and > that happens already after re-lock. _bt_search() doesn't do anything with > that. > So, I assume there shouldn't be any degradation in the case of many > duplicate entries. BTW, I have a prototype patch that makes the keys at the leaf level unique. It is actually an implementation of nbtree suffix truncation that sometimes *adds* a new attribute in pivot tuples, rather than truncating away non-distinguishing attributes -- it adds a special heap TID attribute when it must. The patch typically increases fan-in, of course, but the real goal was to make all entries at the leaf level truly unique, as L&Y intended (we cannot do it without suffix truncation because that will kill fan-in). My prototype has full amcheck coverage, which really helped me gain confidence in my approach. There are still big problems with my patch, though. It seems to hurt performance more often than it helps when there is a lot of contention, so as things stand I don't see a path to making it commitable. I am still going to clean it up some more and post it to -hackers, though. I still think it's quite interesting. I am not pursuing it much further now because it seems like my timing is bad -- not because it seems like a bad idea. I can imagine something like zheap making this patch important again. I can also imagine someone seeing something that I missed. -- Peter Geoghegan
Re: Portability concerns over pq_sendbyte?
On Wed, Jun 13, 2018 at 11:53:21AM -0700, Andres Freund wrote: > On 2018-06-13 14:10:37 +0900, Michael Paquier wrote: >> On Mon, Jun 11, 2018 at 02:25:44PM +0900, Michael Paquier wrote: >>> On top of that it seems to me that we'd want to rename any new >>> routines to include "uint" in their name instead of "int", and for >>> compatibility with past code pq_sendint should not be touched. > > I'm very doubtful about this one, unless you mean that just the > signature shouldn't be touched. Otherwise we'll just increase code > duplication unnecessarily? Yeah, actually that would be assuming that many modules use it, but that does not seem to be much the case, at least from github's point of view. >> And also pq_sendint64 needs to be kept around for compatibility. > > :(. Wonder if it's better to just break people's code. Indeed. At least breaking compilation has the advantage of making people directly aware of the change and think hopefully about them. A research on github shows a bunch of people having copied of pqformat.h as there are a bunch of copies of Postgres so with this much noise it is not easy to find out what would be broken. In-core contrib and test modules don't make use of those interfaces as well, except for hstore. So that could be acceptable. For pq_sendint there are many matches with printsimple.c. -- Michael signature.asc Description: PGP signature
Logging transaction IDs for DDL.
I just noticed a problem with log_statement = 'ddl' and log_line_prefix containing '%x'. If the statement is the first in the transaction, it will be logged before it is executed, and more importantly, before a transaction ID is assigned. That means that %x will be 0. If the administrator has configured postgres like this in order to ease PITR on bad DDL, it's actually of no use whatsoever and they'll have to use pg_waldump to try to figure things out. PFA a simple patch that I hope addresses the issue in a clean way. It also handles the same problem for 'mod'. I'm undecided whether this is a bugfix or an improvement. I'm leaning towards bugfix. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f4133953be..27027a0fa8 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -38,6 +38,7 @@ #include "access/parallel.h" #include "access/printtup.h" +#include "access/transam.h" #include "access/xact.h" #include "catalog/pg_type.h" #include "commands/async.h" @@ -2098,16 +2099,31 @@ check_log_statement(List *stmt_list) if (log_statement == LOGSTMT_NONE) return false; - if (log_statement == LOGSTMT_ALL) + + /* + * If we're supposed to log mod or ddl statements, we need to make sure we + * have a TransactionId so it appears in log_line_prefix's %x wildcard. + * If the user is logging all statements, we can fast-track out if we + * already have a TransactionId, otherwise we need to loop through the + * statements. + */ + if (log_statement == LOGSTMT_ALL && TransactionIdIsValid(GetTopTransactionIdIfAny())) return true; /* Else we have to inspect the statement(s) to see whether to log */ foreach(stmt_item, stmt_list) { Node *stmt = (Node *) lfirst(stmt_item); + LogStmtLevel level = GetCommandLogLevel(stmt); + + if (level <= log_statement) + { + /* Make sure we have a TransactionId for mod and ddl statements. */ + if (level == LOGSTMT_DDL || level == LOGSTMT_MOD) +(void) GetTopTransactionId(); - if (GetCommandLogLevel(stmt) <= log_statement) return true; + } } return false;
Re: Logging transaction IDs for DDL.
Hi, On 2018-06-14 00:34:54 +0200, Vik Fearing wrote: > I just noticed a problem with log_statement = 'ddl' and log_line_prefix > containing '%x'. If the statement is the first in the transaction, it > will be logged before it is executed, and more importantly, before a > transaction ID is assigned. That means that %x will be 0. This isn't particularly related to DDL, no? It's a general weakness of how %x is handled. What's even worse is that by log_min_duration_statement time the xid is already cleared out, when using a single-statement transaction. > If the administrator has configured postgres like this in order to ease > PITR on bad DDL, it's actually of no use whatsoever and they'll have to > use pg_waldump to try to figure things out. To me that part seems like it'd be better handled if we had an option that allows to log at a time where the xid is already assigned. Making log_statement assign xids imo has a number of problems, but preserving the xid for logging at the *end* of a statement should be much less problematic. > I'm undecided whether this is a bugfix or an improvement. I'm leaning > towards bugfix. It's been reported a couple times, without anybody working to fix it. Your fix has a noticable regression potential, so I'm not immediately in favor of backpatching. > -- > Vik Fearing +33 6 46 75 15 36 > http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support > diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c > index f4133953be..27027a0fa8 100644 > --- a/src/backend/tcop/postgres.c > +++ b/src/backend/tcop/postgres.c > @@ -38,6 +38,7 @@ > > #include "access/parallel.h" > #include "access/printtup.h" > +#include "access/transam.h" > #include "access/xact.h" > #include "catalog/pg_type.h" > #include "commands/async.h" > @@ -2098,16 +2099,31 @@ check_log_statement(List *stmt_list) > > if (log_statement == LOGSTMT_NONE) > return false; > - if (log_statement == LOGSTMT_ALL) > + > + /* > + * If we're supposed to log mod or ddl statements, we need to make sure > we > + * have a TransactionId so it appears in log_line_prefix's %x wildcard. > + * If the user is logging all statements, we can fast-track out if we > + * already have a TransactionId, otherwise we need to loop through the > + * statements. > + */ > + if (log_statement == LOGSTMT_ALL && > TransactionIdIsValid(GetTopTransactionIdIfAny())) > return true; > > /* Else we have to inspect the statement(s) to see whether to log */ > foreach(stmt_item, stmt_list) > { > Node *stmt = (Node *) lfirst(stmt_item); > + LogStmtLevel level = GetCommandLogLevel(stmt); > + > + if (level <= log_statement) > + { > + /* Make sure we have a TransactionId for mod and ddl > statements. */ > + if (level == LOGSTMT_DDL || level == LOGSTMT_MOD) > + (void) GetTopTransactionId(); > > - if (GetCommandLogLevel(stmt) <= log_statement) > return true; > + } > } I'm not at all on board with this. For one this'll break in a number of weird ways on a standby (wrong error messages at the least). For another it increases the overhead of logging quite noticably. Thirdly moving transaction assignment to a different place based on logging settings seems like it'll be very confusing too. Greetings, Andres Freund
Re: Slow planning time for simple query
13.06.2018 12:40, Maksim Milyutin wrote: On 09.06.2018 22:49, Tom Lane wrote: Maksim Milyutin writes: On hot standby I faced with the similar problem. ... is planned 4.940 ms on master and *254.741* ms on standby. (I wonder though why, if you executed the same query on the master, its setting of the index-entry-is-dead bits didn't propagate to the standby.) I have verified the number dead item pointers (through pageinspect extension) in the first leaf page of index participating in query ('main.message_instance_pkey') on master and slave nodes and have noticed a big difference. SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705); On master: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 1 | 58 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 65 On standby: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 59 | 0 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 1 In this point I want to highlight the issue that the changes in *lp_flags* bits (namely, set items as dead) for index item pointers doesn't propagate from master to replica in my case. As a consequence, on standby I have live index items most of which on master are marked as dead. And my queries on planning stage are forced to descent to heap pages under *get_actual_variable_range* execution that considerately slows down planning. Is it bug or restriction of implementation or misconfiguration of WAL/replication? -- Regards, Maksim Milyutin
Re: commitfest 2018-07
On Mon, Jun 11, 2018 at 10:34:55AM -0700, Andres Freund wrote: > On 2018-06-11 11:50:55 +0900, Michael Paquier wrote: >> So, we have confirmation from the RTM that there will be a 2018-07. And >> there is visibly consensus that renaming 2018-09 to 2018-07 makes the >> most sense. Any objections in moving forward with this renaming at the >> 5-CF plan for v12? > > FWIW, I still think it'd be a better plan to explicitly move things, > rather than just rename. But concensus isn't unamity, so ... Well, at least we have a consensus. I have gone through the CF app and just did the renaming, creating new commit fests on the way. -- Michael signature.asc Description: PGP signature
RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com] > Let me share some of the issues mentioned as possibly addressed by TDE > (I'm not entirely sure TDE actually solves them, I'm just saying those > were mentioned in previous discussions): FYI, our product provides TDE like Oracle and SQL Server, which enables encryption per tablespace. Relations, WAL records and temporary files related to encrypted tablespace are encrypted. http://www.fujitsu.com/global/products/software/middleware/opensource/postgres/ (I wonder why the web site doesn't offer the online manual... I've recognized we need to fix this situation. Anyway, I guess the downloadable trial version includes the manual.) > 1) enterprise requirement - Companies want in-database encryption, for > various reasons (because "enterprise solution" or something). To assist compliance with PCI DSS, HIPAA, etc. > 2) like FDE, but OS/filesystem independent - Same config on any OS and > filesystem, which may make maintenance easier. > > 3) does not require special OS/filesystem setup - Does not require help > from system adminitrators, setup of LUKS devices or whatever. > > 4) all filesystem access (basebackups/rsync) is encrypted anyway > > 5) solves key management (the main challenge with pgcrypto) > > 6) allows encrypting only some of the data (tables, columns) to minimize > performance impact All yes. > IMHO it makes sense to have TDE even if it provides the same "security" > as disk-level encryption, assuming it's more convenient to setup/use > from the database. Agreed. Regards Takayuki Tsunakawa
Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development
On 06/12/2018 08:07 PM, Michael Paquier wrote: On Tue, Jun 12, 2018 at 04:51:59PM -0400, Andrew Dunstan wrote: On 06/12/2018 10:51 AM, Andrew Dunstan wrote: meanwhile, Bowerbird (MSVC) is on 1.0.1d, lorikeet (Cygwin64) is on 1.0.2d and jacana (Mingw) doesn't build with openssl. I will look at upgrading bowerbird ASAP. Well, that crashed and burned. What versions of openssl are we supporting? What kind of failures are you seeing? I just compiled Postgres two days ago with MSVC and OpenSSL 1.0.2o (oldest version with a Windows installer I could find), and that was able to compile. On HEAD, OpenSSL should be supported down to 0.9.8. This thread discusses about whether we want to enforce HAVE_X509_GET_SIGNATURE_NID unconditionally or not, as it is disabled now. Even if the code is linked to 1.0.2 and the flag is not set, then the code should be able to compile. I installed 1.1.0h and got errors you can see on the buildfarm. I've now installed 1.0.2o and everything is good. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
> From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com] > On 05/25/2018 01:41 PM, Moon, Insung wrote: > > BTW, I want to support CBC mode encryption[3]. However, I'm not sure > > how to use the IV in CBC mode for this proposal. I'd like to hear > > opinions by security engineer. > > > > I'm not a cryptographer either, but this is exactly where you need a > prior discussion about the threat models - there are a couple of > chaining modes, each with different weaknesses. Our products uses XTS, which recent FDE software like BitLocker and TrueCrypt uses instead of CBC. https://en.wikipedia.org/wiki/Disk_encryption_theory#XTS "According to SP 800-38E, "In the absence of authentication or access control, XTS-AES provides more protection than the other approved confidentiality-only modes against unauthorized manipulation of the encrypted data."" > FWIW it may also matter if data_checksums are enabled, because that may > prevent malleability attacks affecting of the modes. Assuming active > attacker (with the ability to modify the data files) is part of the > threat model, of course. Encrypt the page after embedding its checksum value. If a malicious attacker modifies a page on disk, then the decrypted page would be corrupt anyway, which can be detected by checksum. Regards Takayuki Tsunakawa
Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development
On Wed, Jun 13, 2018 at 08:55:46PM -0400, Andrew Dunstan wrote: > I installed 1.1.0h and got errors you can see on the buildfarm. I've now > installed 1.0.2o and everything is good. Good thing you tested that. I have just used the LTS 1.0.2 for my tests. And there are a couple of problems here. HAVE_BIO_GET_DATA is neither defined nor enforced in pg_config.h.win32, and BIO_get_data has been introduced in 1.1.0, so that explains the failures as you would need to add it in the config files. I imagine that most folks packaging Postgres on Windows simply rely on 1.0.2 (I do!) which is why we have not seen reports of those failures... A second failure is related to HAVE_BIO_METH_NEW, with all routine sets like BIO_meth_set_write & co new as of OpenSSL 1.1.0. The configure check uses BIO_meth_new(). A third problem is related to HAVE_ASN1_STRING_GET0_DATA, but I don't see a complain in the buildfarm logs, which is interesting, but that's already present in pg_config.h.win32. A fourth problem is with HAVE_OPENSSL_INIT_SSL, which is missing in pg_config.h.win32. We claim support for OpenSSL 1.1.0 down to 9.4 per bb132cdd, so I think that we should document all those flags even in back-branches. Thoughts of people here? For now, I would suggest to keep a track of HAVE_BIO_GET_DATA, HAVE_BIO_METH_NEW and others in pg_config.h.win32 but mark them as undef. Anybody shipping Windows stuff also likely patch lightly the MSVC scripts (I do for some paths!), so they could always enforce those flags if building with OpenSSL 1.1.0... Documenting them is really important as well. So attached is an updated patch which should be applied on HEAD to close the gap and close this open item with all the gaps mentioned in the commit message. I'd like to document (but disable!) as well the OpenSSL 1.1.0 flags down to 9.4 as that's where we claim support of this version as bb132cd missed the shot. This would break Windows MSVC builds linking to OpenSSL 1.0.1 or older, so the buildfarm will likely turn red here or there. Thoughts? -- Michael From c571b8d693a5498f2f49f786b065f911e7e4b505 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 14 Jun 2018 10:35:06 +0900 Subject: [PATCH] Track new configure flags introduced for 11 in pg_config.h.win32 The following set of flags mainly matter when building Postgres code with MSVC and those have been forgotten with latest developments: - HAVE_LDAP_INITIALIZE, added by 35c0754f, but tracked as disabled for now. ldap_initialize() is a non-standard extension that provides a way to use "ldaps" with OpenLDAP, but it is not supported on Windows, and instead the non-standard ldap_sslinit() is used if WIN32 is defined. Per input from Thomas Munro. - HAVE_X509_GET_SIGNATURE_NID, added by 054e8c6c, which is used by SCRAM's channel binding tls-server-end-point. Having this flag disabled would cause this channel binding type to be unsupported for Windows builds. - HAVE_SSL_CLEAR_OPTIONS, added recently as of a364dfa4 to disable SSL compression. - HAVE_ASN1_STRING_GET0_DATA, added by 5c6df67, which is used to track a new compatibility with OpenSSL 1.1.0. This was missing from pg_config.win32.h and is not enabled by default. HAVE_BIO_GET_DATA, HAVE_OPENSSL_INIT_SSL and HAVE_BIO_METH_NEW gain the same treatment. The second and third flags are enabled with this commit, which raises the bar of OpenSSL support to 1.0.2 on Windows as a minimum. As this is the TLS version of community and knowing that all recent installers referred by upstream don't have anymore 1.0.1 or older, we could live with that requirement. In order to allow the code to compile with OpenSSL 1.1.0, all the flags mentioned above need to be enabled in pg_config.h.win32. Author: Michael Paquier Discussion: https://postgr.es/m/20180529211559.gf6...@paquier.xyz --- src/include/pg_config.h.win32 | 21 + 1 file changed, 21 insertions(+) diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index 2c701fa718..0c15d7f624 100644 --- a/src/include/pg_config.h.win32 +++ b/src/include/pg_config.h.win32 @@ -72,6 +72,15 @@ # define gettimeofday(a,b) gettimeofday(a) #endif +/* Define to 1 if you have the `ASN1_STRING_get0_data' function. */ +/* #undef HAVE_ASN1_STRING_GET0_DATA */ + +/* Define to 1 if you have the `BIO_get_data' function. */ +/* #undef HAVE_BIO_GET_DATA */ + +/* Define to 1 if you have the `BIO_meth_new' function. */ +/* #undef HAVE_BIO_METH_NEW */ + /* Define to 1 if you have the `cbrt' function. */ //#define HAVE_CBRT 1 @@ -233,6 +242,9 @@ /* Define to 1 if you have the header file. */ /* #undef HAVE_LDAP_H */ +/* Define to 1 if you have the `ldap_initialize' function. */ +/* #undef HAVE_LDAP_INITIALIZE */ + /* Define to 1 if you have the `crypto' library (-lcrypto). */ /* #undef HAVE_LIBCRYPTO */ @@ -288,6 +300,9 @@ /* Define to 1 if you have the header file. */ /* #undef HAVE_NETINET_TCP_H */ +/* Define to 1 if you have the `OPENSSL_init_ss
Re: Partitioning with temp tables is broken
On Wed, Jun 13, 2018 at 10:25:23PM +0530, amul sul wrote: > On Wed, Jun 13, 2018, 8:34 PM Tom Lane wrote: >> Even if you want to argue that there's a use case for these situations, >> it seems far too late in the release cycle to be trying to fix all these >> issues. I think we need to forbid the problematic cases for now, and >> leave relaxing the prohibition to be treated as a future new feature. > > +1, forbid the problematic case. +1 on that. And I can see that nobody on this thread has tested with REL_10_STABLE, right? In that case you don't get a crash, but other sessions can see the temporary partition of another's session, say with \d+. So it seems to me that we should forbid the use of temporary relations in a partition tree and back-patch it as well to v10 for consistency. And if trying to insert into the temporary relation you can a nice error: =# insert into listp values (2); ERROR: 0A000: cannot access temporary tables of other sessions LOCATION: ReadBufferExtended, bufmgr.c:657 This is also a bit uninstinctive as I would think of it as an authorized operation, still my guts tell me that the code complications are not really worth the use-cases: =# create temp table listp2 partition of listp for values in (2); ERROR: 42P17: partition "listp2" would overlap partition "listp2" LOCATION: check_new_partition_bound, partition.c:81 -- Michael signature.asc Description: PGP signature
Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()
On Thu, Jun 14, 2018 at 02:06:57AM +0900, Masahiko Sawada wrote: > On Thu, Jun 14, 2018 at 12:04 AM, Simon Riggs wrote: >> On 13 June 2018 at 15:51, Alvaro Herrera wrote: >>> I guess you could go either way ... we're just changing one unhelpful >>> error with a better one: there is no change in behavior. I would >>> backpatch this, myself, and avoid the code divergence. >> >> WAL control functions all say the same thing, so we can do that here also. > > +1 +1 for a back-patch to v10. The new error message brings clarity in my opinion. -- Michael signature.asc Description: PGP signature
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Wed, Jun 13, 2018 at 10:03 PM, Tomas Vondra wrote: > On 06/11/2018 11:22 AM, Masahiko Sawada wrote: >> >> On Fri, May 25, 2018 at 8:41 PM, Moon, Insung >> wrote: >>> >>> Hello Hackers, >>> >>> This propose a way to develop "Table-level" Transparent Data Encryption >>> (TDE) and Key Management Service (KMS) support in PostgreSQL. >>> >>> ... >> >> >> As per discussion at PGCon unconference, I think that firstly we >> need to discuss what threats we want to defend database data against. >> If user wants to defend against a threat that is malicious user who logged >> in OS or database steals an important data on datbase this design TDE would >> not help. Because such user can steal the data by getting a memory dump or >> by SQL. That is of course differs depending on system requirements or >> security compliance but what threats do >> you want to defend database data against? and why? >> > > I do agree with this - a description of the threat model needs to be part of > the design discussion, otherwise it's not possible to compare it to > alternative solutions (e.g. full-disk encryption using LUKS or using > existing privilege controls and/or RLS). > > TDE was proposed/discussed repeatedly in the past, and every time it died > exactly because it was not very clear which issue it was attempting to > solve. > > Let me share some of the issues mentioned as possibly addressed by TDE (I'm > not entirely sure TDE actually solves them, I'm just saying those were > mentioned in previous discussions): Thank you for sharing! > > 1) enterprise requirement - Companies want in-database encryption, for > various reasons (because "enterprise solution" or something). Yes, I'm often asked it by our customers especially for database migration from DBMS that supports TDE in order to reduce costs of migration. > > 2) like FDE, but OS/filesystem independent - Same config on any OS and > filesystem, which may make maintenance easier. > > 3) does not require special OS/filesystem setup - Does not require help from > system adminitrators, setup of LUKS devices or whatever. > > 4) all filesystem access (basebackups/rsync) is encrypted anyway > > 5) solves key management (the main challenge with pgcrypto) > > 6) allows encrypting only some of the data (tables, columns) to minimize > performance impact > > IMHO it makes sense to have TDE even if it provides the same "security" as > disk-level encryption, assuming it's more convenient to setup/use from the > database. Agreed. > >> Also, if I understand correctly, at unconference session there also were >> two suggestions about the design other than the suggestion by Alexander: >> implementing TDE at column level using POLICY, and implementing TDE at >> table-space level. The former was suggested by >> Joe but I'm not sure the detail of that suggestion. I'd love to hear >> the deal of that suggestion. The latter was suggested by >> Tsunakawa-san. Have you considered that? >> >> You mentioned that encryption of temporary data for query processing and >> large objects are still under the consideration. But other than them you >> should consider the temporary data generated by other subsystems such as >> reorderbuffer and transition table as well. >> > > The severity of those limitations is likely related to the threat model. I > don't think encrypting temporary data would be a big problem, assuming you > know which key to use. Agreed. I thought the possibility of non-encrypted temporary data in backups but since we don't include them in backups it would not be a big problem. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [HACKERS] Pluggable storage
On Thu, Jun 14, 2018 at 1:50 AM, Haribabu Kommi wrote: > > On Fri, Apr 20, 2018 at 4:44 PM Haribabu Kommi > wrote: > > VACUUM: > Not much changes are done in this apart moving the Vacuum visibility > functions as part of the > storage. But idea for the Vacuum was with each access method can define how > it should perform. > We are planning to have a somewhat different mechanism for vacuum (for non-delete marked indexes), so if you can provide some details or discuss what you have in mind before implementation of same, that would be great. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Shared access methods?
Hi, Several features in various discussed access methods would benefit from being able to perform actions when writing out a buffer. As an example, because it doesn't require understanding any of the new proposed storage formats, it'd be good for performance if we could eagerly set hint bits / perform page level pruning when cleaning a dirty buffer either during checkpoint writeout or bgwriter / backend reclaim. That'd allow to avoid the write amplification issues in several of current and proposed cleanup schemes. Unfortunately that's currently not really easy to do. Buffers don't currently know which AM they belong to, therefore we can't know how to treat it at writeout time. It's not that hard to make space in the buffer descriptor to additionally store the oid of the associated AM, e.g. we could just replace buf_id with a small bit of pointer math. But even if we had a AM oid, it'd be unclear what to do with it as it'd be specific to a database. Which makes it pretty much useless for tasks happening on writeout of victim buffers / checkpoint. Thus I think it'd be better design to have pg_am be a shared relation. That'd imply a two things: a) amhandler couldn't be regproc but would need to be two fields, one pointing to internal or a shlib, the other to the function name. b) extensions containing AMs would need to do something INSERT ... ON CONFLICT DO NOTHING like. I don't think this is the most urgent feature for making pluggable AMs useful, but given that we're likely going to whack around pg_am, and that pg_am is fairly new in its current incarnation, it seems like a good idea to discuss this now. Comments? Greetings, Andres Freund PS: I could have written more on this, but people are urging me to come to dinner, so thank them ;)
[GSoC] current working status
Hi mentors and hackers, The first evaluation is coming. Here is my progress so far. During the first stage of work, I have implemented the thrift binary protocol as the format of postgresql plugin. Currently, the main interface is byte. Users who use this plugin need to provide thrift bytes to the plugin, and there are helpers for each data type to parse out the value contained in the bytes. This method has been verified by the use of several unit tests. However, the current interface needs users understand thrift format very well to use this plugin. After discussing with my mentors, it will be more useful to implement the concept of "thrift type", which is a custom type where user provides user understandable format(e.g., json), then the plugin converts into byte. I am currently busy with implementing this feature and still need sometime to complete it. If this part is done, I will go ahead to implement thrift compact protocol. Let me know if you have comments. You can always track the progress from github ( https://github.com/charles-cui/pg_thrift) Thanks, Charles!
Re: WAL prefetch
On Wed, Jun 13, 2018 at 6:39 PM, Konstantin Knizhnik wrote: > There was very interesting presentation at pgconf about pg_prefaulter: > > http://www.pgcon.org/2018/schedule/events/1204.en.html > > But it is implemented in GO and using pg_waldump. > I tried to do the same but using built-on Postgres WAL traverse functions. > I have implemented it as extension for simplicity of integration. > In principle it can be started as BG worker. > Right or in other words, it could do something like autoprewarm [1] which can allow a more user-friendly interface for this utility if we decides to include it. > First of all I tried to estimate effect of preloading data. > I have implemented prefetch utility with is also attached to this mail. > It performs random reads of blocks of some large file and spawns some number > of prefetch threads: > > Just normal read without prefetch: > ./prefetch -n 0 SOME_BIG_FILE > > One prefetch thread which uses pread: > ./prefetch SOME_BIG_FILE > > One prefetch thread which uses posix_fadvise: > ./prefetch -f SOME_BIG_FILE > > 4 prefetch thread which uses posix_fadvise: > ./prefetch -f -n 4 SOME_BIG_FILE > > Based on this experiments (on my desktop), I made the following conclusions: > > 1. Prefetch at HDD doesn't give any positive effect. > 2. Using posix_fadvise allows to speed-up random read speed at SSD up to 2 > times. > 3. posix_fadvise(WILLNEED) is more efficient than performing normal reads. > 4. Calling posix_fadvise in more than one thread has no sense. > > I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME > RAID 10 storage device and 256Gb of RAM connected using InfiniBand. > The speed of synchronous replication between two nodes is increased from 56k > TPS to 60k TPS (on pgbench with scale 1000). > That's a reasonable improvement. > Usage: > 1. At master: create extension wal_prefetch > 2. At replica: Call pg_wal_prefetch() function: it will not return until you > interrupt it. > I think it is not a very user-friendly interface, but the idea sounds good to me, it can help some other workloads. I think this can help in recovery as well. [1] - https://www.postgresql.org/docs/devel/static/pgprewarm.html -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Two round for Client Authentication
Hello everyone. I am a newcomer to PostgreSQL, and I don't know if it is proper to post my question here, but I really need some help. Currently I am reading and testing code about Client Authentication, but I find that there are two progresses forked if I login using psql, while only one progress is forked if using pgAdmin. My pg_hba.conf is as follows: local all all md5 host all all 192.168.64.56/32 md5 If I login with psql, the following stack is called twice: ServerLoop BackendStartup BackendRun PostgresMain InitPostgres PerformAuthentication ClientAuthentication In the first round, the variable `status` in function ClientAuthentication() is always STATUS_EOF (at least in my test). In the second round, `status` seems to be what should be expected: STATUS_OK for correct password, STATUS_ERROR for wrong password, and STATUS_EOF for empty password. Why are there two such progresses forked? I think one round should be enough, like when using pgAdmin. Besides, I find it hard to debug the ServerLoop() stack, compared with the backend progress for query, because there are two many subprogresses and signals. It would be of great help if anyone could give me some instructions on how to learn to debug postmaster using gdb. Thanks in advance! Best Regards
Re: Partitioning with temp tables is broken
On 2018/06/14 11:09, Michael Paquier wrote: > On Wed, Jun 13, 2018 at 10:25:23PM +0530, amul sul wrote: >> On Wed, Jun 13, 2018, 8:34 PM Tom Lane wrote: >>> Even if you want to argue that there's a use case for these situations, >>> it seems far too late in the release cycle to be trying to fix all these >>> issues. I think we need to forbid the problematic cases for now, and >>> leave relaxing the prohibition to be treated as a future new feature. >> >> +1, forbid the problematic case. > > +1 on that. And I can see that nobody on this thread has tested with > REL_10_STABLE, right? In that case you don't get a crash, but other > sessions can see the temporary partition of another's session, say with > \d+. So it seems to me that we should forbid the use of temporary > relations in a partition tree and back-patch it as well to v10 for > consistency. And if trying to insert into the temporary relation you > can a nice error: > > =# insert into listp values (2); > ERROR: 0A000: cannot access temporary tables of other sessions > LOCATION: ReadBufferExtended, bufmgr.c:657 > > This is also a bit uninstinctive as I would think of it as an authorized > operation, still my guts tell me that the code complications are not > really worth the use-cases: > > =# create temp table listp2 partition of listp for values in (2); > ERROR: 42P17: partition "listp2" would overlap partition "listp2" > LOCATION: check_new_partition_bound, partition.c:81 I have to agree to reverting this "feature". As Michael points out, even PG 10 fails to give a satisfactory error message when using a declarative feature like tuple routing. It should've been "partition not found for tuple" rather than "cannot access temporary tables of other sessions". Further as David and Ashutosh point out, PG 11 added even more code around declarative partitioning that failed to consider the possibility that some partitions may not be accessible in a given session even though visible through relcache. I'm attaching a patch here to forbid adding a temporary table as partition of permanent table. I didn't however touch the feature that allows *all* members in a partition tree to be temporary tables. Thanks, Amit diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 8b848f91a7..fb7cd561e2 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -1985,6 +1985,16 @@ MergeAttributes(List *schema, List *supers, char relpersistence, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("inherited relation \"%s\" is not a table or foreign table", parent->relname))); + + /* If the parent is permanent, so must be all of its partitions. */ + if (is_partition && + relation->rd_rel->relpersistence != RELPERSISTENCE_TEMP && + relpersistence == RELPERSISTENCE_TEMP) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), +errmsg("cannot create a temporary relation as partition of permanent relation \"%s\"", + RelationGetRelationName(relation; + /* Permanent rels cannot inherit from temporary ones */ if (relpersistence != RELPERSISTENCE_TEMP && relation->rd_rel->relpersistence == RELPERSISTENCE_TEMP) @@ -14135,7 +14145,15 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd) RelationGetRelationName(rel), RelationGetRelationName(attachrel; - /* Temp parent cannot have a partition that is itself not a temp */ + /* If the parent is permanent, so must be all of its partitions. */ + if (rel->rd_rel->relpersistence != RELPERSISTENCE_TEMP && + attachrel->rd_rel->relpersistence == RELPERSISTENCE_TEMP) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), +errmsg("cannot attach a temporary relation as partition of permanent relation \"%s\"", + RelationGetRelationName(rel; + + /* If the parent is temporary relation, so must be all of its partitions */ if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP && attachrel->rd_rel->relpersistence != RELPERSISTENCE_TEMP) ereport(ERROR, @@ -14143,14 +14161,14 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd) errmsg("cannot attach a permanent relation as partition of temporary relation \"%s\"", RelationGetRelationName(
Re: Two round for Client Authentication
On Thu, Jun 14, 2018 at 7:12 AM, Yinjie Lin wrote: > Currently I am reading and testing code about Client Authentication, but I > find that there are two progresses forked if I login using psql, while only > one progress is forked if using pgAdmin. > If psql finds the server asks for a password, it closes the first connection, displays a password prompt to the user, and then does another connection attempt with the password the user entered. You can avoid the first attempt with the -W flag; though there's usually no reason to do that in practice. .m
Re: Two round for Client Authentication
On Wednesday, June 13, 2018, Yinjie Lin wrote: > > Why are there two such progresses forked? I think one round should be > enough, like when using pgAdmin. > You can use the --password option to prevent it. """ This option is never essential, since psql will automatically prompt for a password if the server demands password authentication. However, psql will waste a connection attempt finding out that the server wants a password. In some cases it is worth typing -W to avoid the extra connection attempt. """ In pgAdmin you've saved a password to the profile so the initial attempt uses it. psql doesn't have a similar capability. Though I am unsure whether the use of .pgpass would make any difference here... David J.
Re: Partitioning with temp tables is broken
On 2018/06/13 21:06, David Rowley wrote: > There's also something pretty weird around the removal of the temp > relation from the partition bound. I've had cases where the session > that attached the temp table is long gone, but \d+ shows the table is > still there and I can't attach another partition due to an overlap, > and can't drop the temp table due to the session not existing anymore. > I've not got a test case for that one yet, but the test case for the > crash is: > > -- Session 1: > create table listp (a int) partition by list(a); > create table listp1 partition of listp for values in(1); > create temp table listp2 partition of listp for values in (2); > > -- Session 2: > select * from listp; When Session 2 crashes (kill -9'ing it would also suffice), for some reason, Session 1 doesn't get an opportunity to perform RemoveTempRelationsCallback(). So, both the listp2's entry pg_class and any references to it (such as its pg_inherits entry as partition of listp) persist. listp2 won't be removed from the partition bound until all of those catalog entries get removed. Thanks, Amit
Re: Postgres, fsync, and OSs (specifically linux)
On Wed, May 23, 2018 at 8:02 AM, Andres Freund wrote: > [patches] Hi Andres, Obviously there is more work to be done here but the basic idea in your clone-fd-checkpointer branch as of today seems OK to me. I think Craig and I both had similar ideas (sending file descriptors that have an old enough f_wb_err) but we thought rlimit management was going to be hard (shared memory counters + deduplication, bleugh). You made it look easy. Nice work. First, let me describe in my own words what's going on, mostly to make sure I really understand this: 1. We adopt a new fd lifecycle that is carefully tailored to avoid error loss on Linux, and can't hurt on other OSes. By sending the file descriptor used for every write to the checkpointer, we guarantee that (1) the inode stays pinned (the file is "in flight" in the socket, even if the sender closes it before the checkpointer receives it) so Linux won't be tempted to throw away the precious information in i_mapping->wb_err, and (2) the checkpointer finishes up with a file descriptor that points to the very same "struct file" with the f_wb_err value that was originally sampled before the write, by the sender. So we can't miss any write-back errors. Wahey! However, errors are still reported only once, so we probably need to panic. Hmm... Is there any way that the *sender* could finish up in file_check_and_advance_wb_err() for the same struct file, before the checkpointer manages to call fsync() on its dup'd fd? I don't immediately see how (it looks like you have to call one of the various sync functions to reach that, and your patch removes the fallback just-call-FileSync()-myself code from register_dirty_segment()). I guess it would be bad if, say, close() were to do that in some future Linux release because then we'd have no race-free way to tell the checkpointer that the file is borked before it runs fsync() and potentially gets an OK response and reports a successful checkpoint (even if we panicked, with sufficiently bad timing it might manage to report a successful checkpoint). 2. In order to make sure that we don't exceed our soft limit on the number of file descriptors per process, you use a simple backend-local counter in the checkpointer, on the theory that we don't care about fds (or, technically, the files they point to) waiting in the socket buffer, we care only about how many the checkpointer has actually received but not yet closed. As long as we make sure there is space for at least one more before we read one message, everything should be fine. Good and simple. One reason I thought this would be harder is because I had no model of how RLIMIT_NOFILE would apply when you start flinging fds around between processes (considering there can be moments when neither end has the file open), so I feared the worst and thought we would need to maintain a counter in shared memory and have senders and receiver cooperate to respect it. My intuition that this was murky and required pessimism wasn't too far from the truth actually: apparently the kernel didn't do a great job at accounting for that until a patch[1] landed for CVE-2013-4312. The behaviour in older releases is super lax, so no problem there. The behaviour from 4.9 onward (or is it 4.4.1?) is that you get a separate per-user RLIMIT_NOFILE allowance for in-flight fds. So effectively the sender doesn't have to worry about about fds it has sent but closed and the receiver doesn't have to care about fds it hasn't received yet, so your counting scheme seems OK. As for exceeding RLIMIT_NOFILE with in-flight fds, it's at least bounded by the fact that the socket would block/EWOULDBLOCK if the receiver isn't draining it fast enough and can only hold a small and finite amount of data and thus file descriptors, so we can probably just ignore that. If you did manage to exceed it, I think you'd find out about that with ETOOMANYREFS at send time (curiously absent from the sendmsg() man page, but there in black and white in the source for unix_attach_fds()), and then you'd just raise an error (currently FATAL in your patch). I have no idea how the rlimit for SCM-ified files works on other Unixoid systems though. Some actual code feedback: + if (entry->syncfds[forknum][segno] == -1) + { + char *path = mdpath(entry->rnode, forknum, segno); + open_fsync_queue_files++; + /* caller must have reserved entry */ + entry->syncfds[forknum][segno] = + FileOpenForFd(fd, path); + pfree(path); + } + else + { + /* +* File is already open. Have to keep the older fd, errors +* might only be reported to it, thus close the one
ntile() throws ERROR when hashagg is false
Hi ntile() throws ERROR when hashagg is false, test case given below. postgres=# create table foo (a int, b int, c text); CREATE TABLE postgres=# insert into foo select i%20, i%30, to_char(i%12, 'FM') from generate_series(0, 36) i; INSERT 0 37 postgres=# explain select ntile(a) OVER () from foo GROUP BY a; QUERY PLAN --- WindowAgg (cost=25.00..29.50 rows=200 width=8) -> HashAggregate (cost=25.00..27.00 rows=200 width=4) Group Key: a -> Seq Scan on foo (cost=0.00..22.00 rows=1200 width=4) (4 rows) postgres=# select ntile(a) OVER () from foo GROUP BY a; ntile --- 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 11 (20 rows) postgres=# set enable_hashagg to false; SET postgres=# explain select ntile(a) OVER () from foo GROUP BY a; QUERY PLAN - WindowAgg (cost=83.37..91.87 rows=200 width=8) -> Group (cost=83.37..89.37 rows=200 width=4) Group Key: a -> Sort (cost=83.37..86.37 rows=1200 width=4) Sort Key: a -> Seq Scan on foo (cost=0.00..22.00 rows=1200 width=4) (6 rows) postgres=# select ntile(a) OVER () from foo GROUP BY a; ERROR: argument of ntile must be greater than zero Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation
Re: WAL prefetch
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik wrote: > pg_wal_prefetch function will infinitely traverse WAL and prefetch block > references in WAL records > using posix_fadvise(WILLNEED) system call. Hi Konstantin, Why stop at the page cache... what about shared buffers? -- Thomas Munro http://www.enterprisedb.com
Re: ntile() throws ERROR when hashagg is false
> "Rajkumar" == Rajkumar Raghuwanshi > writes: Rajkumar> Hi Rajkumar> ntile() throws ERROR when hashagg is false, test case given Rajkumar> below. Rajkumar> postgres=# create table foo (a int, b int, c text); Rajkumar> CREATE TABLE Rajkumar> postgres=# insert into foo select i%20, i%30, to_char(i%12, 'FM') from Rajkumar> generate_series(0, 36) i; Rajkumar> INSERT 0 37 Rajkumar> postgres=# explain select ntile(a) OVER () from foo GROUP BY a; This query isn't actually legal per the spec; the argument of ntile is restricted to being a constant or parameter, so it can't change from row to row. PG is more flexible, but that doesn't make the query any more meaningful. What I think pg is actually doing is taking the value of the ntile() argument from the first row and using that for the whole partition. In your example, enabling or disabling hashagg changes the order of the input rows for the window function (since you've specified no ordering in the window definition), and with hashagg off, you get the smallest value of a first (which is 0 and thus an error). -- Andrew (irc:RhodiumToad)