Re: [PATCH] minor doc fix for create-role
Hi Tom, I agree mostly. It actually does have the words “SQL identifier” in the patch. But you are right it doesn’t link to what a SQL identifier is, but it does provide a practical solution of quoting. That was the part I cared about as a user, I just wanted to solve my problem of an email address as a role name (yes I know that’s sort of dumb as email addresses change). This also addresses the question, why just here, because this was a pain point in the docs for me yesterday :) I also agree your ideal solution is definitely better than what I pushed. But I’m not ready to take that on. If someone else is, I welcome their patch over mine. -Tara — “Rivers know this: there is no hurry. We shall get there some day.” > On Aug 18, 2019, at 9:41 AM, Tom Lane wrote: > > t...@anne.cat writes: >> Attached is a minor patch to fix the name param documentation for create >> role, just adding a direct quote from user-manag.sgml talking about what the >> role name is allowed to be. I was searching for this information and >> figured the reference page should have it as well. > > Hm, I guess my reaction to this proposal is "why just here?". We have > an awful lot of different CREATE commands, and none of them say more > about the target name than this one does. (Not to mention ALTER, DROP, > etc.) Perhaps it's worth adding some boilerplate text to all those > places, but I'm dubious. > > Also, the specific text proposed for addition doesn't seem that helpful, > since it doesn't define which characters are "special characters". > I'd rather see something like "The name must be a valid SQL identifier > as defined in ." But, while that would work fine > in HTML output, it would not be particularly short or useful in man-page > output. > > Perhaps the ideal solution would be further markup on the synopsis > sections that somehow identifies each term as an "identifier" or > other appropriate syntactic category, and provides a hyperlink to > a definition (in output formats that are friendly to that). Seems > like a lot of work though :-( > >regards, tom lane
Re: Fix typos and inconsistencies for HEAD (take 11)
On Mon, Aug 19, 2019 at 07:04:04AM +0300, Alexander Lakhin wrote: > 11.23 TupleLockUpdate -> LockTupleNoKeyExclusive Not sure about this one, so discarded for now. Alvaro? > 11.25 typstore -> tupstore This one is cute. It actually does not cause a compilation failure as tuplestore_donestoring() is a no-op. > 11.33 visca -> vice This one is interesting latin. > 11.57 PASSBYVALUE -> PASSEDBYVALUE Will fix this one separately and back-patch. -- Michael signature.asc Description: PGP signature
Re: Fix typos and inconsistencies for HEAD (take 11)
On Mon, Aug 19, 2019 at 04:22:44PM +0900, Michael Paquier wrote: > On Mon, Aug 19, 2019 at 07:04:04AM +0300, Alexander Lakhin wrote: >> 11.57 PASSBYVALUE -> PASSEDBYVALUE > > Will fix this one separately and back-patch. No need to, actually, as the error comes from 7bdc655. -- Michael signature.asc Description: PGP signature
Re: FETCH FIRST clause PERCENT option
On 2019-08-19 01:33, Ryan Lambert wrote: The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:tested, passed The latest patch [1] and the cleanup patch [2] apply cleanly to the latest master (5f110933). I reviewed the conversation and don't see any outstanding questions or concerns. Updating status to ready for committer. [1] > https://www.postgresql.org/message-id/attachment/103028/percent-incremental-v6.patch [2] > https://www.postgresql.org/message-id/attachment/103157/percent-incremental-v6-comment-cleanup.patch Ryan Lambert The new status of this patch is: Ready for Committer Hi, (running with those two patches applied) select * from onek where thousand < 5 order by thousand fetch first -1 percent rows only is correctly caught (with "ERROR: PERCENT must not be negative") but: select * from onek where thousand < 5 order by thousand fetch first 101 percent rows only is not. It doesn't return, and cannot be Ctrl-C'ed out of, which I guess is another bug? thanks, Erik Rijkers
Cleanup isolation specs from unused steps
Hi all, I have been looking at the isolation tests, and we have in some specs steps which are defined but not used in any permutations. In order to detect them, I have been using the attached trick to track which permutations are used. This allows to find immediately any over-engineered spec by generating diffs about steps defined by not used in any permutations. On HEAD, we have six specs entering in this category. Would that be useful? -- Michael diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c index f98bb1cf64..402bb1f546 100644 --- a/src/test/isolation/isolationtester.c +++ b/src/test/isolation/isolationtester.c @@ -239,6 +239,17 @@ main(int argc, char **argv) */ run_testspec(testspec); + /* + * Verify that all steps have been used, complaining about anything + * defined but not used. + */ + for (i = 0; i < testspec->nallsteps; i++) + { + if (!testspec->allsteps[i]->used) + fprintf(stderr, "unused step name: %s\n", + testspec->allsteps[i]->name); + } + return 0; } @@ -467,7 +478,11 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps) printf("\nstarting permutation:"); for (i = 0; i < nsteps; i++) + { + /* Track the permutation as in-use */ + steps[i]->used = true; printf(" %s", steps[i]->name); + } printf("\n"); /* Perform setup */ diff --git a/src/test/isolation/isolationtester.h b/src/test/isolation/isolationtester.h index 7f91e6433f..d9d2a14ecf 100644 --- a/src/test/isolation/isolationtester.h +++ b/src/test/isolation/isolationtester.h @@ -29,6 +29,7 @@ struct Session struct Step { int session; + bool used; char *name; char *sql; char *errormsg; diff --git a/src/test/isolation/specparse.y b/src/test/isolation/specparse.y index fb8a4d706c..2dfe3533ff 100644 --- a/src/test/isolation/specparse.y +++ b/src/test/isolation/specparse.y @@ -145,6 +145,7 @@ step: $$ = pg_malloc(sizeof(Step)); $$->name = $2; $$->sql = $3; +$$->used = false; $$->errormsg = NULL; } ; diff --git a/src/test/isolation/specs/freeze-the-dead.spec b/src/test/isolation/specs/freeze-the-dead.spec index 8c3649902a..915bf15b92 100644 --- a/src/test/isolation/specs/freeze-the-dead.spec +++ b/src/test/isolation/specs/freeze-the-dead.spec @@ -19,7 +19,6 @@ session "s1" step "s1_begin" { BEGIN; } step "s1_update" { UPDATE tab_freeze SET x = x + 1 WHERE id = 3; } step "s1_commit" { COMMIT; } -step "s1_vacuum" { VACUUM FREEZE tab_freeze; } step "s1_selectone" { BEGIN; SET LOCAL enable_seqscan = false; @@ -28,7 +27,6 @@ step "s1_selectone" { COMMIT; } step "s1_selectall" { SELECT * FROM tab_freeze ORDER BY name, id; } -step "s1_reindex" { REINDEX TABLE tab_freeze; } session "s2" step "s2_begin" { BEGIN; } @@ -40,7 +38,6 @@ session "s3" step "s3_begin" { BEGIN; } step "s3_key_share" { SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE; } step "s3_commit" { COMMIT; } -step "s3_vacuum" { VACUUM FREEZE tab_freeze; } # This permutation verifies that a previous bug # https://postgr.es/m/e5711e62-8fdf-4dca-a888-c200bf6b5...@amazon.com diff --git a/src/test/isolation/specs/insert-conflict-do-nothing.spec b/src/test/isolation/specs/insert-conflict-do-nothing.spec index 9b92c35cec..71acc380c7 100644 --- a/src/test/isolation/specs/insert-conflict-do-nothing.spec +++ b/src/test/isolation/specs/insert-conflict-do-nothing.spec @@ -33,7 +33,6 @@ setup step "donothing2" { INSERT INTO ints(key, val) VALUES(1, 'donothing2') ON CONFLICT DO NOTHING; } step "select2" { SELECT * FROM ints; } step "c2" { COMMIT; } -step "a2" { ABORT; } # Regular case where one session block-waits on another to determine if it # should proceed with an insert or do nothing. diff --git a/src/test/isolation/specs/insert-conflict-do-update-2.spec b/src/test/isolation/specs/insert-conflict-do-update-2.spec index f5b4f601b5..12f6be8000 100644 --- a/src/test/isolation/specs/insert-conflict-do-update-2.spec +++ b/src/test/isolation/specs/insert-conflict-do-update-2.spec @@ -32,7 +32,6 @@ setup step "insert2" { INSERT INTO upsert(key, payload) VALUES('FOOFOO', 'insert2') ON CONFLICT (lower(key)) DO UPDATE set key = EXCLUDED.key, payload = upsert.payload || ' updated by insert2'; } step "select2" { SELECT * FROM upsert; } step "c2" { COMMIT; } -step "a2" { ABORT; } # One session (session 2) block-waits on another (session 1) to determine if it # should proceed with an insert or update. The user can still usefully UPDATE diff --git a/src/test/isolation/specs/insert-conflict-do-update.spec b/src/test/isolation/specs/insert-conflict-do-update.spec index 5d335a3444..7c8cb47100 100644 --- a/src/test/isolation/specs/insert-conflict-do-update.spec +++ b/src/test/isolation/specs/insert-conflict-do-update.spec @@ -30,7 +30,6 @@ setup step "insert2" { INSERT INTO upsert(key, val) VALUES(1, 'insert2') ON CONFLICT (key) DO UPDATE set val = upsert.val || ' updated by insert2'; } step "select2" { SELECT *
Re: pgsql: doc: Add some images
On 2019-08-16 22:00, Alvaro Herrera wrote: > Now when I test Jürgen's new proposed image genetic-algorithm I find > that this stuff doesn't work in VPATH builds, at least for PDF -- I > don't get a build failure, but instead I get just a section title that > doesn't precede any actual image. (There's a very small warning hidden > in the tons of other fop output). If I edit the .fo file by hand to > make the path to .svg absolute, the image appears correctly. fixed (I'm puzzled that one can't tell FOP to abort on an error like this. The ASF JIRA is down right now, so I can't research this, but I'm tempted to file a bug about this.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Global temporary tables
On 18.08.2019 11:28, Pavel Stehule wrote: ne 18. 8. 2019 v 9:02 odesílatel Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> napsal: On 16.08.2019 20:17, Pavel Stehule wrote: pá 16. 8. 2019 v 16:12 odesílatel Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> napsal: I did more investigations of performance of global temp tables with shared buffers vs. vanilla (local) temp tables. 1. Combination of persistent and temporary tables in the same query. Preparation: create table big(pk bigint primary key, val bigint); insert into big values (generate_series(1,1),generate_series(1,1)); create temp table lt(key bigint, count bigint); create global temp table gt(key bigint, count bigint); Size of table is about 6Gb, I run this test on desktop with 16GB of RAM and postgres with 1Gb shared buffers. I run two queries: insert into T (select count(*),pk/P as key from big group by key); select sum(count) from T; where P is (100,10,1) and T is name of temp table (lt or gt). The table below contains times of both queries in msec: Percent of selected data 1% 10% 100% Local temp table 44610 90 47920 891 63414 21612 Global temp table 44669 35 47939 298 59159 26015 As you can see, time of insertion in temporary table is almost the same and time of traversal of temporary table is about twice smaller for global temp table when it fits in RAM together with persistent table and slightly worser when it doesn't fit. 2. Temporary table only access. The same system, but Postgres is configured with shared_buffers=10GB, max_parallel_workers = 4, max_parallel_workers_per_gather = 4 Local temp tables: create temp table local_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint); insert into local_temp values (generate_series(1,1),0,0,0,0,0,0,0,0); select sum(x1) from local_temp; Global temp tables: create global temporary table global_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint); insert into global_temp values (generate_series(1,1),0,0,0,0,0,0,0,0); select sum(x1) from global_temp; Results (msec): Insert Select Local temp table37489 48322 Global temp table 44358 3003 So insertion in local temp table is performed slightly faster but select is 16 times slower! Conclusion: In the assumption then temp table fits in memory, global temp tables with shared buffers provides better performance than local temp table. I didn't consider here global temp tables with local buffers because for them results should be similar with local temp tables. Probably there is not a reason why shared buffers should be slower than local buffers when system is under low load. access to shared memory is protected by spin locks (are cheap for few processes), so tests in one or few process are not too important (or it is just one side of space) another topic can be performance on MS Sys - there are stories about not perfect performance of shared memory there. Regards Pavel One more test which is used to simulate access to temp tables under high load. I am using "upsert" into temp table in multiple connections. create global temp table gtemp (x integer primary key, y bigint); upsert.sql: insert into gtemp values (random() * 100, 0) on conflict(x) do update set y=gtemp.y+1; pgbench -c 10 -M prepared -T 100 -P 1 -n -f upsert.sql postgres I failed to find some standard way in pgbech to perform per-session initialization to create local temp table, so I just insert this code in pgbench code: diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 570cf33..af6a431 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -5994,6 +5994,7 @@ threadRun(void *arg) { if ((state[i].con = doConnect()) == NULL) goto done; + executeStatement(state[i].con, "create temp table ltemp(x integer primary key, y bigint)"); } } Results are the following: Global temp table: 117526 TPS
Re: Duplicated LSN in ReorderBuffer
On Tue, Aug 13, 2019 at 6:36 AM Alvaro Herrera wrote: > > On 2019-Aug-07, Andres Freund wrote: > > > I think we would need to do this for all values of > > SnapBuildCurrentState() - after all the problem occurs because we > > *previously* didn't assign subxids to the toplevel xid. Compared to the > > cost of catalog changes, ReorderBufferAssignChild() is really cheap. So > > I don't think there's any problem just calling it unconditionally (when > > top_xid <> xid, of course). > > BTW I wrote the code as suggested and it passes all the tests ... but I > then noticed that the unpatched code doesn't fail Ildar's original > pgbench-based test for me, either. So maybe my laptop is not powerful > enough to reproduce it, or maybe I'm doing something wrong. If you share the patch fixing this issue I'll test it on my environment where I could reproduce the original problem. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: FETCH FIRST clause PERCENT option
Hi Erik On Mon, Aug 19, 2019 at 10:47 AM Erik Rijkers wrote: > Hi, > > (running with those two patches applied) > >select * from onek where thousand < 5 order by thousand fetch first -1 > percent rows only > is correctly caught (with "ERROR: PERCENT must not be negative") but: > >select * from onek where thousand < 5 order by thousand fetch first > 101 percent rows only > is not. It doesn't return, and cannot be Ctrl-C'ed out of, which I guess > is another bug? > Fixed regards Surafel diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 1759b9e1b6..fb9c2f5319 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -3035,7 +3035,8 @@ estimate_path_cost_size(PlannerInfo *root, if (fpextra && fpextra->has_limit) { adjust_limit_rows_costs(&rows, &startup_cost, &total_cost, - fpextra->offset_est, fpextra->count_est); + fpextra->offset_est, fpextra->count_est, + EXACT_NUMBER); retrieved_rows = rows; } } diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml index 06d611b64c..4ef42df51d 100644 --- a/doc/src/sgml/ref/select.sgml +++ b/doc/src/sgml/ref/select.sgml @@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ] [ LIMIT { count | ALL } ] [ OFFSET start [ ROW | ROWS ] ] -[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ] +[ FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY ] [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ] where from_item can be one of: @@ -1430,7 +1430,7 @@ OFFSET start which PostgreSQL also supports. It is: OFFSET start { ROW | ROWS } -FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY +FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY In this syntax, the start or count value is required by @@ -1440,8 +1440,10 @@ FETCH { FIRST | NEXT } [ count ] { ambiguity. If count is omitted in a FETCH clause, it defaults to 1. -ROW -and ROWS as well as FIRST +With PERCENT count specifies the maximum number of rows to return +in percentage round up to the nearest integer. Using PERCENT +is best suited to returning single-digit percentages of the query's total row count. +ROW and ROWS as well as FIRST and NEXT are noise words that don't influence the effects of these clauses. According to the standard, the OFFSET clause must come diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c index baa669abe8..7a9626d7f9 100644 --- a/src/backend/executor/nodeLimit.c +++ b/src/backend/executor/nodeLimit.c @@ -21,6 +21,8 @@ #include "postgres.h" +#include + #include "executor/executor.h" #include "executor/nodeLimit.h" #include "miscadmin.h" @@ -52,6 +54,7 @@ ExecLimit(PlanState *pstate) */ direction = node->ps.state->es_direction; outerPlan = outerPlanState(node); + slot = node->subSlot; /* * The main logic is a simple state machine. @@ -81,7 +84,15 @@ ExecLimit(PlanState *pstate) /* * Check for empty window; if so, treat like empty subplan. */ - if (node->count <= 0 && !node->noCount) + if (node->limitOption == PERCENTAGE) + { +if (node->percent == 0.0) +{ + node->lstate = LIMIT_EMPTY; + return NULL; +} + } + else if (node->count <= 0 && !node->noCount) { node->lstate = LIMIT_EMPTY; return NULL; @@ -102,11 +113,32 @@ ExecLimit(PlanState *pstate) node->lstate = LIMIT_EMPTY; return NULL; } + +/* + * Count the number of rows to return in exact number so far + */ +if (node->limitOption == PERCENTAGE) +{ + node->perExactCount = ceil(node->percent * node->position / 100.0); + + if (node->perExactCount == node->perExactCount + 1) + node->perExactCount++; +} node->subSlot = slot; if (++node->position > node->offset) break; } + /* + * We may needed this tuple in subsequent scan so put it into + * tuplestore. + */ + if (node->limitOption == PERCENTAGE) + { +tuplestore_puttupleslot(node->tupleStore, slot); +tuplestore_advance(node->tupleStore, true); + } + /* * Okay, we have the first tuple of the window. */ @@ -124,6 +156,106 @@ ExecLimit(PlanState *pstate) case LIMIT_INWINDOW: if (ScanDirectionIsForward(direction)) { +/* + * In case of coming back from backward scan the tuple is + * already in tuple store. + */ +if (node->limitOption == PERCENTAGE && node->backwardPosition > 0) +{ + if (tuplestore_gettupleslot_heaptuple(node->tupleStore, true, true, slot)) + { + node->subSlot = slot; + node->position++; + node->backwardPosition--; + return slot; + } + else + { + node->lstate = LIMIT_SUBPLANEOF; +
Re: FETCH FIRST clause PERCENT option
On 2019-08-19 11:18, Surafel Temesgen wrote: [..] [percent-incremental-v7.patch] Thanks. Another little thing, not sure it's a bug: limit interprets its argument by rounding up or down as one would expect: table onek limit 10.4; --> gives 10 rows table onek limit 10.6; --> gives 11 rows but FETCH count PERCENT does not do that; it rounds always up. select * from (table onek limit 100) f fetch first 10.4 percent rows only; --> gives 11 rows select * from (table onek limit 100) f fetch first 10.6 percent rows only; --> gives 11 rows I see that it's documented in the .sgml to behave as it does, but it seems wrong to me; shouldn't that 10.4-percent-query yield 10 rows instead of 11? thanks, Erik Rijkers
Re: Rearranging ALTER TABLE to avoid multi-operations bugs
> This review seems not very on-point, because I made no claim to have fixed > any of those bugs. The issue at the moment is how to structure the code I am sorry for that and I have another question now. I researched the related code and find something as below: Code: ~~ case AT_AddIdentity: { ... attnum = get_attnum(relid, cmd->name); /* * if attribute not found, something will error about it * later */ if (attnum != InvalidAttrNumber) generateSerialExtraStmts(&cxt, newdef, get_atttype(relid, attnum),def->options, true, NULL, NULL); ... } ~~ Test case1: create table t10 (f1 int); alter table t10 add column f2 int not null, alter column f2 add generated always as identity; I find that the value of 'attnum' is 0 because now we do not have the 'f2' column when I run the Test case1, so it can not generate a sequence (because it can not run the generateSerialExtraStmts function). You can see the code annotation that 'something will error about it later', so I thank it may be an error report instead of executing successfully. Test case2: create table t11 (f1 int); alter table t11 add column f2 int, alter column f2 type int8; Code about 'alter column type' have the same code annotation, and if you run the Test case2, then you can get an error report. I use Test case2 to prove that it may be an error report instead of executing successfully. -- Movead.Li The new status of this patch is: Waiting on Author
Re: [HACKERS] WAL logging problem in 9.4.3?
Thank you for taking time. At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch wrote in <20190818035230.gb3021...@rfd.leadboat.com> > For two-phase commit, PrepareTransaction() needs to execute pending syncs. Now TwoPhaseFileHeader has two new members for (commit-time) pending syncs. Pending-syncs are useless on wal-replay, but that is needed for commit-prepared. > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote: > > --- a/src/backend/access/heap/heapam_handler.c > > +++ b/src/backend/access/heap/heapam_handler.c > > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, > > Relation NewHeap, ... > > - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap); > > - > > /* use_wal off requires smgr_targblock be initially invalid */ > > Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber); > > Since you're deleting the use_wal variable, update that last comment. Oops. Rewrote it. > > --- a/src/backend/catalog/storage.c > > +++ b/src/backend/catalog/storage.c > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit) ... > > + smgrimmedsync(srel, MAIN_FORKNUM); > > This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return > false due to this code, use RelationNeedsWAL() for multiple forks, and then > not actually sync all forks. I agree that all forks needs syncing, but FSM and VM are checking RelationNeedsWAL(modified). To make sure, are you suggesting to sync all forks instead of emitting WAL for them, or suggesting that VM and FSM to emit WALs even when the modified RelationNeedsWAL returns false (+ sync all forks)? > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component > not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation, > or if it's smaller than some threshold, WAL-log the contents of the whole file > at that point." Please write the part to WAL-log the contents of small files > instead of syncing them. I'm not sure the point of the behavior. I suppose that the "log" is a sequence of new_page records. It also needs to be synced and it is always larger than the file to be synced. I can't think of an appropriate threshold without the point. > > --- a/src/backend/commands/copy.c > > +++ b/src/backend/commands/copy.c > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) > > * If it does commit, we'll have done the table_finish_bulk_insert() at > > * the bottom of this routine first. > > * > > -* As mentioned in comments in utils/rel.h, the in-same-transaction test > > -* is not always set correctly, since in rare cases > > rd_newRelfilenodeSubid > > -* can be cleared before the end of the transaction. The exact case is > > -* when a relation sets a new relfilenode twice in same transaction, yet > > -* the second one fails in an aborted subtransaction, e.g. > > -* > > -* BEGIN; > > -* TRUNCATE t; > > -* SAVEPOINT save; > > -* TRUNCATE t; > > -* ROLLBACK TO save; > > -* COPY ... > > The comment material being deleted is still correct, so don't delete it. > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The > attached patch adds an assertion that RelationNeedsWAL() and the > pendingDeletes array have the same opinion about the relfilenode, and it > expands a test case to fail that assertion. (Un?)Fortunately, that doesn't fail.. (with rebased version on the recent master) I'll recheck that tomorrow. > > --- a/src/include/utils/rel.h > > +++ b/src/include/utils/rel.h > > @@ -74,11 +74,13 @@ typedef struct RelationData > > SubTransactionId rd_createSubid;/* rel was created in current > > xact */ > > SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode > > assigned in > > > > * current xact */ > > + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode > > assigned > > + > > * first in current xact */ > > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit > all the lines printed. Many bits of code need to look at all three, > e.g. RelationClose(). Agreed. I'll recheck that. > This field needs to be 100% reliable. In other words, > it must equal InvalidSubTransactionId if and only if the relfilenode matches > the relfilenode that would be in place if the top transaction rolled back. I don't get this. I think the variable moves as you suggested. It is handled same way with fd_new* in AtEOSubXact_cleanup but the difference is in assignment but rollback. rd_fist* won't change after the first assignment so rollback of the subid means relfi
Re: Global temporary tables
On 19.08.2019 11:51, Konstantin Knizhnik wrote: On 18.08.2019 11:28, Pavel Stehule wrote: ne 18. 8. 2019 v 9:02 odesílatel Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> napsal: On 16.08.2019 20:17, Pavel Stehule wrote: pá 16. 8. 2019 v 16:12 odesílatel Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> napsal: I did more investigations of performance of global temp tables with shared buffers vs. vanilla (local) temp tables. 1. Combination of persistent and temporary tables in the same query. Preparation: create table big(pk bigint primary key, val bigint); insert into big values (generate_series(1,1),generate_series(1,1)); create temp table lt(key bigint, count bigint); create global temp table gt(key bigint, count bigint); Size of table is about 6Gb, I run this test on desktop with 16GB of RAM and postgres with 1Gb shared buffers. I run two queries: insert into T (select count(*),pk/P as key from big group by key); select sum(count) from T; where P is (100,10,1) and T is name of temp table (lt or gt). The table below contains times of both queries in msec: Percent of selected data 1% 10% 100% Local temp table 44610 90 47920 891 63414 21612 Global temp table 44669 35 47939 298 59159 26015 As you can see, time of insertion in temporary table is almost the same and time of traversal of temporary table is about twice smaller for global temp table when it fits in RAM together with persistent table and slightly worser when it doesn't fit. 2. Temporary table only access. The same system, but Postgres is configured with shared_buffers=10GB, max_parallel_workers = 4, max_parallel_workers_per_gather = 4 Local temp tables: create temp table local_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint); insert into local_temp values (generate_series(1,1),0,0,0,0,0,0,0,0); select sum(x1) from local_temp; Global temp tables: create global temporary table global_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint); insert into global_temp values (generate_series(1,1),0,0,0,0,0,0,0,0); select sum(x1) from global_temp; Results (msec): Insert Select Local temp table37489 48322 Global temp table 44358 3003 So insertion in local temp table is performed slightly faster but select is 16 times slower! Conclusion: In the assumption then temp table fits in memory, global temp tables with shared buffers provides better performance than local temp table. I didn't consider here global temp tables with local buffers because for them results should be similar with local temp tables. Probably there is not a reason why shared buffers should be slower than local buffers when system is under low load. access to shared memory is protected by spin locks (are cheap for few processes), so tests in one or few process are not too important (or it is just one side of space) another topic can be performance on MS Sys - there are stories about not perfect performance of shared memory there. Regards Pavel One more test which is used to simulate access to temp tables under high load. I am using "upsert" into temp table in multiple connections. create global temp table gtemp (x integer primary key, y bigint); upsert.sql: insert into gtemp values (random() * 100, 0) on conflict(x) do update set y=gtemp.y+1; pgbench -c 10 -M prepared -T 100 -P 1 -n -f upsert.sql postgres I failed to find some standard way in pgbech to perform per-session initialization to create local temp table, so I just insert this code in pgbench code: diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 570cf33..af6a431 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -5994,6 +5994,7 @@ threadRun(void *arg) { if ((state[i].con = doConnect()) == NULL) goto done; + executeStatement(state[i].con, "create temp table ltemp(x integer primary key, y bigint)"); } } Results a
Re: Global temporary tables
po 19. 8. 2019 v 13:16 odesílatel Konstantin Knizhnik < k.knizh...@postgrespro.ru> napsal: > > > On 19.08.2019 11:51, Konstantin Knizhnik wrote: > > > > On 18.08.2019 11:28, Pavel Stehule wrote: > > > > ne 18. 8. 2019 v 9:02 odesílatel Konstantin Knizhnik < > k.knizh...@postgrespro.ru> napsal: > >> >> >> On 16.08.2019 20:17, Pavel Stehule wrote: >> >> >> >> pá 16. 8. 2019 v 16:12 odesílatel Konstantin Knizhnik < >> k.knizh...@postgrespro.ru> napsal: >> >>> I did more investigations of performance of global temp tables with >>> shared buffers vs. vanilla (local) temp tables. >>> >>> 1. Combination of persistent and temporary tables in the same query. >>> >>> Preparation: >>> create table big(pk bigint primary key, val bigint); >>> insert into big values >>> (generate_series(1,1),generate_series(1,1)); >>> create temp table lt(key bigint, count bigint); >>> create global temp table gt(key bigint, count bigint); >>> >>> Size of table is about 6Gb, I run this test on desktop with 16GB of RAM >>> and postgres with 1Gb shared buffers. >>> I run two queries: >>> >>> insert into T (select count(*),pk/P as key from big group by key); >>> select sum(count) from T; >>> >>> where P is (100,10,1) and T is name of temp table (lt or gt). >>> The table below contains times of both queries in msec: >>> >>> Percent of selected data >>> 1% >>> 10% >>> 100% >>> Local temp table >>> 44610 >>> 90 >>> 47920 >>> 891 >>> 63414 >>> 21612 >>> Global temp table >>> 44669 >>> 35 >>> 47939 >>> 298 >>> 59159 >>> 26015 >>> >>> As you can see, time of insertion in temporary table is almost the same >>> and time of traversal of temporary table is about twice smaller for >>> global temp table >>> when it fits in RAM together with persistent table and slightly worser >>> when it doesn't fit. >>> >>> >>> >>> 2. Temporary table only access. >>> The same system, but Postgres is configured with shared_buffers=10GB, >>> max_parallel_workers = 4, max_parallel_workers_per_gather = 4 >>> >>> Local temp tables: >>> create temp table local_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint, >>> x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint); >>> insert into local_temp values >>> (generate_series(1,1),0,0,0,0,0,0,0,0); >>> select sum(x1) from local_temp; >>> >>> Global temp tables: >>> create global temporary table global_temp(x1 bigint, x2 bigint, x3 >>> bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint); >>> insert into global_temp values >>> (generate_series(1,1),0,0,0,0,0,0,0,0); >>> select sum(x1) from global_temp; >>> >>> Results (msec): >>> >>> Insert >>> Select >>> Local temp table 37489 >>> 48322 >>> Global temp table 44358 >>> 3003 >>> >>> So insertion in local temp table is performed slightly faster but select >>> is 16 times slower! >>> >>> Conclusion: >>> In the assumption then temp table fits in memory, global temp tables >>> with shared buffers provides better performance than local temp table. >>> I didn't consider here global temp tables with local buffers because for >>> them results should be similar with local temp tables. >>> >> >> Probably there is not a reason why shared buffers should be slower than >> local buffers when system is under low load. >> >> access to shared memory is protected by spin locks (are cheap for few >> processes), so tests in one or few process are not too important (or it is >> just one side of space) >> >> another topic can be performance on MS Sys - there are stories about not >> perfect performance of shared memory there. >> >> Regards >> >> Pavel >> >> One more test which is used to simulate access to temp tables under high >> load. >> I am using "upsert" into temp table in multiple connections. >> >> create global temp table gtemp (x integer primary key, y bigint); >> >> upsert.sql: >> insert into gtemp values (random() * 100, 0) on conflict(x) do update >> set y=gtemp.y+1; >> >> pgbench -c 10 -M prepared -T 100 -P 1 -n -f upsert.sql postgres >> >> >> I failed to find some standard way in pgbech to perform per-session >> initialization to create local temp table, >> so I just insert this code in pgbench code: >> >> diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c >> index 570cf33..af6a431 100644 >> --- a/src/bin/pgbench/pgbench.c >> +++ b/src/bin/pgbench/pgbench.c >> @@ -5994,6 +5994,7 @@ threadRun(void *arg) >> { >> if ((state[i].con = doConnect()) == NULL) >> goto done; >> + executeStatement(state[i].con, "create temp table >> ltemp(x integer primary key, y bigint)"); >> } >> } >> >> >> Results are the following: >> Global temp table: 117526 TPS >> Local temp table: 107802 TPS >> >> >> So even for this workload global temp table with shared buffers are a >> little bit faster. >> I will be pleased if you can propose some other testing scenario. >> > > please, try to increase number of connections. >
Re: FETCH FIRST clause PERCENT option
On Mon, Aug 19, 2019 at 1:55 PM Erik Rijkers wrote: > Another little thing, not sure it's a bug: > > limit interprets its argument by rounding up or down as one would > expect: > > table onek limit 10.4; --> gives 10 rows > table onek limit 10.6; --> gives 11 rows > > but FETCH count PERCENT does not do that; it rounds always up. > > select * from (table onek limit 100) f fetch first 10.4 percent rows > only; --> gives 11 rows > select * from (table onek limit 100) f fetch first 10.6 percent rows > only; --> gives 11 rows > > I see that it's documented in the .sgml to behave as it does, but it > seems wrong to me; > shouldn't that 10.4-percent-query yield 10 rows instead of 11? > > According to sql standard we have to use the result ceiling value regards Surafel
Re: Global temporary tables
On 19.08.2019 14:25, Pavel Stehule wrote: po 19. 8. 2019 v 13:16 odesílatel Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> napsal: On 19.08.2019 11:51, Konstantin Knizhnik wrote: On 18.08.2019 11:28, Pavel Stehule wrote: ne 18. 8. 2019 v 9:02 odesílatel Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> napsal: On 16.08.2019 20:17, Pavel Stehule wrote: pá 16. 8. 2019 v 16:12 odesílatel Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> napsal: I did more investigations of performance of global temp tables with shared buffers vs. vanilla (local) temp tables. 1. Combination of persistent and temporary tables in the same query. Preparation: create table big(pk bigint primary key, val bigint); insert into big values (generate_series(1,1),generate_series(1,1)); create temp table lt(key bigint, count bigint); create global temp table gt(key bigint, count bigint); Size of table is about 6Gb, I run this test on desktop with 16GB of RAM and postgres with 1Gb shared buffers. I run two queries: insert into T (select count(*),pk/P as key from big group by key); select sum(count) from T; where P is (100,10,1) and T is name of temp table (lt or gt). The table below contains times of both queries in msec: Percent of selected data 1% 10% 100% Local temp table 44610 90 47920 891 63414 21612 Global temp table 44669 35 47939 298 59159 26015 As you can see, time of insertion in temporary table is almost the same and time of traversal of temporary table is about twice smaller for global temp table when it fits in RAM together with persistent table and slightly worser when it doesn't fit. 2. Temporary table only access. The same system, but Postgres is configured with shared_buffers=10GB, max_parallel_workers = 4, max_parallel_workers_per_gather = 4 Local temp tables: create temp table local_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint); insert into local_temp values (generate_series(1,1),0,0,0,0,0,0,0,0); select sum(x1) from local_temp; Global temp tables: create global temporary table global_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint); insert into global_temp values (generate_series(1,1),0,0,0,0,0,0,0,0); select sum(x1) from global_temp; Results (msec): Insert Select Local temp table37489 48322 Global temp table 44358 3003 So insertion in local temp table is performed slightly faster but select is 16 times slower! Conclusion: In the assumption then temp table fits in memory, global temp tables with shared buffers provides better performance than local temp table. I didn't consider here global temp tables with local buffers because for them results should be similar with local temp tables. Probably there is not a reason why shared buffers should be slower than local buffers when system is under low load. access to shared memory is protected by spin locks (are cheap for few processes), so tests in one or few process are not too important (or it is just one side of space) another topic can be performance on MS Sys - there are stories about not perfect performance of shared memory there. Regards Pavel One more test which is used to simulate access to temp tables under high load. I am using "upsert" into temp table in multiple connections. create global temp table gtemp (x integer primary key, y bigint); upsert.sql: insert into gtemp values (random() * 100, 0) on conflict(x) do update set y=gtemp.y+1; pgbench -c 10 -M prepared -T 100 -P 1 -n -f upsert.sql postgres I failed to find some standard way in pgbech to perform per-session initialization to create local temp table, so I just insert this code in pgbench code: diff --git a/
Re: POC: Cleaning up orphaned files using undo logs
On Sat, Aug 17, 2019 at 10:58 PM Andres Freund wrote: > > Hi, > > On 2019-08-17 12:05:21 -0400, Robert Haas wrote: > > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund wrote: > > > > > Again, I think it's not ok to just assume you can lock an essentially > > > > > unbounded number of buffers. This seems almost guaranteed to result in > > > > > deadlocks. And there's limits on how many lwlocks one can hold etc. > > > > > > > > I think for controlling that we need to put a limit on max prepared > > > > undo? I am not sure any other way of limiting the number of buffers > > > > because we must lock all the buffer in which we are going to insert > > > > the undo record under one WAL logged operation. > > > > > > I heard that a number of times. But I still don't know why that'd > > > actually be true. Why would it not be sufficient to just lock the buffer > > > currently being written to, rather than all buffers? It'd require a bit > > > of care updating the official current "logical end" of a log, but > > > otherwise ought to not be particularly hard? Only one backend can extend > > > the log after all, and until the log is externally visibily extended, > > > nobody can read or write those buffers, no? > > > > Well, I don't understand why you're on about this. We've discussed it > > a number of times but I'm still confused. > > There's two reasons here: > > The primary one in the context here is that if we do *not* have to lock > the buffers all ahead of time, we can simplify the interface. We > certainly can't lock the buffers over IO (due to buffer reclaim) as > we're doing right now, so we'd need another phase, called by the "user" > during undo insertion. But if we do not need to lock the buffers before > the insertion over all starts, the inserting location doesn't have to > care. > > Secondarily, all the reasoning for needing to lock all buffers ahead of > time was imo fairly unconvincing. Following the "recipe" for WAL > insertions is a good idea when writing a new run-of-the-mill WAL > inserting location - but when writing a new fundamental facility, that > already needs to modify how WAL works, then I find that much less > convincing. > One point to remember in this regard is that we do need to modify the LSN in undo pages after writing WAL, so all the undo pages need to be locked by that time or we again need to take the lock on them. > > > 1. It's absolutely fine to just put a limit on this, because the > > higher-level facilities that use this shouldn't be doing a single > > WAL-logged operation that touches a zillion buffers. Right, by default a WAL log can only cover 4 buffers. If we need to touch more buffers, then the caller needs to call XLogEnsureRecordSpace. So, I agree with the point that generally, it should be few buffers (2 or 3) of undo that need to be touched in a single operation and if there are more, either callers need to change or at the very least they need to be careful about the same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: (select query)/relation as first class citizen
Hi, John, I think you've outlined the problem and possible solutions quite well. It's great to see that the goal might be not that far from implementing.
psql \r command is working?
Hello Is psql command \r actually works? I expected \r and then \e command should start editor without text of last query. LANG=C EDITOR=cat psql psql (11.5 (Debian 11.5-1.pgdg100+1)) Type "help" for help. melkij=> select 1; ?column? -- 1 (1 row) melkij=> \r Query buffer reset (cleared). melkij=> \e select 1; ?column? -- 1 (1 row) Buffer is still here, in normal external editors too. Same test on REL9_6_STABLE: postgres=# select 1; ?column? -- 1 (1 row) postgres=# \r Query buffer reset (cleared). postgres=# \e Buffer was actually cleared. I see this behavior change on >= 10 versions regards, Sergei
Re: Unused header file inclusion
On 2019-Aug-18, Tom Lane wrote: > I wrote: > > (My headerscheck script is missing that header; I need to update it to > > match the latest version of cpluspluscheck.) > > I did that, and ended up with the attached. I'm rather tempted to stick > this into src/tools/ alongside cpluspluscheck, because it seems to find > rather different trouble spots than cpluspluscheck does. Thoughts? Yeah, let's include this. I've written its equivalent a couple of times already. (My strategy is just to compile the .h file directly though, which creates a .gch file, rather than writing a temp .c file.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: psql \r command is working?
Sergei Kornilov writes: > Is psql command \r actually works? I expected \r and then \e command should > start editor without text of last query. \r clears the current query buffer. \e without an argument is defined to edit the current query buffer, or if that is empty, the last query. I don't see anything particularly wrong here. > I see this behavior change on >= 10 versions Possibly fixed as a side effect of e984ef586. regards, tom lane
Re: psql \r command is working?
so this is expected behavior, ok, thank you! regards, Sergei
Re: Duplicated LSN in ReorderBuffer
On 2019-Aug-19, Masahiko Sawada wrote: > On Tue, Aug 13, 2019 at 6:36 AM Alvaro Herrera > wrote: > > BTW I wrote the code as suggested and it passes all the tests ... but I > > then noticed that the unpatched code doesn't fail Ildar's original > > pgbench-based test for me, either. So maybe my laptop is not powerful > > enough to reproduce it, or maybe I'm doing something wrong. > > If you share the patch fixing this issue I'll test it on my > environment where I could reproduce the original problem. Never mind. I was able to reproduce it later, and verify that Andres' proposed strategy doesn't seem to fix the problem. I'm going to study the problem again today. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Wrong sentence in the README?
Hi, in the README, top level, there is this: PostgreSQL has many language interfaces, many of which are listed here: https://www.postgresql.org/download I don't think the download page lists any language interfaces or do I miss something? Regards Daniel
Re: Cleanup isolation specs from unused steps
Michael Paquier writes: > I have been looking at the isolation tests, and we have in some specs > steps which are defined but not used in any permutations. Hmm, might any of those represent actual bugs? Or are they just leftovers from test development? > In order to > detect them, I have been using the attached trick to track which > permutations are used. This allows to find immediately any > over-engineered spec by generating diffs about steps defined by not > used in any permutations. On HEAD, we have six specs entering in this > category. Seems like a good idea; I'm surprised we've got so many cases. regards, tom lane
Re: Partial join
Richard Guo wrote: > Please refer to the code snippet below: > > @@ -1164,8 +1164,8 @@ generate_join_implied_equalities(PlannerInfo *root, > List *sublist = NIL; > > /* ECs containing consts do not need any further enforcement > */ > if (ec->ec_has_const) > continue; Sorry, I'm quite busy currently. And thanks! That was a good read. I might be wrong, but I think have_partkey_equi_join in joinrels.c should be aware of the const case. My naive approach would be keeping pointers to the first few constant clauses, which are referencing to a yet unmatched partition key, to keep the memory footprint feasible in manner similar to pk_has_clause. The question would be what to do, if there are a lot of const expressions on the part keys. One could palloc additional memory in that case, hoping that it will be quite rare. Or is there a different, better way to go about that? Thank you for your feedback! Regards Arne
Re: Duplicated LSN in ReorderBuffer
Hi, On August 19, 2019 7:43:28 AM PDT, Alvaro Herrera wrote: >On 2019-Aug-19, Masahiko Sawada wrote: > >> On Tue, Aug 13, 2019 at 6:36 AM Alvaro Herrera > wrote: > >> > BTW I wrote the code as suggested and it passes all the tests ... >but I >> > then noticed that the unpatched code doesn't fail Ildar's original >> > pgbench-based test for me, either. So maybe my laptop is not >powerful >> > enough to reproduce it, or maybe I'm doing something wrong. >> >> If you share the patch fixing this issue I'll test it on my >> environment where I could reproduce the original problem. > >Never mind. I was able to reproduce it later, and verify that Andres' >proposed strategy doesn't seem to fix the problem. I'm going to study >the problem again today. Could you post the patch? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 8/19/19 8:51 AM, Ahsan Hadi wrote: > I have shared a calendar invite for TDE/KMS weekly meeting with the > members who expressed interest of joining the meeting in this chain. > Hopefully I haven't missed anyone. > > I am not aware of everyone's timezone but I have tried to setup a time > that's not very inconvenient. It won't be ideal for everyone as we are > dealing with multiple timezone but do let me know It is too bizarre for > you and I will try to find another slot. > > I will share a zoom link for the meeting on the invite in due course. Please add me as well. I would like to join when I can. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
I have shared a calendar invite for TDE/KMS weekly meeting with the members who expressed interest of joining the meeting in this chain. Hopefully I haven't missed anyone. I am not aware of everyone's timezone but I have tried to setup a time that's not very inconvenient. It won't be ideal for everyone as we are dealing with multiple timezone but do let me know It is too bizarre for you and I will try to find another slot. I will share a zoom link for the meeting on the invite in due course. -- Ahsan On Mon, Aug 19, 2019 at 9:26 AM Ahsan Hadi wrote: > > > On Mon, 19 Aug 2019 at 6:23 AM, Smith, Peter > wrote: > >> > From: Ibrar Ahmed Sent: Sunday, 18 August 2019 >> 2:43 AM >> > +1 for voice call, bruce we usually have a weekly TDE call. I will >> include you in >> >> If you don't mind, please also add me to that TDE call list. >> > > Sure will do. > > >> Thanks/Regards, >> --- >> Peter Smith >> Fujitsu Australia >> >
Re: Global temporary tables
> Certainly, default (small) temp buffer size plays roles. > But it this IPC host this difference is not so important. > Result with local temp tables and temp_buffers = 1GB: 859k TPS. > It is little bit unexpected result.I understand so it partially it is generic problem access to smaller dedicated caches versus access to bigger shared cache. But it is hard to imagine so access to local cache is 10% slower than access to shared cache. Maybe there is some bottle neck - maybe our implementation of local buffers are suboptimal. Using local buffers for global temporary tables can be interesting from another reason - it uses temporary files, and temporary files can be forwarded on ephemeral IO on Amazon cloud (with much better performance than persistent IO). > > -- > > Konstantin Knizhnik > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > >
Re: Cleanup isolation specs from unused steps
On Mon, Aug 19, 2019 at 1:08 AM Michael Paquier wrote: > Hi all, > > I have been looking at the isolation tests, and we have in some specs > steps which are defined but not used in any permutations. In order to > detect them, I have been using the attached trick to track which > permutations are used. This allows to find immediately any > over-engineered spec by generating diffs about steps defined by not > used in any permutations. On HEAD, we have six specs entering in this > category. > > Would that be useful? > I think it is useful. could you do the check that all steps have been used in dry_run mode instead of when running the tests for real? -- Melanie Plageman
Re: Unused header file inclusion
Alvaro Herrera writes: > On 2019-Aug-18, Tom Lane wrote: >> I did that, and ended up with the attached. I'm rather tempted to stick >> this into src/tools/ alongside cpluspluscheck, because it seems to find >> rather different trouble spots than cpluspluscheck does. Thoughts? > Yeah, let's include this. Done. regards, tom lane
io_uring support
Hi, For already some time I'm following the new linux IO interface "io_uring", that was introduced relatively recently [1]. Short description says: Shared application/kernel submission and completion ring pairs, for supporting fast/efficient IO. For us the important part is probably that it's an asynchronious IO, that can work not only with O_DIRECT, but with also with buffered access. Plus there are claims that it's pretty efficient (efficiency was one of the design goals [2]). The interface consists of submit/complete queues and data structures, shared between an application and the kernel. To facilitate application development there is also a nice library to utilize io_uring from the user space [3]. Since I haven't found that many discussions in the hackers archives about async IO, and out of curiosity decided to prepare an experimental patch to see how this would looks like to use io_uring in PostgreSQL. I've tested this patch so far only inside a qemu vm on the latest io_uring branch from linux-block tree. The result is relatively simple, and introduces new interface smgrqueueread, smgrsubmitread and smgrwaitread to queue any read we want, then submit a queue to a kernel and then wait for a result. The simplest example of how this interface could be used I found in pg_prewarm for buffers prefetching. As a result of this experiment I have few questions, open points and requests for the community experience: * I guess the proper implementation to use async IO is a big deal, but could bring also significant performance advantages. Is there any (nearest) future for such kind of async IO in PostgreSQL? Buffer prefetching is a simplest example, but taking into account that io_uring supports ordering, barriers and linked events, there are probably more use cases when it could be useful. * Assuming that the answer for previous question is positive, there could be different strategies how to use io_uring. So far I see different opportunities for waiting. Let's say we have prepared a batch of async IO operations and submitted it. Then we can e.g. -> just wait for a batch to be finished -> wait (in the same syscall as submitting) for previously submitted batches, then start submitting again, and at the end wait for the leftovers -> peek if there are any events completed, and get only those without waiting for the whole batch (in this case it's necessary to make sure submission queue is not overflowed) So it's open what and when to use. * Does it makes sense to use io_uring for smgrprefetch? Originally I've added io_uring parts into FilePrefetch also (in the form of preparing and submiting just one buffer), but not sure if this API is suitable. * How may look like a data structure, that can describe IO from PostgreSQL perspective? With io_uring we need to somehow identify IO operations that were completed. For now I'm just using a buffer number. Btw, this experimental patch has many limitations, e.g. only one ring is used for everything, which is of course far from ideal and makes identification even more important. * There are few more freedom dimensions, that io_uring introduces - how many rings to use, how many events per ring (which is going to be n for sqe and 2*n for cqe), how many IO operations per event to do (similar to preadv/pwritev we can provide a vector), what would be the balance between submit and complete queues. I guess it will require a lot of benchmarking to find a good values for these. [1]: https://github.com/torvalds/linux/commit/38e7571c07be01f9f19b355a9306a4e3d5cb0f5b [2]: http://kernel.dk/io_uring.pdf [3]: http://git.kernel.dk/cgit/liburing/ v1-0001-io-uring.patch Description: Binary data
Re: Unused header file inclusion
Hi, On 2019-08-18 14:37:34 -0400, Tom Lane wrote: > Andres Freund writes: > > I've pushed the other ones. > > Checking whether header files compile standalone shows you were overly > aggressive about removing fmgr.h includes: > > In file included from /tmp/headerscheck.Ss8bVx/test.c:3: > ./src/include/utils/selfuncs.h:143: error: expected declaration specifiers or > '...' before 'FmgrInfo' > ./src/include/utils/selfuncs.h:146: error: expected declaration specifiers or > '...' before 'FmgrInfo' > ./src/include/utils/selfuncs.h:152: error: expected declaration specifiers or > '...' before 'FmgrInfo' Darn. Pushed the obvious fix of adding a direct fmgr.h include, rather than the preivous indirect include. > That's with a script I use that's like cpluspluscheck except it tests > with plain gcc not g++. I attached it for the archives' sake. > > Oddly, cpluspluscheck does not complain about those cases, but it > does complain about Hm. I don't understand why it's not complaining. Wonder if it's a question of the flags or such. > In file included from /tmp/cpluspluscheck.FgX2SW/test.cpp:4: > ./src/bin/scripts/scripts_parallel.h:18: error: ISO C++ forbids declaration > of 'PGconn' with no type > ./src/bin/scripts/scripts_parallel.h:18: error: expected ';' before '*' token > ./src/bin/scripts/scripts_parallel.h:29: error: 'PGconn' has not been declared I noticed that "manually" earlier, when looking at the openbsd issue. > (My headerscheck script is missing that header; I need to update it to > match the latest version of cpluspluscheck.) I wonder if we should just add a #ifndef HEADERCHECK or such to the headers that we don't want to process as standalone headers (or #ifndef NOT_STANDALONE or whatever). That way multiple tools can rely on these markers, rather than copying knowledge about that kind of information into multiple places. I wish we could move the whole logic of those scripts into makefiles, so we could employ parallelism. Greetings, Andres Freund
Re: io_uring support
Hi, On 2019-08-19 20:20:46 +0200, Dmitry Dolgov wrote: > For already some time I'm following the new linux IO interface "io_uring", > that > was introduced relatively recently [1]. Short description says: > > Shared application/kernel submission and completion ring pairs, for > supporting fast/efficient IO. Yes, it's quite promising. I also played around some with it. One thing I particularly like is that it seems somewhat realistic to have an abstraction that both supports io_uring and window's iocp - personally I don't think we need support for more than those. > For us the important part is probably that it's an asynchronious IO, that can > work not only with O_DIRECT, but with also with buffered access. Note that while the buffered access does allow for some acceleration, it currently does have quite noticable CPU overhead. > Since I haven't found that many discussions in the hackers archives about > async > IO, and out of curiosity decided to prepare an experimental patch to see how > this would looks like to use io_uring in PostgreSQL. Cool! > I've tested this patch so far only inside a qemu vm on the latest > io_uring branch from linux-block tree. The result is relatively > simple, and introduces new interface smgrqueueread, smgrsubmitread and > smgrwaitread to queue any read we want, then submit a queue to a > kernel and then wait for a result. The simplest example of how this > interface could be used I found in pg_prewarm for buffers prefetching. Hm. I'm bit doubtful that that's going in the direction of being the right interface. I think we'd basically have to insist that all AIO capable smgr's use one common AIO layer (note that the UNDO patches add another smgr implementation). Otherwise I think we'll have a very hard time to make them cooperate. An interface like this would also lead to a lot of duplicated interfaces, because we'd basically need most of the smgr interface functions duplicated. I suspect we'd rather have to build something where the existing functions grow a parameter controlling synchronizity. If AIO is allowed and supported, the smgr implementation would initiate the IO, together with a completion function for it, and return some value allowing the caller to wait for the result if desirable. > As a result of this experiment I have few questions, open points and requests > for the community experience: > > * I guess the proper implementation to use async IO is a big deal, but could > bring also significant performance advantages. Is there any (nearest) future > for such kind of async IO in PostgreSQL? Buffer prefetching is a simplest > example, but taking into account that io_uring supports ordering, barriers > and linked events, there are probably more use cases when it could be > useful. The lowest hanging fruit that I can see - and which I played with - is making the writeback flushing use async IO. That's particularly interesting for bgwriter. As it commonly only performs random IO, and as we need to keep the number of dirty buffers in the kernel small to avoid huge latency spikes, being able to submit IOs asynchronously can yield significant benefits. > * Assuming that the answer for previous question is positive, there could be > different strategies how to use io_uring. So far I see different > opportunities for waiting. Let's say we have prepared a batch of async IO > operations and submitted it. Then we can e.g. > > -> just wait for a batch to be finished > -> wait (in the same syscall as submitting) for previously submitted > batches, > then start submitting again, and at the end wait for the leftovers > -> peek if there are any events completed, and get only those without > waiting > for the whole batch (in this case it's necessary to make sure submission > queue is not overflowed) > > So it's open what and when to use. I don't think there's much point in working only with complete batches. I think we'd loose too much of the benefit by introducing unnecessary synchronous operations. I think we'd need to design the interface in a way that there constantly can be in-progress IOs, block when the queue is full, and handle finished IOs using a callback mechanism or such. > * Does it makes sense to use io_uring for smgrprefetch? Originally I've added > io_uring parts into FilePrefetch also (in the form of preparing and > submiting > just one buffer), but not sure if this API is suitable. I have a hard time seeing that being worthwhile, unless we change the way it's used significantly. I think to benefit really, we'd have to be able to lock multiple buffers, and have io_uring prefetch directly into buffers. > * How may look like a data structure, that can describe IO from PostgreSQL > perspective? With io_uring we need to somehow identify IO operations that > were completed. For now I'm just using a buffer number. In my hacks I've used the sqe's user_data to point to a struct wi
Re: Unused header file inclusion
Hi, On 2019-08-19 13:07:50 -0700, Andres Freund wrote: > On 2019-08-18 14:37:34 -0400, Tom Lane wrote: > > That's with a script I use that's like cpluspluscheck except it tests > > with plain gcc not g++. I attached it for the archives' sake. > > > > Oddly, cpluspluscheck does not complain about those cases, but it > > does complain about > > Hm. I don't understand why it's not complaining. Wonder if it's a > question of the flags or such. I'ts caused by -fsyntax-only # These switches are g++ specific, you may override if necessary. CXXFLAGS=${CXXFLAGS:- -fsyntax-only -Wall} which also explains why your headercheck doesn't have that problem, it doesn't use -fsyntax-only. > > (My headerscheck script is missing that header; I need to update it to > > match the latest version of cpluspluscheck.) > > I wonder if we should just add a #ifndef HEADERCHECK or such to the > headers that we don't want to process as standalone headers (or #ifndef > NOT_STANDALONE or whatever). That way multiple tools can rely on these > markers, > rather than copying knowledge about that kind of information into > multiple places. > > I wish we could move the whole logic of those scripts into makefiles, so > we could employ parallelism. Hm. Perhaps the way to do that would be to use gcc's -include to include postgres.h, and use -Wc++-compat to detect c++ issues, rather than using g++. Without tempfiles it ought to be a lot easier to just do all of the relevant work in make, without a separate shell script. The python/perl/ecpg logic would b e abit annoying, but probably not too bad? Think we could just always add all of them? Greetings, Andres Freund
Re: POC: Cleaning up orphaned files using undo logs
On 2019-08-19 17:52:24 +0530, Amit Kapila wrote: > On Sat, Aug 17, 2019 at 10:58 PM Andres Freund wrote: > > > > Hi, > > > > On 2019-08-17 12:05:21 -0400, Robert Haas wrote: > > > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund wrote: > > > > > > Again, I think it's not ok to just assume you can lock an > > > > > > essentially > > > > > > unbounded number of buffers. This seems almost guaranteed to result > > > > > > in > > > > > > deadlocks. And there's limits on how many lwlocks one can hold etc. > > > > > > > > > > I think for controlling that we need to put a limit on max prepared > > > > > undo? I am not sure any other way of limiting the number of buffers > > > > > because we must lock all the buffer in which we are going to insert > > > > > the undo record under one WAL logged operation. > > > > > > > > I heard that a number of times. But I still don't know why that'd > > > > actually be true. Why would it not be sufficient to just lock the buffer > > > > currently being written to, rather than all buffers? It'd require a bit > > > > of care updating the official current "logical end" of a log, but > > > > otherwise ought to not be particularly hard? Only one backend can extend > > > > the log after all, and until the log is externally visibily extended, > > > > nobody can read or write those buffers, no? > > > > > > Well, I don't understand why you're on about this. We've discussed it > > > a number of times but I'm still confused. > > > > There's two reasons here: > > > > The primary one in the context here is that if we do *not* have to lock > > the buffers all ahead of time, we can simplify the interface. We > > certainly can't lock the buffers over IO (due to buffer reclaim) as > > we're doing right now, so we'd need another phase, called by the "user" > > during undo insertion. But if we do not need to lock the buffers before > > the insertion over all starts, the inserting location doesn't have to > > care. > > > > Secondarily, all the reasoning for needing to lock all buffers ahead of > > time was imo fairly unconvincing. Following the "recipe" for WAL > > insertions is a good idea when writing a new run-of-the-mill WAL > > inserting location - but when writing a new fundamental facility, that > > already needs to modify how WAL works, then I find that much less > > convincing. > > > > One point to remember in this regard is that we do need to modify the > LSN in undo pages after writing WAL, so all the undo pages need to be > locked by that time or we again need to take the lock on them. Well, my main point, which so far has largely been ignored, was that we may not acquire page locks when we still need to search for victim buffers later. If we don't need to lock the pages up-front, but only do so once we're actually copying the records into the undo pages, then we don't a separate phase to acquire the locks. We can still hold all of the page locks at the same time, as long as we just acquire them at the later stage. My secondary point was that *none* of this actually is documented, even if it's entirely unobvious to the reader that the relevant code can only run during WAL insertion, due to being pretty far removed from that. Greetings, Andres Freund
Re: Unused header file inclusion
On 2019-Aug-19, Andres Freund wrote: > > I wish we could move the whole logic of those scripts into makefiles, so > > we could employ parallelism. > > Hm. Perhaps the way to do that would be to use gcc's -include to include > postgres.h, and use -Wc++-compat to detect c++ issues, rather than using > g++. Without tempfiles it ought to be a lot easier to just do all of the > relevant work in make, without a separate shell script. I used to have this: https://postgr.es/m/1293469595-sup-1...@alvh.no-ip.org Not sure how much this helps, since it's a shell line in make, so not very paralellizable. And you still have to build the exclusions somehow. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Unused header file inclusion
Andres Freund writes: > On 2019-08-19 13:07:50 -0700, Andres Freund wrote: >> On 2019-08-18 14:37:34 -0400, Tom Lane wrote: >>> Oddly, cpluspluscheck does not complain about those cases, but it >>> does complain about >> Hm. I don't understand why it's not complaining. Wonder if it's a >> question of the flags or such. > I'ts caused by -fsyntax-only Ah-hah. Should we change that to something else? That's probably a hangover from thinking that all we had to do was check for C++ keywords. >> I wish we could move the whole logic of those scripts into makefiles, so >> we could employ parallelism. I can't really get excited about expending a whole bunch of additional work on these scripts. regards, tom lane
Re: PANIC: could not flush dirty data: Operation not permitted power8, Redhat Centos
On Mon, Aug 19, 2019 at 7:32 AM Thomas Munro wrote: > On Wed, Apr 17, 2019 at 1:04 PM Thomas Munro wrote: > > On Mon, Apr 15, 2019 at 7:57 PM wrote: > > > I forgot to mention that this is happening in a docker container. > > > > Huh, so there may be some configuration of Linux container that can > > fail here with EPERM, even though that error that does not appear in > > the man page, and doesn't make much intuitive sense. Would be good to > > figure out how that happens. > > Steve Dodd ran into the same problem in Borg[1]. It looks like what's > happening here is that on PowerPC and ARM systems, there is a second > system call sync_file_range2 that has the arguments arranged in a > better order for their calling conventions (see Notes section of man > sync_file_range), and glibc helpfully translates for you, but some > container technologies forgot to include sync_file_range2 in their > syscall forwarding table. Perhaps we should just handle this with the > not_implemented_by_kernel mechanism I added for WSL. I've just heard that it was fixed overnight in seccomp, which is probably what Docker is using to give you EPERM for syscalls it doesn't like the look of: https://github.com/systemd/systemd/pull/13352/commits/90ddac6087b5f8f3736364cfdf698e713f7e8869 Not being a Docker user, I'm sure if/when that will flow into the right places in a timely fashion but if not it looks like you can always configure your own profile or take one from somewhere else, probably something like this: https://github.com/moby/moby/commit/52d8f582c331e35f7b841171a1c22e2d9bbfd0b8 So it looks like we don't need to do anything at all on our side, unless someone knows better. -- Thomas Munro https://enterprisedb.com
Re: Zedstore - compressed in-core columnar storage
On Sun, Aug 18, 2019 at 12:35 PM Justin Pryzby wrote: > > . I was missing a way to check for compression ratio; Here are the ways to check compression ratio for zedstore: Table level: select sum(uncompressedsz::numeric) / sum(totalsz) as compratio from pg_zs_btree_pages(); Per column level: select attno, count(*), sum(uncompressedsz::numeric) / sum(totalsz) as compratio from pg_zs_btree_pages() group by attno order by attno; > it looks like zedstore >with lz4 gets ~4.6x for our largest customer's largest table. zfs using >compress=gzip-1 gives 6x compression across all their partitioned > tables, >and I'm surprised it beats zedstore . > What kind of tables did you use? Is it possible to give us the schema of the table? Did you perform 'INSERT INTO ... SELECT' or COPY? Currently COPY give better compression ratios than single INSERT because it generates less pages for meta data. Using the above per column level compression ratio will provide which columns have lower compression ratio. We plan to add other compression algorithms like RLE and delta encoding which should give better compression ratios for column store along with LZ4.
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Tue, Aug 13, 2019 at 8:45 AM Anastasia Lubennikova wrote: > > I still need to think about the exact details of alignment within > > _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I > > could be wrong. > Could you explain more about these cases? > Now I don't understand the problem. Maybe there is no problem. > Thank you for the patch. > Still, I'd suggest to leave it as a possible future improvement, so that > it doesn't > distract us from the original feature. I don't even think that it's useful work for the future. It's just nice to be sure that we could support unique index deduplication if it made sense. Which it doesn't. If I didn't write the patch that implements deduplication for unique indexes, I might still not realize that we need the index_compute_xid_horizon_for_tuples() stuff in certain other places. I'm not serious about it at all, except as a learning exercise/experiment. > I added to v6 another related fix for _bt_compress_one_page(). > Previous code was implicitly deleted DEAD items without > calling index_compute_xid_horizon_for_tuples(). > New code has a check whether DEAD items on the page exist and remove > them if any. > Another possible solution is to copy dead items as is from old page to > the new one, > but I think it's good to remove dead tuples as fast as possible. I think that what you've done in v7 is probably the best way to do it. It's certainly simple, which is appropriate given that we're not really expecting to see LP_DEAD items within _bt_compress_one_page() (we just need to be prepared for them). > > v5 makes _bt_insertonpg_in_posting() prepared to overwrite an > > existing item if it's an LP_DEAD item that falls in the same TID range > > (that's _bt_compare()-wise "equal" to an existing tuple, which may or > > may not be a posting list tuple already). I haven't made this code do > > something like call index_compute_xid_horizon_for_tuples(), even > > though that's needed for correctness (i.e. this new code is currently > > broken in the same way that I mentioned unique index support is > > broken). > Is it possible that DEAD tuple to delete was smaller than itup? I'm not sure what you mean by this. I suppose that it doesn't matter, since we both prefer the alternative that you came up with anyway. > > How do you feel about officially calling this deduplication, not > > compression? I think that it's a more accurate name for the technique. > I agree. > Should I rename all related names of functions and variables in the patch? Please rename them when convenient. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova wrote: > Now the algorithm is the following: > > - If bt_findinsertloc() found out that tuple belongs to existing posting > tuple's > TID interval, it sets 'in_posting_offset' variable and passes it to > _bt_insertonpg() > > - If 'in_posting_offset' is valid and origtup is valid, > merge our itup into origtup. > > It can result in one tuple neworigtup, that must replace origtup; or two > tuples: > neworigtup and newrighttup, if the result exceeds BTMaxItemSize, That sounds like the right way to do it. > - If two new tuple(s) fit into the old page, we're lucky. > call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to > atomically replace oldtup with new tuple(s) and generate xlog record. > > - In case page split is needed, pass both tuples to _bt_split(). > _bt_findsplitloc() is now aware of upcoming replacement of origtup with > neworigtup, so it uses correct item size where needed. That makes sense, since _bt_split() is responsible for both splitting the page, and inserting the new item on either the left or right page, as part of the first phase of a page split. In other words, if you're adding something new to _bt_insertonpg(), you probably also need to add something new to _bt_split(). So that's what you did. > It seems that now all replace operations are crash-safe. The new patch passes > all regression tests, so I think it's ready for review again. I'm looking at it now. I'm going to spend a significant amount of time on this tomorrow. I think that we should start to think about efficient WAL-logging now. > In the meantime, I'll run more stress-tests. As you probably realize, wal_consistency_checking is a good thing to use with your tests here. -- Peter Geoghegan
Plug-in common/logging.h with vacuumlo and oid2name
Hi all, While refactoring some code, I have bumped into $subject, which causes both tools to have incorrect error message formats and progname being used all through the code, sometimes incorrectly or just missing. At the same time, we are missing a call to set_pglocale_pgservice() for both binaries. Any objections to the cleanup attached? -- Michael diff --git a/contrib/oid2name/oid2name.c b/contrib/oid2name/oid2name.c index d3a4a50005..b33e4d3d82 100644 --- a/contrib/oid2name/oid2name.c +++ b/contrib/oid2name/oid2name.c @@ -11,6 +11,7 @@ #include "catalog/pg_class_d.h" +#include "common/logging.h" #include "fe_utils/connect.h" #include "libpq-fe.h" #include "pg_getopt.h" @@ -85,6 +86,8 @@ get_opts(int argc, char **argv, struct options *my_opts) const char *progname; int optindex; + pg_logging_init(argv[0]); + set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("oid2name")); progname = get_progname(argv[0]); /* set the defaults */ @@ -320,8 +323,8 @@ sql_conn(struct options *my_opts) if (!conn) { - fprintf(stderr, "%s: could not connect to database %s\n", - "oid2name", my_opts->dbname); + pg_log_error("could not connect to database %s", + my_opts->dbname); exit(1); } @@ -339,8 +342,8 @@ sql_conn(struct options *my_opts) /* check to see that the backend connection was successfully made */ if (PQstatus(conn) == CONNECTION_BAD) { - fprintf(stderr, "%s: could not connect to database %s: %s", -"oid2name", my_opts->dbname, PQerrorMessage(conn)); + pg_log_error("could not connect to database %s: %s", + my_opts->dbname, PQerrorMessage(conn)); PQfinish(conn); exit(1); } @@ -348,8 +351,8 @@ sql_conn(struct options *my_opts) res = PQexec(conn, ALWAYS_SECURE_SEARCH_PATH_SQL); if (PQresultStatus(res) != PGRES_TUPLES_OK) { - fprintf(stderr, "oid2name: could not clear search_path: %s\n", -PQerrorMessage(conn)); + pg_log_error("could not clear search_path: %s", + PQerrorMessage(conn)); PQclear(res); PQfinish(conn); exit(-1); @@ -382,8 +385,8 @@ sql_exec(PGconn *conn, const char *todo, bool quiet) /* check and deal with errors */ if (!res || PQresultStatus(res) > 2) { - fprintf(stderr, "oid2name: query failed: %s\n", PQerrorMessage(conn)); - fprintf(stderr, "oid2name: query was: %s\n", todo); + pg_log_error("query failed: %s", PQerrorMessage(conn)); + pg_log_error("query was: %s", todo); PQclear(res); PQfinish(conn); diff --git a/contrib/vacuumlo/vacuumlo.c b/contrib/vacuumlo/vacuumlo.c index 73c06a043e..7fbcaee846 100644 --- a/contrib/vacuumlo/vacuumlo.c +++ b/contrib/vacuumlo/vacuumlo.c @@ -23,6 +23,7 @@ #include "catalog/pg_class_d.h" +#include "common/logging.h" #include "fe_utils/connect.h" #include "libpq-fe.h" #include "pg_getopt.h" @@ -109,8 +110,7 @@ vacuumlo(const char *database, const struct _param *param) conn = PQconnectdbParams(keywords, values, true); if (!conn) { - fprintf(stderr, "Connection to database \"%s\" failed\n", - database); + pg_log_error("connection to database \"%s\" failed", database); return -1; } @@ -129,8 +129,8 @@ vacuumlo(const char *database, const struct _param *param) /* check to see that the backend connection was successfully made */ if (PQstatus(conn) == CONNECTION_BAD) { - fprintf(stderr, "Connection to database \"%s\" failed:\n%s", -database, PQerrorMessage(conn)); + pg_log_error("connection to database \"%s\" failed: %s", + database, PQerrorMessage(conn)); PQfinish(conn); return -1; } @@ -145,8 +145,7 @@ vacuumlo(const char *database, const struct _param *param) res = PQexec(conn, ALWAYS_SECURE_SEARCH_PATH_SQL); if (PQresultStatus(res) != PGRES_TUPLES_OK) { - fprintf(stderr, "Failed to set search_path:\n"); - fprintf(stderr, "%s", PQerrorMessage(conn)); + pg_log_error("failed to set search_path: %s", PQerrorMessage(conn)); PQclear(res); PQfinish(conn); return -1; @@ -165,8 +164,7 @@ vacuumlo(const char *database, const struct _param *param) res = PQexec(conn, buf); if (PQresultStatus(res) != PGRES_COMMAND_OK) { - fprintf(stderr, "Failed to create temp table:\n"); - fprintf(stderr, "%s", PQerrorMessage(conn)); + pg_log_error("failed to create temp table: %s", PQerrorMessage(conn)); PQclear(res); PQfinish(conn); return -1; @@ -182,8 +180,7 @@ vacuumlo(const char *database, const struct _param *param) res = PQexec(conn, buf); if (PQresultStatus(res) != PGRES_COMMAND_OK) { - fprintf(stderr, "Failed to vacuum temp table:\n"); - fprintf(stderr, "%s", PQerrorMessage(conn)); + pg_log_error("failed to vacuum temp table: %s", PQerrorMessage(conn)); PQclear(res); PQfinish(conn); return -1; @@ -212,8 +209,7 @@ vacuumlo(const char *database, const struct _param *param) res = PQexec(conn, buf); if (PQresultStatus(res) != PGRES_TUPLES_OK) { - fprintf(stderr, "Failed to find OID columns:\n"); - fprintf(stderr, "%s", PQerrorMessage(conn)); + pg_log_error("fai
Re: Cleanup isolation specs from unused steps
On Mon, Aug 19, 2019 at 11:02:42AM -0400, Tom Lane wrote: > Michael Paquier writes: >> I have been looking at the isolation tests, and we have in some specs >> steps which are defined but not used in any permutations. > > Hmm, might any of those represent actual bugs? Or are they just > leftovers from test development? I cannot yet enter the minds of each test author back this much in time, but I think that's a mix of both. When working on a new isolation spec, I personally tend to do a lot of copy-pasting of the same queries for multiple sessions and then manipulate the permutations to produce a set of useful tests. It is rather easy to forget to remove some steps when doing that. I guess that's what happened with tuplelock-upgrade, insert-conflict-do-update* and freeze-the-dead. -- Michael signature.asc Description: PGP signature
Re: Cleanup isolation specs from unused steps
On Mon, Aug 19, 2019 at 10:23:19AM -0700, Melanie Plageman wrote: > Could you do the check that all steps have been used in dry_run mode > instead of when running the tests for real? Sure, I was hesitating to do so. I have no issue in moving the check into run_testspec(). So done as attached. It is rather a pain to pass down custom options to isolationtester. For example, I have tested the updated version attached after hijacking -n into isolation_start_test(). Ugly hack, but for testing that's enough. Do you make use of this tool in a particular way in greenplum? Just wondering. (Could it make sense to have long options for isolationtester by the way?) -- Michael diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c index f98bb1cf64..f8ec7ca7b4 100644 --- a/src/test/isolation/isolationtester.c +++ b/src/test/isolation/isolationtester.c @@ -251,10 +251,24 @@ static int *piles; static void run_testspec(TestSpec *testspec) { + int i; + if (testspec->permutations) run_named_permutations(testspec); else run_all_permutations(testspec); + + /* + * Verify that all steps have been used, complaining about anything + * defined but not used. This check happens here so as dry-run is able + * to use it. + */ + for (i = 0; i < testspec->nallsteps; i++) + { + if (!testspec->allsteps[i]->used) + fprintf(stderr, "unused step name: %s\n", + testspec->allsteps[i]->name); + } } /* @@ -457,7 +471,11 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps) { printf("permutation"); for (i = 0; i < nsteps; i++) + { + /* Track the permutation as in-use */ + steps[i]->used = true; printf(" \"%s\"", steps[i]->name); + } printf("\n"); return; } @@ -467,7 +485,11 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps) printf("\nstarting permutation:"); for (i = 0; i < nsteps; i++) + { + /* Track the permutation as in-use */ + steps[i]->used = true; printf(" %s", steps[i]->name); + } printf("\n"); /* Perform setup */ diff --git a/src/test/isolation/isolationtester.h b/src/test/isolation/isolationtester.h index 7f91e6433f..d9d2a14ecf 100644 --- a/src/test/isolation/isolationtester.h +++ b/src/test/isolation/isolationtester.h @@ -29,6 +29,7 @@ struct Session struct Step { int session; + bool used; char *name; char *sql; char *errormsg; diff --git a/src/test/isolation/specparse.y b/src/test/isolation/specparse.y index fb8a4d706c..2dfe3533ff 100644 --- a/src/test/isolation/specparse.y +++ b/src/test/isolation/specparse.y @@ -145,6 +145,7 @@ step: $$ = pg_malloc(sizeof(Step)); $$->name = $2; $$->sql = $3; +$$->used = false; $$->errormsg = NULL; } ; diff --git a/src/test/isolation/specs/freeze-the-dead.spec b/src/test/isolation/specs/freeze-the-dead.spec index 8c3649902a..915bf15b92 100644 --- a/src/test/isolation/specs/freeze-the-dead.spec +++ b/src/test/isolation/specs/freeze-the-dead.spec @@ -19,7 +19,6 @@ session "s1" step "s1_begin" { BEGIN; } step "s1_update" { UPDATE tab_freeze SET x = x + 1 WHERE id = 3; } step "s1_commit" { COMMIT; } -step "s1_vacuum" { VACUUM FREEZE tab_freeze; } step "s1_selectone" { BEGIN; SET LOCAL enable_seqscan = false; @@ -28,7 +27,6 @@ step "s1_selectone" { COMMIT; } step "s1_selectall" { SELECT * FROM tab_freeze ORDER BY name, id; } -step "s1_reindex" { REINDEX TABLE tab_freeze; } session "s2" step "s2_begin" { BEGIN; } @@ -40,7 +38,6 @@ session "s3" step "s3_begin" { BEGIN; } step "s3_key_share" { SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE; } step "s3_commit" { COMMIT; } -step "s3_vacuum" { VACUUM FREEZE tab_freeze; } # This permutation verifies that a previous bug # https://postgr.es/m/e5711e62-8fdf-4dca-a888-c200bf6b5...@amazon.com diff --git a/src/test/isolation/specs/insert-conflict-do-nothing.spec b/src/test/isolation/specs/insert-conflict-do-nothing.spec index 9b92c35cec..71acc380c7 100644 --- a/src/test/isolation/specs/insert-conflict-do-nothing.spec +++ b/src/test/isolation/specs/insert-conflict-do-nothing.spec @@ -33,7 +33,6 @@ setup step "donothing2" { INSERT INTO ints(key, val) VALUES(1, 'donothing2') ON CONFLICT DO NOTHING; } step "select2" { SELECT * FROM ints; } step "c2" { COMMIT; } -step "a2" { ABORT; } # Regular case where one session block-waits on another to determine if it # should proceed with an insert or do nothing. diff --git a/src/test/isolation/specs/insert-conflict-do-update-2.spec b/src/test/isolation/specs/insert-conflict-do-update-2.spec index f5b4f601b5..12f6be8000 100644 --- a/src/test/isolation/specs/insert-conflict-do-update-2.spec +++ b/src/test/isolation/specs/insert-conflict-do-update-2.spec @@ -32,7 +32,6 @@ setup step "insert2" { INSERT INTO upsert(key, payload) VALUES('FOOFOO', 'insert2') ON CONFLICT (lower(key)) DO UPDATE set key = EXCLUDED.key, payload = upsert.payload || ' updated by insert2'; } step "select2" { SELECT * FRO
Re: Zedstore - compressed in-core columnar storage
On Mon, Aug 19, 2019 at 04:15:30PM -0700, Alexandra Wang wrote: > On Sun, Aug 18, 2019 at 12:35 PM Justin Pryzby wrote: > > > . I was missing a way to check for compression ratio; > > Here are the ways to check compression ratio for zedstore: > > Table level: > SELECT sum(uncompressedsz::numeric)/sum(totalsz) AS compratio FROM > pg_zs_btree_pages(); postgres=# SELECT sum(uncompressedsz::numeric)/sum(totalsz) AS compratio FROM pg_zs_btree_pages('child.cdrs_huawei_pgwrecord_2019_07_01'); compratio | 4.2730304163521529 For a fair test, I created a separate ZFS tablspace for storing just a copy of that table. ts=# CREATE TABLE test TABLESPACE testcomp AS SELECT * FROM child.cdrs_huawei_pgwrecord_2019_07_01; SELECT 39933381 Time: 882417.775 ms (14:42.418) zfs/testJTP20190819 compressratio 6.01x - zfs/testJTP20190819 compressiongzip-1inherited from zfs > Per column level: > select attno, count(*), sum(uncompressedsz::numeric)/sum(totalsz) as > compratio from pg_zs_btree_pages() group by attno order by attno; Order by 3; I see we have SOME highly compressed columns. It's still surprising to me that's as low as it is, given their content: phone numbers and IPv4 addresses in text form, using characters limited to [[:digit:].] (I realize we can probably save space using inet type.) 0 | 4743 | 1. 32 | 21912 | 1.05953637381493823513 80 | 36441 | 1.2416446300175039 4 | 45059 | 1.3184106811322728 83 | 45059 | 1.3184106811322728 52 | 39208 | 1.3900788061770992 ... 74 | 3464 |10.8258665101057364 17 | 3535 |10.8776086243096534 3 | 7092 |11.0388009154683678 11 | 3518 |11.4396055611832109 65 | |14.6594723104237634 35 | 14077 |15.1642131499381887 ... 43 | 1601 |21.4200106784573211 79 | 1599 |21.4487670806076829 89 | 1934 |23.6292134031933401 33 | 1934 |23.6292134031933401 It seems clear the columns with high n_distinct have low compress ratio, and columns with high compress ratio are those with n_distinct=1... CREATE TEMP TABLE zs AS SELECT zs.*, n_distinct, avg_width, a.attname FROM (SELECT 'child.cdrs_huawei_pgwrecord_2019_07_01'::regclass t)t , LATERAL (SELECT attno, count(*), sum(uncompressedsz::numeric)/sum(totalsz) AS compratio FROM pg_zs_btree_pages(t) GROUP BY attno)zs , pg_attribute a, pg_class c, pg_stats s WHERE a.attrelid=t AND a.attnum=zs.attno AND c.oid=a.attrelid AND c.relname=s.tablename AND s.attname=a.attname; n_distinct | compratio + 217141 | 1.2416446300175039 154829 | 1.5306062496764190 144486 | 1.3900788061770992 128334 | 1.5395022739568842 121324 | 1.4005533187886683 86341 | 1.6262709389296389 84073 | 4.4379336418590519 65413 | 5.1890181028038757 63703 | 5.5029855093836425 63637 | 5.3648468796642262 46450 | 1.3184106811322728 46450 | 1.3184106811322728 43029 | 1.8003513772661308 39363 | 1.5845730687475706 36720 | 1.4751147557399539 36445 | 1.8403087513759131 36445 | 1.5453935268318613 11455 | 1.05953637381493823513 2862 | 9.8649823666870671 2625 | 2.3573614181847621 1376 | 1.7895024285340428 1335 | 2.2812551964262787 807 | 7.1192324141359373 610 | 7.9373623460089360 16 |11.4396055611832109 10 | 5.5429763442365557 7 | 5.0440578041440675 7 | 5.2000132813261135 4 | 6.9741514753325536 4 | 4.2872818036896340 3 | 1.9080838412634827 3 | 2.9915954457453485 3 | 2.3056387009407882 2 |10.8776086243096534 2 | 5.5950929307378287 2 |18.5796576388128741 2 |10.8258665101057364 2 | 9.1112820658021406 2 | 3.4986057630739795 2 | 4.6250999234025238 2 |11.0388009154683678 1 |15.1642131499381887 1 | 2.8855860118178798 1 |23.6292134031933401 1 |21.4200106784573211 [...] > > it looks like zedstore > >with lz4 gets ~4.6x for our largest customer's largest table. zfs using > >compress=gzip-1 gives 6x compression across all their partitioned > > tables, > >and I'm surprised it beats zedstore . > > > > What kind of tables did you use? Is it possible to give us the schema > of the table? Did you perform 'INSERT INTO ... SELECT' or COPY? I did this: |time ~/src/postgresql.bin/bin/pg_restore /srv/cdrperfbackup/ts/final/child.cdrs_huawei_pgwrecord_2019_07_01 -f- |PGOPTIONS='-cdefault_table_access_method=zedstore' psql --port 5678 postgres --host /tmp ... COPY 39933381 ... real100m25.764s child | cdrs_huawei_pgwrecord_2019_07_01 | table | pryz
Fixing typos and inconsistencies
Hello hackers, Now that the unicums checking is finished, I would like to share the script I used to find them. Maybe it can be useful to recheck the source tree from time to time... I don't think that the check could be fully automated, but with some eyeballing it allows to maintain a more consistent state. Best regards, Alexander find-new-unicums.sh Description: application/shellscript
Re: Cleanup isolation specs from unused steps
On 2019-Aug-20, Michael Paquier wrote: > On Mon, Aug 19, 2019 at 10:23:19AM -0700, Melanie Plageman wrote: > > Could you do the check that all steps have been used in dry_run mode > > instead of when running the tests for real? > > Sure, I was hesitating to do so. I have no issue in moving the check > into run_testspec(). So done as attached. I created the dry-run mode to be able to easily generate the set of possible permutations for a new test, then edit the result and put it back in the spec file; but after the deadlock tests were added (with necessary hacking of the lock-detection in isolationtester) that manner of operation became almost completely useless. Maybe we need to rethink what facilities isolationtester offers -- possibly making dry-run have a completely different behavior than currently, which I doubt anybody is using. All that being said, I have no objections to this patch (but I didn't review it closely). -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Cleanup isolation specs from unused steps
On Tue, Aug 20, 2019 at 12:34:45AM -0400, Alvaro Herrera wrote: > I created the dry-run mode to be able to easily generate the set of > possible permutations for a new test, then edit the result and put it > back in the spec file; but after the deadlock tests were added (with > necessary hacking of the lock-detection in isolationtester) that manner > of operation became almost completely useless. Maybe we need to rethink > what facilities isolationtester offers -- possibly making dry-run have a > completely different behavior than currently, which I doubt anybody is > using. I am not sure exactly how it could be redesigned, and with n! permutations that easily leads to bloat of the generated output. I think that --dry-run (well -n) is a bit misleading as option name though as it prints only permutations. Still, keeping it around has no real cost, so it is not a big deal. (Looking at the gpdb code, it does not seem to be used.) > All that being said, I have no objections to this patch (but I didn't > review it closely). Thanks. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote: > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch wrote in > <20190818035230.gb3021...@rfd.leadboat.com> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs. > > Now TwoPhaseFileHeader has two new members for (commit-time) > pending syncs. Pending-syncs are useless on wal-replay, but that > is needed for commit-prepared. There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql command, which is far too late to be syncing new relation files. (A crash may have already destroyed their data.) PrepareTransaction(), which implements the PREPARE TRANSACTION command, is the right place for these syncs. A failure in these new syncs needs to prevent the transaction from being marked committed. Hence, in CommitTransaction(), these new syncs need to happen after the last step that could create assign a new relfilenode and before RecordTransactionCommit(). I suspect it's best to do it after PreCommit_on_commit_actions() and before AtEOXact_LargeObject(). > > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote: > > > --- a/src/backend/catalog/storage.c > > > +++ b/src/backend/catalog/storage.c > > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit) > ... > > > + smgrimmedsync(srel, MAIN_FORKNUM); > > > > This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may > > be > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return > > false due to this code, use RelationNeedsWAL() for multiple forks, and then > > not actually sync all forks. > > I agree that all forks needs syncing, but FSM and VM are checking > RelationNeedsWAL(modified). To make sure, are you suggesting to > sync all forks instead of emitting WAL for them, or suggesting > that VM and FSM to emit WALs even when the modified > RelationNeedsWAL returns false (+ sync all forks)? I hadn't thought that far. What do you think is best? > > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component > > not appearing here. It said, "Instead, at COMMIT, we'd fsync() the > > relation, > > or if it's smaller than some threshold, WAL-log the contents of the whole > > file > > at that point." Please write the part to WAL-log the contents of small > > files > > instead of syncing them. > > I'm not sure the point of the behavior. I suppose that the "log" > is a sequence of new_page records. It also needs to be synced and > it is always larger than the file to be synced. I can't think of > an appropriate threshold without the point. Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks every buffer header containing a buffer of the current database. The belief has been that writing one page to xlog is cheaper than FlushRelationBuffers() in a busy system with large shared_buffers. > > > --- a/src/backend/commands/copy.c > > > +++ b/src/backend/commands/copy.c > > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) > > >* If it does commit, we'll have done the table_finish_bulk_insert() at > > >* the bottom of this routine first. > > >* > > > - * As mentioned in comments in utils/rel.h, the in-same-transaction test > > > - * is not always set correctly, since in rare cases > > > rd_newRelfilenodeSubid > > > - * can be cleared before the end of the transaction. The exact case is > > > - * when a relation sets a new relfilenode twice in same transaction, yet > > > - * the second one fails in an aborted subtransaction, e.g. > > > - * > > > - * BEGIN; > > > - * TRUNCATE t; > > > - * SAVEPOINT save; > > > - * TRUNCATE t; > > > - * ROLLBACK TO save; > > > - * COPY ... > > > > The comment material being deleted is still correct, so don't delete it. > > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The > > attached patch adds an assertion that RelationNeedsWAL() and the > > pendingDeletes array have the same opinion about the relfilenode, and it > > expands a test case to fail that assertion. > > (Un?)Fortunately, that doesn't fail.. (with rebased version on > the recent master) I'll recheck that tomorrow. Did you build with --enable-cassert? > > > --- a/src/include/utils/rel.h > > > +++ b/src/include/utils/rel.h > > > @@ -74,11 +74,13 @@ typedef struct RelationData > > > SubTransactionId rd_createSubid;/* rel was created in current > > > xact */ > > > SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode > > > assigned in > > > > > > * current xact */ > > > + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode > > > assigned > > > + > > > * first in current xact */
Re: FETCH FIRST clause PERCENT option
Hi, At Wed, 7 Aug 2019 10:20:09 +0300, Surafel Temesgen wrote in > Hi > On Wed, Aug 7, 2019 at 6:11 AM Kyotaro Horiguchi > wrote: > > > > > I have some comments. > > > > This patch uses distinct parameters for exact number and > > percentage. On the othe hand planner has a notion of > > tuple_fraction covering the both. The same handling is also used > > for tuple number estimation. Using the same scheme will make > > things far simpler. See the comment of grouping_planner(). > > > > > Its because of data type difference .In planner the data type is the same I meant that, with the usage of tuple_fraction, one variable can represent both the absolute limit and relative limit. This simplifies parse structs. > > In executor part, addition to LimiteState.position, this patch > > uses LimiteState.backwardPosition to count how many tuples we're > > back from the end of the current tuplestore. But things will get > > simpler by just asking the tuplestore for the number of holding > > tuples. > > > > > backwardPosition hold how many tuple returned in backward scan I meant that we don't need to hold the number in estate. > > +slot = node->subSlot; > > > > Why is this needed? The variable is properly set before use and > > the assignment is bogus in the first place. > > > > > its because Tuplestore needs initialized slot. I meant that the initilized slot is overwritten before first use. > > The new code block in LIMIT_RESCAN in ExecLimit is useless since > > it is exatctly the same with existing code block. Why didn't you > > use the existing if block? > > > > But they test different scenario I meant that the two different scenario can share the code block. > > >if (node->limitOption == PERCENTAGE) > > +{ > > +node->perExactCount = ceil(node->percent * > > node->position / 100.0); > > + > > + > > > > node->position is the number of tuples returned to upper node so > > far (not the number of tuples this node has read from the lower > > node so far). I don't understand what the expression means. > > > > node->position hold the number of tuples this node has read from the lower > node so far. see LIMIT_RESCAN state Reallly? node->position is incremented when tuplestore_gettupleslot_heaptuple() succeeded and reutnrs the tuple to the caller immediately... > > +if (node->perExactCount == node->perExactCount + 1) > > +node->perExactCount++; > > > > What? The condition never be true. As the result, the following > > if block below won't run. > > > > it became true according to the number of tuple returned in from the lower > node so far > and percentage specification. Mmm. How do you think X can be equal to (X + 1)? > > >/* > > + * Return the tuple up to the number of exact count in > > OFFSET > > + * clause without percentage value consideration. > > + */ > > +if (node->perExactCount > 0) > > +{ > > + > > > > > > > > > > +/* > > + * We may needed this tuple in backward scan so put it into > > + * tuplestore. > > + */ > > +if (node->limitOption == PERCENTAGE) > > +{ > > +tuplestore_puttupleslot(node->tupleStore, slot); > > +tuplestore_advance(node->tupleStore, true); > > +} > > > > "needed"->"need" ? The comment says that this is needed for > > backward scan, but it is actually required even in forward > > scan. More to the point, tuplestore_advance lacks comment. > > > ok > > > > > > Anyway, the code in LIMIT_RESCAN is broken in some ways. For > > example, if it is used as the inner scan of a NestLoop, the > > tuplestore accumulates tuples by every scan. You will see that > > the tuplestore stores up to 1000 tuples (10 times of the inner) > > by the following query. > > > > It this because in percentage we scan the whole table It's useless and rather harmful that the tuplestore holds indefinite number of duplicate set of the whole tuples from the lower node. We must reuse tuples already stored in the tuplestore or clear it before the next round. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: POC: Cleaning up orphaned files using undo logs
On Tue, Aug 20, 2019 at 2:46 AM Andres Freund wrote: > > On 2019-08-19 17:52:24 +0530, Amit Kapila wrote: > > On Sat, Aug 17, 2019 at 10:58 PM Andres Freund wrote: > > > > > > > Well, I don't understand why you're on about this. We've discussed it > > > > a number of times but I'm still confused. > > > > > > There's two reasons here: > > > > > > The primary one in the context here is that if we do *not* have to lock > > > the buffers all ahead of time, we can simplify the interface. We > > > certainly can't lock the buffers over IO (due to buffer reclaim) as > > > we're doing right now, so we'd need another phase, called by the "user" > > > during undo insertion. But if we do not need to lock the buffers before > > > the insertion over all starts, the inserting location doesn't have to > > > care. > > > > > > Secondarily, all the reasoning for needing to lock all buffers ahead of > > > time was imo fairly unconvincing. Following the "recipe" for WAL > > > insertions is a good idea when writing a new run-of-the-mill WAL > > > inserting location - but when writing a new fundamental facility, that > > > already needs to modify how WAL works, then I find that much less > > > convincing. > > > > > > > One point to remember in this regard is that we do need to modify the > > LSN in undo pages after writing WAL, so all the undo pages need to be > > locked by that time or we again need to take the lock on them. > > Well, my main point, which so far has largely been ignored, was that we > may not acquire page locks when we still need to search for victim > buffers later. If we don't need to lock the pages up-front, but only do > so once we're actually copying the records into the undo pages, then we > don't a separate phase to acquire the locks. We can still hold all of > the page locks at the same time, as long as we just acquire them at the > later stage. > Okay, IIUC, this means that we should have a separate phase where we call LockUndoBuffers (or something like that) before InsertPreparedUndo and after PrepareUndoInsert. The LockUndoBuffers will lock all the buffers pinned during PrepareUndoInsert. We can probably call LockUndoBuffers before entering the critical section to avoid any kind of failure in critical section. If so, that sounds reasonable to me. > My secondary point was that *none* of this actually is > documented, even if it's entirely unobvious to the reader that the > relevant code can only run during WAL insertion, due to being pretty far > removed from that. > I think this can be clearly mentioned in README or someplace else. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com