date:20190819

Re: [PATCH] minor doc fix for create-role

2019-08-19 Thread Tara Anne

Hi Tom,

I agree mostly. It actually does have the words “SQL identifier” in the patch. 
But you are right it doesn’t link to what a SQL identifier is, but it does 
provide a practical solution of quoting. That was the part I cared about as a 
user, I just wanted to solve my problem of an email address as a role name (yes 
I know that’s sort of dumb as email addresses change). This also addresses the 
question, why just here, because this was a pain point in the docs for me 
yesterday :)

 I also agree your ideal solution is definitely better than what I pushed. But 
I’m not ready to take that on. If someone else is, I welcome their patch over 
mine.

-Tara

  —
“Rivers know this: there is no hurry. We shall get there some day.”

> On Aug 18, 2019, at 9:41 AM, Tom Lane  wrote:
> 
> t...@anne.cat writes:
>> Attached is a minor patch to fix the name param documentation for create 
>> role, just adding a direct quote from user-manag.sgml talking about what the 
>> role name is allowed to be.  I was searching for this information and 
>> figured the reference page should have it as well.
> 
> Hm, I guess my reaction to this proposal is "why just here?".  We have
> an awful lot of different CREATE commands, and none of them say more
> about the target name than this one does.  (Not to mention ALTER, DROP,
> etc.)  Perhaps it's worth adding some boilerplate text to all those
> places, but I'm dubious.
> 
> Also, the specific text proposed for addition doesn't seem that helpful,
> since it doesn't define which characters are "special characters".
> I'd rather see something like "The name must be a valid SQL identifier
> as defined in ."  But, while that would work fine
> in HTML output, it would not be particularly short or useful in man-page
> output.
> 
> Perhaps the ideal solution would be further markup on the synopsis
> sections that somehow identifies each term as an "identifier" or
> other appropriate syntactic category, and provides a hyperlink to
> a definition (in output formats that are friendly to that).  Seems
> like a lot of work though :-(
> 
>regards, tom lane

Re: Fix typos and inconsistencies for HEAD (take 11)

2019-08-19 Thread Michael Paquier

On Mon, Aug 19, 2019 at 07:04:04AM +0300, Alexander Lakhin wrote:
> 11.23 TupleLockUpdate -> LockTupleNoKeyExclusive

Not sure about this one, so discarded for now.  Alvaro?

> 11.25 typstore -> tupstore

This one is cute.  It actually does not cause a compilation failure as
tuplestore_donestoring() is a no-op.

> 11.33 visca -> vice

This one is interesting latin.

> 11.57 PASSBYVALUE -> PASSEDBYVALUE

Will fix this one separately and back-patch.
--
Michael

signature.asc
Description: PGP signature

Re: Fix typos and inconsistencies for HEAD (take 11)

2019-08-19 Thread Michael Paquier

On Mon, Aug 19, 2019 at 04:22:44PM +0900, Michael Paquier wrote:
> On Mon, Aug 19, 2019 at 07:04:04AM +0300, Alexander Lakhin wrote:
>> 11.57 PASSBYVALUE -> PASSEDBYVALUE
> 
> Will fix this one separately and back-patch.

No need to, actually, as the error comes from 7bdc655.
--
Michael


signature.asc
Description: PGP signature

Re: FETCH FIRST clause PERCENT option

2019-08-19 Thread Erik Rijkers


On 2019-08-19 01:33, Ryan Lambert wrote:
The following review has been posted through the commitfest 
application:

make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

The latest patch [1] and the cleanup patch [2] apply cleanly to the
latest master (5f110933).  I reviewed the conversation and don't see
any outstanding questions or concerns.  Updating status to ready for
committer.

[1] > 
https://www.postgresql.org/message-id/attachment/103028/percent-incremental-v6.patch
[2] > 
https://www.postgresql.org/message-id/attachment/103157/percent-incremental-v6-comment-cleanup.patch


Ryan Lambert

The new status of this patch is: Ready for Committer


Hi,

(running with those two patches applied)

  select * from onek where thousand < 5 order by thousand fetch first -1 
percent rows only

is correctly caught (with "ERROR:  PERCENT must not be negative") but:

  select * from  onek where thousand < 5 order by thousand fetch first 
101 percent rows only
is not. It doesn't return, and cannot be Ctrl-C'ed out of, which I guess 
is another bug?



thanks,

Erik Rijkers

Cleanup isolation specs from unused steps

2019-08-19 Thread Michael Paquier

Hi all,

I have been looking at the isolation tests, and we have in some specs
steps which are defined but not used in any permutations.  In order to
detect them, I have been using the attached trick to track which
permutations are used.  This allows to find immediately any
over-engineered spec by generating diffs about steps defined by not
used in any permutations.  On HEAD, we have six specs entering in this
category.

Would that be useful?
--
Michael
diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index f98bb1cf64..402bb1f546 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -239,6 +239,17 @@ main(int argc, char **argv)
 	 */
 	run_testspec(testspec);
 
+	/*
+	 * Verify that all steps have been used, complaining about anything
+	 * defined but not used.
+	 */
+	for (i = 0; i < testspec->nallsteps; i++)
+	{
+		if (!testspec->allsteps[i]->used)
+			fprintf(stderr, "unused step name: %s\n",
+	testspec->allsteps[i]->name);
+	}
+
 	return 0;
 }
 
@@ -467,7 +478,11 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps)
 
 	printf("\nstarting permutation:");
 	for (i = 0; i < nsteps; i++)
+	{
+		/* Track the permutation as in-use */
+		steps[i]->used = true;
 		printf(" %s", steps[i]->name);
+	}
 	printf("\n");
 
 	/* Perform setup */
diff --git a/src/test/isolation/isolationtester.h b/src/test/isolation/isolationtester.h
index 7f91e6433f..d9d2a14ecf 100644
--- a/src/test/isolation/isolationtester.h
+++ b/src/test/isolation/isolationtester.h
@@ -29,6 +29,7 @@ struct Session
 struct Step
 {
 	int			session;
+	bool		used;
 	char	   *name;
 	char	   *sql;
 	char	   *errormsg;
diff --git a/src/test/isolation/specparse.y b/src/test/isolation/specparse.y
index fb8a4d706c..2dfe3533ff 100644
--- a/src/test/isolation/specparse.y
+++ b/src/test/isolation/specparse.y
@@ -145,6 +145,7 @@ step:
 $$ = pg_malloc(sizeof(Step));
 $$->name = $2;
 $$->sql = $3;
+$$->used = false;
 $$->errormsg = NULL;
 			}
 		;
diff --git a/src/test/isolation/specs/freeze-the-dead.spec b/src/test/isolation/specs/freeze-the-dead.spec
index 8c3649902a..915bf15b92 100644
--- a/src/test/isolation/specs/freeze-the-dead.spec
+++ b/src/test/isolation/specs/freeze-the-dead.spec
@@ -19,7 +19,6 @@ session "s1"
 step "s1_begin"		{ BEGIN; }
 step "s1_update"	{ UPDATE tab_freeze SET x = x + 1 WHERE id = 3; }
 step "s1_commit"	{ COMMIT; }
-step "s1_vacuum"	{ VACUUM FREEZE tab_freeze; }
 step "s1_selectone"	{
 BEGIN;
 SET LOCAL enable_seqscan = false;
@@ -28,7 +27,6 @@ step "s1_selectone"	{
 COMMIT;
 }
 step "s1_selectall"	{ SELECT * FROM tab_freeze ORDER BY name, id; }
-step "s1_reindex"	{ REINDEX TABLE tab_freeze; }
 
 session "s2"
 step "s2_begin"		{ BEGIN; }
@@ -40,7 +38,6 @@ session "s3"
 step "s3_begin"		{ BEGIN; }
 step "s3_key_share"	{ SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE; }
 step "s3_commit"	{ COMMIT; }
-step "s3_vacuum"	{ VACUUM FREEZE tab_freeze; }
 
 # This permutation verifies that a previous bug
 # https://postgr.es/m/e5711e62-8fdf-4dca-a888-c200bf6b5...@amazon.com
diff --git a/src/test/isolation/specs/insert-conflict-do-nothing.spec b/src/test/isolation/specs/insert-conflict-do-nothing.spec
index 9b92c35cec..71acc380c7 100644
--- a/src/test/isolation/specs/insert-conflict-do-nothing.spec
+++ b/src/test/isolation/specs/insert-conflict-do-nothing.spec
@@ -33,7 +33,6 @@ setup
 step "donothing2" { INSERT INTO ints(key, val) VALUES(1, 'donothing2') ON CONFLICT DO NOTHING; }
 step "select2" { SELECT * FROM ints; }
 step "c2" { COMMIT; }
-step "a2" { ABORT; }
 
 # Regular case where one session block-waits on another to determine if it
 # should proceed with an insert or do nothing.
diff --git a/src/test/isolation/specs/insert-conflict-do-update-2.spec b/src/test/isolation/specs/insert-conflict-do-update-2.spec
index f5b4f601b5..12f6be8000 100644
--- a/src/test/isolation/specs/insert-conflict-do-update-2.spec
+++ b/src/test/isolation/specs/insert-conflict-do-update-2.spec
@@ -32,7 +32,6 @@ setup
 step "insert2" { INSERT INTO upsert(key, payload) VALUES('FOOFOO', 'insert2') ON CONFLICT (lower(key)) DO UPDATE set key = EXCLUDED.key, payload = upsert.payload || ' updated by insert2'; }
 step "select2" { SELECT * FROM upsert; }
 step "c2" { COMMIT; }
-step "a2" { ABORT; }
 
 # One session (session 2) block-waits on another (session 1) to determine if it
 # should proceed with an insert or update.  The user can still usefully UPDATE
diff --git a/src/test/isolation/specs/insert-conflict-do-update.spec b/src/test/isolation/specs/insert-conflict-do-update.spec
index 5d335a3444..7c8cb47100 100644
--- a/src/test/isolation/specs/insert-conflict-do-update.spec
+++ b/src/test/isolation/specs/insert-conflict-do-update.spec
@@ -30,7 +30,6 @@ setup
 step "insert2" { INSERT INTO upsert(key, val) VALUES(1, 'insert2') ON CONFLICT (key) DO UPDATE set val = upsert.val || ' updated by insert2'; }
 step "select2" { SELECT *

Re: pgsql: doc: Add some images

2019-08-19 Thread Peter Eisentraut

On 2019-08-16 22:00, Alvaro Herrera wrote:
> Now when I test Jürgen's new proposed image genetic-algorithm I find
> that this stuff doesn't work in VPATH builds, at least for PDF -- I
> don't get a build failure, but instead I get just a section title that
> doesn't precede any actual image.  (There's a very small warning hidden
> in the tons of other fop output).  If I edit the .fo file by hand to
> make the path to .svg absolute, the image appears correctly.

fixed

(I'm puzzled that one can't tell FOP to abort on an error like this.
The ASF JIRA is down right now, so I can't research this, but I'm
tempted to file a bug about this.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Global temporary tables

2019-08-19 Thread Konstantin Knizhnik




On 18.08.2019 11:28, Pavel Stehule wrote:



ne 18. 8. 2019 v 9:02 odesílatel Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> napsal:




On 16.08.2019 20:17, Pavel Stehule wrote:



pá 16. 8. 2019 v 16:12 odesílatel Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>>
napsal:

I did more investigations of performance of global temp
tables with shared buffers vs. vanilla (local) temp tables.

1. Combination of persistent and temporary tables in the same
query.

Preparation:
create table big(pk bigint primary key, val bigint);
insert into big values
(generate_series(1,1),generate_series(1,1));
create temp table lt(key bigint, count bigint);
create global temp table gt(key bigint, count bigint);

Size of table is about 6Gb, I run this test on desktop with
16GB of RAM and postgres with 1Gb shared buffers.
I run two queries:

insert into T (select count(*),pk/P as key from big group by
key);
select sum(count) from T;

where P is (100,10,1) and T is name of temp table (lt or gt).
The table below contains times of both queries in msec:

Percent of selected data
1%
10%
100%
Local temp table
44610
90
47920
891
63414
21612
Global temp table
44669
35
47939
298
59159
26015


As you can see, time of insertion in temporary table is
almost the same
and time of traversal of temporary table is about twice
smaller for global temp table
when it fits in RAM together with persistent table and
slightly worser when it doesn't fit.



2. Temporary table only access.
The same system, but Postgres is configured with
shared_buffers=10GB, max_parallel_workers = 4,
max_parallel_workers_per_gather = 4

Local temp tables:
create temp table local_temp(x1 bigint, x2 bigint, x3 bigint,
x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9
bigint);
insert into local_temp values
(generate_series(1,1),0,0,0,0,0,0,0,0);
select sum(x1) from local_temp;

Global temp tables:
create global temporary table global_temp(x1 bigint, x2
bigint, x3 bigint, x4 bigint, x5 bigint, x6 bigint, x7
bigint, x8 bigint, x9 bigint);
insert into global_temp values
(generate_series(1,1),0,0,0,0,0,0,0,0);
select sum(x1) from global_temp;

Results (msec):

Insert
Select
Local temp table37489
48322
Global temp table   44358
3003


So insertion in local temp table is performed slightly faster
but select is 16 times slower!

Conclusion:
In the assumption then temp table fits in memory, global temp
tables with shared buffers provides better performance than
local temp table.
I didn't consider here global temp tables with local buffers
because for them results should be similar with local temp
tables.


Probably there is not a reason why shared buffers should be
slower than local buffers when system is under low load.

access to shared memory is protected by spin locks (are cheap for
few processes), so tests in one or few process are not too
important (or it is just one side of space)

another topic can be performance on MS Sys - there are stories
about not perfect performance of shared memory there.

Regards

Pavel


 One more test which is used to simulate access to temp tables
under high load.
I am using "upsert" into temp table in multiple connections.

create global temp table gtemp (x integer primary key, y bigint);

upsert.sql:
insert into gtemp values (random() * 100, 0) on conflict(x) do
update set y=gtemp.y+1;

pgbench -c 10 -M prepared -T 100 -P 1 -n -f upsert.sql postgres


I failed to find some standard way in pgbech to perform
per-session initialization to create local temp table,
so I just insert this code in pgbench code:

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 570cf33..af6a431 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -5994,6 +5994,7 @@ threadRun(void *arg)
    {
    if ((state[i].con = doConnect()) == NULL)
    goto done;
+   executeStatement(state[i].con, "create
temp table ltemp(x integer primary key, y bigint)");
    }
    }


Results are the following:
Global temp table: 117526 TPS

Re: Duplicated LSN in ReorderBuffer

2019-08-19 Thread Masahiko Sawada

On Tue, Aug 13, 2019 at 6:36 AM Alvaro Herrera  wrote:
>
> On 2019-Aug-07, Andres Freund wrote:
>
> > I think we would need to do this for all values of
> > SnapBuildCurrentState() - after all the problem occurs because we
> > *previously* didn't assign subxids to the toplevel xid.  Compared to the
> > cost of catalog changes, ReorderBufferAssignChild() is really cheap. So
> > I don't think there's any problem just calling it unconditionally (when
> > top_xid <> xid, of course).
>
> BTW I wrote the code as suggested and it passes all the tests ... but I
> then noticed that the unpatched code doesn't fail Ildar's original
> pgbench-based test for me, either.  So maybe my laptop is not powerful
> enough to reproduce it, or maybe I'm doing something wrong.

If you share the patch fixing this issue I'll test it on my
environment where I could reproduce the original problem.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: FETCH FIRST clause PERCENT option

2019-08-19 Thread Surafel Temesgen

Hi Erik
On Mon, Aug 19, 2019 at 10:47 AM Erik Rijkers  wrote:

> Hi,
>
> (running with those two patches applied)
>
>select * from onek where thousand < 5 order by thousand fetch first -1
> percent rows only
> is correctly caught (with "ERROR:  PERCENT must not be negative") but:
>
>select * from  onek where thousand < 5 order by thousand fetch first
> 101 percent rows only
> is not. It doesn't return, and cannot be Ctrl-C'ed out of, which I guess
> is another bug?
>

Fixed

regards
Surafel
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 1759b9e1b6..fb9c2f5319 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -3035,7 +3035,8 @@ estimate_path_cost_size(PlannerInfo *root,
 		if (fpextra && fpextra->has_limit)
 		{
 			adjust_limit_rows_costs(&rows, &startup_cost, &total_cost,
-	fpextra->offset_est, fpextra->count_est);
+	fpextra->offset_est, fpextra->count_est,
+	EXACT_NUMBER);
 			retrieved_rows = rows;
 		}
 	}
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 06d611b64c..4ef42df51d 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
 [ LIMIT { count | ALL } ]
 [ OFFSET start [ ROW | ROWS ] ]
-[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
+[ FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY ]
 [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 where from_item can be one of:
@@ -1430,7 +1430,7 @@ OFFSET start
 which PostgreSQL also supports.  It is:
 
 OFFSET start { ROW | ROWS }
-FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY
+FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY
 
 In this syntax, the start
 or count value is required by
@@ -1440,8 +1440,10 @@ FETCH { FIRST | NEXT } [ count ] {
 ambiguity.
 If count is
 omitted in a FETCH clause, it defaults to 1.
-ROW
-and ROWS as well as FIRST
+With PERCENT count specifies the maximum number of rows to return 
+in percentage round up to the nearest integer. Using PERCENT
+is best suited to returning single-digit percentages of the query's total row count.
+ROW and ROWS as well as FIRST
 and NEXT are noise words that don't influence
 the effects of these clauses.
 According to the standard, the OFFSET clause must come
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index baa669abe8..7a9626d7f9 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -21,6 +21,8 @@
 
 #include "postgres.h"
 
+#include 
+
 #include "executor/executor.h"
 #include "executor/nodeLimit.h"
 #include "miscadmin.h"
@@ -52,6 +54,7 @@ ExecLimit(PlanState *pstate)
 	 */
 	direction = node->ps.state->es_direction;
 	outerPlan = outerPlanState(node);
+	slot = node->subSlot;
 
 	/*
 	 * The main logic is a simple state machine.
@@ -81,7 +84,15 @@ ExecLimit(PlanState *pstate)
 			/*
 			 * Check for empty window; if so, treat like empty subplan.
 			 */
-			if (node->count <= 0 && !node->noCount)
+			if (node->limitOption == PERCENTAGE)
+			{
+if (node->percent == 0.0)
+{
+	node->lstate = LIMIT_EMPTY;
+	return NULL;
+}
+			}
+			else if (node->count <= 0 && !node->noCount)
 			{
 node->lstate = LIMIT_EMPTY;
 return NULL;
@@ -102,11 +113,32 @@ ExecLimit(PlanState *pstate)
 	node->lstate = LIMIT_EMPTY;
 	return NULL;
 }
+
+/*
+ * Count the number of rows to return in exact number so far
+ */
+if (node->limitOption == PERCENTAGE)
+{
+	node->perExactCount = ceil(node->percent * node->position / 100.0);
+
+	if (node->perExactCount == node->perExactCount + 1)
+		node->perExactCount++;
+}
 node->subSlot = slot;
 if (++node->position > node->offset)
 	break;
 			}
 
+			/*
+			 * We may needed this tuple in subsequent scan so put it into
+			 * tuplestore.
+			 */
+			if (node->limitOption == PERCENTAGE)
+			{
+tuplestore_puttupleslot(node->tupleStore, slot);
+tuplestore_advance(node->tupleStore, true);
+			}
+
 			/*
 			 * Okay, we have the first tuple of the window.
 			 */
@@ -124,6 +156,106 @@ ExecLimit(PlanState *pstate)
 		case LIMIT_INWINDOW:
 			if (ScanDirectionIsForward(direction))
 			{
+/*
+ * In case of coming back from backward scan the tuple is
+ * already in tuple store.
+ */
+if (node->limitOption == PERCENTAGE && node->backwardPosition > 0)
+{
+	if (tuplestore_gettupleslot_heaptuple(node->tupleStore, true, true, slot))
+	{
+		node->subSlot = slot;
+		node->position++;
+		node->backwardPosition--;
+		return slot;
+	}
+	else
+	{
+		node->lstate = LIMIT_SUBPLANEOF;
+

Re: FETCH FIRST clause PERCENT option

2019-08-19 Thread Erik Rijkers


On 2019-08-19 11:18, Surafel Temesgen wrote:


[..]

[percent-incremental-v7.patch]


Thanks.


Another little thing, not sure it's a bug:

limit interprets its argument by rounding up or down as one would 
expect:


table onek limit 10.4;  --> gives 10 rows
table onek limit 10.6;  --> gives 11 rows

but  FETCH count PERCENT  does not do that; it rounds always up.

select * from (table onek limit 100) f fetch first 10.4 percent rows 
only; --> gives 11 rows
select * from (table onek limit 100) f fetch first 10.6 percent rows 
only; --> gives 11 rows


I see that it's documented in the .sgml to behave as it does, but it 
seems wrong to me;

shouldn't that 10.4-percent-query yield 10 rows instead of 11?


thanks,

Erik Rijkers

Re: Rearranging ALTER TABLE to avoid multi-operations bugs

2019-08-19 Thread movead li

> This review seems not very on-point, because I made no claim to have fixed
> any of those bugs.  The issue at the moment is how to structure the code

I am sorry for that and I have another question now. I researched the related 
code and find something as below:
Code:
~~
case AT_AddIdentity:
{
...
attnum = get_attnum(relid, cmd->name);
/*
 * if attribute not found, something will error about it
 * later
 */
if (attnum != InvalidAttrNumber)
generateSerialExtraStmts(&cxt, newdef,
 get_atttype(relid, attnum),def->options, true,
 NULL, NULL);
...
}
~~

Test case1:

create table t10 (f1 int);
alter table t10 add column f2 int not null,
alter column f2 add generated always as identity;

I find that the value of 'attnum' is 0 because now we do not have the 'f2'
column when I run the Test case1, so it can not generate a sequence
(because it can not run the generateSerialExtraStmts function).
You can see the code annotation that 'something will error about it later',
so I thank it may be an error report instead of executing successfully.

Test case2:

create table t11 (f1 int);
alter table t11 add column f2 int,
alter column f2 type int8;
 
Code about 'alter column type' have the same code annotation, and
if you run the Test case2, then you can get an error report. I use Test case2
to prove that it may be an error report instead of executing successfully. 

--
Movead.Li

The new status of this patch is: Waiting on Author

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-19 Thread Kyotaro Horiguchi

Thank you for taking time.

At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
<20190818035230.gb3021...@rfd.leadboat.com>
> For two-phase commit, PrepareTransaction() needs to execute pending syncs.

Now TwoPhaseFileHeader has two new members for (commit-time)
pending syncs. Pending-syncs are useless on wal-replay, but that
is needed for commit-prepared.


> On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> > Relation NewHeap,   
...
> > -   use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> > -
> > /* use_wal off requires smgr_targblock be initially invalid */
> > Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
> 
> Since you're deleting the use_wal variable, update that last comment.

Oops. Rewrote it.

> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
> > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
...
> > +   smgrimmedsync(srel, MAIN_FORKNUM);
> 
> This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
> no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> false due to this code, use RelationNeedsWAL() for multiple forks, and then
> not actually sync all forks.

I agree that all forks needs syncing, but FSM and VM are checking
RelationNeedsWAL(modified). To make sure, are you suggesting to
sync all forks instead of emitting WAL for them, or suggesting
that VM and FSM to emit WALs even when the modified
RelationNeedsWAL returns false (+ sync all forks)?

> The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component
> not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
> or if it's smaller than some threshold, WAL-log the contents of the whole file
> at that point."  Please write the part to WAL-log the contents of small files
> instead of syncing them.

I'm not sure the point of the behavior. I suppose that the "log"
is a sequence of new_page records. It also needs to be synced and
it is always larger than the file to be synced. I can't think of
an appropriate threshold without the point.

> > --- a/src/backend/commands/copy.c
> > +++ b/src/backend/commands/copy.c
> > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> >  * If it does commit, we'll have done the table_finish_bulk_insert() at
> >  * the bottom of this routine first.
> >  *
> > -* As mentioned in comments in utils/rel.h, the in-same-transaction test
> > -* is not always set correctly, since in rare cases 
> > rd_newRelfilenodeSubid
> > -* can be cleared before the end of the transaction. The exact case is
> > -* when a relation sets a new relfilenode twice in same transaction, yet
> > -* the second one fails in an aborted subtransaction, e.g.
> > -*
> > -* BEGIN;
> > -* TRUNCATE t;
> > -* SAVEPOINT save;
> > -* TRUNCATE t;
> > -* ROLLBACK TO save;
> > -* COPY ...
> 
> The comment material being deleted is still correct, so don't delete it.
> Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> attached patch adds an assertion that RelationNeedsWAL() and the
> pendingDeletes array have the same opinion about the relfilenode, and it
> expands a test case to fail that assertion.

(Un?)Fortunately, that doesn't fail.. (with rebased version on
the recent master) I'll recheck that tomorrow.

> > --- a/src/include/utils/rel.h
> > +++ b/src/include/utils/rel.h
> > @@ -74,11 +74,13 @@ typedef struct RelationData
> > SubTransactionId rd_createSubid;/* rel was created in current 
> > xact */
> > SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode 
> > assigned in
> > 
> >  * current xact */
> > +   SubTransactionId rd_firstRelfilenodeSubid;  /* new relfilenode 
> > assigned
> > +   
> >  * first in current xact */
> 
> In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
> all the lines printed.  Many bits of code need to look at all three,
> e.g. RelationClose().

Agreed. I'll recheck that.

>  This field needs to be 100% reliable.  In other words,
> it must equal InvalidSubTransactionId if and only if the relfilenode matches
> the relfilenode that would be in place if the top transaction rolled back.

I don't get this. I think the variable moves as you suggested. It
is handled same way with fd_new* in AtEOSubXact_cleanup but the
difference is in assignment but rollback. rd_fist* won't change
after the first assignment so rollback of the subid means
relfi

Re: Global temporary tables

2019-08-19 Thread Konstantin Knizhnik




On 19.08.2019 11:51, Konstantin Knizhnik wrote:



On 18.08.2019 11:28, Pavel Stehule wrote:



ne 18. 8. 2019 v 9:02 odesílatel Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> napsal:




On 16.08.2019 20:17, Pavel Stehule wrote:



pá 16. 8. 2019 v 16:12 odesílatel Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>>
napsal:

I did more investigations of performance of global temp
tables with shared buffers vs. vanilla (local) temp tables.

1. Combination of persistent and temporary tables in the
same query.

Preparation:
create table big(pk bigint primary key, val bigint);
insert into big values
(generate_series(1,1),generate_series(1,1));
create temp table lt(key bigint, count bigint);
create global temp table gt(key bigint, count bigint);

Size of table is about 6Gb, I run this test on desktop with
16GB of RAM and postgres with 1Gb shared buffers.
I run two queries:

insert into T (select count(*),pk/P as key from big group by
key);
select sum(count) from T;

where P is (100,10,1) and T is name of temp table (lt or gt).
The table below contains times of both queries in msec:

Percent of selected data
1%
10%
100%
Local temp table
44610
90
47920
891
63414
21612
Global temp table
44669
35
47939
298
59159
26015


As you can see, time of insertion in temporary table is
almost the same
and time of traversal of temporary table is about twice
smaller for global temp table
when it fits in RAM together with persistent table and
slightly worser when it doesn't fit.



2. Temporary table only access.
The same system, but Postgres is configured with
shared_buffers=10GB, max_parallel_workers = 4,
max_parallel_workers_per_gather = 4

Local temp tables:
create temp table local_temp(x1 bigint, x2 bigint, x3
bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8
bigint, x9 bigint);
insert into local_temp values
(generate_series(1,1),0,0,0,0,0,0,0,0);
select sum(x1) from local_temp;

Global temp tables:
create global temporary table global_temp(x1 bigint, x2
bigint, x3 bigint, x4 bigint, x5 bigint, x6 bigint, x7
bigint, x8 bigint, x9 bigint);
insert into global_temp values
(generate_series(1,1),0,0,0,0,0,0,0,0);
select sum(x1) from global_temp;

Results (msec):

Insert
Select
Local temp table37489
48322
Global temp table   44358
3003


So insertion in local temp table is performed slightly
faster but select is 16 times slower!

Conclusion:
In the assumption then temp table fits in memory, global
temp tables with shared buffers provides better performance
than local temp table.
I didn't consider here global temp tables with local buffers
because for them results should be similar with local temp
tables.


Probably there is not a reason why shared buffers should be
slower than local buffers when system is under low load.

access to shared memory is protected by spin locks (are cheap
for few processes), so tests in one or few process are not too
important (or it is just one side of space)

another topic can be performance on MS Sys - there are stories
about not perfect performance of shared memory there.

Regards

Pavel


 One more test which is used to simulate access to temp tables
under high load.
I am using "upsert" into temp table in multiple connections.

create global temp table gtemp (x integer primary key, y bigint);

upsert.sql:
insert into gtemp values (random() * 100, 0) on conflict(x)
do update set y=gtemp.y+1;

pgbench -c 10 -M prepared -T 100 -P 1 -n -f upsert.sql postgres


I failed to find some standard way in pgbech to perform
per-session initialization to create local temp table,
so I just insert this code in pgbench code:

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 570cf33..af6a431 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -5994,6 +5994,7 @@ threadRun(void *arg)
    {
    if ((state[i].con = doConnect()) == NULL)
    goto done;
+   executeStatement(state[i].con, "create
temp table ltemp(x integer primary key, y bigint)");
    }
    }


Results a

Re: Global temporary tables

2019-08-19 Thread Pavel Stehule

po 19. 8. 2019 v 13:16 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 19.08.2019 11:51, Konstantin Knizhnik wrote:
>
>
>
> On 18.08.2019 11:28, Pavel Stehule wrote:
>
>
>
> ne 18. 8. 2019 v 9:02 odesílatel Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> napsal:
>
>>
>>
>> On 16.08.2019 20:17, Pavel Stehule wrote:
>>
>>
>>
>> pá 16. 8. 2019 v 16:12 odesílatel Konstantin Knizhnik <
>> k.knizh...@postgrespro.ru> napsal:
>>
>>> I did more investigations of performance of global temp tables with
>>> shared buffers vs. vanilla (local) temp tables.
>>>
>>> 1. Combination of persistent and temporary tables in the same query.
>>>
>>> Preparation:
>>> create table big(pk bigint primary key, val bigint);
>>> insert into big values
>>> (generate_series(1,1),generate_series(1,1));
>>> create temp table lt(key bigint, count bigint);
>>> create global temp table gt(key bigint, count bigint);
>>>
>>> Size of table is about 6Gb, I run this test on desktop with 16GB of RAM
>>> and postgres with 1Gb shared buffers.
>>> I run two queries:
>>>
>>> insert into T (select count(*),pk/P as key from big group by key);
>>> select sum(count) from T;
>>>
>>> where P is (100,10,1) and T is name of temp table (lt or gt).
>>> The table below contains times of both queries in msec:
>>>
>>> Percent of selected data
>>> 1%
>>> 10%
>>> 100%
>>> Local temp table
>>> 44610
>>> 90
>>> 47920
>>> 891
>>> 63414
>>> 21612
>>> Global temp table
>>> 44669
>>> 35
>>> 47939
>>> 298
>>> 59159
>>> 26015
>>>
>>> As you can see, time of insertion in temporary table is almost the same
>>> and time of traversal of temporary table is about twice smaller for
>>> global temp table
>>> when it fits in RAM together with persistent table and slightly worser
>>> when it doesn't fit.
>>>
>>>
>>>
>>> 2. Temporary table only access.
>>> The same system, but Postgres is configured with shared_buffers=10GB,
>>> max_parallel_workers = 4, max_parallel_workers_per_gather = 4
>>>
>>> Local temp tables:
>>> create temp table local_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint,
>>> x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint);
>>> insert into local_temp values
>>> (generate_series(1,1),0,0,0,0,0,0,0,0);
>>> select sum(x1) from local_temp;
>>>
>>> Global temp tables:
>>> create global temporary table global_temp(x1 bigint, x2 bigint, x3
>>> bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint);
>>> insert into global_temp values
>>> (generate_series(1,1),0,0,0,0,0,0,0,0);
>>> select sum(x1) from global_temp;
>>>
>>> Results (msec):
>>>
>>> Insert
>>> Select
>>> Local temp table 37489
>>> 48322
>>> Global temp table 44358
>>> 3003
>>>
>>> So insertion in local temp table is performed slightly faster but select
>>> is 16 times slower!
>>>
>>> Conclusion:
>>> In the assumption then temp table fits in memory, global temp tables
>>> with shared buffers provides better performance than local temp table.
>>> I didn't consider here global temp tables with local buffers because for
>>> them results should be similar with local temp tables.
>>>
>>
>> Probably there is not a reason why shared buffers should be slower than
>> local buffers when system is under low load.
>>
>> access to shared memory is protected by spin locks (are cheap for few
>> processes), so tests in one or few process are not too important (or it is
>> just one side of space)
>>
>> another topic can be performance on MS Sys - there are stories about not
>> perfect performance of shared memory there.
>>
>> Regards
>>
>> Pavel
>>
>>  One more test which is used to simulate access to temp tables under high
>> load.
>> I am using "upsert" into temp table in multiple connections.
>>
>> create global temp table gtemp (x integer primary key, y bigint);
>>
>> upsert.sql:
>> insert into gtemp values (random() * 100, 0) on conflict(x) do update
>> set y=gtemp.y+1;
>>
>> pgbench -c 10 -M prepared -T 100 -P 1 -n -f upsert.sql postgres
>>
>>
>> I failed to find some standard way in pgbech to perform per-session
>> initialization to create local temp table,
>> so I just insert this code in pgbench code:
>>
>> diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
>> index 570cf33..af6a431 100644
>> --- a/src/bin/pgbench/pgbench.c
>> +++ b/src/bin/pgbench/pgbench.c
>> @@ -5994,6 +5994,7 @@ threadRun(void *arg)
>> {
>> if ((state[i].con = doConnect()) == NULL)
>> goto done;
>> +   executeStatement(state[i].con, "create temp table
>> ltemp(x integer primary key, y bigint)");
>> }
>> }
>>
>>
>> Results are the following:
>> Global temp table: 117526 TPS
>> Local temp table:   107802 TPS
>>
>>
>> So even for this workload global temp table with shared buffers are a
>> little bit faster.
>> I will be pleased if you can propose some other testing scenario.
>>
>
> please, try to increase number of connections.
>

Re: FETCH FIRST clause PERCENT option

2019-08-19 Thread Surafel Temesgen

On Mon, Aug 19, 2019 at 1:55 PM Erik Rijkers  wrote:

> Another little thing, not sure it's a bug:
>
> limit interprets its argument by rounding up or down as one would
> expect:
>
> table onek limit 10.4;  --> gives 10 rows
> table onek limit 10.6;  --> gives 11 rows
>
> but  FETCH count PERCENT  does not do that; it rounds always up.
>
> select * from (table onek limit 100) f fetch first 10.4 percent rows
> only; --> gives 11 rows
> select * from (table onek limit 100) f fetch first 10.6 percent rows
> only; --> gives 11 rows
>
> I see that it's documented in the .sgml to behave as it does, but it
> seems wrong to me;
> shouldn't that 10.4-percent-query yield 10 rows instead of 11?
>
>

According to sql standard we have to use the result ceiling value

regards
Surafel

Re: Global temporary tables

2019-08-19 Thread Konstantin Knizhnik




On 19.08.2019 14:25, Pavel Stehule wrote:



po 19. 8. 2019 v 13:16 odesílatel Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> napsal:




On 19.08.2019 11:51, Konstantin Knizhnik wrote:



On 18.08.2019 11:28, Pavel Stehule wrote:



ne 18. 8. 2019 v 9:02 odesílatel Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>>
napsal:



On 16.08.2019 20:17, Pavel Stehule wrote:



pá 16. 8. 2019 v 16:12 odesílatel Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>> napsal:

I did more investigations of performance of global temp
tables with shared buffers vs. vanilla (local) temp tables.

1. Combination of persistent and temporary tables in
the same query.

Preparation:
create table big(pk bigint primary key, val bigint);
insert into big values
(generate_series(1,1),generate_series(1,1));
create temp table lt(key bigint, count bigint);
create global temp table gt(key bigint, count bigint);

Size of table is about 6Gb, I run this test on desktop
with 16GB of RAM and postgres with 1Gb shared buffers.
I run two queries:

insert into T (select count(*),pk/P as key from big
group by key);
select sum(count) from T;

where P is (100,10,1) and T is name of temp table (lt
or gt).
The table below contains times of both queries in msec:

Percent of selected data
1%
10%
100%
Local temp table
44610
90
47920
891
63414
21612
Global temp table
44669
35
47939
298
59159
26015


As you can see, time of insertion in temporary table is
almost the same
and time of traversal of temporary table is about twice
smaller for global temp table
when it fits in RAM together with persistent table and
slightly worser when it doesn't fit.



2. Temporary table only access.
The same system, but Postgres is configured with
shared_buffers=10GB, max_parallel_workers = 4,
max_parallel_workers_per_gather = 4

Local temp tables:
create temp table local_temp(x1 bigint, x2 bigint, x3
bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8
bigint, x9 bigint);
insert into local_temp values
(generate_series(1,1),0,0,0,0,0,0,0,0);
select sum(x1) from local_temp;

Global temp tables:
create global temporary table global_temp(x1 bigint, x2
bigint, x3 bigint, x4 bigint, x5 bigint, x6 bigint, x7
bigint, x8 bigint, x9 bigint);
insert into global_temp values
(generate_series(1,1),0,0,0,0,0,0,0,0);
select sum(x1) from global_temp;

Results (msec):

Insert
Select
Local temp table37489
48322
Global temp table   44358
3003


So insertion in local temp table is performed slightly
faster but select is 16 times slower!

Conclusion:
In the assumption then temp table fits in memory,
global temp tables with shared buffers provides better
performance than local temp table.
I didn't consider here global temp tables with local
buffers because for them results should be similar with
local temp tables.


Probably there is not a reason why shared buffers should be
slower than local buffers when system is under low load.

access to shared memory is protected by spin locks (are
cheap for few processes), so tests in one or few process
are not too important (or it is just one side of space)

another topic can be performance on MS Sys - there are
stories about not perfect performance of shared memory there.

Regards

Pavel


 One more test which is used to simulate access to temp
tables under high load.
I am using "upsert" into temp table in multiple connections.

create global temp table gtemp (x integer primary key, y
bigint);

upsert.sql:
insert into gtemp values (random() * 100, 0) on
conflict(x) do update set y=gtemp.y+1;

pgbench -c 10 -M prepared -T 100 -P 1 -n -f upsert.sql postgres


I failed to find some standard way in pgbech to perform
per-session initialization to create local temp table,
so I just insert this code in pgbench code:

diff --git a/

Re: POC: Cleaning up orphaned files using undo logs

2019-08-19 Thread Amit Kapila

On Sat, Aug 17, 2019 at 10:58 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > > >
> > > > I think for controlling that we need to put a limit on max prepared
> > > > undo?  I am not sure any other way of limiting the number of buffers
> > > > because we must lock all the buffer in which we are going to insert
> > > > the undo record under one WAL logged operation.
> > >
> > > I heard that a number of times. But I still don't know why that'd
> > > actually be true. Why would it not be sufficient to just lock the buffer
> > > currently being written to, rather than all buffers? It'd require a bit
> > > of care updating the official current "logical end" of a log, but
> > > otherwise ought to not be particularly hard? Only one backend can extend
> > > the log after all, and until the log is externally visibily extended,
> > > nobody can read or write those buffers, no?
> >
> > Well, I don't understand why you're on about this.  We've discussed it
> > a number of times but I'm still confused.
>
> There's two reasons here:
>
> The primary one in the context here is that if we do *not* have to lock
> the buffers all ahead of time, we can simplify the interface. We
> certainly can't lock the buffers over IO (due to buffer reclaim) as
> we're doing right now, so we'd need another phase, called by the "user"
> during undo insertion. But if we do not need to lock the buffers before
> the insertion over all starts, the inserting location doesn't have to
> care.
>
> Secondarily, all the reasoning for needing to lock all buffers ahead of
> time was imo fairly unconvincing. Following the "recipe" for WAL
> insertions is a good idea when writing a new run-of-the-mill WAL
> inserting location - but when writing a new fundamental facility, that
> already needs to modify how WAL works, then I find that much less
> convincing.
>

One point to remember in this regard is that we do need to modify the
LSN in undo pages after writing WAL, so all the undo pages need to be
locked by that time or we again need to take the lock on them.

>
> > 1. It's absolutely fine to just put a limit on this, because the
> > higher-level facilities that use this shouldn't be doing a single
> > WAL-logged operation that touches a zillion buffers.

Right, by default a WAL log can only cover 4 buffers.  If we need to
touch more buffers, then the caller needs to call
XLogEnsureRecordSpace.  So, I agree with the point that generally, it
should be few buffers (2 or 3) of undo that need to be touched in a
single operation and if there are more, either callers need to change
or at the very least they need to be careful about the same.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: (select query)/relation as first class citizen

2019-08-19 Thread Roman Pekar

Hi, John,

I think you've outlined the problem and possible solutions quite well. It's
great to see that the goal might be not that far from implementing.

psql \r command is working?

2019-08-19 Thread Sergei Kornilov

Hello

Is psql command \r actually works? I expected \r and then \e command should 
start editor without text of last query.

LANG=C EDITOR=cat psql 
psql (11.5 (Debian 11.5-1.pgdg100+1))
Type "help" for help.

melkij=> select 1;
 ?column? 
--
1
(1 row)

melkij=> \r
Query buffer reset (cleared).
melkij=> \e
select 1;
 ?column? 
--
1
(1 row)

Buffer is still here, in normal external editors too.
Same test on REL9_6_STABLE:

postgres=# select 1;
 ?column? 
--
1
(1 row)

postgres=# \r
Query buffer reset (cleared).
postgres=# \e

Buffer was actually cleared.

I see this behavior change on >= 10 versions

regards, Sergei

Re: Unused header file inclusion

2019-08-19 Thread Alvaro Herrera

On 2019-Aug-18, Tom Lane wrote:

> I wrote:
> > (My headerscheck script is missing that header; I need to update it to
> > match the latest version of cpluspluscheck.)
> 
> I did that, and ended up with the attached.  I'm rather tempted to stick
> this into src/tools/ alongside cpluspluscheck, because it seems to find
> rather different trouble spots than cpluspluscheck does.  Thoughts?

Yeah, let's include this.  I've written its equivalent a couple of times
already.  (My strategy is just to compile the .h file directly though,
which creates a .gch file, rather than writing a temp .c file.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: psql \r command is working?

2019-08-19 Thread Tom Lane

Sergei Kornilov  writes:
> Is psql command \r actually works? I expected \r and then \e command should 
> start editor without text of last query.

\r clears the current query buffer.  \e without an argument is
defined to edit the current query buffer, or if that is empty,
the last query.  I don't see anything particularly wrong here.

> I see this behavior change on >= 10 versions

Possibly fixed as a side effect of e984ef586.

regards, tom lane

Re: psql \r command is working?

2019-08-19 Thread Sergei Kornilov

so this is expected behavior, ok, thank you!

regards, Sergei

Re: Duplicated LSN in ReorderBuffer

2019-08-19 Thread Alvaro Herrera

On 2019-Aug-19, Masahiko Sawada wrote:

> On Tue, Aug 13, 2019 at 6:36 AM Alvaro Herrera  
> wrote:

> > BTW I wrote the code as suggested and it passes all the tests ... but I
> > then noticed that the unpatched code doesn't fail Ildar's original
> > pgbench-based test for me, either.  So maybe my laptop is not powerful
> > enough to reproduce it, or maybe I'm doing something wrong.
> 
> If you share the patch fixing this issue I'll test it on my
> environment where I could reproduce the original problem.

Never mind.  I was able to reproduce it later, and verify that Andres'
proposed strategy doesn't seem to fix the problem.  I'm going to study
the problem again today.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Wrong sentence in the README?

2019-08-19 Thread Daniel Westermann (DWE)

Hi,

in the README, top level, there is this:

PostgreSQL has many language interfaces, many of which are listed here:
https://www.postgresql.org/download

I don't think the download page lists any language interfaces or do I miss 
something?

Regards
Daniel

Re: Cleanup isolation specs from unused steps

2019-08-19 Thread Tom Lane

Michael Paquier  writes:
> I have been looking at the isolation tests, and we have in some specs
> steps which are defined but not used in any permutations.

Hmm, might any of those represent actual bugs?  Or are they just
leftovers from test development?

> In order to
> detect them, I have been using the attached trick to track which
> permutations are used.  This allows to find immediately any
> over-engineered spec by generating diffs about steps defined by not
> used in any permutations.  On HEAD, we have six specs entering in this
> category.

Seems like a good idea; I'm surprised we've got so many cases.

regards, tom lane

Re: Partial join

2019-08-19 Thread Arne Roland

Richard Guo  wrote:
> Please refer to the code snippet below:
>
> @@ -1164,8 +1164,8 @@ generate_join_implied_equalities(PlannerInfo *root,
> List   *sublist = NIL;
>
> /* ECs containing consts do not need any further enforcement 
> */
> if (ec->ec_has_const)
> continue;

Sorry, I'm quite busy currently. And thanks! That was a good read.

I might be wrong, but I think have_partkey_equi_join in joinrels.c should be 
aware of the const case. My naive approach would be keeping pointers to the 
first few constant clauses, which are referencing to a yet unmatched partition 
key, to keep the memory footprint feasible in manner similar to pk_has_clause. 
The question would be what to do, if there are a lot of const expressions on 
the part keys. One could palloc additional memory in that case, hoping that it 
will be quite rare. Or is there a different, better way to go about that?
Thank you for your feedback!

Regards
Arne

Re: Duplicated LSN in ReorderBuffer

2019-08-19 Thread Andres Freund

Hi,

On August 19, 2019 7:43:28 AM PDT, Alvaro Herrera  
wrote:
>On 2019-Aug-19, Masahiko Sawada wrote:
>
>> On Tue, Aug 13, 2019 at 6:36 AM Alvaro Herrera
> wrote:
>
>> > BTW I wrote the code as suggested and it passes all the tests ...
>but I
>> > then noticed that the unpatched code doesn't fail Ildar's original
>> > pgbench-based test for me, either.  So maybe my laptop is not
>powerful
>> > enough to reproduce it, or maybe I'm doing something wrong.
>> 
>> If you share the patch fixing this issue I'll test it on my
>> environment where I could reproduce the original problem.
>
>Never mind.  I was able to reproduce it later, and verify that Andres'
>proposed strategy doesn't seem to fix the problem.  I'm going to study
>the problem again today.

Could you post the patch?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-19 Thread Joe Conway

On 8/19/19 8:51 AM, Ahsan Hadi wrote:
> I have shared a calendar invite for TDE/KMS weekly meeting with the
> members who expressed interest of joining the meeting in this chain.
> Hopefully I haven't missed anyone.
> 
> I am not aware of everyone's timezone but I have tried to setup a time
> that's not very inconvenient. It won't be ideal for everyone as we are
> dealing with multiple timezone but do let me know It is too bizarre for
> you and I will try to find another slot.    
> 
> I will share a zoom link for the meeting on the invite in due course.


Please add me as well. I would like to join when I can.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-19 Thread Ahsan Hadi

I have shared a calendar invite for TDE/KMS weekly meeting with the members
who expressed interest of joining the meeting in this chain. Hopefully I
haven't missed anyone.

I am not aware of everyone's timezone but I have tried to setup a time
that's not very inconvenient. It won't be ideal for everyone as we are
dealing with multiple timezone but do let me know It is too bizarre for you
and I will try to find another slot.

I will share a zoom link for the meeting on the invite in due course.

-- Ahsan

On Mon, Aug 19, 2019 at 9:26 AM Ahsan Hadi  wrote:

>
>
> On Mon, 19 Aug 2019 at 6:23 AM, Smith, Peter 
> wrote:
>
>> > From: Ibrar Ahmed  Sent: Sunday, 18 August 2019
>> 2:43 AM
>> > +1 for voice call, bruce we usually have a weekly TDE call. I will
>> include you in
>>
>> If you don't mind, please also add me to that TDE call list.
>>
>
> Sure will do.
>
>
>> Thanks/Regards,
>> ---
>> Peter Smith
>> Fujitsu Australia
>>
>

Re: Global temporary tables

2019-08-19 Thread Pavel Stehule

> Certainly, default (small) temp buffer size plays roles.
> But it this IPC host this difference is not so important.
> Result with local temp tables and temp_buffers = 1GB: 859k TPS.
>

It is little bit unexpected result.I understand so it partially it is
generic problem access to smaller dedicated caches versus access to bigger
shared cache.

But it is hard to imagine so access to local cache is 10% slower than
access to shared cache. Maybe there is some bottle neck - maybe our
implementation of local buffers are suboptimal.

Using local buffers for global temporary tables can be interesting from
another reason - it uses temporary files, and temporary files can be
forwarded on ephemeral IO on Amazon cloud (with much better performance
than persistent IO).




>
> --
>
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>

Re: Cleanup isolation specs from unused steps

2019-08-19 Thread Melanie Plageman

On Mon, Aug 19, 2019 at 1:08 AM Michael Paquier  wrote:

> Hi all,
>
> I have been looking at the isolation tests, and we have in some specs
> steps which are defined but not used in any permutations.  In order to
> detect them, I have been using the attached trick to track which
> permutations are used.  This allows to find immediately any
> over-engineered spec by generating diffs about steps defined by not
> used in any permutations.  On HEAD, we have six specs entering in this
> category.
>
> Would that be useful?
>

I think it is useful.

could you do the check that all steps have been used in dry_run mode
instead of when running the tests for real?

-- 
Melanie Plageman

Re: Unused header file inclusion

2019-08-19 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Aug-18, Tom Lane wrote:
>> I did that, and ended up with the attached.  I'm rather tempted to stick
>> this into src/tools/ alongside cpluspluscheck, because it seems to find
>> rather different trouble spots than cpluspluscheck does.  Thoughts?

> Yeah, let's include this.

Done.

regards, tom lane

io_uring support

2019-08-19 Thread Dmitry Dolgov

Hi,

For already some time I'm following the new linux IO interface "io_uring", that
was introduced relatively recently [1]. Short description says:

 Shared application/kernel submission and completion ring pairs, for
 supporting fast/efficient IO.

For us the important part is probably that it's an asynchronious IO, that can
work not only with O_DIRECT, but with also with buffered access. Plus there are
claims that it's pretty efficient (efficiency was one of the design goals [2]).
The interface consists of submit/complete queues and data structures, shared
between an application and the kernel. To facilitate application development
there is also a nice library to utilize io_uring from the user space [3].

Since I haven't found that many discussions in the hackers archives about async
IO, and out of curiosity decided to prepare an experimental patch to see how
this would looks like to use io_uring in PostgreSQL. I've tested this patch so
far only inside a qemu vm on the latest io_uring branch from linux-block tree.
The result is relatively simple, and introduces new interface smgrqueueread,
smgrsubmitread and smgrwaitread to queue any read we want, then submit a queue
to a kernel and then wait for a result. The simplest example of how this
interface could be used I found in pg_prewarm for buffers prefetching.

As a result of this experiment I have few questions, open points and requests
for the community experience:

* I guess the proper implementation to use async IO is a big deal, but could
  bring also significant performance advantages. Is there any (nearest) future
  for such kind of async IO in PostgreSQL? Buffer prefetching is a simplest
  example, but taking into account that io_uring supports ordering, barriers
  and linked events, there are probably more use cases when it could be useful.

* Assuming that the answer for previous question is positive, there could be
  different strategies how to use io_uring. So far I see different
  opportunities for waiting. Let's say we have prepared a batch of async IO
  operations and submitted it. Then we can e.g.

  -> just wait for a batch to be finished
  -> wait (in the same syscall as submitting) for previously submitted batches,
 then start submitting again, and at the end wait for the leftovers
  -> peek if there are any events completed, and get only those without waiting
for the whole batch (in this case it's necessary to make sure submission
queue is not overflowed)

  So it's open what and when to use.

* Does it makes sense to use io_uring for smgrprefetch? Originally I've added
  io_uring parts into FilePrefetch also (in the form of preparing and submiting
  just one buffer), but not sure if this API is suitable.

* How may look like a data structure, that can describe IO from PostgreSQL
  perspective? With io_uring we need to somehow identify IO operations that
  were completed. For now I'm just using a buffer number. Btw, this
  experimental patch has many limitations, e.g. only one ring is used for
  everything, which is of course far from ideal and makes identification even
  more important.

* There are few more freedom dimensions, that io_uring introduces - how many
  rings to use, how many events per ring (which is going to be n for sqe and
  2*n for cqe), how many IO operations per event to do (similar to
  preadv/pwritev we can provide a vector), what would be the balance between
  submit and complete queues. I guess it will require a lot of benchmarking to
  find a good values for these.


[1]: 
https://github.com/torvalds/linux/commit/38e7571c07be01f9f19b355a9306a4e3d5cb0f5b
[2]: http://kernel.dk/io_uring.pdf
[3]: http://git.kernel.dk/cgit/liburing/


v1-0001-io-uring.patch
Description: Binary data

Re: Unused header file inclusion

2019-08-19 Thread Andres Freund

Hi,

On 2019-08-18 14:37:34 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I've pushed the other ones.
> 
> Checking whether header files compile standalone shows you were overly
> aggressive about removing fmgr.h includes:
> 
> In file included from /tmp/headerscheck.Ss8bVx/test.c:3:
> ./src/include/utils/selfuncs.h:143: error: expected declaration specifiers or 
> '...' before 'FmgrInfo'
> ./src/include/utils/selfuncs.h:146: error: expected declaration specifiers or 
> '...' before 'FmgrInfo'
> ./src/include/utils/selfuncs.h:152: error: expected declaration specifiers or 
> '...' before 'FmgrInfo'

Darn.  Pushed the obvious fix of adding a direct fmgr.h include, rather
than the preivous indirect include.

> That's with a script I use that's like cpluspluscheck except it tests
> with plain gcc not g++.  I attached it for the archives' sake.
> 
> Oddly, cpluspluscheck does not complain about those cases, but it
> does complain about

Hm. I don't understand why it's not complaining. Wonder if it's a
question of the flags or such.

> In file included from /tmp/cpluspluscheck.FgX2SW/test.cpp:4:
> ./src/bin/scripts/scripts_parallel.h:18: error: ISO C++ forbids declaration 
> of 'PGconn' with no type
> ./src/bin/scripts/scripts_parallel.h:18: error: expected ';' before '*' token
> ./src/bin/scripts/scripts_parallel.h:29: error: 'PGconn' has not been declared

I noticed that "manually" earlier, when looking at the openbsd issue.

> (My headerscheck script is missing that header; I need to update it to
> match the latest version of cpluspluscheck.)

I wonder if we should just add a #ifndef HEADERCHECK or such to the
headers that we don't want to process as standalone headers (or #ifndef
NOT_STANDALONE or whatever). That way multiple tools can rely on these markers,
rather than copying knowledge about that kind of information into
multiple places.

I wish we could move the whole logic of those scripts into makefiles, so
we could employ parallelism.

Greetings,

Andres Freund

Re: io_uring support

2019-08-19 Thread Andres Freund

Hi,

On 2019-08-19 20:20:46 +0200, Dmitry Dolgov wrote:
> For already some time I'm following the new linux IO interface "io_uring", 
> that
> was introduced relatively recently [1]. Short description says:
> 
>  Shared application/kernel submission and completion ring pairs, for
>  supporting fast/efficient IO.

Yes, it's quite promising.  I also played around some with it.  One
thing I particularly like is that it seems somewhat realistic to have an
abstraction that both supports io_uring and window's iocp - personally I
don't think we need support for more than those.

> For us the important part is probably that it's an asynchronious IO, that can
> work not only with O_DIRECT, but with also with buffered access.

Note that while the buffered access does allow for some acceleration, it
currently does have quite noticable CPU overhead.

> Since I haven't found that many discussions in the hackers archives about 
> async
> IO, and out of curiosity decided to prepare an experimental patch to see how
> this would looks like to use io_uring in PostgreSQL.

Cool!

> I've tested this patch so far only inside a qemu vm on the latest
> io_uring branch from linux-block tree.  The result is relatively
> simple, and introduces new interface smgrqueueread, smgrsubmitread and
> smgrwaitread to queue any read we want, then submit a queue to a
> kernel and then wait for a result. The simplest example of how this
> interface could be used I found in pg_prewarm for buffers prefetching.

Hm. I'm bit doubtful that that's going in the direction of being the
right interface. I think we'd basically have to insist that all AIO
capable smgr's use one common AIO layer (note that the UNDO patches add
another smgr implementation). Otherwise I think we'll have a very hard
time to make them cooperate.  An interface like this would also lead to
a lot of duplicated interfaces, because we'd basically need most of the
smgr interface functions duplicated.

I suspect we'd rather have to build something where the existing
functions grow a parameter controlling synchronizity. If AIO is allowed
and supported, the smgr implementation would initiate the IO, together
with a completion function for it, and return some value allowing the
caller to wait for the result if desirable.

> As a result of this experiment I have few questions, open points and requests
> for the community experience:
> 
> * I guess the proper implementation to use async IO is a big deal, but could
>   bring also significant performance advantages. Is there any (nearest) future
>   for such kind of async IO in PostgreSQL? Buffer prefetching is a simplest
>   example, but taking into account that io_uring supports ordering, barriers
>   and linked events, there are probably more use cases when it could be 
> useful.

The lowest hanging fruit that I can see - and which I played with - is
making the writeback flushing use async IO. That's particularly
interesting for bgwriter. As it commonly only performs random IO, and
as we need to keep the number of dirty buffers in the kernel small to
avoid huge latency spikes, being able to submit IOs asynchronously can
yield significant benefits.

> * Assuming that the answer for previous question is positive, there could be
>   different strategies how to use io_uring. So far I see different
>   opportunities for waiting. Let's say we have prepared a batch of async IO
>   operations and submitted it. Then we can e.g.
> 
>   -> just wait for a batch to be finished
>   -> wait (in the same syscall as submitting) for previously submitted 
> batches,
>  then start submitting again, and at the end wait for the leftovers
>   -> peek if there are any events completed, and get only those without 
> waiting
> for the whole batch (in this case it's necessary to make sure submission
> queue is not overflowed)
> 
>   So it's open what and when to use.

I don't think there's much point in working only with complete
batches. I think we'd loose too much of the benefit by introducing
unnecessary synchronous operations.  I think we'd need to design the
interface in a way that there constantly can be in-progress IOs, block
when the queue is full, and handle finished IOs using a callback
mechanism or such.

> * Does it makes sense to use io_uring for smgrprefetch? Originally I've added
>   io_uring parts into FilePrefetch also (in the form of preparing and 
> submiting
>   just one buffer), but not sure if this API is suitable.

I have a hard time seeing that being worthwhile, unless we change the
way it's used significantly.  I think to benefit really, we'd have to be
able to lock multiple buffers, and have io_uring prefetch directly into
buffers.

> * How may look like a data structure, that can describe IO from PostgreSQL
>   perspective? With io_uring we need to somehow identify IO operations that
>   were completed. For now I'm just using a buffer number.

In my hacks I've used the sqe's user_data to point to a struct wi

Re: Unused header file inclusion

2019-08-19 Thread Andres Freund

Hi,

On 2019-08-19 13:07:50 -0700, Andres Freund wrote:
> On 2019-08-18 14:37:34 -0400, Tom Lane wrote:
> > That's with a script I use that's like cpluspluscheck except it tests
> > with plain gcc not g++.  I attached it for the archives' sake.
> > 
> > Oddly, cpluspluscheck does not complain about those cases, but it
> > does complain about
> 
> Hm. I don't understand why it's not complaining. Wonder if it's a
> question of the flags or such.

I'ts caused by -fsyntax-only

# These switches are g++ specific, you may override if necessary.
CXXFLAGS=${CXXFLAGS:- -fsyntax-only -Wall}

which also explains why your headercheck doesn't have that problem, it
doesn't use -fsyntax-only.

> > (My headerscheck script is missing that header; I need to update it to
> > match the latest version of cpluspluscheck.)
> 
> I wonder if we should just add a #ifndef HEADERCHECK or such to the
> headers that we don't want to process as standalone headers (or #ifndef
> NOT_STANDALONE or whatever). That way multiple tools can rely on these 
> markers,
> rather than copying knowledge about that kind of information into
> multiple places.
> 
> I wish we could move the whole logic of those scripts into makefiles, so
> we could employ parallelism.

Hm. Perhaps the way to do that would be to use gcc's -include to include
postgres.h, and use -Wc++-compat to detect c++ issues, rather than using
g++. Without tempfiles it ought to be a lot easier to just do all of the
relevant work in make, without a separate shell script. The
python/perl/ecpg logic would b e abit annoying, but probably not too
bad? Think we could just always add all of them?

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

2019-08-19 Thread Andres Freund

On 2019-08-19 17:52:24 +0530, Amit Kapila wrote:
> On Sat, Aug 17, 2019 at 10:58 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> > > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > > > Again, I think it's not ok to just assume you can lock an 
> > > > > > essentially
> > > > > > unbounded number of buffers. This seems almost guaranteed to result 
> > > > > > in
> > > > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > > > >
> > > > > I think for controlling that we need to put a limit on max prepared
> > > > > undo?  I am not sure any other way of limiting the number of buffers
> > > > > because we must lock all the buffer in which we are going to insert
> > > > > the undo record under one WAL logged operation.
> > > >
> > > > I heard that a number of times. But I still don't know why that'd
> > > > actually be true. Why would it not be sufficient to just lock the buffer
> > > > currently being written to, rather than all buffers? It'd require a bit
> > > > of care updating the official current "logical end" of a log, but
> > > > otherwise ought to not be particularly hard? Only one backend can extend
> > > > the log after all, and until the log is externally visibily extended,
> > > > nobody can read or write those buffers, no?
> > >
> > > Well, I don't understand why you're on about this.  We've discussed it
> > > a number of times but I'm still confused.
> >
> > There's two reasons here:
> >
> > The primary one in the context here is that if we do *not* have to lock
> > the buffers all ahead of time, we can simplify the interface. We
> > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > we're doing right now, so we'd need another phase, called by the "user"
> > during undo insertion. But if we do not need to lock the buffers before
> > the insertion over all starts, the inserting location doesn't have to
> > care.
> >
> > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > time was imo fairly unconvincing. Following the "recipe" for WAL
> > insertions is a good idea when writing a new run-of-the-mill WAL
> > inserting location - but when writing a new fundamental facility, that
> > already needs to modify how WAL works, then I find that much less
> > convincing.
> >
> 
> One point to remember in this regard is that we do need to modify the
> LSN in undo pages after writing WAL, so all the undo pages need to be
> locked by that time or we again need to take the lock on them.

Well, my main point, which so far has largely been ignored, was that we
may not acquire page locks when we still need to search for victim
buffers later. If we don't need to lock the pages up-front, but only do
so once we're actually copying the records into the undo pages, then we
don't a separate phase to acquire the locks. We can still hold all of
the page locks at the same time, as long as we just acquire them at the
later stage.  My secondary point was that *none* of this actually is
documented, even if it's entirely unobvious to the reader that the
relevant code can only run during WAL insertion, due to being pretty far
removed from that.

Greetings,

Andres Freund

Re: Unused header file inclusion

2019-08-19 Thread Alvaro Herrera

On 2019-Aug-19, Andres Freund wrote:

> > I wish we could move the whole logic of those scripts into makefiles, so
> > we could employ parallelism.
> 
> Hm. Perhaps the way to do that would be to use gcc's -include to include
> postgres.h, and use -Wc++-compat to detect c++ issues, rather than using
> g++. Without tempfiles it ought to be a lot easier to just do all of the
> relevant work in make, without a separate shell script.

I used to have this:
https://postgr.es/m/1293469595-sup-1...@alvh.no-ip.org
Not sure how much this helps, since it's a shell line in make, so not
very paralellizable.  And you still have to build the exclusions
somehow.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unused header file inclusion

2019-08-19 Thread Tom Lane

Andres Freund  writes:
> On 2019-08-19 13:07:50 -0700, Andres Freund wrote:
>> On 2019-08-18 14:37:34 -0400, Tom Lane wrote:
>>> Oddly, cpluspluscheck does not complain about those cases, but it
>>> does complain about

>> Hm. I don't understand why it's not complaining. Wonder if it's a
>> question of the flags or such.

> I'ts caused by -fsyntax-only

Ah-hah.  Should we change that to something else?  That's probably
a hangover from thinking that all we had to do was check for C++
keywords.

>> I wish we could move the whole logic of those scripts into makefiles, so
>> we could employ parallelism.

I can't really get excited about expending a whole bunch of additional
work on these scripts.

regards, tom lane

Re: PANIC: could not flush dirty data: Operation not permitted power8, Redhat Centos

2019-08-19 Thread Thomas Munro

On Mon, Aug 19, 2019 at 7:32 AM Thomas Munro  wrote:
> On Wed, Apr 17, 2019 at 1:04 PM Thomas Munro  wrote:
> > On Mon, Apr 15, 2019 at 7:57 PM  wrote:
> > > I forgot to mention that this is happening in a docker container.
> >
> > Huh, so there may be some configuration of Linux container that can
> > fail here with EPERM, even though that error that does not appear in
> > the man page, and doesn't make much intuitive sense.  Would be good to
> > figure out how that happens.
>
> Steve Dodd ran into the same problem in Borg[1].  It looks like what's
> happening here is that on PowerPC and ARM systems, there is a second
> system call sync_file_range2 that has the arguments arranged in a
> better order for their calling conventions (see Notes section of man
> sync_file_range), and glibc helpfully translates for you, but some
> container technologies forgot to include sync_file_range2 in their
> syscall forwarding table.  Perhaps we should just handle this with the
> not_implemented_by_kernel mechanism I added for WSL.

I've just heard that it was fixed overnight in seccomp, which is
probably what Docker is using to give you EPERM for syscalls it
doesn't like the look of:

https://github.com/systemd/systemd/pull/13352/commits/90ddac6087b5f8f3736364cfdf698e713f7e8869

Not being a Docker user, I'm sure if/when that will flow into the
right places in a timely fashion but if not it looks like you can
always configure your own profile or take one from somewhere else,
probably something like this:

https://github.com/moby/moby/commit/52d8f582c331e35f7b841171a1c22e2d9bbfd0b8

So it looks like we don't need to do anything at all on our side,
unless someone knows better.

-- 
Thomas Munro
https://enterprisedb.com

Re: Zedstore - compressed in-core columnar storage

2019-08-19 Thread Alexandra Wang

On Sun, Aug 18, 2019 at 12:35 PM Justin Pryzby  wrote:

>
>  . I was missing a way to check for compression ratio;

Here are the ways to check compression ratio for zedstore:

Table level:
select sum(uncompressedsz::numeric) / sum(totalsz) as compratio from
pg_zs_btree_pages();

Per column level:
select attno, count(*), sum(uncompressedsz::numeric) / sum(totalsz) as
compratio from pg_zs_btree_pages() group by attno order by attno;

> it looks like zedstore
>with lz4 gets ~4.6x for our largest customer's largest table.  zfs using
>compress=gzip-1 gives 6x compression across all their partitioned
> tables,
>and I'm surprised it beats zedstore .
>

What kind of tables did you use? Is it possible to give us the schema
of the table? Did you perform 'INSERT INTO ... SELECT' or COPY?
Currently COPY give better compression ratios than single INSERT
because it generates less pages for meta data. Using the above per column
level compression ratio will provide which columns have lower
compression ratio.

We plan to add other compression algorithms like RLE and delta
encoding which should give better compression ratios for column store
along with LZ4.

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-19 Thread Peter Geoghegan

On Tue, Aug 13, 2019 at 8:45 AM Anastasia Lubennikova
 wrote:
> > I still need to think about the exact details of alignment within
> > _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
> > could be wrong.
> Could you explain more about these cases?
> Now I don't understand the problem.

Maybe there is no problem.

> Thank you for the patch.
> Still, I'd suggest to leave it as a possible future improvement, so that
> it doesn't
> distract us from the original feature.

I don't even think that it's useful work for the future. It's just
nice to be sure that we could support unique index deduplication if it
made sense. Which it doesn't. If I didn't write the patch that
implements deduplication for unique indexes, I might still not realize
that we need the index_compute_xid_horizon_for_tuples() stuff in
certain other places. I'm not serious about it at all, except as a
learning exercise/experiment.

> I added to v6 another related fix for _bt_compress_one_page().
> Previous code was implicitly deleted DEAD items without
> calling index_compute_xid_horizon_for_tuples().
> New code has a check whether DEAD items on the page exist and remove
> them if any.
> Another possible solution is to copy dead items as is from old page to
> the new one,
> but I think it's good to remove dead tuples as fast as possible.

I think that what you've done in v7 is probably the best way to do it.
It's certainly simple, which is appropriate given that we're not
really expecting to see LP_DEAD items within _bt_compress_one_page()
(we just need to be prepared for them).

> > v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
> > existing item if it's an LP_DEAD item that falls in the same TID range
> > (that's _bt_compare()-wise "equal" to an existing tuple, which may or
> > may not be a posting list tuple already). I haven't made this code do
> > something like call  index_compute_xid_horizon_for_tuples(), even
> > though that's needed for correctness (i.e. this new code is currently
> > broken in the same way that I mentioned unique index support is
> > broken).
> Is it possible that DEAD tuple to delete was smaller than itup?

I'm not sure what you mean by this. I suppose that it doesn't matter,
since we both prefer the alternative that you came up with anyway.

> > How do you feel about officially calling this deduplication, not
> > compression? I think that it's a more accurate name for the technique.
> I agree.
> Should I rename all related names of functions and variables in the patch?

Please rename them when convenient.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-19 Thread Peter Geoghegan

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
 wrote:
> Now the algorithm is the following:
>
> - If bt_findinsertloc() found out that tuple belongs to existing posting 
> tuple's
> TID interval, it sets 'in_posting_offset' variable and passes it to
> _bt_insertonpg()
>
> - If 'in_posting_offset' is valid and origtup is valid,
> merge our itup into origtup.
>
> It can result in one tuple neworigtup, that must replace origtup; or two 
> tuples:
> neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

That sounds like the right way to do it.

> - If two new tuple(s) fit into the old page, we're lucky.
> call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
> atomically replace oldtup with new tuple(s) and generate xlog record.
>
> - In case page split is needed, pass both tuples to _bt_split().
>  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
> neworigtup, so it uses correct item size where needed.

That makes sense, since _bt_split() is responsible for both splitting
the page, and inserting the new item on either the left or right page,
as part of the first phase of a page split. In other words, if you're
adding something new to _bt_insertonpg(), you probably also need to
add something new to _bt_split(). So that's what you did.

> It seems that now all replace operations are crash-safe. The new patch passes
> all regression tests, so I think it's ready for review again.

I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.

I think that we should start to think about efficient WAL-logging now.

> In the meantime, I'll run more stress-tests.

As you probably realize, wal_consistency_checking is a good thing to
use with your tests here.

-- 
Peter Geoghegan

Plug-in common/logging.h with vacuumlo and oid2name

2019-08-19 Thread Michael Paquier

Hi all,

While refactoring some code, I have bumped into $subject, which causes
both tools to have incorrect error message formats and progname being
used all through the code, sometimes incorrectly or just missing.

At the same time, we are missing a call to set_pglocale_pgservice()
for both binaries.

Any objections to the cleanup attached?
--
Michael
diff --git a/contrib/oid2name/oid2name.c b/contrib/oid2name/oid2name.c
index d3a4a50005..b33e4d3d82 100644
--- a/contrib/oid2name/oid2name.c
+++ b/contrib/oid2name/oid2name.c
@@ -11,6 +11,7 @@
 
 #include "catalog/pg_class_d.h"
 
+#include "common/logging.h"
 #include "fe_utils/connect.h"
 #include "libpq-fe.h"
 #include "pg_getopt.h"
@@ -85,6 +86,8 @@ get_opts(int argc, char **argv, struct options *my_opts)
 	const char *progname;
 	int			optindex;
 
+	pg_logging_init(argv[0]);
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("oid2name"));
 	progname = get_progname(argv[0]);
 
 	/* set the defaults */
@@ -320,8 +323,8 @@ sql_conn(struct options *my_opts)
 
 		if (!conn)
 		{
-			fprintf(stderr, "%s: could not connect to database %s\n",
-	"oid2name", my_opts->dbname);
+			pg_log_error("could not connect to database %s",
+		 my_opts->dbname);
 			exit(1);
 		}
 
@@ -339,8 +342,8 @@ sql_conn(struct options *my_opts)
 	/* check to see that the backend connection was successfully made */
 	if (PQstatus(conn) == CONNECTION_BAD)
 	{
-		fprintf(stderr, "%s: could not connect to database %s: %s",
-"oid2name", my_opts->dbname, PQerrorMessage(conn));
+		pg_log_error("could not connect to database %s: %s",
+	 my_opts->dbname, PQerrorMessage(conn));
 		PQfinish(conn);
 		exit(1);
 	}
@@ -348,8 +351,8 @@ sql_conn(struct options *my_opts)
 	res = PQexec(conn, ALWAYS_SECURE_SEARCH_PATH_SQL);
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
 	{
-		fprintf(stderr, "oid2name: could not clear search_path: %s\n",
-PQerrorMessage(conn));
+		pg_log_error("could not clear search_path: %s",
+	 PQerrorMessage(conn));
 		PQclear(res);
 		PQfinish(conn);
 		exit(-1);
@@ -382,8 +385,8 @@ sql_exec(PGconn *conn, const char *todo, bool quiet)
 	/* check and deal with errors */
 	if (!res || PQresultStatus(res) > 2)
 	{
-		fprintf(stderr, "oid2name: query failed: %s\n", PQerrorMessage(conn));
-		fprintf(stderr, "oid2name: query was: %s\n", todo);
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_error("query was: %s", todo);
 
 		PQclear(res);
 		PQfinish(conn);
diff --git a/contrib/vacuumlo/vacuumlo.c b/contrib/vacuumlo/vacuumlo.c
index 73c06a043e..7fbcaee846 100644
--- a/contrib/vacuumlo/vacuumlo.c
+++ b/contrib/vacuumlo/vacuumlo.c
@@ -23,6 +23,7 @@
 
 #include "catalog/pg_class_d.h"
 
+#include "common/logging.h"
 #include "fe_utils/connect.h"
 #include "libpq-fe.h"
 #include "pg_getopt.h"
@@ -109,8 +110,7 @@ vacuumlo(const char *database, const struct _param *param)
 		conn = PQconnectdbParams(keywords, values, true);
 		if (!conn)
 		{
-			fprintf(stderr, "Connection to database \"%s\" failed\n",
-	database);
+			pg_log_error("connection to database \"%s\" failed", database);
 			return -1;
 		}
 
@@ -129,8 +129,8 @@ vacuumlo(const char *database, const struct _param *param)
 	/* check to see that the backend connection was successfully made */
 	if (PQstatus(conn) == CONNECTION_BAD)
 	{
-		fprintf(stderr, "Connection to database \"%s\" failed:\n%s",
-database, PQerrorMessage(conn));
+		pg_log_error("connection to database \"%s\" failed: %s",
+	 database, PQerrorMessage(conn));
 		PQfinish(conn);
 		return -1;
 	}
@@ -145,8 +145,7 @@ vacuumlo(const char *database, const struct _param *param)
 	res = PQexec(conn, ALWAYS_SECURE_SEARCH_PATH_SQL);
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
 	{
-		fprintf(stderr, "Failed to set search_path:\n");
-		fprintf(stderr, "%s", PQerrorMessage(conn));
+		pg_log_error("failed to set search_path: %s", PQerrorMessage(conn));
 		PQclear(res);
 		PQfinish(conn);
 		return -1;
@@ -165,8 +164,7 @@ vacuumlo(const char *database, const struct _param *param)
 	res = PQexec(conn, buf);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
-		fprintf(stderr, "Failed to create temp table:\n");
-		fprintf(stderr, "%s", PQerrorMessage(conn));
+		pg_log_error("failed to create temp table: %s", PQerrorMessage(conn));
 		PQclear(res);
 		PQfinish(conn);
 		return -1;
@@ -182,8 +180,7 @@ vacuumlo(const char *database, const struct _param *param)
 	res = PQexec(conn, buf);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
-		fprintf(stderr, "Failed to vacuum temp table:\n");
-		fprintf(stderr, "%s", PQerrorMessage(conn));
+		pg_log_error("failed to vacuum temp table: %s", PQerrorMessage(conn));
 		PQclear(res);
 		PQfinish(conn);
 		return -1;
@@ -212,8 +209,7 @@ vacuumlo(const char *database, const struct _param *param)
 	res = PQexec(conn, buf);
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
 	{
-		fprintf(stderr, "Failed to find OID columns:\n");
-		fprintf(stderr, "%s", PQerrorMessage(conn));
+		pg_log_error("fai

Re: Cleanup isolation specs from unused steps

2019-08-19 Thread Michael Paquier

On Mon, Aug 19, 2019 at 11:02:42AM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>> I have been looking at the isolation tests, and we have in some specs
>> steps which are defined but not used in any permutations.
> 
> Hmm, might any of those represent actual bugs?  Or are they just
> leftovers from test development?

I cannot yet enter the minds of each test author back this much in
time, but I think that's a mix of both.  When working on a new
isolation spec, I personally tend to do a lot of copy-pasting of the
same queries for multiple sessions and then manipulate the
permutations to produce a set of useful tests.  It is rather easy to
forget to remove some steps when doing that.  I guess that's what
happened with tuplelock-upgrade, insert-conflict-do-update* and
freeze-the-dead.
--
Michael


signature.asc
Description: PGP signature

Re: Cleanup isolation specs from unused steps

2019-08-19 Thread Michael Paquier

On Mon, Aug 19, 2019 at 10:23:19AM -0700, Melanie Plageman wrote:
> Could you do the check that all steps have been used in dry_run mode
> instead of when running the tests for real?

Sure, I was hesitating to do so.  I have no issue in moving the check
into run_testspec().  So done as attached.

It is rather a pain to pass down custom options to isolationtester.
For example, I have tested the updated version attached after
hijacking -n into isolation_start_test().  Ugly hack, but for testing
that's enough.  Do you make use of this tool in a particular way in
greenplum?  Just wondering.

(Could it make sense to have long options for isolationtester by the
way?)
--
Michael
diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index f98bb1cf64..f8ec7ca7b4 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -251,10 +251,24 @@ static int *piles;
 static void
 run_testspec(TestSpec *testspec)
 {
+	int		i;
+
 	if (testspec->permutations)
 		run_named_permutations(testspec);
 	else
 		run_all_permutations(testspec);
+
+	/*
+	 * Verify that all steps have been used, complaining about anything
+	 * defined but not used.  This check happens here so as dry-run is able
+	 * to use it.
+	 */
+	for (i = 0; i < testspec->nallsteps; i++)
+	{
+		if (!testspec->allsteps[i]->used)
+			fprintf(stderr, "unused step name: %s\n",
+	testspec->allsteps[i]->name);
+	}
 }
 
 /*
@@ -457,7 +471,11 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps)
 	{
 		printf("permutation");
 		for (i = 0; i < nsteps; i++)
+		{
+			/* Track the permutation as in-use */
+			steps[i]->used = true;
 			printf(" \"%s\"", steps[i]->name);
+		}
 		printf("\n");
 		return;
 	}
@@ -467,7 +485,11 @@ run_permutation(TestSpec *testspec, int nsteps, Step **steps)
 
 	printf("\nstarting permutation:");
 	for (i = 0; i < nsteps; i++)
+	{
+		/* Track the permutation as in-use */
+		steps[i]->used = true;
 		printf(" %s", steps[i]->name);
+	}
 	printf("\n");
 
 	/* Perform setup */
diff --git a/src/test/isolation/isolationtester.h b/src/test/isolation/isolationtester.h
index 7f91e6433f..d9d2a14ecf 100644
--- a/src/test/isolation/isolationtester.h
+++ b/src/test/isolation/isolationtester.h
@@ -29,6 +29,7 @@ struct Session
 struct Step
 {
 	int			session;
+	bool		used;
 	char	   *name;
 	char	   *sql;
 	char	   *errormsg;
diff --git a/src/test/isolation/specparse.y b/src/test/isolation/specparse.y
index fb8a4d706c..2dfe3533ff 100644
--- a/src/test/isolation/specparse.y
+++ b/src/test/isolation/specparse.y
@@ -145,6 +145,7 @@ step:
 $$ = pg_malloc(sizeof(Step));
 $$->name = $2;
 $$->sql = $3;
+$$->used = false;
 $$->errormsg = NULL;
 			}
 		;
diff --git a/src/test/isolation/specs/freeze-the-dead.spec b/src/test/isolation/specs/freeze-the-dead.spec
index 8c3649902a..915bf15b92 100644
--- a/src/test/isolation/specs/freeze-the-dead.spec
+++ b/src/test/isolation/specs/freeze-the-dead.spec
@@ -19,7 +19,6 @@ session "s1"
 step "s1_begin"		{ BEGIN; }
 step "s1_update"	{ UPDATE tab_freeze SET x = x + 1 WHERE id = 3; }
 step "s1_commit"	{ COMMIT; }
-step "s1_vacuum"	{ VACUUM FREEZE tab_freeze; }
 step "s1_selectone"	{
 BEGIN;
 SET LOCAL enable_seqscan = false;
@@ -28,7 +27,6 @@ step "s1_selectone"	{
 COMMIT;
 }
 step "s1_selectall"	{ SELECT * FROM tab_freeze ORDER BY name, id; }
-step "s1_reindex"	{ REINDEX TABLE tab_freeze; }
 
 session "s2"
 step "s2_begin"		{ BEGIN; }
@@ -40,7 +38,6 @@ session "s3"
 step "s3_begin"		{ BEGIN; }
 step "s3_key_share"	{ SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE; }
 step "s3_commit"	{ COMMIT; }
-step "s3_vacuum"	{ VACUUM FREEZE tab_freeze; }
 
 # This permutation verifies that a previous bug
 # https://postgr.es/m/e5711e62-8fdf-4dca-a888-c200bf6b5...@amazon.com
diff --git a/src/test/isolation/specs/insert-conflict-do-nothing.spec b/src/test/isolation/specs/insert-conflict-do-nothing.spec
index 9b92c35cec..71acc380c7 100644
--- a/src/test/isolation/specs/insert-conflict-do-nothing.spec
+++ b/src/test/isolation/specs/insert-conflict-do-nothing.spec
@@ -33,7 +33,6 @@ setup
 step "donothing2" { INSERT INTO ints(key, val) VALUES(1, 'donothing2') ON CONFLICT DO NOTHING; }
 step "select2" { SELECT * FROM ints; }
 step "c2" { COMMIT; }
-step "a2" { ABORT; }
 
 # Regular case where one session block-waits on another to determine if it
 # should proceed with an insert or do nothing.
diff --git a/src/test/isolation/specs/insert-conflict-do-update-2.spec b/src/test/isolation/specs/insert-conflict-do-update-2.spec
index f5b4f601b5..12f6be8000 100644
--- a/src/test/isolation/specs/insert-conflict-do-update-2.spec
+++ b/src/test/isolation/specs/insert-conflict-do-update-2.spec
@@ -32,7 +32,6 @@ setup
 step "insert2" { INSERT INTO upsert(key, payload) VALUES('FOOFOO', 'insert2') ON CONFLICT (lower(key)) DO UPDATE set key = EXCLUDED.key, payload = upsert.payload || ' updated by insert2'; }
 step "select2" { SELECT * FRO

Re: Zedstore - compressed in-core columnar storage

2019-08-19 Thread Justin Pryzby

On Mon, Aug 19, 2019 at 04:15:30PM -0700, Alexandra Wang wrote:
> On Sun, Aug 18, 2019 at 12:35 PM Justin Pryzby  wrote:
> 
> >  . I was missing a way to check for compression ratio;
> 
> Here are the ways to check compression ratio for zedstore:
> 
> Table level:
> SELECT sum(uncompressedsz::numeric)/sum(totalsz) AS compratio FROM 
> pg_zs_btree_pages();

postgres=# SELECT sum(uncompressedsz::numeric)/sum(totalsz) AS compratio FROM 
pg_zs_btree_pages('child.cdrs_huawei_pgwrecord_2019_07_01');
compratio | 4.2730304163521529

For a fair test, I created a separate ZFS tablspace for storing just a copy of
that table.

ts=# CREATE TABLE test TABLESPACE testcomp AS SELECT * FROM 
child.cdrs_huawei_pgwrecord_2019_07_01;
SELECT 39933381
Time: 882417.775 ms (14:42.418)

zfs/testJTP20190819  compressratio  6.01x -
zfs/testJTP20190819  compressiongzip-1inherited from zfs

> Per column level:
> select attno, count(*), sum(uncompressedsz::numeric)/sum(totalsz) as 
> compratio from pg_zs_btree_pages() group by attno order by attno;

Order by 3; I see we have SOME highly compressed columns.

It's still surprising to me that's as low as it is, given their content: phone
numbers and IPv4 addresses in text form, using characters limited to
[[:digit:].]

(I realize we can probably save space using inet type.)

 0 |  4743 | 1.
32 | 21912 | 1.05953637381493823513
80 | 36441 | 1.2416446300175039
 4 | 45059 | 1.3184106811322728
83 | 45059 | 1.3184106811322728
52 | 39208 | 1.3900788061770992
...
74 |  3464 |10.8258665101057364
17 |  3535 |10.8776086243096534
 3 |  7092 |11.0388009154683678
11 |  3518 |11.4396055611832109
65 |   |14.6594723104237634
35 | 14077 |15.1642131499381887
...
43 |  1601 |21.4200106784573211
79 |  1599 |21.4487670806076829
89 |  1934 |23.6292134031933401
33 |  1934 |23.6292134031933401

It seems clear the columns with high n_distinct have low compress ratio, and
columns with high compress ratio are those with n_distinct=1...

CREATE TEMP TABLE zs AS SELECT zs.*, n_distinct, avg_width, a.attname FROM 
(SELECT 'child.cdrs_huawei_pgwrecord_2019_07_01'::regclass t)t , LATERAL 
(SELECT attno, count(*), sum(uncompressedsz::numeric)/sum(totalsz) AS compratio 
FROM pg_zs_btree_pages(t) GROUP BY attno)zs , pg_attribute a, pg_class c, 
pg_stats s WHERE a.attrelid=t AND a.attnum=zs.attno AND c.oid=a.attrelid AND 
c.relname=s.tablename AND s.attname=a.attname;

 n_distinct |   compratio
+
 217141 | 1.2416446300175039
 154829 | 1.5306062496764190
 144486 | 1.3900788061770992
 128334 | 1.5395022739568842
 121324 | 1.4005533187886683
  86341 | 1.6262709389296389
  84073 | 4.4379336418590519
  65413 | 5.1890181028038757
  63703 | 5.5029855093836425
  63637 | 5.3648468796642262
  46450 | 1.3184106811322728
  46450 | 1.3184106811322728
  43029 | 1.8003513772661308
  39363 | 1.5845730687475706
  36720 | 1.4751147557399539
  36445 | 1.8403087513759131
  36445 | 1.5453935268318613
  11455 | 1.05953637381493823513
   2862 | 9.8649823666870671
   2625 | 2.3573614181847621
   1376 | 1.7895024285340428
   1335 | 2.2812551964262787
807 | 7.1192324141359373
610 | 7.9373623460089360
 16 |11.4396055611832109
 10 | 5.5429763442365557
  7 | 5.0440578041440675
  7 | 5.2000132813261135
  4 | 6.9741514753325536
  4 | 4.2872818036896340
  3 | 1.9080838412634827
  3 | 2.9915954457453485
  3 | 2.3056387009407882
  2 |10.8776086243096534
  2 | 5.5950929307378287
  2 |18.5796576388128741
  2 |10.8258665101057364
  2 | 9.1112820658021406
  2 | 3.4986057630739795
  2 | 4.6250999234025238
  2 |11.0388009154683678
  1 |15.1642131499381887
  1 | 2.8855860118178798
  1 |23.6292134031933401
  1 |21.4200106784573211
[...]

> > it looks like zedstore
> >with lz4 gets ~4.6x for our largest customer's largest table.  zfs using
> >compress=gzip-1 gives 6x compression across all their partitioned
> > tables,
> >and I'm surprised it beats zedstore .
> >
> 
> What kind of tables did you use? Is it possible to give us the schema
> of the table? Did you perform 'INSERT INTO ... SELECT' or COPY?

I did this:

|time ~/src/postgresql.bin/bin/pg_restore 
/srv/cdrperfbackup/ts/final/child.cdrs_huawei_pgwrecord_2019_07_01 -f- 
|PGOPTIONS='-cdefault_table_access_method=zedstore' psql --port 5678 postgres 
--host /tmp
...
COPY 39933381
...
real100m25.764s

 child  | cdrs_huawei_pgwrecord_2019_07_01 | table | pryz

Fixing typos and inconsistencies

2019-08-19 Thread Alexander Lakhin

Hello hackers,

Now that the unicums checking is finished, I would like to share the
script I used to find them.
Maybe it can be useful to recheck the source tree from time to time...
I don't think that the check could be fully automated, but with some
eyeballing it allows to maintain a more consistent state.

Best regards,
Alexander


find-new-unicums.sh
Description: application/shellscript

Re: Cleanup isolation specs from unused steps

2019-08-19 Thread Alvaro Herrera

On 2019-Aug-20, Michael Paquier wrote:

> On Mon, Aug 19, 2019 at 10:23:19AM -0700, Melanie Plageman wrote:
> > Could you do the check that all steps have been used in dry_run mode
> > instead of when running the tests for real?
> 
> Sure, I was hesitating to do so.  I have no issue in moving the check
> into run_testspec().  So done as attached.

I created the dry-run mode to be able to easily generate the set of
possible permutations for a new test, then edit the result and put it
back in the spec file; but after the deadlock tests were added (with
necessary hacking of the lock-detection in isolationtester) that manner
of operation became almost completely useless.  Maybe we need to rethink
what facilities isolationtester offers -- possibly making dry-run have a
completely different behavior than currently, which I doubt anybody is
using.

All that being said, I have no objections to this patch (but I didn't
review it closely).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cleanup isolation specs from unused steps

2019-08-19 Thread Michael Paquier

On Tue, Aug 20, 2019 at 12:34:45AM -0400, Alvaro Herrera wrote:
> I created the dry-run mode to be able to easily generate the set of
> possible permutations for a new test, then edit the result and put it
> back in the spec file; but after the deadlock tests were added (with
> necessary hacking of the lock-detection in isolationtester) that manner
> of operation became almost completely useless.  Maybe we need to rethink
> what facilities isolationtester offers -- possibly making dry-run have a
> completely different behavior than currently, which I doubt anybody is
> using.

I am not sure exactly how it could be redesigned, and with n!
permutations that easily leads to bloat of the generated output.  I
think that --dry-run (well -n) is a bit misleading as option name
though as it prints only permutations.  Still, keeping it around has
no real cost, so it is not a big deal.

(Looking at the gpdb code, it does not seem to be used.)

> All that being said, I have no objections to this patch (but I didn't
> review it closely).

Thanks.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-19 Thread Noah Misch

On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
> <20190818035230.gb3021...@rfd.leadboat.com>
> > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> 
> Now TwoPhaseFileHeader has two new members for (commit-time)
> pending syncs. Pending-syncs are useless on wal-replay, but that
> is needed for commit-prepared.

There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
command, which is far too late to be syncing new relation files.  (A crash may
have already destroyed their data.)  PrepareTransaction(), which implements
the PREPARE TRANSACTION command, is the right place for these syncs.

A failure in these new syncs needs to prevent the transaction from being
marked committed.  Hence, in CommitTransaction(), these new syncs need to
happen after the last step that could create assign a new relfilenode and
before RecordTransactionCommit().  I suspect it's best to do it after
PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> ...
> > > + smgrimmedsync(srel, MAIN_FORKNUM);
> > 
> > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may 
> > be
> > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > not actually sync all forks.
> 
> I agree that all forks needs syncing, but FSM and VM are checking
> RelationNeedsWAL(modified). To make sure, are you suggesting to
> sync all forks instead of emitting WAL for them, or suggesting
> that VM and FSM to emit WALs even when the modified
> RelationNeedsWAL returns false (+ sync all forks)?

I hadn't thought that far.  What do you think is best?

> > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component
> > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the 
> > relation,
> > or if it's smaller than some threshold, WAL-log the contents of the whole 
> > file
> > at that point."  Please write the part to WAL-log the contents of small 
> > files
> > instead of syncing them.
> 
> I'm not sure the point of the behavior. I suppose that the "log"
> is a sequence of new_page records. It also needs to be synced and
> it is always larger than the file to be synced. I can't think of
> an appropriate threshold without the point.

Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
every buffer header containing a buffer of the current database.  The belief
has been that writing one page to xlog is cheaper than FlushRelationBuffers()
in a busy system with large shared_buffers.

> > > --- a/src/backend/commands/copy.c
> > > +++ b/src/backend/commands/copy.c
> > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> > >* If it does commit, we'll have done the table_finish_bulk_insert() at
> > >* the bottom of this routine first.
> > >*
> > > -  * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > > -  * is not always set correctly, since in rare cases 
> > > rd_newRelfilenodeSubid
> > > -  * can be cleared before the end of the transaction. The exact case is
> > > -  * when a relation sets a new relfilenode twice in same transaction, yet
> > > -  * the second one fails in an aborted subtransaction, e.g.
> > > -  *
> > > -  * BEGIN;
> > > -  * TRUNCATE t;
> > > -  * SAVEPOINT save;
> > > -  * TRUNCATE t;
> > > -  * ROLLBACK TO save;
> > > -  * COPY ...
> > 
> > The comment material being deleted is still correct, so don't delete it.
> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
> 
> (Un?)Fortunately, that doesn't fail.. (with rebased version on
> the recent master) I'll recheck that tomorrow.

Did you build with --enable-cassert?

> > > --- a/src/include/utils/rel.h
> > > +++ b/src/include/utils/rel.h
> > > @@ -74,11 +74,13 @@ typedef struct RelationData
> > >   SubTransactionId rd_createSubid;/* rel was created in current 
> > > xact */
> > >   SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode 
> > > assigned in
> > >   
> > >  * current xact */
> > > + SubTransactionId rd_firstRelfilenodeSubid;  /* new relfilenode 
> > > assigned
> > > + 
> > >  * first in current xact */

Re: FETCH FIRST clause PERCENT option

2019-08-19 Thread Kyotaro Horiguchi

Hi,

At Wed, 7 Aug 2019 10:20:09 +0300, Surafel Temesgen  
wrote in 
> Hi
> On Wed, Aug 7, 2019 at 6:11 AM Kyotaro Horiguchi 
> wrote:
> 
> >
> > I have some comments.
> >
> > This patch uses distinct parameters for exact number and
> > percentage. On the othe hand planner has a notion of
> > tuple_fraction covering the both. The same handling is also used
> > for tuple number estimation. Using the same scheme will make
> > things far simpler. See the comment of grouping_planner().
> >
> >
> Its because of data type difference .In planner the data type is the same

I meant that, with the usage of tuple_fraction, one variable can
represent both the absolute limit and relative limit. This
simplifies parse structs.

> > In executor part, addition to LimiteState.position, this patch
> > uses LimiteState.backwardPosition to count how many tuples we're
> > back from the end of the current tuplestore. But things will get
> > simpler by just asking the tuplestore for the number of holding
> > tuples.
> >
> >
> backwardPosition hold how many tuple returned in backward scan

I meant that we don't need to hold the number in estate.

> > +slot = node->subSlot;
> >
> > Why is this needed? The variable is properly set before use and
> > the assignment is bogus in the first place.
> >
> >
> its because Tuplestore needs initialized slot.

I meant that the initilized slot is overwritten before first use.

> > The new code block in LIMIT_RESCAN in ExecLimit is useless since
> > it is exatctly the same with existing code block. Why didn't you
> > use the existing if block?
> >
> 
> But they test different scenario

I meant that the two different scenario can share the code block.

> > >if (node->limitOption == PERCENTAGE)
> > +{
> > +node->perExactCount = ceil(node->percent *
> > node->position / 100.0);
> > +
> > +
> >
> > node->position is the number of tuples returned to upper node so
> > far (not the number of tuples this node has read from the lower
> > node so far).  I don't understand what the expression means.
> >
> 
> node->position hold the number of tuples this node has read from the lower
> node so far. see LIMIT_RESCAN state

Reallly? node->position is incremented when
tuplestore_gettupleslot_heaptuple() succeeded and reutnrs the
tuple to the caller immediately...

> > +if (node->perExactCount == node->perExactCount + 1)
> > +node->perExactCount++;
> >
> > What? The condition never be true. As the result, the following
> > if block below won't run.
> >
> 
> it became true according to the number of tuple returned in from the lower
> node so far
> and percentage specification.

Mmm. How do you think X can be equal to  (X + 1)?

> > >/*
> > + * Return the tuple up to the number of exact count in
> > OFFSET
> > + * clause without percentage value consideration.
> > + */
> > +if (node->perExactCount > 0)
> > +{
> > +
> >
> >
> >
> >
> > +/*
> > + * We may needed this tuple in backward scan so put it into
> > + * tuplestore.
> > + */
> > +if (node->limitOption == PERCENTAGE)
> > +{
> > +tuplestore_puttupleslot(node->tupleStore, slot);
> > +tuplestore_advance(node->tupleStore, true);
> > +}
> >
> > "needed"->"need" ? The comment says that this is needed for
> > backward scan, but it is actually required even in forward
> > scan. More to the point, tuplestore_advance lacks comment.
> 
> 
> ok
> 
> 
> >
> > Anyway, the code in LIMIT_RESCAN is broken in some ways. For
> > example, if it is used as the inner scan of a NestLoop, the
> > tuplestore accumulates tuples by every scan. You will see that
> > the tuplestore stores up to 1000 tuples (10 times of the inner)
> > by the following query.
> >
> 
> It this because in percentage we scan the whole table

It's useless and rather harmful that the tuplestore holds
indefinite number of duplicate set of the whole tuples from the
lower node. We must reuse tuples already stored in the tuplestore
or clear it before the next round.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: POC: Cleaning up orphaned files using undo logs

2019-08-19 Thread Amit Kapila

On Tue, Aug 20, 2019 at 2:46 AM Andres Freund  wrote:
>
> On 2019-08-19 17:52:24 +0530, Amit Kapila wrote:
> > On Sat, Aug 17, 2019 at 10:58 PM Andres Freund  wrote:
> > >
> > > > Well, I don't understand why you're on about this.  We've discussed it
> > > > a number of times but I'm still confused.
> > >
> > > There's two reasons here:
> > >
> > > The primary one in the context here is that if we do *not* have to lock
> > > the buffers all ahead of time, we can simplify the interface. We
> > > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > > we're doing right now, so we'd need another phase, called by the "user"
> > > during undo insertion. But if we do not need to lock the buffers before
> > > the insertion over all starts, the inserting location doesn't have to
> > > care.
> > >
> > > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > > time was imo fairly unconvincing. Following the "recipe" for WAL
> > > insertions is a good idea when writing a new run-of-the-mill WAL
> > > inserting location - but when writing a new fundamental facility, that
> > > already needs to modify how WAL works, then I find that much less
> > > convincing.
> > >
> >
> > One point to remember in this regard is that we do need to modify the
> > LSN in undo pages after writing WAL, so all the undo pages need to be
> > locked by that time or we again need to take the lock on them.
>
> Well, my main point, which so far has largely been ignored, was that we
> may not acquire page locks when we still need to search for victim
> buffers later. If we don't need to lock the pages up-front, but only do
> so once we're actually copying the records into the undo pages, then we
> don't a separate phase to acquire the locks. We can still hold all of
> the page locks at the same time, as long as we just acquire them at the
> later stage.
>

Okay, IIUC, this means that we should have a separate phase where we
call LockUndoBuffers (or something like that) before
InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
will lock all the buffers pinned during PrepareUndoInsert.  We can
probably call LockUndoBuffers before entering the critical section to
avoid any kind of failure in critical section.  If so, that sounds
reasonable to me.

>  My secondary point was that *none* of this actually is
> documented, even if it's entirely unobvious to the reader that the
> relevant code can only run during WAL insertion, due to being pretty far
> removed from that.
>

I think this can be clearly mentioned in README or someplace else.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

53 matches

Mail list logo