date:20180613

Re: Server crashed with dense_rank on partition table.

2018-06-13 Thread David Rowley

On 13 June 2018 at 17:55, Michael Paquier  wrote:
> On Wed, Jun 13, 2018 at 11:08:38AM +0530, Rajkumar Raghuwanshi wrote:
>> postgres=# SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab
>> GROUP BY b ORDER BY 1;
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
>
> Indeed, thanks for the test case.  This used to work in v10 but this is
> failing with v11 so I am adding an open item.  The plans of the pre-10
> query and the query on HEAD are rather similar, and the memory context
> at execution time looks messed up.

Looks like some memory is being stomped on somewhere.

4b9094eb6 (Adapt to LLVM 7+ Orc API changes.) appears to be the first
bad commit.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Server crashed with dense_rank on partition table.

2018-06-13 Thread Amit Langote

Hi.

On 2018/06/13 14:55, Michael Paquier wrote:
> On Wed, Jun 13, 2018 at 11:08:38AM +0530, Rajkumar Raghuwanshi wrote:
>> postgres=# SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab
>> GROUP BY b ORDER BY 1;
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
> 
> Indeed, thanks for the test case.  This used to work in v10 but this is
> failing with v11 so I am adding an open item.  The plans of the pre-10
> query and the query on HEAD are rather similar, and the memory context
> at execution time looks messed up.

Fwiw, I see that the crash can also occur even when using a
non-partitioned table in the query, as shown in the following example
which reuses Rajkumar's test data and query:

create table foo (a int, b int, c text);
postgres=# insert into foo select i%20, i%30, to_char(i%12, 'FM') from
generate_series(0, 36) i;

select dense_rank(b) within group (order by a) from foo group by b order by 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Following query in the regression test suite can also be made to crash by
adding a group by clause:

select dense_rank(3) within group (order by x) from (values
(1),(1),(2),(2),(3),(3),(4)) v(x) group by (x);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Looking at the core dump of this, it seems the following commit may be
relevant:

commit bf6c614a2f2c58312b3be34a47e7fb7362e07bcb
Author: Andres Freund 
Date:   Thu Feb 15 21:55:31 2018 -0800

Do execGrouping.c via expression eval machinery, take two.

Thanks,
Amit

Re: Duplicate Item Pointers in Gin index

2018-06-13 Thread Masahiko Sawada

On Wed, Jun 13, 2018 at 3:32 PM, Peter Geoghegan  wrote:
> On Tue, Jun 12, 2018 at 11:01 PM, Masahiko Sawada  
> wrote:
>> FWIW, I've looked at this again. I think that the situation Siva
>> reported in the first mail can happen before we get commit 3b2787e.
>> That is, gin indexes had had a data corruption bug. I've reproduced
>> the situation with PostgreSQL 10.1 and observed that a gin index can
>> corrupt.
>
> So, you've recreated the problem with Postgres from before 3b2787e,
> but not after 3b2787e? Are you suggesting that 3b2787e might have
> fixed it, or that it only hid the problem, or something else?

I meant 3b2787e fixed it. I checked that at least the situation
doesn't happen after 3b2787e.

> How did you recreate the problem? Do you have a test case you can share?

I recreated it by executing each steps step by step using gdb. So I
can share the test case but it might not help.

create extension pageinspect;
create table g (c int[]);
insert into g select ARRAY[1] from generate_series(1,1000);
create index g_idx on g using gin (c);
alter table g set (autovacuum_enabled = off);
insert into g select ARRAY[1] from generate_series(1, 408); -- 408
items fit in exactly one page of pending list
insert into g select ARRAY[1] from generate_series(1, 100); -- insert
into 2nd page of pending list
select n_pending_pages, n_pending_tuples from
gin_metapage_info(get_raw_page('g_idx', 0));
insert into g select ARRAY[999]; -- insert into 2nd pending list page
delete from g where c = ARRAY[999];
-- At this point, gin entry of 'ARRAY[999]' exists on 2nd page of
pending list and deleted.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Index maintenance function for BRIN doesn't check RecoveryInProgress()

2018-06-13 Thread Masahiko Sawada

Hi,

Three functions: brin_summarize_new_values, brin_summarize_range and
brin_desummarize_range can be called during recovery as follows.

=# select brin_summarize_new_values('a_idx');
ERROR:  cannot acquire lock mode ShareUpdateExclusiveLock on database
objects while recovery is in progress
HINT:  Only RowExclusiveLock or less can be acquired on database
objects during recovery.

I think we should complaint "recovery is in progress" error in this
case rather than erroring due to lock modes.
Attached patch fixes them.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


brin_maintenance_func.patch
Description: Binary data

Re: Slow planning time for simple query

2018-06-13 Thread Maksim Milyutin


On 09.06.2018 22:49, Tom Lane wrote:


Maksim Milyutin  writes:

On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.


(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)


I have verified the number dead item pointers (through pageinspect 
extension) in the first leaf page of index participating in query 
('main.message_instance_pkey') on master and slave nodes and have 
noticed a big difference.


SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);

On master:

 blkno | type | live_items | dead_items | avg_item_size | page_size | 
free_size | btpo_prev | btpo_next | btpo | btpo_flags

---+--+++---+---+---+---+---+--+
  3705 | l    |  1 | 58 |    24 |  8192 
|  6496 | 0 |  3719 |    0 | 65


On standby:

 blkno | type | live_items | dead_items | avg_item_size | page_size | 
free_size | btpo_prev | btpo_next | btpo | btpo_flags

---+--+++---+---+---+---+---+--+
  3705 | l    | 59 |  0 |    24 |  8192 
|  6496 | 0 |  3719 |    0 | 1



The vacuum routine improves the situation.
Сan there be something that I have incorrectly configured WAL logging or 
replication?



I wonder if we should extend the "SnapshotNonVacuumable" logic introduced
in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed
non vacuumable.  This would essentially get rid of long standby planning
times in this sort of scenario by instead accepting worse (possibly much
worse) planner range estimates.  I'm unsure if that's a good tradeoff or
not.


I applied the patch introduced in this commit to test standby (not 
master; I don't know if this is correct) and haven't noticed any 
differences.


--
Regards,
Maksim Milyutin

Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()

2018-06-13 Thread Kuntal Ghosh

On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada  wrote:
> Hi,
>
> Three functions: brin_summarize_new_values, brin_summarize_range and
> brin_desummarize_range can be called during recovery as follows.
>
> =# select brin_summarize_new_values('a_idx');
> ERROR:  cannot acquire lock mode ShareUpdateExclusiveLock on database
> objects while recovery is in progress
> HINT:  Only RowExclusiveLock or less can be acquired on database
> objects during recovery.
>
> I think we should complaint "recovery is in progress" error in this
> case rather than erroring due to lock modes.
+1



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-06-13 Thread Marina Polyakova


On 10-06-2018 10:38, Fabien COELHO wrote:

Hello Marina,


Hello!


v9-0003-Pgbench-errors-use-the-ereport-macro-to-report-de.patch
- a patch for the ereport() macro (this is used to report client 
failures that do not cause an aborts and this depends on the level of 
debugging).


ISTM that abort() is called under FATAL.


If you mean abortion of the client, this is not an abortion of the main 
program.


- implementation: if possible, use the local ErrorData structure 
during the errstart()/errmsg()/errfinish() calls. Otherwise use a 
static variable protected by a mutex if necessary. To do all of this 
export the function appendPQExpBufferVA from libpq.


This patch applies cleanly on top of the other ones (there are minimal
interactions), compiles cleanly, global & pgbench "make check" are ok.


:-)


IMO this patch is more controversial than the other ones.

It is not really related to the aim of the patch series, which could
do without, couldn't it?



I'd suggest that it should be an independent submission, unrelated to
the pgbench error management patch.


I suppose that this is related; because of my patch there may be a lot 
of such code (see v7 in [1]):


-   fprintf(stderr,
-   "malformed variable \"%s\" value: 
\"%s\"\n",
-   var->name, var->svalue);
+   if (debug_level >= DEBUG_FAILS)
+   {
+   fprintf(stderr,
+   "malformed variable \"%s\" value: 
\"%s\"\n",
+   var->name, var->svalue);
+   }

-   if (debug)
+   if (debug_level >= DEBUG_ALL)
fprintf(stderr, "client %d sending %s\n", st->id, sql);

That's why it was suggested to make the error function which hides all 
these things (see [2]):


There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with 
corresponding fprintf(stderr..) I think it's time to do it like in the 
main code, wrap with some function like log(level, msg).



Moreover, it changes pgbench current
behavior, which might be admissible, but should be discussed clearly.



The semantics of the existing code is changed, the FATAL levels calls
abort() and replace existing exit(1) calls. Maybe you want an ERROR
level as well.


Oh, thanks, I agree with you. And I do not want to change the program 
exit code without good reasons, but I'm sorry I may not know all pros 
and cons in this matter..


Or did you also mean other changes?


The code adapts/duplicates existing server-side "ereport" stuff and
brings it to the frontend, where the logging needs are somehow quite
different.

I'd prefer to avoid duplication and/or have some code sharing.


I was recommended to use the same interface in [3]:

On elog/errstart: we already have a convention for what ereport() calls 
look like; I suggest to use that instead of inventing your own.



If it
really needs to be duplicated, I'd suggest to put all this stuff in
separated files. If we want to do that, I think that it would belong
to fe_utils, and where it could/should be used by all front-end
programs.


I'll try to do it..


I do not understand why names are changed, eg ELEVEL_FATAL instead of
FATAL. ISTM that part of the point of the move would be to be
homogeneous, which suggests that the same names should be reused.


Ok!


For logging purposes, ISTM that the "elog" macro interface is nicer,
closer to the existing "fprintf(stderr", as it would not introduce the
additional parentheses hack for "rest".


I was also recommended to use ereport() instead of elog() in [3]:

With that, is there a need for elog()?  In the backend we have it 
because $HISTORY but there's no need for that here -- I propose to lose 
elog() and use only ereport everywhere.



I see no actual value in creating on the fly a dynamic buffer through
plenty macros and functions as the end result is just to print the
message out to stderr in the end.

  errfinishImpl: fprintf(stderr, "%s", error->message.data);

This looks like overkill. From reading the code, this does not look
like an improvement:

  fprintf(stderr, "invalid socket: %s", PQerrorMessage(st->con));

vs

  ereport(ELEVEL_LOG, (errmsg("invalid socket: %s", 
PQerrorMessage(st->con;


The whole complexity of the server-side interface only make sense
because TRY/CATCH stuff and complex logging requirements (eg several
outputs) in the backend. The patch adds quite some code and complexity
without clear added value that I can see.



My 0.02€: maybe you just want to turn

  fprintf(stderr, format, ...);
  // then possibly exit or abort depending...

into

  elog(level, format, ...);

which maybe would exit or abort depending on level, and possibly not
actually report under some levels and/or some conditions. For that, it
could enough to just provide an nice "elog" function.


I agree

Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()

2018-06-13 Thread Alexander Korotkov

On Wed, Jun 13, 2018 at 12:48 PM Kuntal Ghosh
 wrote:
> On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada  
> wrote:
> > Hi,
> >
> > Three functions: brin_summarize_new_values, brin_summarize_range and
> > brin_desummarize_range can be called during recovery as follows.
> >
> > =# select brin_summarize_new_values('a_idx');
> > ERROR:  cannot acquire lock mode ShareUpdateExclusiveLock on database
> > objects while recovery is in progress
> > HINT:  Only RowExclusiveLock or less can be acquired on database
> > objects during recovery.
> >
> > I think we should complaint "recovery is in progress" error in this
> > case rather than erroring due to lock modes.
> +1

+1,
but current behavior doesn't seem to be bug, but rather not precise
enough error reporting.  So, I think we shouldn't consider
backpatching this.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Partitioning with temp tables is broken

2018-06-13 Thread David Rowley

Hi,

It seems temp tables with partitioned tables is not working correctly.
9140cf8269b0c has not considered that in build_simple_rel() an
AppendRelInfo could be missing due to it having been skipped in
expand_partitioned_rtentry()

Assert(cnt_parts == nparts); fails in build_simple_rel, and without an
assert enabled build it just crashes later when trying to dereference
the in prune_append_rel_partitions() or
apply_scanjoin_target_to_paths(), depending if partition pruning is
used and does not prune the relation.

There's also something pretty weird around the removal of the temp
relation from the partition bound. I've had cases where the session
that attached the temp table is long gone, but \d+ shows the table is
still there and I can't attach another partition due to an overlap,
and can't drop the temp table due to the session not existing anymore.
I've not got a test case for that one yet, but the test case for the
crash is:

-- Session 1:
create table listp (a int) partition by list(a);
create table listp1 partition of listp for values in(1);
create temp table listp2 partition of listp for values in (2);

-- Session 2:
select * from listp;

We might want to create the RelOptInfo and somehow set it as a dummy
rel, or consider setting it to NULL and skipping the NULLs. I'd rather
see the latter for the future, but for PG11 we might prefer the
former. I've not looked into how this would be done or if it can be
done sanely.

I'll add this to the open items list.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-13 Thread Tomas Vondra


On 06/11/2018 11:22 AM, Masahiko Sawada wrote:

On Fri, May 25, 2018 at 8:41 PM, Moon, Insung
 wrote:

Hello Hackers,

This propose a way to develop "Table-level" Transparent Data 
Encryption (TDE) and Key Management Service (KMS) support in 
PostgreSQL.


...


As per discussion at PGCon unconference, I think that firstly we
need to discuss what threats we want to defend database data against.
If user wants to defend against a threat that is malicious user who 
logged in OS or database steals an important data on datbase this 
design TDE would not help. Because such user can steal the data by 
getting a memory dump or by SQL. That is of course differs depending 
on system requirements or security compliance but what threats do

you want to defend database data against? and why?



I do agree with this - a description of the threat model needs to be 
part of the design discussion, otherwise it's not possible to compare it 
to alternative solutions (e.g. full-disk encryption using LUKS or using 
existing privilege controls and/or RLS).


TDE was proposed/discussed repeatedly in the past, and every time it 
died exactly because it was not very clear which issue it was attempting 
to solve.


Let me share some of the issues mentioned as possibly addressed by TDE 
(I'm not entirely sure TDE actually solves them, I'm just saying those 
were mentioned in previous discussions):


1) enterprise requirement - Companies want in-database encryption, for 
various reasons (because "enterprise solution" or something).


2) like FDE, but OS/filesystem independent - Same config on any OS and 
filesystem, which may make maintenance easier.


3) does not require special OS/filesystem setup - Does not require help 
from system adminitrators, setup of LUKS devices or whatever.


4) all filesystem access (basebackups/rsync) is encrypted anyway

5) solves key management (the main challenge with pgcrypto)

6) allows encrypting only some of the data (tables, columns) to minimize 
performance impact


IMHO it makes sense to have TDE even if it provides the same "security" 
as disk-level encryption, assuming it's more convenient to setup/use 
from the database.


Also, if I understand correctly, at unconference session there also 
were two suggestions about the design other than the suggestion by 
Alexander: implementing TDE at column level using POLICY, and 
implementing TDE at table-space level. The former was suggested by

Joe but I'm not sure the detail of that suggestion. I'd love to hear
the deal of that suggestion. The latter was suggested by
Tsunakawa-san. Have you considered that?

You mentioned that encryption of temporary data for query processing 
and large objects are still under the consideration. But other than 
them you should consider the temporary data generated by other 
subsystems such as reorderbuffer and transition table as well.




The severity of those limitations is likely related to the threat model. 
I don't think encrypting temporary data would be a big problem, assuming 
you know which key to use.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

WAL prefetch

2018-06-13 Thread Konstantin Knizhnik


There was very interesting presentation at pgconf about pg_prefaulter:

http://www.pgcon.org/2018/schedule/events/1204.en.html

But it is implemented in GO and using pg_waldump.
I tried to do the same but using built-on Postgres WAL traverse functions.
I have implemented it as extension for simplicity of integration.
In principle it can be started as BG worker.

First of all I tried to estimate effect of preloading data.
I have implemented prefetch utility with is also attached to this mail.
It performs random reads of blocks of some large file and spawns some 
number of prefetch threads:


Just normal read without prefetch:
./prefetch -n 0 SOME_BIG_FILE

One prefetch thread which uses pread:
./prefetch SOME_BIG_FILE

One prefetch thread which uses posix_fadvise:
./prefetch -f SOME_BIG_FILE

4 prefetch thread which uses posix_fadvise:
./prefetch -f -n 4 SOME_BIG_FILE

Based on this experiments (on my desktop), I made the following conclusions:

1. Prefetch at HDD doesn't give any positive effect.
2. Using posix_fadvise allows to speed-up random read speed at SSD up to 
2 times.

3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
4. Calling posix_fadvise in more than one thread has no sense.

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb 
NVME RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 
56k TPS to 60k TPS (on pgbench with scale 1000).


Usage:
1. At master: create extension wal_prefetch
2. At replica: Call pg_wal_prefetch() function: it will not return until 
you interrupt it.


pg_wal_prefetch function will infinitely traverse WAL and prefetch block 
references in WAL records

using posix_fadvise(WILLNEED) system call.

It is possible to explicitly specify start LSN for pg_wal_prefetch() 
function. Otherwise, WAL redo position will be used as start LSN.



--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define BLOCK_SIZE 8192
#define INIT_SEED 1999
#define MAX_THREADS 8

size_t file_size;
char* file_name;
size_t n_prefetchers = 1;
int  use_fadvise;
int n_iterations = 10*1024*1024;

static off_t random_offset(uint64_t* seed)
{
	off_t rnd =  (3141592621u * *seed + 2718281829u) % 17u;
	*seed = rnd;
	return rnd % file_size * BLOCK_SIZE;
}

void reader(void)
{
	uint64_t seed = INIT_SEED;
	char page[BLOCK_SIZE];
	time_t start = time(NULL);
	int i;
	int fd = open(file_name, O_RDONLY);
	assert(fd >= 0);

	for (i = 0; i < n_iterations; i++) {
		off_t offs = random_offset(&seed);
		ssize_t rc = pread(fd, page, sizeof page, offs);
		time_t now;
		assert(rc == BLOCK_SIZE);
		now = time(NULL);
		if (i % 1024 == 0 && now != start) {
			printf("%d: %.2f Mb/sec   \r", i/1024, (double)(i+1)*BLOCK_SIZE/1024/1024/(now - start));
			fflush(stdout);
		}
	}
}

void* prefetcher(void* arg)
{
	size_t id = (size_t)arg;
	uint64_t seed = INIT_SEED;
	char page[BLOCK_SIZE];
	int fd = open(file_name, O_RDONLY);
	int i;
	assert(fd >= 0);

	for (i = 0;;i++) {
		off_t offs = random_offset(&seed);
		if (i % n_prefetchers == id) { 
			if (use_fadvise) {
int rc = posix_fadvise(fd, offs, BLOCK_SIZE, POSIX_FADV_WILLNEED);
assert(rc == 0);
			} else { 
ssize_t rc = pread(fd, page, sizeof page, offs);
assert(rc == BLOCK_SIZE);
			}
		}
	}
	return 0;
}



int main(int argc, char* argv[])
{
	pthread_t prefetchers[MAX_THREADS];
	int i;
	int fd;

	for (i = 1; i < argc; i++) {
		if (argv[i][0] == '-') {
			switch (argv[i][1]) {
			  case 'f':
use_fadvise = 1;
continue;
			  case 'n':
n_prefetchers = atoi(argv[++i]);
continue;
			  case 'i':
n_iterations = atoi(argv[++i]);
continue;
			  default:
			  help:
fprintf(stderr, "prefetch [-f] [-n THREADS] [-i ITERATIONS] file\n");
return 1;
			}
		} else {
			file_name = argv[i];
		}
	}
	if (file_name == NULL) {
		goto help;
	}
  	fd = open(file_name, O_RDONLY);
	assert(fd >= 0);
	file_size = lseek(fd, 0, SEEK_END)/BLOCK_SIZE;
	assert(file_size != 0);

	for (i = 0; i < n_prefetchers; i++) {
		pthread_create(&prefetchers[i], NULL, prefetcher, (void*)(size_t)i);
	}

	reader();
	puts("\nDone");
	return 0;
}


wal_prefetch.tgz
Description: application/compressed-tar

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-13 Thread Tomas Vondra


Hi,

On 05/25/2018 01:41 PM, Moon, Insung wrote:

Hello Hackers,

...

BTW, I want to support CBC mode encryption[3]. However, I'm not sure 
how to use the IV in CBC mode for this proposal. I'd like to hear

opinions by security engineer.



I'm not a cryptographer either, but this is exactly where you need a 
prior discussion about the threat models - there are a couple of 
chaining modes, each with different weaknesses.


FWIW it may also matter if data_checksums are enabled, because that may 
prevent malleability attacks affecting of the modes. Assuming active 
attacker (with the ability to modify the data files) is part of the 
threat model, of course.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-13 Thread Joe Conway

On 06/11/2018 05:22 AM, Masahiko Sawada wrote:
> As per discussion at PGCon unconference, I think that firstly we need
> to discuss what threats we want to defend database data against.

Exactly. While certainly there is demand for encryption for the sake of
"checking a box", different designs will defend against different
threats, and we should be clear on which ones we are trying to protect
against for any particular design.

> Also, if I understand correctly, at unconference session there also
> were two suggestions about the design other than the suggestion by
> Alexander: implementing TDE at column level using POLICY, and
> implementing TDE at table-space level. The former was suggested by Joe
> but I'm not sure the detail of that suggestion. I'd love to hear the
> deal of that suggestion.

The idea has not been extensively fleshed out yet, but the thought was
that we create column level POLICY, which would transparently apply some
kind of transform on input and/or output. The transforms would
presumably be expressions, which in turn could use functions (extension
or builtin) to do their work. That would allow encryption/decryption,
DLP (data loss prevention) schemes (masking, redacting), etc. to be
applied based on the policies.

This, in and of itself, would not address key management. There is
probably a separate need for some kind of built in key management --
perhaps a flexible way to integrate with external systems such as Vault
for example, or maybe something self contained, or perhaps both. Or
maybe key management is really tied into the separately discussed effort
to create SQL VARIABLEs somehow.

In any case certainly a lot of room for discussion.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: Duplicate Item Pointers in Gin index

2018-06-13 Thread Alexander Korotkov

On Wed, Jun 13, 2018 at 11:40 AM Masahiko Sawada  wrote:
>
> On Wed, Jun 13, 2018 at 3:32 PM, Peter Geoghegan  wrote:
> > On Tue, Jun 12, 2018 at 11:01 PM, Masahiko Sawada  
> > wrote:
> >> FWIW, I've looked at this again. I think that the situation Siva
> >> reported in the first mail can happen before we get commit 3b2787e.
> >> That is, gin indexes had had a data corruption bug. I've reproduced
> >> the situation with PostgreSQL 10.1 and observed that a gin index can
> >> corrupt.
> >
> > So, you've recreated the problem with Postgres from before 3b2787e,
> > but not after 3b2787e? Are you suggesting that 3b2787e might have
> > fixed it, or that it only hid the problem, or something else?
>
> I meant 3b2787e fixed it. I checked that at least the situation
> doesn't happen after 3b2787e.

I also think that 3b2787e should fix such problems.  After 3b2787e,
vacuum is forced to cleanup all pending list entries, which were
inserted before vacuum start.  So, vacuum should have everything to be
vaccumed merged into posting lists/trees.

> > How did you recreate the problem? Do you have a test case you can share?
>
> I recreated it by executing each steps step by step using gdb. So I
> can share the test case but it might not help.
>
> create extension pageinspect;
> create table g (c int[]);
> insert into g select ARRAY[1] from generate_series(1,1000);
> create index g_idx on g using gin (c);
> alter table g set (autovacuum_enabled = off);
> insert into g select ARRAY[1] from generate_series(1, 408); -- 408
> items fit in exactly one page of pending list
> insert into g select ARRAY[1] from generate_series(1, 100); -- insert
> into 2nd page of pending list
> select n_pending_pages, n_pending_tuples from
> gin_metapage_info(get_raw_page('g_idx', 0));
> insert into g select ARRAY[999]; -- insert into 2nd pending list page
> delete from g where c = ARRAY[999];
> -- At this point, gin entry of 'ARRAY[999]' exists on 2nd page of
> pending list and deleted.

Is this test case completed?  It looks like there should be a
continuation with concurrent vacuum and insertions managed by gdb...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: why partition pruning doesn't work?

2018-06-13 Thread Robert Haas

On Wed, Jun 13, 2018 at 12:15 AM, Tom Lane  wrote:
>> As for whether to change it at this point in the release cycle, I
>> guess that, to me, depends on how invasive the fix is.
>
> It seems not to be that bad: we just need a shutdown call for the
> PartitionPruneState, and then we can remember the open relation there.
> The attached is based on David's patch from yesterday.

Seems reasonable.  Really, I think we should look for a way to hang
onto the relation at the point where it's originally opened and locked
instead of reopening it here.  But that's probably more invasive than
we can really justify right at the moment, and I think this is a step
in a good direction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Partitioning with temp tables is broken

2018-06-13 Thread Ashutosh Bapat

On Wed, Jun 13, 2018 at 5:36 PM, David Rowley
 wrote:
> Hi,
>
> It seems temp tables with partitioned tables is not working correctly.
> 9140cf8269b0c has not considered that in build_simple_rel() an
> AppendRelInfo could be missing due to it having been skipped in
> expand_partitioned_rtentry()
>
> Assert(cnt_parts == nparts); fails in build_simple_rel, and without an
> assert enabled build it just crashes later when trying to dereference
> the in prune_append_rel_partitions() or
> apply_scanjoin_target_to_paths(), depending if partition pruning is
> used and does not prune the relation.
>
> There's also something pretty weird around the removal of the temp
> relation from the partition bound. I've had cases where the session
> that attached the temp table is long gone, but \d+ shows the table is
> still there and I can't attach another partition due to an overlap,
> and can't drop the temp table due to the session not existing anymore.
> I've not got a test case for that one yet, but the test case for the
> crash is:
>
> -- Session 1:
> create table listp (a int) partition by list(a);
> create table listp1 partition of listp for values in(1);
> create temp table listp2 partition of listp for values in (2);
>
> -- Session 2:
> select * from listp;

Culprit here is following code in expand_partitioned_rtentry()
/* As in expand_inherited_rtentry, skip non-local temp tables */
if (RELATION_IS_OTHER_TEMP(childrel))
{
heap_close(childrel, lockmode);
continue;
}

>
> We might want to create the RelOptInfo and somehow set it as a dummy
> rel, or consider setting it to NULL and skipping the NULLs. I'd rather
> see the latter for the future, but for PG11 we might prefer the
> former. I've not looked into how this would be done or if it can be
> done sanely.

A temporary table is a base relation. In order to create a RelOptInfo
of a base relation we need to have an entry available for it in
query->rtables, simple_rel_array. We need corresponding RTE and
AppendRelInfo so that we can reserve an entry there. Somehow this
AppendRelInfo or the RTE should convey that it's a temporary relation
from the other session and mark the RelOptInfo empty when it gets
created in build_simple_rel() or when it gets in part_rels array. We
will require accessing metadata about the temporary table while
building RelOptInfo. That may be fine. I haven't checked.

I haven't thought about setting NULL in part_rels array. That may be
safer than the first option since the places where we scan part_rels
are fewer and straight forward.

But having a temporary table with partition has other effects also e.g.
\d+ t1
Table "public.t1"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats
target | Description
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  |
 b  | integer |   |  | | plain   |  |
Partition key: RANGE (a)
Indexes:
"t1_a" btree (a)
Partitions: t1p1 FOR VALUES FROM (0) TO (100),
t1p2 FOR VALUES FROM (100) TO (200)

both t1p1 and t1p2 are regular tables.

create a temporary table as default partition from some other session
and insert data

create temporary table t1p3 partition of t1 default;
insert into t1 values (350, 1);

now try creating a partition from a session other than the one where
temporary table was created
create table t1p4 partition of t1 for values from (300) to (400);
ERROR:  cannot access temporary tables of other sessions

The reason that happens because when creating a regular partition we
scan the default partition to check whether the regular partition has
any data that belongs to the partition being created.

Similar error will result if we try to insert a row to the partitioned
table which belongs to temporary partition from other session.

I think it's debatable whether that's a bug or not. But it's not as
bad as a crash. So, a solution to fix crash your reproduction is to
just remove the code in expand_partitioned_rtentry() which skips the
temporary tables from the other session
1712 /* As in expand_inherited_rtentry, skip non-local temp tables */
1713 if (RELATION_IS_OTHER_TEMP(childrel))
1714 {
1715 heap_close(childrel, lockmode);
1716 continue;
1717 }
If we do that we will see above error when the partition corresponding
to the temporary table from the other session is scanned i.e. if such
partition is not pruned.

But a larger question is what use such temporary partitions are?
Should we just prohibit adding temporary partitions to a permanant
partitioned table? We should allow adding temporary partitions to a
temporary partitioned table if only they both belong to the same
session.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Spilling hashed SetOps and aggregates to disk

2018-06-13 Thread Robert Haas

On Mon, Jun 11, 2018 at 1:50 PM, Jeff Davis  wrote:
> I think average group size is the wrong way to look at it, because
> averages are only useful for a normal distribution. The two most
> interesting cases are:
>
> 1. Fairly uniform group size (e.g. normal dist)
> 2. Skew values, power law distribution, etc., where some groups are
> very large and most are tiny by comparison. I am calling the very large
> groups "skewed groups".

I wasn't using the term "average" in a mathematical sense.  I suppose
I really meant "typical".  I agree that thinking about cases where the
group size is nonuniform is a good idea, but I don't think I agree
that all uniform distributions are created equal.  Uniform
distributions with 1 row per group are very different than uniform
distributions with 1000 rows per group.

> If we get the skewed groups into the hash table, and avoid OOM, I think
> that is a success. My patch did that, except it didn't account for two
> cases:
>
>   (a) clustered input, where skewed groups may not appear until the
> hash table is already full; and
>   (b) variable-sized transition values that grow with the group size

I think that many of the algorithms under consideration could be
implemented without worrying about variable-sized transition values,
and then the approach could be extended later.  However, whether or
not a given algorithm can handle clustered input seems like a fairly
basic property of the algorithm.  I don't think we should set the bar
too high because no algorithm is going to be perfect in every case; at
the same time, clustered input is pretty common in real-world
scenarios, and an algorithm that handles such cases better has a
significant leg up over one that can't, all other things being equal.
I'm not sure I remember precisely what your proposal was any more.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hot_standby_feedback vs excludeVacuum and snapshots

2018-06-13 Thread Simon Riggs

On 11 June 2018 at 17:56, Andres Freund  wrote:

> I don't think this is a good idea. We shouldn't continue down the path
> of having running xacts not actually running xacts, but rather go back
> to including everything. The case presented in the thread didn't
> actually do what it claimed originally, and there's a fair amount of
> potential for the excluded xids to cause problems down the line.
>
> Especially not when the fixes should be backpatched.  I think the
> earlier patch should be reverted, and then the AEL lock release problem
> should be fixed separately.

Since Greg has not reappeared to speak either way, I agree we should
revert, though I will add comments to document this. I will do this
today.

Looks like we would need a multi-node isolation tester to formally
test the AEL lock release, so I won't add tests for that.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: why partition pruning doesn't work?

2018-06-13 Thread Tom Lane

Robert Haas  writes:
> Seems reasonable.  Really, I think we should look for a way to hang
> onto the relation at the point where it's originally opened and locked
> instead of reopening it here.  But that's probably more invasive than
> we can really justify right at the moment, and I think this is a step
> in a good direction.

The existing coding there makes me itch a bit, because there's only a
rather fragile line of reasoning justifying the assumption that there is a
pre-existing lock at all.  So I'd be in favor of what you suggest just to
get rid of the "open(NoLock)" hazard.  But I agree that it'd be rather
invasive and right now is probably not the time for it.

regards, tom lane

Re: PostgreSQL vs SQL Standard

2018-06-13 Thread Robert Haas

On Sun, Jun 10, 2018 at 11:32 AM, Tom Lane  wrote:
> Andrew Gierth  writes:
>> I beat at the grammar a bit to see what it would take to fix it at least
>> to the extent of allowing a_expr ColId in a select list after removing
>> postfix ops. It looked like it was doable by making these keywords more
>> reserved (all of which are already reserved words per spec):
>>   DOUBLE, DAY, FILTER, HOUR, MINUTE, MONTH, OVER, PRECISION, SECOND,
>>   VARYING, WITHIN, WITHOUT, YEAR
>
> Yeah, a side effect of allowing "a_expr ColId" is that we can expect,
> going forward, that a lot of new keywords are going to have to become
> fully reserved that otherwise wouldn't have to be.  This is particularly
> a problem because of the SQL committee's COBOL-hangover tendency to
> invent new syntax that involves sequences of keywords; we usually
> don't have a good way to deal with that short of making the first
> keyword(s) reserved.
>
> It's arguable that going down that path will, in the long run, lead to
> breaking more applications (via their table/column names suddenly becoming
> reserved in some new version) than we rescue from having to quote their
> SELECT aliases.  At the very least we need to recognize that this is far
> from cost-free.

It depends on the source of those applications.  Applications
originally written for PostgreSQL are going to break if we reserve
those keywords.  Applications originally written for other databases
that already reserve those words won't, and migrating to PostgreSQL
will become easier for those people.  I think this is to some extent a
question about whether it's more important to have backward
compatibility with our own previous releases or compatibility with
other SQL database systems.

> (wanders away wondering exactly what parsing technology the SQL committee
> thinks implementations use...)

I wonder about this, too, because I've noticed that one other system
in particular seems to do parsing in a quite different manner than we
do (but I have no idea specifically how they do it).  bison is a good
tool in many ways and converting to something else would be a huge
amount of work, but there are also a number of things about it that
are not so great:

- The fact that conflicts can only be resolved with precedence
declarations sucks.  You can only set the priority of rule A relative
to rule B unless they've both got a precedence assigned, and there's
frequently no non-invasive way to accomplish that.  It would be nice
to be able to mark the postfix operator production as "weak" so that
it's only used when no other interpretation is possible.  Of course
this kind of thing could be overdone leading to a confusing grammar,
but it's not a bad idea to be able to do things like this to handle
corner cases, at least IMHO.

- It really only handles LR(1), but for certain cases more tokens of
lookahead would be really useful.  NULLS_LA and NOT_LA are nasty hacks
to work around the lack of adequate lookahead, and there are other
cases where we could do nicer things if we had it.

- There's no way to modify the parser at runtime, which means our
extension facility can't invent new SQL syntax, not even e.g. CREATE
THINGY and DROP THINGY.  (Our lack of support for thingys is really
holding us back.)

- The release cadence is extremely slow.  They just released 3.0.5,
but 3.0.4 came out in 2015 and 3.0.3 in 2013.  It's hard to imagine
bison going away completely, but it's also pretty hard to imagine it
ever having any new features that would help us.  And if we find a
bug, waiting for the next release to fix it is going to be painful.
To all appearances, it's only a little ahead of abandonware.

I'm not sure that there's a better tool than bison out there and I'm
not volunteering to write one in my spare time, but if something turns
up, we might not want to dismiss it out of hand.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Spilling hashed SetOps and aggregates to disk

2018-06-13 Thread Claudio Freire

On Tue, Jun 5, 2018 at 5:05 AM Tomas Vondra
 wrote:
>
> On 06/05/2018 07:46 AM, Jeff Davis wrote:
> > On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
> >> I expect the eviction strategy to be the primary design challenge of
> >> this patch. The other bits will be mostly determined by this one
> >> piece.
> >
> > Not sure I agree that this is the primary challenge.
> >
> > The cases that benefit from eviction are probably a minority. I see two
> > categories that would benefit:
> >
> >* Natural clustering in the heap. This sounds fairly common, but a
> > lot of the cases that come to mind are too low-cardinality to be
> > compelling; e.g. timestamps grouped by hour/day/month. If someone has
> > run into a high-cardinality natural grouping case, let me know.
> >* ARRAY_AGG (or similar): individual state values can be large enough
> > that we need to evict to avoid exceeding work_mem even if not adding
> > any new groups.
> >
> > In either case, it seems like a fairly simple eviction strategy would
> > work. For instance, we could just evict the entire hash table if
> > work_mem is exceeded or if the hit rate on the hash table falls below a
> > certain threshold. If there was really something important that should
> > have stayed in the hash table, it will go back in soon anyway.
> >
> > So why should eviction be a major driver for the entire design? I agree
> > it should be an area of improvement for the future, so let me know if
> > you see a major problem, but I haven't been as focused on eviction.
> >
>
> My concern is more about what happens when the input tuple ordering is
> inherently incompatible with the eviction strategy, greatly increasing
> the amount of data written to disk during evictions.
>
> Say for example that we can fit 1000 groups into work_mem, and that
> there are 2000 groups in the input data set. If the input is correlated
> with the groups, everything is peachy because we'll evict the first
> batch, and then group the remaining 1000 groups (or something like
> that). But if the input data is random (which can easily happen, e.g.
> for IP addresses, UUIDs and such) we'll hit the limit repeatedly and
> will evict much sooner.
>
> I know you suggested simply dumping the whole hash table and starting
> from scratch while we talked about this at pgcon, but ISTM it has
> exactly this issue.
>
> But I don't know if there actually is a better option - maybe we simply
> have to accept this problem. After all, running slowly is still better
> than OOM (which may or may not happen now).
>
> I wonder if we can somehow detect this at run-time and maybe fall-back
> to groupagg. E.g. we could compare number of groups vs. number of input
> tuples when we first hit the limit. It's a rough heuristics, but maybe
> sufficient.

I've been applying a strategy like that to do massive streaming
aggregation quite successfully.

The code I have is in python and in a private repo. There have been
talks of both opensourcing it, and of porting it into postgres as a
kind of aggregate node, because it sounds like a good idea. It seems
very related to this thread.

In essence, the technique I've been using always uses a fixed-size
hash table to do partial grouping. The table is never grown, when
collisions do happen, the existing entry in the hash table is flushed
to disk and the aggregate state in that bucket reset for the incoming
key. To avoid excessive spilling due to frequent collisions, we use a
kind of "lazy cuckoo" hash table. Lazy in the sense that it does no
relocations, it just checks 2 hash values, and if it has to evict, it
evicts the "oldest" value (with some cheap definition of "old").

The technique works very well to reduce temporary data size with a
fixed amount of working memory. The resulting spill file can then be
processed again to finalize the computation.

What I was pondering PG could do, is feed the spilled tuples to a sort
node, using the partial hash aggregation as a data-reducing node only.

scan -> partial hash agg -> sort -> final group agg

The group agg would have to know to consume and combine aggregate
states instead of producing them, but in essence it seems a relatively
efficient process.

An adaptive hash agg node would start as a hash agg, and turn into a
"partial hash agg + sort + final group agg" when OOM is detected.

The benefit over ordinary sort+group agg is that the sort is happening
on a potentially much smaller data set. When the buffer is large
enough to capture enough key repetitions, the output of the partial
hash agg can be orders of magnitude smaller than the scan.

My proposal was to use this for all group aggs, not only when the
planner chooses a hash agg, because it can speed up the sort and
reduce temp storage considerably. I guess the trick lies in estimating
that cardinality reduction to avoid applying this when there's no hope
of getting a payoff. The overhead of such a lazy hash table isn't
much, really. But yes, its cache locality is terri

Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()

2018-06-13 Thread Alvaro Herrera

On 2018-Jun-13, Alexander Korotkov wrote:

> On Wed, Jun 13, 2018 at 12:48 PM Kuntal Ghosh
>  wrote:
> > On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada  
> > wrote:
> > > Hi,
> > >
> > > Three functions: brin_summarize_new_values, brin_summarize_range and
> > > brin_desummarize_range can be called during recovery as follows.
> > >
> > > =# select brin_summarize_new_values('a_idx');
> > > ERROR:  cannot acquire lock mode ShareUpdateExclusiveLock on database
> > > objects while recovery is in progress
> > > HINT:  Only RowExclusiveLock or less can be acquired on database
> > > objects during recovery.

Good catch!

> > > I think we should complaint "recovery is in progress" error in this
> > > case rather than erroring due to lock modes.
> > +1
> 
> +1,
> but current behavior doesn't seem to be bug, but rather not precise
> enough error reporting.  So, I think we shouldn't consider
> backpatching this.

I guess you could go either way ... we're just changing one unhelpful
error with a better one: there is no change in behavior.  I would
backpatch this, myself, and avoid the code divergence.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Partitioning with temp tables is broken

2018-06-13 Thread Tom Lane

Ashutosh Bapat  writes:
> [ lots o' problems with $subject ]

> But a larger question is what use such temporary partitions are?
> Should we just prohibit adding temporary partitions to a permanant
> partitioned table? We should allow adding temporary partitions to a
> temporary partitioned table if only they both belong to the same
> session.

Even if you want to argue that there's a use case for these situations,
it seems far too late in the release cycle to be trying to fix all these
issues.  I think we need to forbid the problematic cases for now, and
leave relaxing the prohibition to be treated as a future new feature.

regards, tom lane

Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()

2018-06-13 Thread Simon Riggs

On 13 June 2018 at 15:51, Alvaro Herrera  wrote:
> On 2018-Jun-13, Alexander Korotkov wrote:
>
>> On Wed, Jun 13, 2018 at 12:48 PM Kuntal Ghosh
>>  wrote:
>> > On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada  
>> > wrote:
>> > > Hi,
>> > >
>> > > Three functions: brin_summarize_new_values, brin_summarize_range and
>> > > brin_desummarize_range can be called during recovery as follows.
>> > >
>> > > =# select brin_summarize_new_values('a_idx');
>> > > ERROR:  cannot acquire lock mode ShareUpdateExclusiveLock on database
>> > > objects while recovery is in progress
>> > > HINT:  Only RowExclusiveLock or less can be acquired on database
>> > > objects during recovery.
>
> Good catch!
>
>> > > I think we should complaint "recovery is in progress" error in this
>> > > case rather than erroring due to lock modes.
>> > +1
>>
>> +1,
>> but current behavior doesn't seem to be bug, but rather not precise
>> enough error reporting.  So, I think we shouldn't consider
>> backpatching this.
>
> I guess you could go either way ... we're just changing one unhelpful
> error with a better one: there is no change in behavior.  I would
> backpatch this, myself, and avoid the code divergence.

WAL control functions all say the same thing, so we can do that here also.

I'd prefer it if the message was more generic, so remove the
summarization/desummarization wording from the message. i.e.
"BRIN control functions cannot be executed during recovery"

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Duplicate Item Pointers in Gin index

2018-06-13 Thread Masahiko Sawada

On Wed, Jun 13, 2018 at 10:22 PM, Alexander Korotkov
 wrote:
> On Wed, Jun 13, 2018 at 11:40 AM Masahiko Sawada  
> wrote:
>>
>> On Wed, Jun 13, 2018 at 3:32 PM, Peter Geoghegan  wrote:
>> > On Tue, Jun 12, 2018 at 11:01 PM, Masahiko Sawada  
>> > wrote:
>> >> FWIW, I've looked at this again. I think that the situation Siva
>> >> reported in the first mail can happen before we get commit 3b2787e.
>> >> That is, gin indexes had had a data corruption bug. I've reproduced
>> >> the situation with PostgreSQL 10.1 and observed that a gin index can
>> >> corrupt.
>> >
>> > So, you've recreated the problem with Postgres from before 3b2787e,
>> > but not after 3b2787e? Are you suggesting that 3b2787e might have
>> > fixed it, or that it only hid the problem, or something else?
>>
>> I meant 3b2787e fixed it. I checked that at least the situation
>> doesn't happen after 3b2787e.
>
> I also think that 3b2787e should fix such problems.  After 3b2787e,
> vacuum is forced to cleanup all pending list entries, which were
> inserted before vacuum start.  So, vacuum should have everything to be
> vaccumed merged into posting lists/trees.
>
>> > How did you recreate the problem? Do you have a test case you can share?
>>
>> I recreated it by executing each steps step by step using gdb. So I
>> can share the test case but it might not help.
>>
>> create extension pageinspect;
>> create table g (c int[]);
>> insert into g select ARRAY[1] from generate_series(1,1000);
>> create index g_idx on g using gin (c);
>> alter table g set (autovacuum_enabled = off);
>> insert into g select ARRAY[1] from generate_series(1, 408); -- 408
>> items fit in exactly one page of pending list
>> insert into g select ARRAY[1] from generate_series(1, 100); -- insert
>> into 2nd page of pending list
>> select n_pending_pages, n_pending_tuples from
>> gin_metapage_info(get_raw_page('g_idx', 0));
>> insert into g select ARRAY[999]; -- insert into 2nd pending list page
>> delete from g where c = ARRAY[999];
>> -- At this point, gin entry of 'ARRAY[999]' exists on 2nd page of
>> pending list and deleted.
>
> Is this test case completed?  It looks like there should be a
> continuation with concurrent vacuum and insertions managed by gdb...
>

This is not completed test case. This is only for step 1 and we need
concurrent vacuum and insertions as you said.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: why partition pruning doesn't work?

2018-06-13 Thread Tom Lane

David Rowley  writes:
> On 13 June 2018 at 16:15, Tom Lane  wrote:
>> It seems not to be that bad: we just need a shutdown call for the
>> PartitionPruneState, and then we can remember the open relation there.
>> The attached is based on David's patch from yesterday.

> I've looked over this and it seems better than mine. Especially so
> that you've done things so that the FmgrInfo is copied into a memory
> context that's not about to get reset.

Pushed, thanks for looking it over.

> One small thing is that I'd move the:
> context.partrel = NULL;
> to under:
> /* These are not valid when being called from the planner */

Judgment call I guess.  I tend to initialize struct fields in field order
unless there's a good reason to do differently, just so it's easier to
confirm that none were overlooked.  But I can see the point of your
suggestion too, so done that way.

There's still one thing I'm a bit confused about here.  I noticed that
we weren't actually using the partopfamily and partopcintype fields in
PartitionPruneContext, so I removed them.  But that still leaves both
partsupfunc and partcollation as pointers into the relcache that were
subject to this hazard.  My testing agrees with lousyjack's results
that both of those were, in fact, being improperly accessed.  The OID
comparison effect I mentioned upthread explains why the buildfarm's
cache-clobbering members failed to notice any problem with garbage
partsupfunc data ... but why did we not see any complaints about invalid
collation OIDs?  Tis strange.

regards, tom lane

Re: POC: GROUP BY optimization

2018-06-13 Thread Teodor Sigaev


I see v9 is now calling estimate_num_groups(), so it already benefits
from extended statistics. Nice! I think the algorithm is more or less
the greedy option I proposed in one of the earlier messages - I don't
know if it's sufficient or not, but the only other idea I have is
essentially an exhaustive search through all possible permutations.


All possible permutation could be slow with extremely large group by clause, so 
we will need some some stop limit - like join_collapse_limit. I don't like this 
idea.



So that's a nice improvement, although I think we should also consider
non-uniform distributions (using the per-column MCV lists).

Could you clarify how to do that?




2) Planing cost is the same for all queries. So, cost_sort() doesn't
take into account even number of columns.



Yeah. As I wrote before, I think we'll need to come up with a model for
comparison_cost depending on the column order, which should make costs
for those paths different.
Yeah. But I still think it should be separated patch which will improve cost 
estimation and plan choosing in other optimizer part too.



OK. I think we should also have a debug GUC for the first patch (i.e.
one that would allow reordering group_pathkeys to match the input path).
Added to 0002-opt_group_by_index_and_order-v10.patch . 
debug_group_by_reorder_by_pathkeys.



Regarding the two GUCs introduced in v9, I'm not sure I 100% understand
what they are doing. Let me try explaining what I think they do:

1) debug_cheapest_group_by - Reorder GROUP BY clauses to using ndistinct
stats for columns, placing columns with more distinct values first (so
that we don't need to compare the following columns as often).

yes



2) debug_group_by_match_order_by - Try to reorder the GROUP BY clauses
to match the ORDER BY, so that the group aggregate produces results in
the desired output (and we don't need an explicit Sort).

yes


FWIW I doubt about the debug_group_by_match_order_by optimization. The
trouble is that it's only based on comparing the pathkeys lists, and
entirely ignores that the two result sets may operate on very different
numbers of rows.

Consider for example "SELECT * FROM t GROUP BY a,b,c,d ORDER BY c,d"
where table "t" has 1B rows, but there are only ~10k rows in the result
(which is fairly common in fact tables in star-schema DBs). IIUC the
optimization will ensure "c,d" is placed first in the GROUP BY, and then
we reorder "a,b" using ndistinct. But this may be much less efficient
than simply reordering (a,b,c,d) per ndistinct and then adding explicit
Sort on top of that (because on 10k rows that's going to be cheap).

As you pointed in next email, this optimization is already implemented in 
preprocess_groupclause(). And correct resolving of this issue could be done only 
after implementing of correct cost_sort() - which will pay attention to 
comparison cost, number of distinct value in columns and how often it's needed 
to compare second and next columns.




The other "issue" I have with get_cheapest_group_keys_order() is how it
interacts with group_keys_reorder_by_pathkeys(). I mean, we first call

   n_preordered = group_keys_reorder_by_pathkeys(path->pathkeys, ...);

so the reordering tries to maintain ordering of the input path. Then if
(n_preordered == 0) we do this:

   group_keys_reorder_by_pathkeys(root->sort_pathkeys, ...);

Doesn't that mean that if we happen to have input path with partially
matching ordering (so still requiring explicit Sort before grouping) we
may end up ignoring the ORDER BY entirely? Instead of using Sort that
would match the ORDER BY? I might have misunderstood this, though.
Hm. I believe ordering of input of GROUP clause is always more expensive - just 
because output of GROUP BY clause should have less number of rows than  its 
input, what means more cheap ordering. So here it uses ORDER BY only if we don't 
have a group pathkey(s) matched by path pathkeys.




I'm not sure why the first group_keys_reorder_by_pathkeys() call does
not simply return 0 when the input path ordering is not sufficient for
the grouping? Is there some reasoning behind that (e.g. that sorting
partially sorted is faster)?
Hm, what do you mean - sufficient? group_keys_reorder_by_pathkeys() always tries 
to order GROUP BY columns by preexisting pathkeys. But of course, incremental 
sort will add more benefits here. I'd like to hope that incremental sort patch 
is close to commit.




Overall I think this patch introduces four different optimizations that
are indeed very interesting:

1) consider additional sorted paths for grouping

Agree


2) reorder GROUP BY clauses per ndistinct to minimize comparisons
Agree (with a help of estimate_num_groups() and patch tries to maximize a number 
of groups on each step)



3) when adding Sort for grouping, maintain partial input ordering

Agree - and incremental sort patch will helpful here.


4) when adding Sort for grouping, try producing the right output order
(if the ORDER BY was spe

Re: Partitioning with temp tables is broken

2018-06-13 Thread amul sul

On Wed, Jun 13, 2018, 8:34 PM Tom Lane  wrote:

> Ashutosh Bapat  writes:
> > [ lots o' problems with $subject ]
>
> > But a larger question is what use such temporary partitions are?
> > Should we just prohibit adding temporary partitions to a permanant
> > partitioned table? We should allow adding temporary partitions to a
> > temporary partitioned table if only they both belong to the same
> > session.
>
> Even if you want to argue that there's a use case for these situations,
> it seems far too late in the release cycle to be trying to fix all these
> issues.  I think we need to forbid the problematic cases for now, and
> leave relaxing the prohibition to be treated as a future new feature.
>


+1, forbid the problematic case.

Regards,
Amul

Sent from a mobile device. Please excuse brevity and tpyos.

>

Re: POC: GROUP BY optimization

2018-06-13 Thread Teodor Sigaev


I.e. we already do reorder the group clauses to match ORDER BY, to only
require a single sort. This happens in preprocess_groupclause(), which
also explains the reasoning behind that.
Huh. I missed that. That means group_keys_reorder_by_pathkeys() call inside 
get_cheapest_group_keys_order() duplicates work of preprocess_groupclause().
 And so it's possible to replace it to simple counting number of the same 
pathkeys in beginning of lists. Or remove most part of preprocess_groupclause() 
- but it seems to me we should use first option, preprocess_groupclause() is 
simpler, it doesn't operate with pathkeys only with  SortGroupClause which is 
simpler.


BTW, incremental sort path provides pathkeys_common(), exactly what we need.


I wonder if some of the new code reordering group pathkeys could/should
be moved here (not sure, maybe it's too early for those decisions). In
any case, it might be appropriate to update some of the comments before
preprocess_groupclause() which claim we don't do certain things added by
the proposed patches.


preprocess_groupclause() is too early step to use something like 
group_keys_reorder_by_pathkeys() because pathkeys is not known here yet.




This probably also somewhat refutes my claim that the order of grouping
keys is currently fully determined by users (and so they may pick the
most efficient order), while the reorder-by-ndistinct patch would make
that impossible. Apparently when there's ORDER BY, we already mess with
the order of group clauses - there are ways to get around it (subquery
with OFFSET 0) but it's much less clear.


I like a moment when objections go away :)

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()

2018-06-13 Thread Masahiko Sawada

On Thu, Jun 14, 2018 at 12:04 AM, Simon Riggs  wrote:
> On 13 June 2018 at 15:51, Alvaro Herrera  wrote:
>> On 2018-Jun-13, Alexander Korotkov wrote:
>>
>>> On Wed, Jun 13, 2018 at 12:48 PM Kuntal Ghosh
>>>  wrote:
>>> > On Wed, Jun 13, 2018 at 2:28 PM, Masahiko Sawada  
>>> > wrote:
>>> > > Hi,
>>> > >
>>> > > Three functions: brin_summarize_new_values, brin_summarize_range and
>>> > > brin_desummarize_range can be called during recovery as follows.
>>> > >
>>> > > =# select brin_summarize_new_values('a_idx');
>>> > > ERROR:  cannot acquire lock mode ShareUpdateExclusiveLock on database
>>> > > objects while recovery is in progress
>>> > > HINT:  Only RowExclusiveLock or less can be acquired on database
>>> > > objects during recovery.
>>
>> Good catch!
>>
>>> > > I think we should complaint "recovery is in progress" error in this
>>> > > case rather than erroring due to lock modes.
>>> > +1
>>>
>>> +1,
>>> but current behavior doesn't seem to be bug, but rather not precise
>>> enough error reporting.  So, I think we shouldn't consider
>>> backpatching this.
>>
>> I guess you could go either way ... we're just changing one unhelpful
>> error with a better one: there is no change in behavior.  I would
>> backpatch this, myself, and avoid the code divergence.
>
> WAL control functions all say the same thing, so we can do that here also.

+1

> I'd prefer it if the message was more generic, so remove the
> summarization/desummarization wording from the message. i.e.
> "BRIN control functions cannot be executed during recovery"
>

Agreed. Attached an updated patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 60e650d..8453bfe 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -871,6 +871,12 @@ brin_summarize_range(PG_FUNCTION_ARGS)
 	Relation	heapRel;
 	double		numSummarized = 0;
 
+	if (RecoveryInProgress())
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("BRIN control functions cannot be executed during recovery.")));
+
 	if (heapBlk64 > BRIN_ALL_BLOCKRANGES || heapBlk64 < 0)
 	{
 		char	   *blk = psprintf(INT64_FORMAT, heapBlk64);
@@ -942,6 +948,12 @@ brin_desummarize_range(PG_FUNCTION_ARGS)
 	Relation	indexRel;
 	bool		done;
 
+	if (RecoveryInProgress())
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("BRIN control functions cannot be executed during recovery.")));
+
 	if (heapBlk64 > MaxBlockNumber || heapBlk64 < 0)
 	{
 		char	   *blk = psprintf(INT64_FORMAT, heapBlk64);

Re: Portability concerns over pq_sendbyte?

2018-06-13 Thread Andres Freund

Hi,

On 2018-05-24 18:13:23 +0100, Andrew Gierth wrote:
> In PG11, pq_sendbyte got changed from taking an int parameter to taking
> an int8.
> 
> While that seems to work in general, it does mean that there are now
> several places in the code that do the equivalent of:
> 
> unsigned char x = 128;
> pq_sendbyte(&buf, x);
> 
> which I believe is not well-defined since pq_sendbyte takes an int8, and
> conversions of unrepresentable values to _signed_ integer types are
> (iirc) implementation-dependent.

It's not implementation defined in postgres' dialect of C - we rely on
accurate signed->unsigned conversions in a number of places.  But I
doin't think we should increase that reliance, so I think you're right
we should do something about this.

> There are also some cases where pq_sendint16 is being called for an
> unsigned value or a value that might exceed 32767.

Hm, which case were you thinking of here?  The calls usually are exactly
the types that the wire protocol expects, no?

> Would it be better for these to take unsigned values, or have unsigned
> variants?

I wonder if we should just take 'int' out of the name. Say,
pg_send{8,16,32,64}(unsigned ...).

Greetings,

Andres Freund

Re: Portability concerns over pq_sendbyte?

2018-06-13 Thread Andres Freund

On 2018-06-13 14:10:37 +0900, Michael Paquier wrote:
> On Mon, Jun 11, 2018 at 02:25:44PM +0900, Michael Paquier wrote:
> > On Wed, Jun 06, 2018 at 12:27:58PM -0400, Alvaro Herrera wrote:
> >> Do you have an answer to this question?  Does anybody else?
> >> 
> >> (My guts tell me it'd be better to change these routines to take
> >> unsigned values, without creating extra variants.  But guts frequently
> >> misspeak.)
> > 
> > My guts are telling me as well to not have more variants.

Agreed.


> > On top of that it seems to me that we'd want to rename any new
> > routines to include "uint" in their name instead of "int", and for
> > compatibility with past code pq_sendint should not be touched.

I'm very doubtful about this one, unless you mean that just the
signature shouldn't be touched.  Otherwise we'll just increase code
duplication unnecessarily?


> And also pq_sendint64 needs to be kept around for compatibility.

:(. Wonder if it's better to just break people's code.

Greetings,

Andres Freund

Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-06-13 Thread Daniel Gustafsson

> On 9 Apr 2018, at 23:55, Andres Freund  wrote:
> On 2017-06-20 13:01:35 -0700, Andres Freund wrote:
>> For extensions it'd also be useful if it'd be possible to overwrite the
>> error code.  E.g. for citus there's a distributed deadlock detector,
>> running out of process because there's no way to interrupt lock waits
>> locally, and we've to do some ugly hacking to generate proper error
>> messages and code from another session.
> 
> What happened to this request?  Seems we're out of the crunch mode and
> could round the feature out a littlebit more…

Revisiting old patches, I took a stab at this request.

Since I don’t really have a use case for altering the sqlerrcode other than the
on that Citus..  cited, I modelled the API around that.  The slot now holds a
sqlerrcode as well as a message, with functions to just set the message keeping
the default sqlerrcode for when that is all one wants to do.  There is no
function for just altering the sqlerrcode without a message as that seems not
useful to me.

The combination of sqlerrcode and message was dubbed SignalFeedback for lack of
a better term.  With this I also addressed something that annoyed me; I had
called all the functions Cancel* which technically isn’t true since we also
support termination.

There are no new user facing changes in patch compared to the previous version.

This patchset still has the refactoring that Alvaro brought up upthread.

Parking this again the commitfest as it was returned with feedback from the
last one it was in (all review comments addressed, see upthread).

cheers ./daniel

0001-Refactor-backend-signalling-code-v11.patch
Description: Binary data

0002-Support-optional-message-in-backend-cancel-terminate-v11.patch
Description: Binary data

Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-06-13 Thread Andres Freund

Hi,

Onder, I CCed you because it seems worthwhile to ensure that the
relevant Citus code could use this instead of the gross hack you and I
committed...

On 2018-06-13 20:54:03 +0200, Daniel Gustafsson wrote:
> > On 9 Apr 2018, at 23:55, Andres Freund  wrote:
> > On 2017-06-20 13:01:35 -0700, Andres Freund wrote:
> >> For extensions it'd also be useful if it'd be possible to overwrite the
> >> error code.  E.g. for citus there's a distributed deadlock detector,
> >> running out of process because there's no way to interrupt lock waits
> >> locally, and we've to do some ugly hacking to generate proper error
> >> messages and code from another session.
> > 
> > What happened to this request?  Seems we're out of the crunch mode and
> > could round the feature out a littlebit more…
> 
> Revisiting old patches, I took a stab at this request.
> 
> Since I don’t really have a use case for altering the sqlerrcode other than 
> the
> on that Citus..  cited, I modelled the API around that.

Cool. Onder?


> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> index b851fe023a..ad9763698f 100644
> --- a/doc/src/sgml/func.sgml
> +++ b/doc/src/sgml/func.sgml
> @@ -18540,7 +18540,7 @@ SELECT set_config('log_statement_stats', 'off', 
> false);
>   
>
> 
> -pg_cancel_backend(pid 
> int)
> +pg_cancel_backend(pid 
> int [, message 
> text])
>  
> boolean
> Cancel a backend's current query.  This is also allowed if the
> @@ -18565,7 +18565,7 @@ SELECT set_config('log_statement_stats', 'off', 
> false);
>
>
> 
> -pg_terminate_backend(pid 
> int)
> +pg_terminate_backend(pid 
> int [, message 
> text])
>  
> boolean
> Terminate a backend.  This is also allowed if the calling role
> @@ -18596,6 +18596,14 @@ SELECT set_config('log_statement_stats', 'off', 
> false);
>  The role of an active backend can be found from the
>  usename column of the
>  pg_stat_activity view.
> +If the optional message parameter is set, the text
> +will be appended to the error message returned to the signalled backend.
> +message is limited to 128 bytes, any longer text
> +will be truncated. An example where we cancel our own backend:
> +
> +postgres=# SELECT pg_cancel_backend(pg_backend_pid(), 'Cancellation message 
> text');
> +ERROR:  canceling statement due to user request: Cancellation message text
> +
> 

I'm not sure I really like the appending bit. There's a security
argument to be made about doing so, but from a user POV that mostly
seems restrictive.  I wonder if that aspect would be better handled by
adding an error context that contains information about which process
did the cancellation (or DETAIL?)?


> +/*
> + * Structure for registering a feedback payload to be sent to a cancelled, or
> + * terminated backend. Each backend is registered per pid in the array which 
> is
> + * indexed by Backend ID. Reading and writing the message is protected by a
> + * per-slot spinlock.
> + */
> +typedef struct
> +{
> + pid_t   pid;

This is the pid of the receiving process? If so, why do we need this in
here? That's just duplicated data, no?


> + slock_t mutex;
> + charmessage[MAX_CANCEL_MSG];
> + int len;
> + int sqlerrcode;

I'd vote for including the pid of the process that did the cancelling in
here.

> +Datum
> +pg_cancel_backend(PG_FUNCTION_ARGS)
> +{
> + PG_RETURN_BOOL(pg_cancel_backend_internal(PG_GETARG_INT32(0), NULL));
> +}
> +
> +Datum
> +pg_cancel_backend_msg(PG_FUNCTION_ARGS)
> +{
> + pid_t   pid;
> + char   *msg = NULL;
> +
> + if (PG_ARGISNULL(0))
> + PG_RETURN_NULL();
> +
> + pid = PG_GETARG_INT32(0);
> +
> + if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
> + msg = text_to_cstring(PG_GETARG_TEXT_PP(1));
> +
> + PG_RETURN_BOOL(pg_cancel_backend_internal(pid, msg));
> +}

Why isn't this just one function? Given that you already have a PG_NARGS
check, I don't quite see the point?

>  /*
>   * Signal to terminate a backend process.  This is allowed if you are a 
> member
>   * of the role whose process is being terminated.
>   *
>   * Note that only superusers can signal superuser-owned processes.
>   */
> -Datum
> -pg_terminate_backend(PG_FUNCTION_ARGS)
> +static bool
> +pg_terminate_backend_internal(pid_t pid, char *msg)
>  {
> - int r = pg_signal_backend(PG_GETARG_INT32(0), 
> SIGTERM);
> + int r = pg_signal_backend(pid, SIGTERM, msg);
>  
>   if (r == SIGNAL_BACKEND_NOSUPERUSER)
>   ereport(ERROR,
> @@ -146,7 +203,30 @@ pg_terminate_backend(PG_FUNCTION_ARGS)
>   (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
>(errmsg("must be a member of the role whose 
> process is being terminated or member of pg_signal_backend";
>  
> - PG_RETURN_BOOL(r == SIGN

Re: Spilling hashed SetOps and aggregates to disk

2018-06-13 Thread Jeff Davis

On Wed, 2018-06-13 at 10:08 -0400, Robert Haas wrote:
> I wasn't using the term "average" in a mathematical sense.  I suppose
> I really meant "typical".  I agree that thinking about cases where
> the
> group size is nonuniform is a good idea, but I don't think I agree
> that all uniform distributions are created equal.  Uniform
> distributions with 1 row per group are very different than uniform
> distributions with 1000 rows per group.

The only mechanism we have for dealing efficiently with skewed groups
is hashing; no alternative has been proposed.

I think what you are saying is that the case of medium-large groups
with clustered input are kind of like skewed groups because they have
enough locality to benefit from grouping before spilling. I can see
that.

So how do we handle both of these cases (skewed groups and clustered
groups) well?

I tend toward a simple approach, which is to just put whatever we can
in the hash table. Once it's full, if the hit rate on the hash table
(the whole table) drops below a certain threshold, we just dump all the
transition states out to disk and empty the hash table.

That would handle both skewed groups and clustering fairly well.
Clustering would cause the hit rate to rapidly go to zero when we are
past the current batch of groups, causing fairly quick switching to new
groups which should handle Tomas's case[1]. And it would also handle
ordinary skewed groups fairly well -- it may write them out sometimes
(depending on how smart our eviction strategy is), but the cost of that
is not necessarily bad because it will go back into the hash table
quickly.

It also offers an incremental implementation strategy: something
resembling my patch could be first (which doesn't dump transition
states at all), followed by something that can dump transition states,
followed by some tweaking to make it smarter.

That still leaves the question about what to do with the small groups:
partition them (like my patch does) or feed them into a sort and group
them?

Regards,
Jeff Davis

[1] https://www.postgresql.org/message-id/46734151-3245-54ac-76fc-17742
fb0e6d9%402ndquadrant.com

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-06-13 Thread Fabien COELHO



Hello Marina,

I suppose that this is related; because of my patch there may be a lot of 
such code (see v7 in [1]):


-   fprintf(stderr,
-	"malformed variable \"%s\" value: 
\"%s\"\n",

-   var->name, var->svalue);
+   if (debug_level >= DEBUG_FAILS)
+   {
+   fprintf(stderr,
+		"malformed variable \"%s\" 
value: \"%s\"\n",

+   var->name, var->svalue);
+   }

-   if (debug)
+   if (debug_level >= DEBUG_ALL)
			fprintf(stderr, "client %d sending %s\n", st->id, 
sql);


I'm not sure that debug messages needs to be kept after debug, if it is 
about debugging pgbench itself. That is debatable.


That's why it was suggested to make the error function which hides all these 
things (see [2]):


There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with 
corresponding fprintf(stderr..) I think it's time to do it like in the 
main code, wrap with some function like log(level, msg).


Yep. I did not wrote that, but I agree with an "elog" suggestion to switch

  if (...) { fprintf(...); exit/abort/continue/... }

to a simpler:

  elog(level, ...)

Moreover, it changes pgbench current behavior, which might be 
admissible, but should be discussed clearly.



The semantics of the existing code is changed, the FATAL levels calls
abort() and replace existing exit(1) calls. Maybe you want an ERROR
level as well.


Oh, thanks, I agree with you. And I do not want to change the program exit 
code without good reasons, but I'm sorry I may not know all pros and cons in 
this matter..


Or did you also mean other changes?


AFAICR I meant switching exit to abort in some cases.


The code adapts/duplicates existing server-side "ereport" stuff and
brings it to the frontend, where the logging needs are somehow quite
different.

I'd prefer to avoid duplication and/or have some code sharing.


I was recommended to use the same interface in [3]:

On elog/errstart: we already have a convention for what ereport() 
calls look like; I suggest to use that instead of inventing your own.


The "elog" interface already exists, it is not an invention. "ereport" is 
a hack which is somehow necessary in some cases. I prefer a simple 
function call if possible for the purpose, and ISTM that this is the case.


If it really needs to be duplicated, I'd suggest to put all this stuff 
in separated files. If we want to do that, I think that it would belong 
to fe_utils, and where it could/should be used by all front-end 
programs.


I'll try to do it..


Dunno. If you only need one "elog" function which prints a message to 
stderr and decides whether to abort/exit/whatevrer, maybe it can just be 
kept in pgbench. If there are are several complicated functions and 
macros, better with a file. So I'd say it depends.



For logging purposes, ISTM that the "elog" macro interface is nicer,
closer to the existing "fprintf(stderr", as it would not introduce the
additional parentheses hack for "rest".


I was also recommended to use ereport() instead of elog() in [3]:


Probably. Are you hoping that advises from different reviewers should be 
consistent? That seems optimistic:-)


With that, is there a need for elog()?  In the backend we have it 
because $HISTORY but there's no need for that here -- I propose to 
lose elog() and use only ereport everywhere.


See commit 8a07ebb3c172 which turns some ereport into elog...


My 0.02€: maybe you just want to turn

  fprintf(stderr, format, ...);
  // then possibly exit or abort depending...

into

  elog(level, format, ...);

which maybe would exit or abort depending on level, and possibly not
actually report under some levels and/or some conditions. For that, it
could enough to just provide an nice "elog" function.


I agree that elog() can be coded in this way. To use ereport() I need a 
structure to store the error level as a condition to exit.


Yep. That is a lot of complication which are justified server side where 
logging requirements are special, but in this case I see it as overkill.


So my current view is that if you only need an "elog" function, it is 
simpler to add it to "pgbench.c".


--
Fabien.

Re: Spilling hashed SetOps and aggregates to disk

2018-06-13 Thread Jeff Davis

On Wed, 2018-06-13 at 11:50 -0300, Claudio Freire wrote:
> In essence, the technique I've been using always uses a fixed-size
> hash table to do partial grouping. The table is never grown, when
> collisions do happen, the existing entry in the hash table is flushed
> to disk and the aggregate state in that bucket reset for the incoming
> key. To avoid excessive spilling due to frequent collisions, we use a
> kind of "lazy cuckoo" hash table. Lazy in the sense that it does no
> relocations, it just checks 2 hash values, and if it has to evict, it
> evicts the "oldest" value (with some cheap definition of "old").

...

> An adaptive hash agg node would start as a hash agg, and turn into a
> "partial hash agg + sort + final group agg" when OOM is detected.
> 
> The benefit over ordinary sort+group agg is that the sort is
> happening
> on a potentially much smaller data set. When the buffer is large
> enough to capture enough key repetitions, the output of the partial
> hash agg can be orders of magnitude smaller than the scan.
> 
> My proposal was to use this for all group aggs, not only when the
> planner chooses a hash agg, because it can speed up the sort and
> reduce temp storage considerably. I guess the trick lies in
> estimating
> that cardinality reduction to avoid applying this when there's no
> hope
> of getting a payoff. The overhead of such a lazy hash table isn't
> much, really. But yes, its cache locality is terrible, so it needs to
> be considered.

I think this is a promising approach because it means the planner has
one less decion to make. And the planner doesn't have great information
to make its decision on, anyway (ndistinct estimates are hard enough on
plain tables, and all bets are off a few levels up in the plan).

Unfortunately, it means that either all data types must support hashing
and sorting[1], or we need to have a fallback path that can get by with
hashing alone (which would not be tested nearly as well as more typical
paths).

I still like this idea, though.

Regards,
Jeff Davis


[1] https://www.postgresql.org/message-id/9584.1528739975%40sss.pgh.pa.
us

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-06-13 Thread Alvaro Herrera

On 2018-Jun-13, Fabien COELHO wrote:

> > > > With that, is there a need for elog()?  In the backend we have
> > > > it because $HISTORY but there's no need for that here -- I
> > > > propose to lose elog() and use only ereport everywhere.
> 
> See commit 8a07ebb3c172 which turns some ereport into elog...

For context: in the backend, elog() is only used for internal messages
(i.e. "can't-happen" conditions), and ereport() is used for user-facing
messages.  There are many things ereport() has that elog() doesn't, such
as additional message fields (HINT, DETAIL, etc) that I think could have
some use in pgbench as well.  If you use elog() then you can't have that.

Another difference is that in the backend, elog() messages are never
translated, while ereport() message are translated.  Since pgbench is
translatable I think it would be best to keep those things in sync, to
avoid confusion. (Although of course you could do it differently in
pgbench than backend.)

One thing that just came to mind is that pgbench uses some src/fe_utils
stuff.  I hope having ereport() doesn't cause a conflict with that ...

BTW I think abort() is not the right thing, as it'll cause core dumps if
enabled.  Why not just exit(1)?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Portability concerns over pq_sendbyte?

2018-06-13 Thread Andrew Gierth

> "Andres" == Andres Freund  writes:

 >> unsigned char x = 128;
 >> pq_sendbyte(&buf, x);
 >> 
 >> which I believe is not well-defined since pq_sendbyte takes an int8,
 >> and conversions of unrepresentable values to _signed_ integer types
 >> are (iirc) implementation-dependent.

 Andres> It's not implementation defined in postgres' dialect of C - we
 Andres> rely on accurate signed->unsigned conversions in a number of
 Andres> places.

Converting signed integer to unsigned is ok as I understand it - what's
happening here is the reverse, converting an unrepresentable unsigned
value to a signed type.

 >> There are also some cases where pq_sendint16 is being called for an
 >> unsigned value or a value that might exceed 32767.

 Andres> Hm, which case were you thinking of here? The calls usually are
 Andres> exactly the types that the wire protocol expects, no?

There are cases where it's not actually clear what the wire protocol
expects - I'm thinking in particular of the number of entries in a list
of parameter types/formats.

-- 
Andrew (irc:RhodiumToad)

Re: Portability concerns over pq_sendbyte?

2018-06-13 Thread Andres Freund

Hi,

On 2018-06-13 22:02:13 +0100, Andrew Gierth wrote:
> > "Andres" == Andres Freund  writes:
> 
>  >> unsigned char x = 128;
>  >> pq_sendbyte(&buf, x);
>  >> 
>  >> which I believe is not well-defined since pq_sendbyte takes an int8,
>  >> and conversions of unrepresentable values to _signed_ integer types
>  >> are (iirc) implementation-dependent.
> 
>  Andres> It's not implementation defined in postgres' dialect of C - we
>  Andres> rely on accurate signed->unsigned conversions in a number of
>  Andres> places.
> 
> Converting signed integer to unsigned is ok as I understand it - what's
> happening here is the reverse, converting an unrepresentable unsigned
> value to a signed type.

Err, yes, I was thinking about that conversion. Sorry for the confusion.


>  >> There are also some cases where pq_sendint16 is being called for an
>  >> unsigned value or a value that might exceed 32767.
> 
>  Andres> Hm, which case were you thinking of here? The calls usually are
>  Andres> exactly the types that the wire protocol expects, no?
> 
> There are cases where it's not actually clear what the wire protocol
> expects - I'm thinking in particular of the number of entries in a list
> of parameter types/formats.

But that didn't change around the pq_send* changes?  So I'm not sure I
understand how this is related?  I mean I'm all for documenting the wire
protocol more extensively, but we can't just change the width?

Greetings,

Andres Freund

Re: Locking B-tree leafs immediately in exclusive mode

2018-06-13 Thread Peter Geoghegan

On Mon, Jun 11, 2018 at 9:30 AM, Alexander Korotkov
 wrote:
> On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs  wrote:
>> It's a good idea. How does it perform with many duplicate entries?

I agree that this is a good idea. It independently occurred to me to
do this. The L&Y algorithm doesn't take a position on this at all,
supposing that *no* locks are needed at all (that's why I recommend
people skip L&Y and just read the Lanin & Shasha paper -- L&Y is
unnecessarily confusing). This idea seems relatively low risk.

> Our B-tree is currently maintaining duplicates unordered.  So, during 
> insertion
> we can traverse rightlinks in order to find page, which would fit new
> index tuple.
> However, in this case we're traversing pages in exclusive lock mode, and
> that happens already after re-lock.  _bt_search() doesn't do anything with 
> that.
> So, I assume there shouldn't be any degradation in the case of many
> duplicate entries.

BTW, I have a prototype patch that makes the keys at the leaf level
unique. It is actually an implementation of nbtree suffix truncation
that sometimes *adds* a new attribute in pivot tuples, rather than
truncating away non-distinguishing attributes -- it adds a special
heap TID attribute when it must. The patch typically increases fan-in,
of course, but the real goal was to make all entries at the leaf level
truly unique, as L&Y intended (we cannot do it without suffix
truncation because that will kill fan-in). My prototype has full
amcheck coverage, which really helped me gain confidence in my
approach.

There are still big problems with my patch, though. It seems to hurt
performance more often than it helps when there is a lot of
contention, so as things stand I don't see a path to making it
commitable. I am still going to clean it up some more and post it to
-hackers, though. I still think it's quite interesting. I am not
pursuing it much further now because it seems like my timing is bad --
not because it seems like a bad idea. I can imagine something like
zheap making this patch important again. I can also imagine someone
seeing something that I missed.

-- 
Peter Geoghegan

Re: Portability concerns over pq_sendbyte?

2018-06-13 Thread Michael Paquier

On Wed, Jun 13, 2018 at 11:53:21AM -0700, Andres Freund wrote:
> On 2018-06-13 14:10:37 +0900, Michael Paquier wrote:
>> On Mon, Jun 11, 2018 at 02:25:44PM +0900, Michael Paquier wrote:
>>> On top of that it seems to me that we'd want to rename any new
>>> routines to include "uint" in their name instead of "int", and for
>>> compatibility with past code pq_sendint should not be touched.
> 
> I'm very doubtful about this one, unless you mean that just the
> signature shouldn't be touched.  Otherwise we'll just increase code
> duplication unnecessarily?

Yeah, actually that would be assuming that many modules use it, but that
does not seem to be much the case, at least from github's point of view.

>> And also pq_sendint64 needs to be kept around for compatibility.
> 
> :(. Wonder if it's better to just break people's code.

Indeed.  At least breaking compilation has the advantage of making
people directly aware of the change and think hopefully about them.

A research on github shows a bunch of people having copied of pqformat.h
as there are a bunch of copies of Postgres so with this much noise it is
not easy to find out what would be broken.  In-core contrib and test
modules don't make use of those interfaces as well, except for hstore.
So that could be acceptable.

For pq_sendint there are many matches with printsimple.c.
--
Michael

signature.asc
Description: PGP signature

Logging transaction IDs for DDL.

2018-06-13 Thread Vik Fearing

I just noticed a problem with log_statement = 'ddl' and log_line_prefix
containing '%x'.  If the statement is the first in the transaction, it
will be logged before it is executed, and more importantly, before a
transaction ID is assigned.  That means that %x will be 0.

If the administrator has configured postgres like this in order to ease
PITR on bad DDL, it's actually of no use whatsoever and they'll have to
use pg_waldump to try to figure things out.

PFA a simple patch that I hope addresses the issue in a clean way.  It
also handles the same problem for 'mod'.

I'm undecided whether this is a bugfix or an improvement.  I'm leaning
towards bugfix.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f4133953be..27027a0fa8 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -38,6 +38,7 @@
 
 #include "access/parallel.h"
 #include "access/printtup.h"
+#include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
@@ -2098,16 +2099,31 @@ check_log_statement(List *stmt_list)
 
 	if (log_statement == LOGSTMT_NONE)
 		return false;
-	if (log_statement == LOGSTMT_ALL)
+
+	/*
+	 * If we're supposed to log mod or ddl statements, we need to make sure we
+	 * have a TransactionId so it appears in log_line_prefix's %x wildcard.
+	 * If the user is logging all statements, we can fast-track out if we
+	 * already have a TransactionId, otherwise we need to loop through the
+	 * statements.
+	 */
+	if (log_statement == LOGSTMT_ALL && TransactionIdIsValid(GetTopTransactionIdIfAny()))
 		return true;
 
 	/* Else we have to inspect the statement(s) to see whether to log */
 	foreach(stmt_item, stmt_list)
 	{
 		Node	   *stmt = (Node *) lfirst(stmt_item);
+		LogStmtLevel level = GetCommandLogLevel(stmt);
+
+		if (level <= log_statement)
+		{
+			/* Make sure we have a TransactionId for mod and ddl statements. */
+			if (level == LOGSTMT_DDL || level == LOGSTMT_MOD)
+(void) GetTopTransactionId();
 
-		if (GetCommandLogLevel(stmt) <= log_statement)
 			return true;
+		}
 	}
 
 	return false;

Re: Logging transaction IDs for DDL.

2018-06-13 Thread Andres Freund

Hi,

On 2018-06-14 00:34:54 +0200, Vik Fearing wrote:
> I just noticed a problem with log_statement = 'ddl' and log_line_prefix
> containing '%x'.  If the statement is the first in the transaction, it
> will be logged before it is executed, and more importantly, before a
> transaction ID is assigned.  That means that %x will be 0.

This isn't particularly related to DDL, no? It's a general weakness of
how %x is handled. What's even worse is that by
log_min_duration_statement time the xid is already cleared out, when
using a single-statement transaction.

> If the administrator has configured postgres like this in order to ease
> PITR on bad DDL, it's actually of no use whatsoever and they'll have to
> use pg_waldump to try to figure things out.

To me that part seems like it'd be better handled if we had an option
that allows to log at a time where the xid is already assigned. Making
log_statement assign xids imo has a number of problems, but preserving
the xid for logging at the *end* of a statement should be much less
problematic.

> I'm undecided whether this is a bugfix or an improvement.  I'm leaning
> towards bugfix.

It's been reported a couple times, without anybody working to fix
it. Your fix has a noticable regression potential, so I'm not
immediately in favor of backpatching.

> -- 
> Vik Fearing  +33 6 46 75 15 36
> http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
> index f4133953be..27027a0fa8 100644
> --- a/src/backend/tcop/postgres.c
> +++ b/src/backend/tcop/postgres.c
> @@ -38,6 +38,7 @@
>  
>  #include "access/parallel.h"
>  #include "access/printtup.h"
> +#include "access/transam.h"
>  #include "access/xact.h"
>  #include "catalog/pg_type.h"
>  #include "commands/async.h"
> @@ -2098,16 +2099,31 @@ check_log_statement(List *stmt_list)
>  
>   if (log_statement == LOGSTMT_NONE)
>   return false;
> - if (log_statement == LOGSTMT_ALL)
> +
> + /*
> +  * If we're supposed to log mod or ddl statements, we need to make sure 
> we
> +  * have a TransactionId so it appears in log_line_prefix's %x wildcard.
> +  * If the user is logging all statements, we can fast-track out if we
> +  * already have a TransactionId, otherwise we need to loop through the
> +  * statements.
> +  */
> + if (log_statement == LOGSTMT_ALL && 
> TransactionIdIsValid(GetTopTransactionIdIfAny()))
>   return true;
>
>   /* Else we have to inspect the statement(s) to see whether to log */
>   foreach(stmt_item, stmt_list)
>   {
>   Node   *stmt = (Node *) lfirst(stmt_item);
> + LogStmtLevel level = GetCommandLogLevel(stmt);
> +
> + if (level <= log_statement)
> + {
> + /* Make sure we have a TransactionId for mod and ddl 
> statements. */
> + if (level == LOGSTMT_DDL || level == LOGSTMT_MOD)
> + (void) GetTopTransactionId();
>  
> - if (GetCommandLogLevel(stmt) <= log_statement)
>   return true;
> + }
>   }

I'm not at all on board with this.  For one this'll break in a number of
weird ways on a standby (wrong error messages at the least). For another
it increases the overhead of logging quite noticably. Thirdly moving
transaction assignment to a different place based on logging settings
seems like it'll be very confusing too.

Greetings,

Andres Freund

Re: Slow planning time for simple query

2018-06-13 Thread Maksim Milyutin


13.06.2018 12:40, Maksim Milyutin wrote:


On 09.06.2018 22:49, Tom Lane wrote:


Maksim Milyutin  writes:

On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)


I have verified the number dead item pointers (through pageinspect 
extension) in the first leaf page of index participating in query 
('main.message_instance_pkey') on master and slave nodes and have 
noticed a big difference.


SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 
3705);


On master:

 blkno | type | live_items | dead_items | avg_item_size | page_size | 
free_size | btpo_prev | btpo_next | btpo | btpo_flags

---+--+++---+---+---+---+---+--+
  3705 | l    |  1 | 58 |    24 |  8192 
|  6496 | 0 |  3719 |    0 | 65


On standby:

 blkno | type | live_items | dead_items | avg_item_size | page_size | 
free_size | btpo_prev | btpo_next | btpo | btpo_flags

---+--+++---+---+---+---+---+--+
  3705 | l    | 59 |  0 |    24 |  8192 
|  6496 | 0 |  3719 |    0 |  1





In this point I want to highlight the issue that the changes in 
*lp_flags* bits (namely, set items as dead) for index item pointers 
doesn't propagate from master to replica in my case. As a consequence, 
on standby I have live index items most of which on master are marked as 
dead. And my queries on planning stage are forced to descent to heap 
pages under *get_actual_variable_range* execution that considerately 
slows down planning.


Is it bug or restriction of implementation or misconfiguration of 
WAL/replication?


--
Regards,
Maksim Milyutin

Re: commitfest 2018-07

2018-06-13 Thread Michael Paquier

On Mon, Jun 11, 2018 at 10:34:55AM -0700, Andres Freund wrote:
> On 2018-06-11 11:50:55 +0900, Michael Paquier wrote:
>> So, we have confirmation from the RTM that there will be a 2018-07.  And
>> there is visibly consensus that renaming 2018-09 to 2018-07 makes the
>> most sense.  Any objections in moving forward with this renaming at the
>> 5-CF plan for v12?
> 
> FWIW, I still think it'd be a better plan to explicitly move things,
> rather than just rename.  But concensus isn't unamity, so ...

Well, at least we have a consensus.  I have gone through the CF app and
just did the renaming, creating new commit fests on the way.
--
Michael


signature.asc
Description: PGP signature

RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-13 Thread Tsunakawa, Takayuki

From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> Let me share some of the issues mentioned as possibly addressed by TDE
> (I'm not entirely sure TDE actually solves them, I'm just saying those
> were mentioned in previous discussions):

FYI, our product provides TDE like Oracle and SQL Server, which enables 
encryption per tablespace.  Relations, WAL records and temporary files related 
to encrypted tablespace are encrypted.

http://www.fujitsu.com/global/products/software/middleware/opensource/postgres/

(I wonder why the web site doesn't offer the online manual... I've recognized 
we need to fix this situation.  Anyway, I guess the downloadable trial version 
includes the manual.)



> 1) enterprise requirement - Companies want in-database encryption, for
> various reasons (because "enterprise solution" or something).

To assist compliance with PCI DSS, HIPAA, etc.

> 2) like FDE, but OS/filesystem independent - Same config on any OS and
> filesystem, which may make maintenance easier.
> 
> 3) does not require special OS/filesystem setup - Does not require help
> from system adminitrators, setup of LUKS devices or whatever.
> 
> 4) all filesystem access (basebackups/rsync) is encrypted anyway
> 
> 5) solves key management (the main challenge with pgcrypto)
> 
> 6) allows encrypting only some of the data (tables, columns) to minimize
> performance impact

All yes.


> IMHO it makes sense to have TDE even if it provides the same "security"
> as disk-level encryption, assuming it's more convenient to setup/use
> from the database.

Agreed.


Regards
Takayuki Tsunakawa

Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development

2018-06-13 Thread Andrew Dunstan





On 06/12/2018 08:07 PM, Michael Paquier wrote:

On Tue, Jun 12, 2018 at 04:51:59PM -0400, Andrew Dunstan wrote:

On 06/12/2018 10:51 AM, Andrew Dunstan wrote:

meanwhile, Bowerbird (MSVC) is on 1.0.1d, lorikeet (Cygwin64) is on
1.0.2d and jacana (Mingw) doesn't build with openssl.

I will look at upgrading bowerbird ASAP.

Well, that crashed and burned. What versions of openssl are we supporting?

What kind of failures are you seeing?  I just compiled Postgres two days
ago with MSVC and OpenSSL 1.0.2o (oldest version with a Windows
installer I could find), and that was able to compile.  On HEAD, OpenSSL
should be supported down to 0.9.8.  This thread discusses about whether
we want to enforce HAVE_X509_GET_SIGNATURE_NID unconditionally or not,
as it is disabled now.  Even if the code is linked to 1.0.2 and the flag
is not set, then the code should be able to compile.




I installed 1.1.0h and got errors you can see on the buildfarm. I've now 
installed 1.0.2o and everything is good.


cheers

andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-13 Thread Tsunakawa, Takayuki

> From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> On 05/25/2018 01:41 PM, Moon, Insung wrote:
> > BTW, I want to support CBC mode encryption[3]. However, I'm not sure
> > how to use the IV in CBC mode for this proposal. I'd like to hear
> > opinions by security engineer.
> >
> 
> I'm not a cryptographer either, but this is exactly where you need a
> prior discussion about the threat models - there are a couple of
> chaining modes, each with different weaknesses.
Our products uses XTS, which recent FDE software like BitLocker and TrueCrypt 
uses instead of CBC.

https://en.wikipedia.org/wiki/Disk_encryption_theory#XTS

"According to SP 800-38E, "In the absence of authentication or access control, 
XTS-AES provides more protection than the other approved confidentiality-only 
modes against unauthorized manipulation of the encrypted data.""



> FWIW it may also matter if data_checksums are enabled, because that may
> prevent malleability attacks affecting of the modes. Assuming active
> attacker (with the ability to modify the data files) is part of the
> threat model, of course.

Encrypt the page after embedding its checksum value.  If a malicious attacker 
modifies a page on disk, then the decrypted page would be corrupt anyway, which 
can be detected by checksum.


Regards
Takayuki Tsunakawa

Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development

2018-06-13 Thread Michael Paquier

On Wed, Jun 13, 2018 at 08:55:46PM -0400, Andrew Dunstan wrote:
> I installed 1.1.0h and got errors you can see on the buildfarm. I've now
> installed 1.0.2o and everything is good.

Good thing you tested that.  I have just used the LTS 1.0.2 for my
tests.  And there are a couple of problems here.

HAVE_BIO_GET_DATA is neither defined nor enforced in pg_config.h.win32,
and BIO_get_data has been introduced in 1.1.0, so that explains the
failures as you would need to add it in the config files.  I imagine
that most folks packaging Postgres on Windows simply rely on 1.0.2 (I
do!) which is why we have not seen reports of those failures...

A second failure is related to HAVE_BIO_METH_NEW, with all routine sets
like BIO_meth_set_write & co new as of OpenSSL 1.1.0.  The configure
check uses BIO_meth_new().

A third problem is related to HAVE_ASN1_STRING_GET0_DATA, but I don't
see a complain in the buildfarm logs, which is interesting, but that's
already present in pg_config.h.win32.

A fourth problem is with HAVE_OPENSSL_INIT_SSL, which is missing in
pg_config.h.win32.

We claim support for OpenSSL 1.1.0 down to 9.4 per bb132cdd, so I think
that we should document all those flags even in back-branches.  Thoughts
of people here?

For now, I would suggest to keep a track of HAVE_BIO_GET_DATA,
HAVE_BIO_METH_NEW and others in pg_config.h.win32 but mark them as
undef.  Anybody shipping Windows stuff also likely patch lightly the
MSVC scripts (I do for some paths!), so they could always enforce those
flags if building with OpenSSL 1.1.0...  Documenting them is really
important as well.  So attached is an updated patch which should be
applied on HEAD to close the gap and close this open item with all the
gaps mentioned in the commit message.  I'd like to document (but
disable!) as well the OpenSSL 1.1.0 flags down to 9.4 as that's where we
claim support of this version as bb132cd missed the shot.  This would
break Windows MSVC builds linking to OpenSSL 1.0.1 or older, so the
buildfarm will likely turn red here or there.

Thoughts?
--
Michael
From c571b8d693a5498f2f49f786b065f911e7e4b505 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 14 Jun 2018 10:35:06 +0900
Subject: [PATCH] Track new configure flags introduced for 11 in
 pg_config.h.win32

The following set of flags mainly matter when building Postgres code
with MSVC and those have been forgotten with latest developments:
- HAVE_LDAP_INITIALIZE, added by 35c0754f, but tracked as disabled for
now.  ldap_initialize() is a non-standard extension that provides a way
to use "ldaps" with OpenLDAP, but it is not supported on Windows, and
instead the non-standard ldap_sslinit() is used if WIN32 is defined.
Per input from Thomas Munro.
- HAVE_X509_GET_SIGNATURE_NID, added by 054e8c6c, which is used by
SCRAM's channel binding tls-server-end-point.  Having this flag disabled
would cause this channel binding type to be unsupported for Windows
builds.
- HAVE_SSL_CLEAR_OPTIONS, added recently as of a364dfa4 to disable SSL
compression.
- HAVE_ASN1_STRING_GET0_DATA, added by 5c6df67, which is used to track
a new compatibility with OpenSSL 1.1.0.  This was missing from
pg_config.win32.h and is not enabled by default.  HAVE_BIO_GET_DATA,
HAVE_OPENSSL_INIT_SSL and HAVE_BIO_METH_NEW gain the same treatment.

The second and third flags are enabled with this commit, which raises
the bar of OpenSSL support to 1.0.2 on Windows as a minimum.  As this is
the TLS version of community and knowing that all recent installers
referred by upstream don't have anymore 1.0.1 or older, we could live
with that requirement.  In order to allow the code to compile with
OpenSSL 1.1.0, all the flags mentioned above need to be enabled in
pg_config.h.win32.

Author: Michael Paquier
Discussion: https://postgr.es/m/20180529211559.gf6...@paquier.xyz
---
 src/include/pg_config.h.win32 | 21 +
 1 file changed, 21 insertions(+)

diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 2c701fa718..0c15d7f624 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -72,6 +72,15 @@
 # define gettimeofday(a,b) gettimeofday(a)
 #endif

+/* Define to 1 if you have the `ASN1_STRING_get0_data' function. */
+/* #undef HAVE_ASN1_STRING_GET0_DATA */
+
+/* Define to 1 if you have the `BIO_get_data' function. */
+/* #undef HAVE_BIO_GET_DATA */
+
+/* Define to 1 if you have the `BIO_meth_new' function. */
+/* #undef HAVE_BIO_METH_NEW */
+
 /* Define to 1 if you have the `cbrt' function. */
 //#define HAVE_CBRT 1

@@ -233,6 +242,9 @@
 /* Define to 1 if you have the  header file. */
 /* #undef HAVE_LDAP_H */

+/* Define to 1 if you have the `ldap_initialize' function. */
+/* #undef HAVE_LDAP_INITIALIZE */
+
 /* Define to 1 if you have the `crypto' library (-lcrypto). */
 /* #undef HAVE_LIBCRYPTO */

@@ -288,6 +300,9 @@
 /* Define to 1 if you have the  header file. */
 /* #undef HAVE_NETINET_TCP_H */

+/* Define to 1 if you have the `OPENSSL_init_ss

Re: Partitioning with temp tables is broken

2018-06-13 Thread Michael Paquier

On Wed, Jun 13, 2018 at 10:25:23PM +0530, amul sul wrote:
> On Wed, Jun 13, 2018, 8:34 PM Tom Lane  wrote:
>> Even if you want to argue that there's a use case for these situations,
>> it seems far too late in the release cycle to be trying to fix all these
>> issues.  I think we need to forbid the problematic cases for now, and
>> leave relaxing the prohibition to be treated as a future new feature.
> 
> +1, forbid the problematic case.

+1 on that.  And I can see that nobody on this thread has tested with
REL_10_STABLE, right?  In that case you don't get a crash, but other
sessions can see the temporary partition of another's session, say with
\d+.  So it seems to me that we should forbid the use of temporary
relations in a partition tree and back-patch it as well to v10 for
consistency.  And if trying to insert into the temporary relation you
can a nice error:
=# insert into listp values (2);
ERROR:  0A000: cannot access temporary tables of other sessions
LOCATION:  ReadBufferExtended, bufmgr.c:657

This is also a bit uninstinctive as I would think of it as an authorized
operation, still my guts tell me that the code complications are not
really worth the use-cases:
=# create temp table listp2 partition of listp for values in (2);
ERROR:  42P17: partition "listp2" would overlap partition "listp2"
LOCATION:  check_new_partition_bound, partition.c:81
--
Michael


signature.asc
Description: PGP signature

Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()

2018-06-13 Thread Michael Paquier

On Thu, Jun 14, 2018 at 02:06:57AM +0900, Masahiko Sawada wrote:
> On Thu, Jun 14, 2018 at 12:04 AM, Simon Riggs  wrote:
>> On 13 June 2018 at 15:51, Alvaro Herrera  wrote:
>>> I guess you could go either way ... we're just changing one unhelpful
>>> error with a better one: there is no change in behavior.  I would
>>> backpatch this, myself, and avoid the code divergence.
>>
>> WAL control functions all say the same thing, so we can do that here also.
> 
> +1

+1 for a back-patch to v10.  The new error message brings clarity in my
opinion.
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-13 Thread Masahiko Sawada

On Wed, Jun 13, 2018 at 10:03 PM, Tomas Vondra
 wrote:
> On 06/11/2018 11:22 AM, Masahiko Sawada wrote:
>>
>> On Fri, May 25, 2018 at 8:41 PM, Moon, Insung
>>  wrote:
>>>
>>> Hello Hackers,
>>>
>>> This propose a way to develop "Table-level" Transparent Data Encryption
>>> (TDE) and Key Management Service (KMS) support in PostgreSQL.
>>>
>>> ...
>>
>>
>> As per discussion at PGCon unconference, I think that firstly we
>> need to discuss what threats we want to defend database data against.
>> If user wants to defend against a threat that is malicious user who logged
>> in OS or database steals an important data on datbase this design TDE would
>> not help. Because such user can steal the data by getting a memory dump or
>> by SQL. That is of course differs depending on system requirements or
>> security compliance but what threats do
>> you want to defend database data against? and why?
>>
>
> I do agree with this - a description of the threat model needs to be part of
> the design discussion, otherwise it's not possible to compare it to
> alternative solutions (e.g. full-disk encryption using LUKS or using
> existing privilege controls and/or RLS).
>
> TDE was proposed/discussed repeatedly in the past, and every time it died
> exactly because it was not very clear which issue it was attempting to
> solve.
>
> Let me share some of the issues mentioned as possibly addressed by TDE (I'm
> not entirely sure TDE actually solves them, I'm just saying those were
> mentioned in previous discussions):

Thank you for sharing!

>
> 1) enterprise requirement - Companies want in-database encryption, for
> various reasons (because "enterprise solution" or something).

Yes, I'm often asked it by our customers especially for database
migration from DBMS that supports TDE in order to reduce costs of
migration.

>
> 2) like FDE, but OS/filesystem independent - Same config on any OS and
> filesystem, which may make maintenance easier.
>
> 3) does not require special OS/filesystem setup - Does not require help from
> system adminitrators, setup of LUKS devices or whatever.
>
> 4) all filesystem access (basebackups/rsync) is encrypted anyway
>
> 5) solves key management (the main challenge with pgcrypto)
>
> 6) allows encrypting only some of the data (tables, columns) to minimize
> performance impact
>
> IMHO it makes sense to have TDE even if it provides the same "security" as
> disk-level encryption, assuming it's more convenient to setup/use from the
> database.

Agreed.

>
>> Also, if I understand correctly, at unconference session there also were
>> two suggestions about the design other than the suggestion by Alexander:
>> implementing TDE at column level using POLICY, and implementing TDE at
>> table-space level. The former was suggested by
>> Joe but I'm not sure the detail of that suggestion. I'd love to hear
>> the deal of that suggestion. The latter was suggested by
>> Tsunakawa-san. Have you considered that?
>>
>> You mentioned that encryption of temporary data for query processing and
>> large objects are still under the consideration. But other than them you
>> should consider the temporary data generated by other subsystems such as
>> reorderbuffer and transition table as well.
>>
>
> The severity of those limitations is likely related to the threat model. I
> don't think encrypting temporary data would be a big problem, assuming you
> know which key to use.

Agreed. I thought the possibility of non-encrypted temporary data in
backups but since we don't include them in backups it would not be a
big problem.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Pluggable storage

2018-06-13 Thread Amit Kapila

On Thu, Jun 14, 2018 at 1:50 AM, Haribabu Kommi
 wrote:
>
> On Fri, Apr 20, 2018 at 4:44 PM Haribabu Kommi 
> wrote:
>
> VACUUM:
> Not much changes are done in this apart moving the Vacuum visibility
> functions as part of the
> storage. But idea for the Vacuum was with each access method can define how
> it should perform.
>

We are planning to have a somewhat different mechanism for vacuum (for
non-delete marked indexes), so if you can provide some details or
discuss what you have in mind before implementation of same, that
would be great.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Shared access methods?

2018-06-13 Thread Andres Freund

Hi,

Several features in various discussed access methods would benefit from
being able to perform actions when writing out a buffer. As an example,
because it doesn't require understanding any of the new proposed storage
formats, it'd be good for performance if we could eagerly set hint bits
/ perform page level pruning when cleaning a dirty buffer either during
checkpoint writeout or bgwriter / backend reclaim.  That'd allow to
avoid the write amplification issues in several of current and proposed
cleanup schemes.

Unfortunately that's currently not really easy to do. Buffers don't
currently know which AM they belong to, therefore we can't know how to
treat it at writeout time.  It's not that hard to make space in the
buffer descriptor to additionally store the oid of the associated AM,
e.g. we could just replace buf_id with a small bit of pointer math.

But even if we had a AM oid, it'd be unclear what to do with it as it'd
be specific to a database. Which makes it pretty much useless for tasks
happening on writeout of victim buffers / checkpoint.

Thus I think it'd be better design to have pg_am be a shared
relation. That'd imply a two things:
a) amhandler couldn't be regproc but would need to be two fields, one
   pointing to internal or a shlib, the other to the function name.
b) extensions containing AMs would need to do something INSERT ... ON
   CONFLICT DO NOTHING like.

I don't think this is the most urgent feature for making pluggable AMs
useful, but given that we're likely going to whack around pg_am, and
that pg_am is fairly new in its current incarnation, it seems like a
good idea to discuss this now.

Comments?

Greetings,

Andres Freund


PS: I could have written more on this, but people are urging me to come
to dinner, so thank them ;)

[GSoC] current working status

2018-06-13 Thread Charles Cui

Hi mentors and hackers,

   The first evaluation is coming. Here is my progress so far. During the
first stage of work, I have implemented the thrift binary protocol as the
format of postgresql plugin. Currently, the main interface is byte. Users
who use this plugin need to provide thrift bytes to the plugin, and there
are helpers for each data type to parse out the value contained in the
bytes. This method has been verified by the use of several unit tests.
However, the current interface needs users understand thrift format very
well to use this plugin. After discussing with my mentors, it will be more
useful to implement the concept of "thrift type", which is a custom type
where user provides user understandable format(e.g., json), then the plugin
converts into byte. I am currently busy with implementing this feature and
still need sometime to complete it. If this part is done, I will go ahead
to implement thrift compact protocol.  Let me know if you have comments.
You can always track the progress from github (
https://github.com/charles-cui/pg_thrift)

Thanks, Charles!

Re: WAL prefetch

2018-06-13 Thread Amit Kapila

On Wed, Jun 13, 2018 at 6:39 PM, Konstantin Knizhnik
 wrote:
> There was very interesting presentation at pgconf about pg_prefaulter:
>
> http://www.pgcon.org/2018/schedule/events/1204.en.html
>
> But it is implemented in GO and using pg_waldump.
> I tried to do the same but using built-on Postgres WAL traverse functions.
> I have implemented it as extension for simplicity of integration.
> In principle it can be started as BG worker.
>

Right or in other words, it could do something like autoprewarm [1]
which can allow a more user-friendly interface for this utility if we
decides to include it.

> First of all I tried to estimate effect of preloading data.
> I have implemented prefetch utility with is also attached to this mail.
> It performs random reads of blocks of some large file and spawns some number
> of prefetch threads:
>
> Just normal read without prefetch:
> ./prefetch -n 0 SOME_BIG_FILE
>
> One prefetch thread which uses pread:
> ./prefetch SOME_BIG_FILE
>
> One prefetch thread which uses posix_fadvise:
> ./prefetch -f SOME_BIG_FILE
>
> 4 prefetch thread which uses posix_fadvise:
> ./prefetch -f -n 4 SOME_BIG_FILE
>
> Based on this experiments (on my desktop), I made the following conclusions:
>
> 1. Prefetch at HDD doesn't give any positive effect.
> 2. Using posix_fadvise allows to speed-up random read speed at SSD up to 2
> times.
> 3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
> 4. Calling posix_fadvise in more than one thread has no sense.
>
> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
> The speed of synchronous replication between two nodes is increased from 56k
> TPS to 60k TPS (on pgbench with scale 1000).
>

That's a reasonable improvement.

> Usage:
> 1. At master: create extension wal_prefetch
> 2. At replica: Call pg_wal_prefetch() function: it will not return until you
> interrupt it.
>

I think it is not a very user-friendly interface, but the idea sounds
good to me, it can help some other workloads.  I think this can help
in recovery as well.


[1] - https://www.postgresql.org/docs/devel/static/pgprewarm.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Two round for Client Authentication

2018-06-13 Thread Yinjie Lin

Hello everyone.

I am a newcomer to PostgreSQL, and I don't know if it is proper to post my
question here, but I really need some help.

Currently I am reading and testing code about Client Authentication, but I
find that there are two progresses forked if I login using psql, while only
one progress is forked if using pgAdmin.

My pg_hba.conf is as follows:
local  all  all   md5
host  all  all  192.168.64.56/32  md5

If I login with psql, the following stack is called twice:
ServerLoop
BackendStartup
BackendRun
PostgresMain
InitPostgres
PerformAuthentication
ClientAuthentication

In the first round, the variable `status` in function
ClientAuthentication() is always STATUS_EOF (at least in my test). In the
second round, `status` seems to be what should be expected: STATUS_OK for
correct password, STATUS_ERROR for wrong password, and STATUS_EOF for empty
password.

Why are there two such progresses forked? I think one round should be
enough, like when using pgAdmin.

Besides, I find it hard to debug the ServerLoop() stack, compared with the
backend progress for query, because there are two many subprogresses and
signals. It would be of great help if anyone could give me some
instructions on how to learn to debug postmaster using gdb.

Thanks in advance!

Best Regards

Re: Partitioning with temp tables is broken

2018-06-13 Thread Amit Langote

On 2018/06/14 11:09, Michael Paquier wrote:
> On Wed, Jun 13, 2018 at 10:25:23PM +0530, amul sul wrote:
>> On Wed, Jun 13, 2018, 8:34 PM Tom Lane  wrote:
>>> Even if you want to argue that there's a use case for these situations,
>>> it seems far too late in the release cycle to be trying to fix all these
>>> issues.  I think we need to forbid the problematic cases for now, and
>>> leave relaxing the prohibition to be treated as a future new feature.
>>
>> +1, forbid the problematic case.
> 
> +1 on that.  And I can see that nobody on this thread has tested with
> REL_10_STABLE, right?  In that case you don't get a crash, but other
> sessions can see the temporary partition of another's session, say with
> \d+.  So it seems to me that we should forbid the use of temporary
> relations in a partition tree and back-patch it as well to v10 for
> consistency.  And if trying to insert into the temporary relation you
> can a nice error:
>
> =# insert into listp values (2);
> ERROR:  0A000: cannot access temporary tables of other sessions
> LOCATION:  ReadBufferExtended, bufmgr.c:657
> 
> This is also a bit uninstinctive as I would think of it as an authorized
> operation, still my guts tell me that the code complications are not
> really worth the use-cases:
>
> =# create temp table listp2 partition of listp for values in (2);
> ERROR:  42P17: partition "listp2" would overlap partition "listp2"
> LOCATION:  check_new_partition_bound, partition.c:81

I have to agree to reverting this "feature".  As Michael points out, even
PG 10 fails to give a satisfactory error message when using a declarative
feature like tuple routing.  It should've been "partition not found for
tuple" rather than "cannot access temporary tables of other sessions".
Further as David and Ashutosh point out, PG 11 added even more code around
declarative partitioning that failed to consider the possibility that some
partitions may not be accessible in a given session even though visible
through relcache.

I'm attaching a patch here to forbid adding a temporary table as partition
of permanent table.  I didn't however touch the feature that allows *all*
members in a partition tree to be temporary tables.

Thanks,
Amit
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8b848f91a7..fb7cd561e2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1985,6 +1985,16 @@ MergeAttributes(List *schema, List *supers, char 
relpersistence,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("inherited relation \"%s\" is 
not a table or foreign table",
parent->relname)));
+
+   /* If the parent is permanent, so must be all of its 
partitions. */
+   if (is_partition &&
+   relation->rd_rel->relpersistence != RELPERSISTENCE_TEMP 
&&
+   relpersistence == RELPERSISTENCE_TEMP)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("cannot create a temporary 
relation as partition of permanent relation \"%s\"",
+   
RelationGetRelationName(relation;
+
/* Permanent rels cannot inherit from temporary ones */
if (relpersistence != RELPERSISTENCE_TEMP &&
relation->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
@@ -14135,7 +14145,15 @@ ATExecAttachPartition(List **wqueue, Relation rel, 
PartitionCmd *cmd)
   RelationGetRelationName(rel),
   
RelationGetRelationName(attachrel;
 
-   /* Temp parent cannot have a partition that is itself not a temp */
+   /* If the parent is permanent, so must be all of its partitions. */
+   if (rel->rd_rel->relpersistence != RELPERSISTENCE_TEMP &&
+   attachrel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("cannot attach a temporary relation as 
partition of permanent relation \"%s\"",
+   RelationGetRelationName(rel;
+
+   /* If the parent is temporary relation, so must be all of its 
partitions */
if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP &&
attachrel->rd_rel->relpersistence != RELPERSISTENCE_TEMP)
ereport(ERROR,
@@ -14143,14 +14161,14 @@ ATExecAttachPartition(List **wqueue, Relation rel, 
PartitionCmd *cmd)
 errmsg("cannot attach a permanent relation as 
partition of temporary relation \"%s\"",
RelationGetRelationName(

Re: Two round for Client Authentication

2018-06-13 Thread Marko Tiikkaja

On Thu, Jun 14, 2018 at 7:12 AM, Yinjie Lin  wrote:

> Currently I am reading and testing code about Client Authentication, but I
> find that there are two progresses forked if I login using psql, while only
> one progress is forked if using pgAdmin.
>

If psql finds the server asks for a password, it closes the first
connection, displays a password prompt to the user, and then does another
connection attempt with the password the user entered.  You can avoid the
first attempt with the -W flag; though there's usually no reason to do that
in practice.

.m

Re: Two round for Client Authentication

2018-06-13 Thread David G. Johnston

On Wednesday, June 13, 2018, Yinjie Lin  wrote:
>
> Why are there two such progresses forked? I think one round should be
> enough, like when using pgAdmin.
>

You can use the --password option to prevent it.

"""
This option is never essential, since psql will automatically prompt for a
password if the server demands password authentication. However, psql will
waste a connection attempt finding out that the server wants a password. In
some cases it is worth typing -W to avoid the extra connection attempt.
"""

In pgAdmin you've saved a password to the profile so the initial attempt
uses it.  psql doesn't have a similar capability.  Though I am unsure
whether the use of .pgpass would make any difference here...

David J.

Re: Partitioning with temp tables is broken

2018-06-13 Thread Amit Langote

On 2018/06/13 21:06, David Rowley wrote:
> There's also something pretty weird around the removal of the temp
> relation from the partition bound. I've had cases where the session
> that attached the temp table is long gone, but \d+ shows the table is
> still there and I can't attach another partition due to an overlap,
> and can't drop the temp table due to the session not existing anymore.
> I've not got a test case for that one yet, but the test case for the
> crash is:
> 
> -- Session 1:
> create table listp (a int) partition by list(a);
> create table listp1 partition of listp for values in(1);
> create temp table listp2 partition of listp for values in (2);
> 
> -- Session 2:
> select * from listp;

When Session 2 crashes (kill -9'ing it would also suffice), for some
reason, Session 1 doesn't get an opportunity to perform
RemoveTempRelationsCallback().  So, both the listp2's entry pg_class and
any references to it (such as its pg_inherits entry as partition of listp)
persist.  listp2 won't be removed from the partition bound until all of
those catalog entries get removed.

Thanks,
Amit

Re: Postgres, fsync, and OSs (specifically linux)

2018-06-13 Thread Thomas Munro

On Wed, May 23, 2018 at 8:02 AM, Andres Freund  wrote:
> [patches]

Hi Andres,

Obviously there is more work to be done here but the basic idea in
your clone-fd-checkpointer branch as of today seems OK to me.  I think
Craig and I both had similar ideas (sending file descriptors that have
an old enough f_wb_err) but we thought rlimit management was going to
be hard (shared memory counters + deduplication, bleugh).  You made it
look easy.  Nice work.

First, let me describe in my own words what's going on, mostly to make
sure I really understand this:

1.  We adopt a new fd lifecycle that is carefully tailored to avoid
error loss on Linux, and can't hurt on other OSes.  By sending the
file descriptor used for every write to the checkpointer, we guarantee
that (1) the inode stays pinned (the file is "in flight" in the
socket, even if the sender closes it before the checkpointer receives
it) so Linux won't be tempted to throw away the precious information
in i_mapping->wb_err, and (2) the checkpointer finishes up with a file
descriptor that points to the very same "struct file" with the
f_wb_err value that was originally sampled before the write, by the
sender.  So we can't miss any write-back errors.  Wahey!  However,
errors are still reported only once, so we probably need to panic.

Hmm...  Is there any way that the *sender* could finish up in
file_check_and_advance_wb_err() for the same struct file, before the
checkpointer manages to call fsync() on its dup'd fd?  I don't
immediately see how (it looks like you have to call one of the various
sync functions to reach that, and your patch removes the fallback
just-call-FileSync()-myself code from register_dirty_segment()).  I
guess it would be bad if, say, close() were to do that in some future
Linux release because then we'd have no race-free way to tell the
checkpointer that the file is borked before it runs fsync() and
potentially gets an OK response and reports a successful checkpoint
(even if we panicked, with sufficiently bad timing it might manage to
report a successful checkpoint).

2.  In order to make sure that we don't exceed our soft limit on the
number of file descriptors per process, you use a simple backend-local
counter in the checkpointer, on the theory that we don't care about
fds (or, technically, the files they point to) waiting in the socket
buffer, we care only about how many the checkpointer has actually
received but not yet closed.  As long as we make sure there is space
for at least one more before we read one message, everything should be
fine.  Good and simple.

One reason I thought this would be harder is because I had no model of
how RLIMIT_NOFILE would apply when you start flinging fds around
between processes (considering there can be moments when neither end
has the file open), so I feared the worst and thought we would need to
maintain a counter in shared memory and have senders and receiver
cooperate to respect it.  My intuition that this was murky and
required pessimism wasn't too far from the truth actually: apparently
the kernel didn't do a great job at accounting for that until a
patch[1] landed for CVE-2013-4312.

The behaviour in older releases is super lax, so no problem there.
The behaviour from 4.9 onward (or is it 4.4.1?) is that you get a
separate per-user RLIMIT_NOFILE allowance for in-flight fds.  So
effectively the sender doesn't have to worry about about fds it has
sent but closed and the receiver doesn't have to care about fds it
hasn't received yet, so your counting scheme seems OK.  As for
exceeding RLIMIT_NOFILE with in-flight fds, it's at least bounded by
the fact that the socket would block/EWOULDBLOCK if the receiver isn't
draining it fast enough and can only hold a small and finite amount of
data and thus file descriptors, so we can probably just ignore that.
If you did manage to exceed it, I think you'd find out about that with
ETOOMANYREFS at send time (curiously absent from the sendmsg() man
page, but there in black and white in the source for
unix_attach_fds()), and then you'd just raise an error (currently
FATAL in your patch).  I have no idea how the rlimit for SCM-ified
files works on other Unixoid systems though.

Some actual code feedback:

+   if (entry->syncfds[forknum][segno] == -1)
+   {
+   char *path = mdpath(entry->rnode,
forknum, segno);
+   open_fsync_queue_files++;
+   /* caller must have reserved entry */
+   entry->syncfds[forknum][segno] =
+   FileOpenForFd(fd, path);
+   pfree(path);
+   }
+   else
+   {
+   /*
+* File is already open. Have to keep
the older fd, errors
+* might only be reported to it, thus
close the one

ntile() throws ERROR when hashagg is false

2018-06-13 Thread Rajkumar Raghuwanshi

Hi

ntile() throws ERROR when hashagg is false, test case given below.

postgres=# create table foo (a int, b int, c text);
CREATE TABLE
postgres=# insert into foo select i%20, i%30, to_char(i%12, 'FM') from
generate_series(0, 36) i;
INSERT 0 37
postgres=# explain select ntile(a) OVER () from foo GROUP BY a;
QUERY PLAN
---
 WindowAgg  (cost=25.00..29.50 rows=200 width=8)
   ->  HashAggregate  (cost=25.00..27.00 rows=200 width=4)
 Group Key: a
 ->  Seq Scan on foo  (cost=0.00..22.00 rows=1200 width=4)
(4 rows)

postgres=# select ntile(a) OVER () from foo GROUP BY a;
 ntile
---
 1
 1
 2
 2
 3
 3
 4
 4
 5
 5
 6
 6
 7
 7
 8
 8
 9
 9
10
11
(20 rows)

postgres=# set enable_hashagg to false;
SET
postgres=# explain select ntile(a) OVER () from foo GROUP BY a;
   QUERY PLAN
-
 WindowAgg  (cost=83.37..91.87 rows=200 width=8)
   ->  Group  (cost=83.37..89.37 rows=200 width=4)
 Group Key: a
 ->  Sort  (cost=83.37..86.37 rows=1200 width=4)
   Sort Key: a
   ->  Seq Scan on foo  (cost=0.00..22.00 rows=1200 width=4)
(6 rows)
postgres=# select ntile(a) OVER () from foo GROUP BY a;
ERROR:  argument of ntile must be greater than zero

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: WAL prefetch

2018-06-13 Thread Thomas Munro

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
 wrote:
> pg_wal_prefetch function will infinitely traverse WAL and prefetch block
> references in WAL records
> using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache...  what about shared buffers?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: ntile() throws ERROR when hashagg is false

2018-06-13 Thread Andrew Gierth

> "Rajkumar" == Rajkumar Raghuwanshi 
>  writes:

 Rajkumar> Hi

 Rajkumar> ntile() throws ERROR when hashagg is false, test case given
 Rajkumar> below.

 Rajkumar> postgres=# create table foo (a int, b int, c text);
 Rajkumar> CREATE TABLE
 Rajkumar> postgres=# insert into foo select i%20, i%30, to_char(i%12, 
'FM') from
 Rajkumar> generate_series(0, 36) i;
 Rajkumar> INSERT 0 37
 Rajkumar> postgres=# explain select ntile(a) OVER () from foo GROUP BY a;

This query isn't actually legal per the spec; the argument of ntile is
restricted to being a constant or parameter, so it can't change from row
to row. PG is more flexible, but that doesn't make the query any more
meaningful.

What I think pg is actually doing is taking the value of the ntile()
argument from the first row and using that for the whole partition.
In your example, enabling or disabling hashagg changes the order of the
input rows for the window function (since you've specified no ordering
in the window definition), and with hashagg off, you get the smallest
value of a first (which is 0 and thus an error).

-- 
Andrew (irc:RhodiumToad)

66 matches

Mail list logo