Re: [PATCH] Logical decoding of TRUNCATE

2018-04-04 Thread Simon Riggs
On 4 April 2018 at 03:31, Peter Eisentraut
 wrote:

> I wonder why this approach was chosen.

I looked at coding it that way and it seemed worse than the way chosen.

> I'm going to try to hack up an alternative approach and see how it works
> out.

If you add a new filter for TRUNCATE it will mean compatibility
problems and the need for pg_dump support.

Need note in release notes to explain that people will need to add
TRUNCATE filter to their existing publications. I was hoping to have
that picked up automatically, which can't happen if we have an
explicit filter for it.

>> I know this has been discussed in the thread already, but it really
>> strikes me as wrong to basically do some mini DDL replication feature
>> via per-command WAL records.

Don't really understand that comment.

> I think TRUNCATE is special in some ways and it's reasonable to treat it
> specially.  Real DDL is being worked on in the 2PC decoding thread.
> What we are discussing here isn't going to be applicable there and vice
> versa, I think.  In fact, one of the reasons for this effort is that in
> BDR TRUNCATE is currently handled like a DDL command, which doesn't work
> well.

Agreed

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Runtime Partition Pruning

2018-04-04 Thread David Rowley
On 4 April 2018 at 18:27, Amit Langote  wrote:
> On 2018/04/04 11:10, David Rowley wrote:
>> On 4 April 2018 at 05:44, Jesper Pedersen  wrote:
>>> Also, I'm seeing a regression for check-world in
>>> src/test/regress/results/inherit.out
>>>
>>> ***
>>> *** 642,648 
>>>   -+---+---+-
>>>mlparted_tab_part1  | 1 | a |
>>>mlparted_tab_part2a | 2 | a |
>>> !  mlparted_tab_part2b | 2 | b | xxx
>>>mlparted_tab_part3  | 3 | a | xxx
>>>   (4 rows)
>>>
>>> --- 642,648 
>>>   -+---+---+-
>>>mlparted_tab_part1  | 1 | a |
>>>mlparted_tab_part2a | 2 | a |
>>> !  mlparted_tab_part2b | 2 | b |
>>>mlparted_tab_part3  | 3 | a | xxx
>>>   (4 rows)
>>>
>>> I'll spend some more time tomorrow.
>>
>> Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for
>> that with Amit over on [2].
>
> I'm not sure if we've yet discussed anything that'd be related to this on
> the faster pruning thread.

hmm, yeah, I didn't really explain the context, but the report was in [1]

Basically, the OR clause in the following SQL fragment was overwriting
the scan_all_non_null value:

where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3;

Basically the:

result->scan_all_nonnull = step_result->scan_all_nonnull;

The minimum fix would have been to change that line to:

result->scan_all_nonnull |= step_result->scan_all_nonnull;

Anyway, it all irrelevant now as that code has all changed.

[1] 
https://www.postgresql.org/message-id/cakjs1f_shpuqdhqwjq-_p1kppqn7bjt71ypbdp_8b3rhwfq...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] path toward faster partition pruning

2018-04-04 Thread David Rowley
On 4 April 2018 at 17:42, Amit Langote  wrote:
> I'm not sure about the following change in your patch:
>
> -   if (!result->scan_null)
> -   result->scan_null = step_result->scan_null;
> -   if (!result->scan_default)
> -   result->scan_default = step_result->scan_default;
> +   result->scan_null |= step_result->scan_null;
> +   result->scan_default |= step_result->scan_default;
>
> Afaik, |= does bitwise OR, which even if it might give the result we want,
> is not a logical operation.  I had written the original code using the
> following definition of logical OR.
>
>   a OR b = if a then true else b

Ok, no problem. I only changed that to make it more compact.

For the record we do the same in plenty of over places over the code base:

E.g.

parse->hasSubLinks |= subquery->hasSubLinks;

/* If subquery had any RLS conditions, now main query does too */
parse->hasRowSecurity |= subquery->hasRowSecurity;


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] path toward faster partition pruning

2018-04-04 Thread David Rowley
On 4 April 2018 at 19:04, Amit Langote  wrote:
> Attached v49.

Thank for including the changes. I'll look now.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Runtime Partition Pruning

2018-04-04 Thread Amit Langote
On 2018/04/04 16:04, David Rowley wrote:
> On 4 April 2018 at 18:27, Amit Langote  wrote:
>> I'm not sure if we've yet discussed anything that'd be related to this on
>> the faster pruning thread.
> 
> hmm, yeah, I didn't really explain the context, but the report was in [1]>
> [1] 
> https://www.postgresql.org/message-id/cakjs1f_shpuqdhqwjq-_p1kppqn7bjt71ypbdp_8b3rhwfq...@mail.gmail.com

Oh, I see.  Hopefully it is no longer an issue.

Thanks,
Amit




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Thomas Munro
On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer  wrote:
> On 4 April 2018 at 13:29, Thomas Munro 
> wrote:
>> /* Ensure that we skip any errors that predate opening of the file */
>> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
>>
>> [...]
>
> Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel
> will deliberately hide writeback errors that predate our fsync() call from
> us?

Predates the opening of the file by the process that calls fsync().
Yeah, it sure looks that way based on the above code fragment.  Does
anyone know better?

> Does that mean that the ONLY ways to do reliable I/O are:
>
> - single-process, single-file-descriptor write() then fsync(); on failure,
> retry all work since last successful fsync()

I suppose you could some up with some crazy complicated IPC scheme to
make sure that the checkpointer always has an fd older than any writes
to be flushed, with some fallback strategy for when it can't take any
more fds.

I haven't got any good ideas right now.

> - direct I/O

As a bit of an aside, I gather that when you resize files (think
truncating/extending relation files) you still need to call fsync()
even if you read/write all data with O_DIRECT, to make it flush the
filesystem meta-data.  I have no idea if that could also be affected
by eaten writeback errors.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Craig Ringer
On 4 April 2018 at 14:00, Craig Ringer  wrote:

> On 4 April 2018 at 13:29, Thomas Munro 
> wrote:
>
>> On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
>>  wrote:
>> > On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian  wrote:
>> >> Uh, are you sure it fixes our use-case?  From the email description it
>> >> sounded like it only reported fsync errors for every open file
>> >> descriptor at the time of the failure, but the checkpoint process might
>> >> open the file _after_ the failure and try to fsync a write that
>> happened
>> >> _before_ the failure.
>> >
>> > I'm not sure of anything.  I can see that it's designed to report
>> > errors since the last fsync() of the *file* (presumably via any fd),
>> > which sounds like the desired behaviour:
>> >
>> > [..]
>>
>> Scratch that.  Whenever you open a file descriptor you can't see any
>> preceding errors at all, because:
>>
>> /* Ensure that we skip any errors that predate opening of the file */
>> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
>>
>> https://github.com/torvalds/linux/blob/master/fs/open.c#L752
>>
>> Our whole design is based on being able to open, close and reopen
>> files at will from any process, and in particular to fsync() from a
>> different process that didn't inherit the fd but instead opened it
>> later.  But it looks like that might be able to eat errors that
>> occurred during asynchronous writeback (when there was nobody to
>> report them to), before you opened the file?
>>
>
> Holy hell. So even PANICing on fsync() isn't sufficient, because the
> kernel will deliberately hide writeback errors that predate our fsync()
> call from us?
>
> I'll see if I can expand my testcase for that. I'm presently dockerizing
> it to make it easier for others to use, but that turns out to be a major
> pain when using devmapper etc. Docker in privileged mode doesn't seem to
> play nice with device-mapper.
>
>
Done, you can find it in
https://github.com/ringerc/scrapcode/tree/master/testcases/fsync-error-clear
now.

Warning, this runs a Docker container in privileged mode on your system,
and it uses devicemapper. Read it before you run it, and while I've tried
to keep it safe, beware that it might eat your system.

For now it tests only xfs and EIO. Other FSs should be easy enough.

I haven't added coverage for multi-processing yet, but given what you found
above, I should. I'll probably just system() a copy of the same proc with
instructions to only fsync(). I'll do that next.

I haven't worked out a reliable way to trigger ENOSPC on fsync() yet, when
mapping without the error hole. It happens sometimes but I don't know why,
it almost always happens on write() instead. I know it can happen on nfs,
but I'm hoping for a saner example than that to test with. ext4 and xfs do
delayed allocation but eager reservation so it shouldn't happen to them.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Problem while setting the fpw with SIGHUP

2018-04-04 Thread Kyotaro HORIGUCHI
At Sat, 31 Mar 2018 17:43:58 +0530, Amit Kapila  wrote 
in 
> On Wed, Mar 28, 2018 at 12:10 PM, Kyotaro HORIGUCHI
>  wrote:
> > At Tue, 27 Mar 2018 22:02:26 +0900, Michael Paquier  
> > wrote in <20180327130226.ga1...@paquier.xyz>
> >> On Tue, Mar 27, 2018 at 09:01:20PM +0900, Kyotaro HORIGUCHI wrote:
> >> > The current UpdateFullPageWrites is safe on standby and promotion
> >> > so what we should consider is only the non-standby case. I think
> >> > what we should do is just calling RecoveryInProgress() at the
> >> > beginning of CheckPointerMain, which is just the same thing with
> >> > InitPostgres, but before setting up signal handler to avoid
> >> > processing SIGHUP before being ready to insert xlog.
> >>
> >> Your proposal does not fix the issue for a checkpointer process started
> >> on a standby.  After a promotion, if SIGHUP is issued with a change in
> >> full_page_writes, then the initialization of InitXLogInsert() would
> >> happen again in the critical section of UpdateFullPageWrites().  The
> >> window is rather small for normal promotions as the startup process
> >> requests a checkpoint which would do the initialization, and much larger
> >> for fallback_promote where the startup process is in charge of doing the
> >> end-of-recovery checkpoint.
> >
> > Yeah. I realized that after sending the mail.
> >
> > I looked closer and found several problems there.
> >
> > - On standby, StartupXLOG calls UpdateFullPageWrites and
> >   checkpointer can call the same function simultaneously, but it
> >   doesn't assume concurrent call.
> >
> > - StartupXLOG can make a concurrent write to
> >   Insert->fullPageWrite so it needs to be locked.
> >
> > - At the time of the very end of recovery, the startup process
> >   ignores possible change of full_page_writes GUC. It sticks with
> >   the startup value. It leads to loss of
> >   XLOG_CHANGE_FPW. (StartXLOG is not considering GUC changes by
> >   reload)
> >
> 
> Won't this be covered by checkpointer process?  Basically, the next
> time checkpointer sees that it has received the sighup, it will call
> UpdateFullPageWrites which will log the record if required.

Right. Checkpointer is doing the right thing, but without
writing XLOG_FPW_CHANGE records during recovery.

The problem is in StartupXLOG. It doesn't read shared FPW flag
during recovery and updates local flag according to WAL
records. Then it tries to issue XLOG_FPW_CHANGE if its local
status and shared flag are different. This is correct.

But after that, checkpointer still cannot write XLOG (before
SharedRecovoeryInProgress becomes false) but checkpointer can
change shared fullPagesWrites without writing WAL and the WAL
record is eventually lost.

> In general, I was wondering why in the first place this variable
> (full_page_writes) is a SIGHUP variable?  I think if the user tries to
> switch it to 'on' from 'off', it won't guarantee the recovery from
> torn pages.  Yeah, one can turn it to 'off' from 'on' without any
> problem, but as the reverse doesn't guarantee anything, it can confuse
> users. What do you think?

I tend to agree with you. It works as expected after the next
checkpoint. So define the variable as "it can be changed any time
but has an effect at the next checkpoint time", then remove
XLOG_FPW_CHANGE record. And that eliminates the problem of
concurrent calls since the checkpointer becomes the only modifier
of the variable. And the problematic function
UpdateFullPageWrites also no longer needs to write a WAL
record. The information is conveyed only by checkpoint records.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-04 Thread Heikki Linnakangas

On 03/04/18 21:56, Daniel Gustafsson wrote:

The following line should say SSE and not SSSE (and the same typo is in
src/include/pg_config.h.in and src/include/pg_config.h.win32).  While not
introduced in this patch, it’s adjacent to the patched codepath and on topic so
may well be fixed while in there.

  AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use 
Intel SSSE 4.2 CRC instructions with a runtime check.])


I pushed that as a separate commit, as that goes back to 9.5. Also, I 
noticed that the description of USE_SLICING_BY_8_CRC32C was completely 
wrong, fixed that too. Thanks!


- Heikki



Postgres stucks in deadlock detection

2018-04-04 Thread Konstantin Knizhnik

Hi hackers,
Please notice that it is not a first of April joke;)

Several times we and our customers are suffered from the problem that 
Postgres got stuck in deadlock detection.
One scenario is YCSB workload with Zipf's distribution when many clients 
are trying to update the same record and compete for it's lock.
Another scenario is large number of clients performing inserts in the 
same table. In this case the source of the problem is relation extension 
lock.

In both cases number of connection is large enough: several hundreds.

So what's happening? Due to high contention backends will not be able to 
obtains requested lock in the specified deadlock detection timeout (1 
second by default).
Wait it interrupted by timeout and backend tries to perform deadlock 
detection. CheckDeadLock sets exclusive lock on all partitions locks... 
Avalanche of deadlock timeout expiration
in backends and there contention of exclusive partition locks cause 
Postgres to got stucks.
Speed falls down almost to zero and it is not possible even to login to 
Postgres.


It is well known fact that Postgres is not scaling well for such larger 
number of connections and it is necessary to use pgbouncer or some other 
connection pooler
to limit number of backends. But modern systems has hundreds of CPU 
cores. And to utilize all this resources we need to have hundreds of 
active backaneds.
So this is not an artificial problem, but real show stopper, which takes 
place on real workloads.


There are several ways to solve this problem.
First is trivial: increase deadlock detection timeout. In case of YCSB 
it helps. But in case of many concurrent inserts, some backends are 
waiting for lock for several minutes.
So there is no any realistic value of deadlock detection timeout which 
can completely solve the problem.
Also significant increasing of deadlock detection timeout may case 
blocking applications for unacceptable amount of time in case of real 
deadlock occurrence.


There is a patch in commitfest proposed by Yury Sokolov: 
https://commitfest.postgresql.org/18/1602/
He make deadlock check in two phases: first under shared lock and second 
under exclusive lock.


I am proposing much simpler patch (attached) which uses atomic flag to 
prevent concurrent deadlock detection by more than one backend.
The obvious drawback of such solution is that detection of unrelated 
deadlock loops may take larger amount of time. But deadlock is abnormal 
situation in any case and I do not know applications which consider 
deadlocks as normal behavior. Also I didn't see in my life situations 
when more than one independent deadlocks are happen at the same time 
(but obviously it is possible).


So, I see three possible ways to fix this problem:
1. Yury Sololov's patch with two phase deadlock check
2. Avoid concurrent deadlock detection
3. Avoid concurrent deadlock detection + let CheckDeadLock detect all 
deadlocks, not only those in which current transaction is involved.


I want to know opinion of community concerning this approaches (or may 
we there are some other solutions).


Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index bfa8499..6412184 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -86,6 +86,8 @@ static LOCALLOCK *lockAwaited = NULL;
 
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
+static bool inside_deadlock_check = false;
+
 /* Is a deadlock check pending? */
 static volatile sig_atomic_t got_deadlock_timeout;
 
@@ -186,6 +188,7 @@ InitProcGlobal(void)
 	ProcGlobal->walwriterLatch = NULL;
 	ProcGlobal->checkpointerLatch = NULL;
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
+	pg_atomic_init_flag(&ProcGlobal->activeDeadlockCheck);
 
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
@@ -754,6 +757,14 @@ ProcReleaseLocks(bool isCommit)
 {
 	if (!MyProc)
 		return;
+
+	/* Release deadlock detection flag is backend was interrupted inside deadlock check */
+	if (inside_deadlock_check)
+	{
+		pg_atomic_clear_flag(&ProcGlobal->activeDeadlockCheck);
+		inside_deadlock_check = false;
+	}
+
 	/* If waiting, get off wait queue (should only be needed after error) */
 	LockErrorCleanup();
 	/* Release standard locks, including session-level if aborting */
@@ -1658,6 +1669,14 @@ CheckDeadLock(void)
 	int			i;
 
 	/*
+	 * Ensure that only one backend is checking for deadlock.
+	 * Otherwise under high load cascade of deadlock timeout expirations can cause stuck of Postgres.
+	 */
+	if (!pg_atomic_test_set_flag(&ProcGlobal->activeDeadlockCheck))
+		return;
+	inside_deadlock_check = true;
+
+	/*
 	 * Acquire exclusive lock on the entire shared lock data structures. Must
 	 * grab LWLocks in partition-number order to avoid LWLock deadlock.
 	 *
@@ -1732,6 +1751,9 @@ CheckDeadLock(voi

Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-04 Thread Heikki Linnakangas

On 03/04/18 19:43, Andres Freund wrote:

Architecture manual time?  They're available freely IIRC and should
answer this.


Yeah. The best reference I could find was "ARM Cortex-A Series 
Programmer’s Guide for ARMv8-A" 
(http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/ch08s01.html). 
In the "Porting to A64" section, it says:



Data and code must be aligned to appropriate boundaries. The
alignment of accesses can affect performance on ARM cores and can
represent a portability problem when moving code from an earlier
architecture to ARMv8-A. It is worth being aware of alignment issues
for performance reasons, or when porting code that makes assumptions
about pointers or 32-bit and 64-bit integer variables.


I was a bit surprised by the "must be aligned to appropriate boundaries" 
statement. Googling around, the strict alignment requirement was removed 
in ARMv7, and since then, unaligned access works similarly to Intel. I 
think there are some special instructions, like atomic ops, that require 
alignment though. Perhaps that's what that sentence refers to.


On 03/04/18 20:47, Tom Lane wrote:

I'm pretty sure that some ARM platforms emulate unaligned access through
kernel trap handlers, which would certainly make this a lot slower than
handling the unaligned bytes manually.  Maybe that doesn't apply to any
ARM CPU that has this instruction ... but as you said, it'd be better
to consider the presence of the instruction as orthogonal to other
CPU features.


I did some quick testing, and found that unaligned access is about 2x 
slower than aligned. I don't think it's being trapped by the kernel, I 
think that would be even slower, but clearly there is an effect there. 
So I added code to process the first 1-7 bytes separately, so that the 
main loop runs on 8-byte aligned addresses.


Pushed, thanks everyone!

- Heikki



Re: pgsql: Validate page level checksums in base backups

2018-04-04 Thread Magnus Hagander
On Tue, Apr 3, 2018 at 10:48 PM, Michael Banck 
wrote:

> Hi,
>
> On Tue, Apr 03, 2018 at 08:48:08PM +0200, Magnus Hagander wrote:
> > On Tue, Apr 3, 2018 at 8:29 PM, Tom Lane  wrote:
> > I'd bet a good lunch that nondefault BLCKSZ would break it, as well,
> > > since the way in which the corruption is induced is just guessing
> > > as to where page boundaries are.
> >
> > Yeah, that might be a problem. Those should be calculated from the block
> > size.
> >
> > Also, scribbling on tables as sensitive as pg_class is just asking for
> > > trouble IMO.  I don't see anything in this test, for example, that
> > > prevents autovacuum from running and causing a PANIC before the test
> > > can complete.  Even with AV off, there's a good chance that clobber-
> > > cache-always animals will fall over because they do so many more
> > > physical accesses to the system catalogs.  I'd suggest inducing the
> > > corruption in some user table(s) that we can more tightly constrain
> > > the source server's accesses to.
> >
> > Yeah, that seems like a good idea. And probably also shut the server down
> > while writing the corruption, just in case.
> >
> > Will stick looking into that on my todo for when I'm back, unless beaten
> to
> > it. Michael, you want a stab at it?
>
> Attached is a patch which does that hopefully:
>
> 1. creates two user tables, one large enough for at least 6 blocks
> (around 360kb), the other just one block.
>
> 2. stops the cluster before scribbling over its data and starts it
> afterwards.
>
> 3. uses the blocksize (and the pager header size) to determine offsets
> for scribbling.
>
> I've tested it with blocksizes 8 and 32 now, the latter should make sure
> that the first table is indeed large enough, but maybe something less
> arbitrary than "1 integers" should be used?
>
> Anyway, sorry for the hassle.
>

Applied, with the addition that I explicitly disabled autovacuum on those
tables as well.

We might want to enhance it further by calculating the figure 10,000 based
on blocksize perhaps?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] GUC for cleanup indexes threshold.

2018-04-04 Thread Alexander Korotkov
On Tue, Apr 3, 2018 at 6:42 PM, Masahiko Sawada 
wrote:

> On Tue, Mar 27, 2018 at 8:01 PM, Alexander Korotkov
> > So, I would like to clarify why could my patch block future improvements
> > in this area?  For instance, if we would decide to make btree able to
> skip
> > cleanup when some delete pages are pending for recycle, we can add
> > it in future.
> >
>
> Anyway, for approaches of this feature I agree your version patch and
> your patch looks good to me as the first step of this feature.


Agreed.  I think we got consensus that this patch is good first step,
which doesn't block further enhancements in future.

So, I'm attaching rebased version of patch and marking this RFC.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-lazy-btree-cleanup-7.patch
Description: Binary data


Re: json(b)_to_tsvector with numeric values

2018-04-04 Thread Teodor Sigaev

the consistency of FTS.I think this is a bug, which needs to be fixed,
else inconsistency with existing full text search  will be gets
deeper.
Hm, seems, it's useful feature, but I suggest to make separate function 
jsonb_any_to_tsvector and add support for boolean too (if you know better name 
for function, do not hide it). Changing behavior of existing function is not 
obvious for users and, seems, should not backpatched.




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-04 Thread Etsuro Fujita

(2018/04/03 22:01), Etsuro Fujita wrote:

Attached is an updated version of the patch. Patch
foreign-routing-fdwapi-3.patch is created on top of patch
postgres-fdw-refactoring-3.patch and the bug-fix patch [1].


One thing I noticed about patch foreign-routing-fdwapi-3.patch is this 
bug: the server will crash when copying data into a foreign table that 
doesn't support the proposed APIs (eg, file_fdw foreign tables).  The 
reason is that the patch doesn't perform CheckValidResultRel before that 
operation in that case.  So I modified the patch as such and added 
regression tests for that.


Attached is an updated version of the patch set:
* As before, patch foreign-routing-fdwapi-4.patch is created on top of 
patch postgres-fdw-refactoring-4.patch and the bug-fix patch [1].
* I revised comments, docs, and regression tests a bit further, but no 
code changes other than the bug fix.


Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/5aba4074.1090...@lab.ntt.co.jp
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***
*** 376,387  static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
--- 376,396 
  static void create_cursor(ForeignScanState *node);
  static void fetch_more_data(ForeignScanState *node);
  static void close_cursor(PGconn *conn, unsigned int cursor_number);
+ static PgFdwModifyState *create_foreign_modify(EState *estate,
+ 	  ResultRelInfo *resultRelInfo,
+ 	  CmdType operation,
+ 	  Plan *subplan,
+ 	  char *query,
+ 	  List *target_attrs,
+ 	  bool has_returning,
+ 	  List *retrieved_attrs);
  static void prepare_foreign_modify(PgFdwModifyState *fmstate);
  static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
  		 ItemPointer tupleid,
  		 TupleTableSlot *slot);
  static void store_returning_result(PgFdwModifyState *fmstate,
  	   TupleTableSlot *slot, PGresult *res);
+ static void finish_foreign_modify(PgFdwModifyState *fmstate);
  static List *build_remote_returning(Index rtindex, Relation rel,
  	   List *returningList);
  static void rebuild_fdw_scan_tlist(ForeignScan *fscan, List *tlist);
***
*** 1681,1698  postgresBeginForeignModify(ModifyTableState *mtstate,
  		   int eflags)
  {
  	PgFdwModifyState *fmstate;
! 	EState	   *estate = mtstate->ps.state;
! 	CmdType		operation = mtstate->operation;
! 	Relation	rel = resultRelInfo->ri_RelationDesc;
! 	RangeTblEntry *rte;
! 	Oid			userid;
! 	ForeignTable *table;
! 	UserMapping *user;
! 	AttrNumber	n_params;
! 	Oid			typefnoid;
! 	bool		isvarlena;
! 	ListCell   *lc;
! 	TupleDesc	tupdesc = RelationGetDescr(rel);
  
  	/*
  	 * Do nothing in EXPLAIN (no ANALYZE) case.  resultRelInfo->ri_FdwState
--- 1690,1699 
  		   int eflags)
  {
  	PgFdwModifyState *fmstate;
! 	char	   *query;
! 	List	   *target_attrs;
! 	bool		has_returning;
! 	List	   *retrieved_attrs;
  
  	/*
  	 * Do nothing in EXPLAIN (no ANALYZE) case.  resultRelInfo->ri_FdwState
***
*** 1701,1782  postgresBeginForeignModify(ModifyTableState *mtstate,
  	if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
  		return;
  
- 	/* Begin constructing PgFdwModifyState. */
- 	fmstate = (PgFdwModifyState *) palloc0(sizeof(PgFdwModifyState));
- 	fmstate->rel = rel;
- 
- 	/*
- 	 * Identify which user to do the remote access as.  This should match what
- 	 * ExecCheckRTEPerms() does.
- 	 */
- 	rte = rt_fetch(resultRelInfo->ri_RangeTableIndex, estate->es_range_table);
- 	userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
- 
- 	/* Get info about foreign table. */
- 	table = GetForeignTable(RelationGetRelid(rel));
- 	user = GetUserMapping(userid, table->serverid);
- 
- 	/* Open connection; report that we'll create a prepared statement. */
- 	fmstate->conn = GetConnection(user, true);
- 	fmstate->p_name = NULL;		/* prepared statement not made yet */
- 
  	/* Deconstruct fdw_private data. */
! 	fmstate->query = strVal(list_nth(fdw_private,
! 	 FdwModifyPrivateUpdateSql));
! 	fmstate->target_attrs = (List *) list_nth(fdw_private,
! 			  FdwModifyPrivateTargetAttnums);
! 	fmstate->has_returning = intVal(list_nth(fdw_private,
! 			 FdwModifyPrivateHasReturning));
! 	fmstate->retrieved_attrs = (List *) list_nth(fdw_private,
!  FdwModifyPrivateRetrievedAttrs);
! 
! 	/* Create context for per-tuple temp workspace. */
! 	fmstate->temp_cxt = AllocSetContextCreate(estate->es_query_cxt,
! 			  "postgres_fdw temporary data",
! 			  ALLOCSET_SMALL_SIZES);
! 
! 	/* Prepare for input conversion of RETURNING results. */
! 	if (fmstate->has_returning)
! 		fmstate->attinmeta = TupleDescGetAttInMetadata(tupdesc);
! 
! 	/* Prepare for output conversion of parameters used in prepared stmt. */
! 	n_params = list_length(fmstate->target_attrs) + 1;
! 	fmstate->p_flinfo = (FmgrInfo *) palloc0(sizeof(FmgrInfo) * n_params);
! 	fmstate->p_nums = 0;
! 
! 	if (operation == CMD_UPDATE 

Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-04 Thread Thomas Munro
On Wed, Apr 4, 2018 at 9:23 PM, Heikki Linnakangas  wrote:
> Pushed, thanks everyone!

On eelpout two test_decoding tests failed:

test ddl  ... FAILED
test rewrite  ... FAILED

The output has several cases where pg_logical_slot_get_changes() is
invoked and then fails like this:

  SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
! ERROR:  incorrect resource manager data checksum in record at 0/1BD6600

Not sure why it would break when called by pg_logical_slot_get_changes()...

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2018-04-04%2009%3A58%3A56

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-04 Thread Heikki Linnakangas

On 04/04/18 14:13, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 9:23 PM, Heikki Linnakangas  wrote:

Pushed, thanks everyone!


On eelpout two test_decoding tests failed:

test ddl  ... FAILED
test rewrite  ... FAILED

The output has several cases where pg_logical_slot_get_changes() is
invoked and then fails like this:

   SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
! ERROR:  incorrect resource manager data checksum in record at 0/1BD6600

Not sure why it would break when called by pg_logical_slot_get_changes()...

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2018-04-04%2009%3A58%3A56


Yep. I got the code before the main loop, to handle the first 1-7 
unaligned bytes, wrong. Apparently those are the only tests that call 
the CRC function with very short and unaligned input.


I've got a fix ready, and I'm running "make check-world" on my ARM box 
now, to make sure there aren't any more failures. I'll push the fix once 
that's finished. Should've run the full test suite before pushing in the 
first place..


- Heikki



Re: json(b)_to_tsvector with numeric values

2018-04-04 Thread Dmitry Dolgov
> On 4 April 2018 at 11:52, Teodor Sigaev  wrote:
 the consistency of FTS.I think this is a bug, which needs to be fixed,
 else inconsistency with existing full text search  will be gets
 deeper.
>
> Hm, seems, it's useful feature, but I suggest to make separate function
> jsonb_any_to_tsvector and add support for boolean too (if you know better
> name for function, do not hide it). Changing behavior of existing function
> is not obvious for users and, seems, should not backpatched.

What do you think about having not a separate function, but a flag argument to
the existing one (like `create` in `jsonb_set`), that will have false as
default value? The result would be the same, but without an extra function with
almost the same implementation.



Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-04 Thread Thomas Munro
On Wed, Apr 4, 2018 at 11:21 PM, Heikki Linnakangas  wrote:
> Yep. I got the code before the main loop, to handle the first 1-7 unaligned
> bytes, wrong. Apparently those are the only tests that call the CRC function
> with very short and unaligned input.

BTW I did some googling just now and found out that some libraries use
a technique they call "CPU probing": just try it and see if you get
SIGILL.  Is that a bad idea for some reason?  Here is a quick hack --
anyone got an ARM system without crc that they could test it on?

-- 
Thomas Munro
http://www.enterprisedb.com


cpu-probe-hack.patch
Description: Binary data


Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-04 Thread Thomas Munro
On Wed, Apr 4, 2018 at 11:47 PM, Thomas Munro
 wrote:
> BTW I did some googling just now and found out that some libraries use
> a technique they call "CPU probing": just try it and see if you get
> SIGILL.  Is that a bad idea for some reason?  Here is a quick hack --
> anyone got an ARM system without crc that they could test it on?

Hmm.  I suppose there must be some horrendous non-portable 'step over
the instruction' bit missing in the signal handler.  Probably not a
great idea.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Add default role 'pg_access_server_files'

2018-04-04 Thread Stephen Frost
Michael,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Mon, Apr 02, 2018 at 05:09:21PM -0400, Stephen Frost wrote:
> > * Michael Paquier (mich...@paquier.xyz) wrote:
> >> No refactoring for pg_file_unlink and its v1.1?
> > 
> > I considered each function and thought about if it'd make sense to
> > refactor them or if they were simple enough that the additional function
> > wouldn't really be all that useful.  I'm open to revisiting that, but
> > just want to make it clear that it was something I thought about and
> > considered.  Since pg_file_unlink is basically two function calls, I
> > didn't think it worthwhile to refactor those into their own function.
> 
> I don't mind if this is done your way.
> 
> >> The argument checks are exactly the same for pg_file_rename and
> >> pg_file_rename_v1_1.  Why about just passing fcinfo around and simplify
> >> the patch?
> > 
> > In general, I prefer to keep the PG_FUNCTION_ARGS abstraction when we
> > can.  Unfortunately, that does fall apart when wrapping an SRF as in
> > pg_logdir_ls(), but with pg_file_rename we can maintain it and it's
> > really not that much code to do so.  As with the refactoring of
> > pg_file_unlink, this is something which could really go either way.
> 
> Okay...
> 
> > I'm not sure how useful it is to REVOKE the rights on the simple SQL
> > function; that would just mean that an admin has to go GRANT the rights
> > on that function as well as the three-argument version.
> 
> Indeed, I had a brain fade here.
> 
> > The more I think about it, the more I like the approach of just dropping
> > these deprecated "alternative names for things in core" with the new
> > adminpack version.  In terms of applications, as I understand it, they
> > aren't used in the latest version of pgAdmin3 and they also aren't used
> > with pgAdmin4, so I don't think we need to be worrying about supporting
> > them in v11.
> 
> +1 to simplify the code a bit.

Great, thanks.  I'll be doing more review of it myself and see about
pushing it later this afternoon.

Stephen


signature.asc
Description: PGP signature


Re: pgsql: Validate page level checksums in base backups

2018-04-04 Thread Michael Banck
Hi,

On Wed, Apr 04, 2018 at 11:38:35AM +0200, Magnus Hagander wrote:
> On Tue, Apr 3, 2018 at 10:48 PM, Michael Banck 
> wrote:
> 
> > Hi,
> >
> > On Tue, Apr 03, 2018 at 08:48:08PM +0200, Magnus Hagander wrote:
> > > On Tue, Apr 3, 2018 at 8:29 PM, Tom Lane  wrote:
> > > I'd bet a good lunch that nondefault BLCKSZ would break it, as well,
> > > > since the way in which the corruption is induced is just guessing
> > > > as to where page boundaries are.
> > >
> > > Yeah, that might be a problem. Those should be calculated from the block
> > > size.
> > >
> > > Also, scribbling on tables as sensitive as pg_class is just asking for
> > > > trouble IMO.  I don't see anything in this test, for example, that
> > > > prevents autovacuum from running and causing a PANIC before the test
> > > > can complete.  Even with AV off, there's a good chance that clobber-
> > > > cache-always animals will fall over because they do so many more
> > > > physical accesses to the system catalogs.  I'd suggest inducing the
> > > > corruption in some user table(s) that we can more tightly constrain
> > > > the source server's accesses to.
> > >
> > > Yeah, that seems like a good idea. And probably also shut the server down
> > > while writing the corruption, just in case.
> > >
> > > Will stick looking into that on my todo for when I'm back, unless beaten
> > to
> > > it. Michael, you want a stab at it?
> >
> > Attached is a patch which does that hopefully:
> >
> > 1. creates two user tables, one large enough for at least 6 blocks
> > (around 360kb), the other just one block.
> >
> > 2. stops the cluster before scribbling over its data and starts it
> > afterwards.
> >
> > 3. uses the blocksize (and the pager header size) to determine offsets
> > for scribbling.
> >
> > I've tested it with blocksizes 8 and 32 now, the latter should make sure
> > that the first table is indeed large enough, but maybe something less
> > arbitrary than "1 integers" should be used?
> >
> > Anyway, sorry for the hassle.
> >
> 
> Applied, with the addition that I explicitly disabled autovacuum on those
> tables as well.
 
Thanks! It looks like there were no further builfarm failures so far,
let's see how this goes.

> We might want to enhance it further by calculating the figure 10,000 based
> on blocksize perhaps?

10,000 was roughly twice the size needed for 32k block sizes. If there
are concerns that this might not be enough, I am happy to invest some
more time here (next week probably). However, the pg_basebackup
testsuite takes up 800+ MB to run, so I don't see the urgent need of
optimizing away 50-100 KB (which clearly everybody else thought as well)
if we are talking about disk space overhead.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer



Re: Online enabling of checksums

2018-04-04 Thread Michael Banck
Hi,

On Tue, Apr 03, 2018 at 02:05:04PM +0200, Magnus Hagander wrote:
> PFA a rebase on top of the just committed verify-checksums patch.

For the record, I am on vacation this week and won't be able to do
further in-depth review or testing of this patch before the end of the
commitfest, sorry.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer



Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-04 Thread Jesper Pedersen

Hi Simon,

On 03/30/2018 07:10 AM, Simon Riggs wrote:

No problems found, but moving proposed commit to 2 April pm



There is a warning for this, as attached.

Best regards,
 Jesper
diff --git a/src/backend/executor/nodeMerge.c b/src/backend/executor/nodeMerge.c
index 0e0d0795d4..9aee073a94 100644
--- a/src/backend/executor/nodeMerge.c
+++ b/src/backend/executor/nodeMerge.c
@@ -485,7 +485,6 @@ ExecMerge(ModifyTableState *mtstate, EState *estate, TupleTableSlot *slot,
 	ItemPointer tupleid;
 	ItemPointerData tuple_ctid;
 	bool		matched = false;
-	char		relkind;
 	Datum		datum;
 	bool		isNull;
 


Re: [HACKERS] path toward faster partition pruning

2018-04-04 Thread David Rowley
On 4 April 2018 at 19:04, Amit Langote  wrote:
> On 2018/04/04 14:42, Amit Langote wrote:
>> Attached v48.
>
> I had forgotten to remove the static_pruning parameter I had added in the
> v47, because it is no longer used.  Static pruning now occurs even if a
> step contains all Params, in which case each of
> get_matching_hash/list/range_bounds() functions returns offsets of all
> non-null datums, because the Params cannot be resolved to actual values
> during static pruning.

Thanks for updating. I've made a pass over v49 and I didn't find very
much wrong with it.

The only real bug I found was a missing IsA(rinfo->clause, Const) in
the pseudoconstant check inside
generate_partition_pruning_steps_internal.

Most of the changes are comment fixes with a few stylistic changes
thrown which are pretty much all there just to try to shrink the code
a line or two or reduce indentation.

I feel pretty familiar with this code now and assuming the attached is
included I'm happy for someone else, hopefully, a committer to take a
look at it.

I'll leave the following notes:

1. Still not sure about RelOptInfo->has_default_part. This flag is
only looked at in generate_partition_pruning_steps. The RelOptInfo and
the boundinfo is available to look at, it's just that the
partition_bound_has_default macro is defined in partition.c rather
than partition.h.

2. Don't really like the new isopne variable name. It's not very
simple to decode, perhaps something like is_not_eq is better?

3. The part of the code I'm least familiar with is
get_steps_using_prefix_recurse(). I admit to not having had time to
fully understand that and consider ways to break it.

Marking as ready for committer.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v49_fixes_drowley.patch
Description: Binary data


Re: Foreign keys and partitioned tables

2018-04-04 Thread Peter Eisentraut
On 4/3/18 15:11, Alvaro Herrera wrote:
> 0003 is the main patch, which is a bit changed from v4, notably to cover
> your review comments:

Looks good now.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Validate page level checksums in base backups

2018-04-04 Thread Alvaro Herrera
Michael Banck wrote:

> However, the pg_basebackup testsuite takes up 800+ MB to run,

Uh, you're right.  This seems a bit over the top.  Can we reduce that
without losing coverage?  We've gone great lengths to avoid large
amounts of data in tests elsewhere.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Bruce Momjian
On Wed, Apr  4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:
> On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer  wrote:
> > On 4 April 2018 at 13:29, Thomas Munro 
> > wrote:
> >> /* Ensure that we skip any errors that predate opening of the file */
> >> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> >>
> >> [...]
> >
> > Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel
> > will deliberately hide writeback errors that predate our fsync() call from
> > us?
> 
> Predates the opening of the file by the process that calls fsync().
> Yeah, it sure looks that way based on the above code fragment.  Does
> anyone know better?

Uh, just to clarify, what is new here is that it is ignoring any
_errors_ that happened before the open().  It is not ignoring write()'s
that happened but have not been written to storage before the open().

FYI, pg_test_fsync has always tested the ability to fsync() writes()
from from other processes:

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close5360.341 ops/sec 187 
usecs/op
write, close, fsync4785.240 ops/sec 209 
usecs/op

Those two numbers should be similar.  I added this as a check to make
sure the behavior we were relying on was working.  I never tested sync
errors though.

I think the fundamental issue is that we always assumed that writes to
the kernel that could not be written to storage would remain in the
kernel until they succeeded, and that fsync() would report their
existence.

I can understand why kernel developers don't want to keep failed sync
buffers in memory, and once they are gone we lose reporting of their
failure.  Also, if the kernel is going to not retry the syncs, how long
should it keep reporting the sync failure?  To the first fsync that
happens after the failure?  How long should it continue to record the
failure?  What if no fsync() every happens, which is likely for
non-Postgres workloads?  I think once they decided to discard failed
syncs and not retry them, the fsync behavior we are complaining about
was almost required.

Our only option might be to tell administrators to closely watch for
kernel write failure messages, and then restore or failover.  :-(

The last time I remember being this surprised about storage was in the
early Postgres years when we learned that just because the BSD file
system uses 8k pages doesn't mean those are atomically written to
storage.  We knew the operating system wrote the data in 8k chunks to
storage but:

o  the 8k pages are written as separate 512-byte sectors
o  the 8k might be contiguous logically on the drive but not physically
o  even 512-byte sectors are not written atomically

This is why we added pre-page images are written to WAL, which is what
full_page_writes controls.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Bruce Momjian
On Wed, Apr  4, 2018 at 10:40:16AM +0800, Craig Ringer wrote:
> The trouble with direct I/O is that it pushes a _lot_ of work back on
> PostgreSQL regarding knowledge of the storage subsystem, I/O scheduling, etc.
> It's absurd to have the kernel do this, unless you want it reliable, in which
> case you bypass it and drive the hardware directly.
> 
> We'd need pools of writer threads to deal with all the blocking I/O. It'd be
> such a nightmare. Hey, why bother having a kernel at all, except for drivers?

I believe this is how Oracle views the kernel, so there is precedent for
this approach, though I am not advocating it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Craig Ringer
On 4 April 2018 at 15:51, Craig Ringer  wrote:

> On 4 April 2018 at 14:00, Craig Ringer  wrote:
>
>> On 4 April 2018 at 13:29, Thomas Munro 
>> wrote:
>>
>>> On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
>>>  wrote:
>>> > On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian 
>>> wrote:
>>> >> Uh, are you sure it fixes our use-case?  From the email description it
>>> >> sounded like it only reported fsync errors for every open file
>>> >> descriptor at the time of the failure, but the checkpoint process
>>> might
>>> >> open the file _after_ the failure and try to fsync a write that
>>> happened
>>> >> _before_ the failure.
>>> >
>>> > I'm not sure of anything.  I can see that it's designed to report
>>> > errors since the last fsync() of the *file* (presumably via any fd),
>>> > which sounds like the desired behaviour:
>>> >
>>> > [..]
>>>
>>> Scratch that.  Whenever you open a file descriptor you can't see any
>>> preceding errors at all, because:
>>>
>>> /* Ensure that we skip any errors that predate opening of the file */
>>> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
>>>
>>> https://github.com/torvalds/linux/blob/master/fs/open.c#L752
>>>
>>> Our whole design is based on being able to open, close and reopen
>>> files at will from any process, and in particular to fsync() from a
>>> different process that didn't inherit the fd but instead opened it
>>> later.  But it looks like that might be able to eat errors that
>>> occurred during asynchronous writeback (when there was nobody to
>>> report them to), before you opened the file?
>>>
>>
>> Holy hell. So even PANICing on fsync() isn't sufficient, because the
>> kernel will deliberately hide writeback errors that predate our fsync()
>> call from us?
>>
>> I'll see if I can expand my testcase for that. I'm presently dockerizing
>> it to make it easier for others to use, but that turns out to be a major
>> pain when using devmapper etc. Docker in privileged mode doesn't seem to
>> play nice with device-mapper.
>>
>>
> Done, you can find it in https://github.com/ringerc/scrapcode/tree/master/
> testcases/fsync-error-clear now.
>
>
Update. Now supports multiple FSes.

I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO.
Didn't try zfs-on-linux or other platforms yet.

Still working on getting ENOSPC on fsync() rather than write(). Kernel code
reading suggests this is possible, but all the above FSes reserve space
eagerly on write( ) even if they do delayed allocation of the actual
storage, so it doesn't seem to happen at least in my simple single-process
test.

I'm not overly inclined to complain about a fsync() succeeding after a
write() error. That seems reasonable enough, the kernel told the app at the
time of the failure. What else is it going to do? I don't personally even
object hugely to the current fsync() behaviour if it were, say, DOCUMENTED
and conformant to the relevant standards, though not giving us any sane way
to find out the affected file ranges makes it drastically harder to recover
sensibly.

But what's come out since on this thread, that we cannot even rely on
fsync() giving us an EIO *once* when it loses our data, because:

- all currently widely deployed kernels can fail to deliver info due to
recently fixed limitation; and
- the kernel deliberately hides errors from us if they relate to writes
that occurred before we opened the FD (?)

... that's really troubling. I thought we could at least fix this by
PANICing on EIO, and was mostly worried about ENOSPC. But now it seems we
can't even do that and expect reliability. So how the @#$ are we meant to
do?

It's the error reporting issues around closing and reopening files with
outstanding buffered I/O that's really going to hurt us here. I'll be
expanding my test case to cover that shortly.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: json(b)_to_tsvector with numeric values

2018-04-04 Thread Teodor Sigaev



Hm, seems, it's useful feature, but I suggest to make separate function
jsonb_any_to_tsvector and add support for boolean too (if you know better
name for function, do not hide it). Changing behavior of existing function
is not obvious for users and, seems, should not backpatched.


What do you think about having not a separate function, but a flag argument to
the existing one (like `create` in `jsonb_set`), that will have false as
default value? The result would be the same, but without an extra function with
almost the same implementation.


tsvector jsonb_to_tsvector(jsonb[, bool]) ?
Agreed. Second arg should be optional.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Craig Ringer
On 4 April 2018 at 22:00, Craig Ringer  wrote:


> It's the error reporting issues around closing and reopening files with
> outstanding buffered I/O that's really going to hurt us here. I'll be
> expanding my test case to cover that shortly.
>
>
Also, just to be clear, this is not in any way confined to xfs and/or lvm
as I originally thought it might be.

Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't
help either (so what does it do?).

What bewilders me is that running with data=journal doesn't seem to be safe
either. WTF?

[26438.846111] EXT4-fs (dm-0): mounted filesystem with journalled data
mode. Opts: errors=remount-ro,data_err=abort,data=journal
[26454.125319] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error
10 writing to inode 12 (offset 0 size 0 starting block 59393)
[26454.125326] Buffer I/O error on device dm-0, logical block 59393
[26454.125337] Buffer I/O error on device dm-0, logical block 59394
[26454.125343] Buffer I/O error on device dm-0, logical block 59395
[26454.125350] Buffer I/O error on device dm-0, logical block 59396

and splat, there goes your data anyway.

It's possible that this is in some way related to using the device-mapper
"error" target and a loopback device in testing. But I don't really see how.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Bruce Momjian
On Wed, Apr  4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:
> On 4 April 2018 at 22:00, Craig Ringer  wrote:
>  
> 
> It's the error reporting issues around closing and reopening files with
> outstanding buffered I/O that's really going to hurt us here. I'll be
> expanding my test case to cover that shortly.
> 
> 
> 
> Also, just to be clear, this is not in any way confined to xfs and/or lvm as I
> originally thought it might be. 
> 
> Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help
> either (so what does it do?).

Anthony Iliopoulos reported in this thread that errors=remount-ro is
only affected by metadata writes.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: new function for tsquery creartion

2018-04-04 Thread Dmitry Ivanov

I'm not sure about the different result for these queries:
SELECT websearch_to_tsquery('simple', 'cat or ');
 websearch_to_tsquery
--
 'cat'
(1 row)
SELECT websearch_to_tsquery('simple', 'cat or');
 websearch_to_tsquery
--
 'cat' & 'or'
(1 row)


I guess both queries should produce just 'cat'. I've changed the 
definition of parse_or_operator().



I found an odd behavior of the query creation function in case:
SELECT websearch_to_tsquery('english', '"pg_class pg"');
websearch_to_tsquery
-
 ( 'pg' & 'class' ) <-> 'pg'
(1 row)

This query means that lexemes 'pg' and 'class' should be at the same
distance from the last 'pg'. In other words, they should have the same
position. But default parser will interpret pg_class as two separate
words during text parsing/vector creation.

The bug wasn't introduced in the patch and can be found in current
master. During the discussion of the patch with Dmitry, he noticed that
to_tsquery() function shares same odd behavior:
select to_tsquery('english', ' pg_class <-> pg');
 to_tsquery
-
 ( 'pg' & 'class' ) <-> 'pg'
(1 row)


I've been thinking about this for a while, and it seems that this should 
be fixed somewhere near parsetext(). Perhaps 'pg' and 'class' should 
share the same position. After all, somebody could implement a parser 
which would split some words using an arbitrary set of rules, for 
instance "split all words containing digits". I propose merging this 
patch provided that there are no objections.


--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 9a1efc14cf..122f034f17 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9630,6 +9630,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
 phraseto_tsquery('english', 'The Fat Rats')
 'fat' <-> 'rat'

+   
+
+ 
+  websearch_to_tsquery
+ 
+  websearch_to_tsquery( config regconfig ,  query text)
+ 
+tsquery
+produce tsquery from a web search style query
+websearch_to_tsquery('english', '"fat rat" or rat')
+'fat' <-> 'rat' | 'rat'
+   

 
  
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 610b7bf033..19f58511c8 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,16 @@ UPDATE tt SET ti =

 PostgreSQL provides the
 functions to_tsquery,
-plainto_tsquery, and
-phraseto_tsquery
+plainto_tsquery,
+phraseto_tsquery and
+websearch_to_tsquery
 for converting a query to the tsquery data type.
 to_tsquery offers access to more features
 than either plainto_tsquery or
-phraseto_tsquery, but it is less forgiving
-about its input.
+phraseto_tsquery, but it is less forgiving about its
+input. websearch_to_tsquery is a simplified version
+of to_tsquery with an alternative syntax, similar
+to the one used by web search engines.

 

@@ -962,6 +965,87 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
 

 
+
+websearch_to_tsquery( config regconfig,  querytext text) returns tsquery
+
+
+   
+websearch_to_tsquery creates a tsquery
+value from querytext using an alternative
+syntax in which simple unformatted text is a valid query.
+Unlike plainto_tsquery
+and phraseto_tsquery, it also recognizes certain
+operators. Moreover, this function should never raise syntax errors,
+which makes it possible to use raw user-supplied input for search.
+The following syntax is supported:
+
+ 
+   
+unquoted text: text not inside quote marks will be
+converted to terms separated by & operators, as
+if processed by
+plainto_tsquery.
+  
+ 
+ 
+   
+"quoted text": text inside quote marks will be
+converted to terms separated by <->
+operators, as if processed by phraseto_tsquery.
+  
+ 
+ 
+  
+   OR: logical or will be converted to
+   the | operator.
+  
+ 
+ 
+  
+   -: the logical not operator, converted to the
+   the ! operator.
+  
+ 
+
+   
+   
+Examples:
+
+  select websearch_to_tsquery('english', 'The fat rats');
+   websearch_to_tsquery
+  -
+   'fat' & 'rat'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '"supernovae stars" -crab');
+ websearch_to_tsquery
+  --
+   'supernova' <-> 'star' & !'crab'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '"sad cat" or "fat rat"');
+ websearch_to_tsquery
+  ---
+   'sad' <->

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Craig Ringer
On 4 April 2018 at 22:25, Bruce Momjian  wrote:

> On Wed, Apr  4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:
> > On 4 April 2018 at 22:00, Craig Ringer  wrote:
> >
> >
> > It's the error reporting issues around closing and reopening files
> with
> > outstanding buffered I/O that's really going to hurt us here. I'll be
> > expanding my test case to cover that shortly.
> >
> >
> >
> > Also, just to be clear, this is not in any way confined to xfs and/or
> lvm as I
> > originally thought it might be.
> >
> > Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't
> help
> > either (so what does it do?).
>
> Anthony Iliopoulos reported in this thread that errors=remount-ro is
> only affected by metadata writes.


Yep, I gathered. I was referring to data_err.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: new function for tsquery creartion

2018-04-04 Thread Aleksandr Parfenov

On 2018-04-04 17:33, Dmitry Ivanov wrote:

I've been thinking about this for a while, and it seems that this
should be fixed somewhere near parsetext(). Perhaps 'pg' and 'class'
should share the same position. After all, somebody could implement a
parser which would split some words using an arbitrary set of rules,
for instance "split all words containing digits". I propose merging
this patch provided that there are no objections.


I'm agree that this problem should be solved in separate patch and
that this feature can be merged without waiting for the fix.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Postgres stucks in deadlock detection

2018-04-04 Thread Teodor Sigaev

So, I see three possible ways to fix this problem:
1. Yury Sololov's patch with two phase deadlock check
I could be wrong, but LWLocks in pgsql aren't a "fair" locks: if LWLock is share 
locked then exclusive lock will wait, but next shared lock will be acquired 
immediately. If so, the we have a risk when two deadlocked processes will 
infinitely wait for excluse lock over partitions while other processes will 
repeady lock for share to find its owned deadlocks. I.e. lock graph has one loop 
and members of that loop could not acquire exclusive lock becouse non-members of 
loop constantly holds a share lock. It has a low probability, but nevertheless.




2. Avoid concurrent deadlock detection
Seems simple, workable solution with unpredicable delay for deadlock check. 
Although it helps very good in cases when there isn't deadlock.



3. Avoid concurrent deadlock detection + let CheckDeadLock detect all deadlocks, 
not only those in which current transaction is involved.

I think, it's better solution, but I'm afraid it's too late for 11v...

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Add support for printing/reading MergeAction nodes

2018-04-04 Thread Marina Polyakova

Hello, hackers!

When debugging is enabled for server logging, isolation tests fail 
because there're no corresponding output functions for InsertStmt / 
DeleteStmt / UpdateStmt that are used in the output of the MergeAction 
nodes (see the attached regressions diffs and output). I also attached a 
try that makes the tests pass. Sorry if I missed that it was already 
discussed somewhere.


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c8d9626..2411658 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -409,6 +409,68 @@ _outMergeAction(StringInfo str, const MergeAction *node)
 }
 
 static void
+_outInferClause(StringInfo str, const InferClause *node)
+{
+	WRITE_NODE_TYPE("INFERCLAUSE");
+
+	WRITE_NODE_FIELD(indexElems);
+	WRITE_NODE_FIELD(whereClause);
+	WRITE_STRING_FIELD(conname);
+	WRITE_LOCATION_FIELD(location);
+}
+
+static void
+_outOnConflictClause(StringInfo str, const OnConflictClause *node)
+{
+	WRITE_NODE_TYPE("ONCONFLICTCLAUSE");
+
+	WRITE_ENUM_FIELD(action, OnConflictAction);
+	WRITE_NODE_FIELD(infer);
+	WRITE_NODE_FIELD(targetList);
+	WRITE_NODE_FIELD(whereClause);
+	WRITE_LOCATION_FIELD(location);
+}
+
+static void
+_outInsertStmt(StringInfo str, const InsertStmt *node)
+{
+	WRITE_NODE_TYPE("INSERT");
+
+	WRITE_NODE_FIELD(relation);
+	WRITE_NODE_FIELD(cols);
+	WRITE_NODE_FIELD(selectStmt);
+	WRITE_NODE_FIELD(onConflictClause);
+	WRITE_NODE_FIELD(returningList);
+	WRITE_NODE_FIELD(withClause);
+	WRITE_ENUM_FIELD(override, OverridingKind);
+}
+
+static void
+_outDeleteStmt(StringInfo str, const DeleteStmt *node)
+{
+	WRITE_NODE_TYPE("DELETE");
+
+	WRITE_NODE_FIELD(relation);
+	WRITE_NODE_FIELD(usingClause);
+	WRITE_NODE_FIELD(whereClause);
+	WRITE_NODE_FIELD(returningList);
+	WRITE_NODE_FIELD(withClause);
+}
+
+static void
+_outUpdateStmt(StringInfo str, const UpdateStmt *node)
+{
+	WRITE_NODE_TYPE("UPDATE");
+
+	WRITE_NODE_FIELD(relation);
+	WRITE_NODE_FIELD(targetList);
+	WRITE_NODE_FIELD(whereClause);
+	WRITE_NODE_FIELD(fromClause);
+	WRITE_NODE_FIELD(returningList);
+	WRITE_NODE_FIELD(withClause);
+}
+
+static void
 _outAppend(StringInfo str, const Append *node)
 {
 	WRITE_NODE_TYPE("APPEND");
@@ -3682,6 +3744,21 @@ outNode(StringInfo str, const void *obj)
 			case T_MergeAction:
 _outMergeAction(str, obj);
 break;
+			case T_InferClause:
+_outInferClause(str, obj);
+break;
+			case T_OnConflictClause:
+_outOnConflictClause(str, obj);
+break;
+			case T_InsertStmt:
+_outInsertStmt(str, obj);
+break;
+			case T_DeleteStmt:
+_outDeleteStmt(str, obj);
+break;
+			case T_UpdateStmt:
+_outUpdateStmt(str, obj);
+break;
 			case T_Append:
 _outAppend(str, obj);
 break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 4518fa0..13891b1 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1620,6 +1620,93 @@ _readMergeAction(void)
 }
 
 /*
+ * _readInferClause
+ */
+static InferClause *
+_readInferClause(void)
+{
+	READ_LOCALS(InferClause);
+
+	READ_NODE_FIELD(indexElems);
+	READ_NODE_FIELD(whereClause);
+	READ_STRING_FIELD(conname);
+	READ_LOCATION_FIELD(location);
+
+	READ_DONE();
+}
+
+/*
+ * _readOnConflictClause
+ */
+static OnConflictClause *
+_readOnConflictClause(void)
+{
+	READ_LOCALS(OnConflictClause);
+
+	READ_ENUM_FIELD(action, OnConflictAction);
+	READ_NODE_FIELD(infer);
+	READ_NODE_FIELD(targetList);
+	READ_NODE_FIELD(whereClause);
+	READ_LOCATION_FIELD(location);
+
+	READ_DONE();
+}
+
+/*
+ * _readInsertStmt
+ */
+static InsertStmt *
+_readInsertStmt(void)
+{
+	READ_LOCALS(InsertStmt);
+
+	READ_NODE_FIELD(relation);
+	READ_NODE_FIELD(cols);
+	READ_NODE_FIELD(selectStmt);
+	READ_NODE_FIELD(onConflictClause);
+	READ_NODE_FIELD(returningList);
+	READ_NODE_FIELD(withClause);
+	READ_ENUM_FIELD(override, OverridingKind);
+
+	READ_DONE();
+}
+
+/*
+ * _readDeleteStmt
+ */
+static DeleteStmt *
+_readDeleteStmt(void)
+{
+	READ_LOCALS(DeleteStmt);
+
+	READ_NODE_FIELD(relation);
+	READ_NODE_FIELD(usingClause);
+	READ_NODE_FIELD(whereClause);
+	READ_NODE_FIELD(returningList);
+	READ_NODE_FIELD(withClause);
+
+	READ_DONE();
+}
+
+/*
+ * _readUpdateStmt
+ */
+static UpdateStmt *
+_readUpdateStmt(void)
+{
+	READ_LOCALS(UpdateStmt);
+
+	READ_NODE_FIELD(relation);
+	READ_NODE_FIELD(targetList);
+	READ_NODE_FIELD(whereClause);
+	READ_NODE_FIELD(fromClause);
+	READ_NODE_FIELD(returningList);
+	READ_NODE_FIELD(withClause);
+
+	READ_DONE();
+}
+
+/*
  * _readAppend
  */
 static Append *
@@ -2620,6 +2707,16 @@ parseNodeString(void)
 		return_value = _readModifyTable();
 	else if (MATCH("MERGEACTION", 11))
 		return_value = _readMergeAction();
+	else if (MATCH("INFERCLAUSE", 11))
+		return_value = _readInferClause();
+	else if (MATCH("ONCONFLICTCLAUSE", 16))
+		return_value = _readOnConflictClause();
+	else if (MATCH("INSERT", 6))
+		r

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Antonis Iliopoulos
On Wed, Apr 4, 2018 at 4:42 PM, Craig Ringer  wrote:
>
> On 4 April 2018 at 22:25, Bruce Momjian  wrote:
>>
>> On Wed, Apr  4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:
>> > On 4 April 2018 at 22:00, Craig Ringer  wrote:
>> >
>> >
>> > It's the error reporting issues around closing and reopening files
with
>> > outstanding buffered I/O that's really going to hurt us here. I'll
be
>> > expanding my test case to cover that shortly.
>> >
>> >
>> >
>> > Also, just to be clear, this is not in any way confined to xfs and/or
lvm as I
>> > originally thought it might be.
>> >
>> > Nor is ext3/ext4's errors=remount-ro protective. data_err=abort
doesn't help
>> > either (so what does it do?).
>>
>> Anthony Iliopoulos reported in this thread that errors=remount-ro is
>> only affected by metadata writes.
>
>
> Yep, I gathered. I was referring to data_err.

As far as I recall data_err=abort pertains to the jbd2 handling of
potential writeback errors. Jbd2 will inetrnally attempt to drain
the data upon txn commit (and it's even kind enough to restore
the EIO at the address space level, that otherwise would get eaten).

When data_err=abort is set, then jbd2 forcibly shuts down the
entire journal, with the error being propagated upwards to ext4.
I am not sure at which point this would be manifested to userspace
and how, but in principle any subsequent fs operations would get
some filesystem error due to the journal being down (I would
assume similar to remounting the fs read-only).

Since you are using data=journal, I would indeed expect to see
something more than what you saw in dmesg.

I can have a look later, I plan to also respond to some of the other
interesting issues that you guys raised in the thread.

Best regards,
Anthony


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Craig Ringer
On 4 April 2018 at 21:49, Bruce Momjian  wrote:

> On Wed, Apr  4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:
> > On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer 
> wrote:
> > > On 4 April 2018 at 13:29, Thomas Munro 
> > > wrote:
> > >> /* Ensure that we skip any errors that predate opening of the file */
> > >> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> > >>
> > >> [...]
> > >
> > > Holy hell. So even PANICing on fsync() isn't sufficient, because the
> kernel
> > > will deliberately hide writeback errors that predate our fsync() call
> from
> > > us?
> >
> > Predates the opening of the file by the process that calls fsync().
> > Yeah, it sure looks that way based on the above code fragment.  Does
> > anyone know better?
>
> Uh, just to clarify, what is new here is that it is ignoring any
> _errors_ that happened before the open().  It is not ignoring write()'s
> that happened but have not been written to storage before the open().
>
> FYI, pg_test_fsync has always tested the ability to fsync() writes()
> from from other processes:
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written on a
> different
> descriptor.)
> write, fsync, close5360.341 ops/sec
>  187 usecs/op
> write, close, fsync4785.240 ops/sec
>  209 usecs/op
>
> Those two numbers should be similar.  I added this as a check to make
> sure the behavior we were relying on was working.  I never tested sync
> errors though.
>
> I think the fundamental issue is that we always assumed that writes to
> the kernel that could not be written to storage would remain in the
> kernel until they succeeded, and that fsync() would report their
> existence.
>
> I can understand why kernel developers don't want to keep failed sync
> buffers in memory, and once they are gone we lose reporting of their
> failure.  Also, if the kernel is going to not retry the syncs, how long
> should it keep reporting the sync failure?


Ideally until the app tells it not to.

But there's no standard API for that.

The obvious answer seems to be "until the FD is closed". But we just
discussed how Pg relies on being able to open and close files freely. That
may not be as reasonable a thing to do as we thought it was when you
consider error reporting. What's the kernel meant to do? How long should it
remember "I had an error while doing writeback on this file"? Should it
flag the file metadata and remember across reboots? Obviously not, but
where does it stop? Tell the next program that does an fsync() and forget?
How could it associate a dirty buffer on a file with no open FDs with any
particular program at all? And what if the app did a write then closed the
file and went away, never to bother to check the file again, like most apps
do?

Some I/O errors are transient (network issue, etc). Some are recoverable
with some sort of action, like disk space issues, but may take a long time
before an admin steps in. Some are entirely unrecoverable (disk 1 in
striped array is on fire) and there's no possible recovery. Currently we
kind of hope the kernel will deal with figuring out which is which and
retrying. Turns out it doesn't do that so much, and I don't think the
reasons for that are wholly unreasonable. We may have been asking too much.

That does leave us in a pickle when it comes to the checkpointer and
opening/closing FDs. I don't know what the "right" thing for the kernel to
do from our perspective even is here, but the best I can come up with is
actually pretty close to what it does now. Report the fsync() error to the
first process that does an fsync() since the writeback error if one has
occurred, then forget about it. Ideally I'd have liked it to mark all FDs
pointing to the file with a flag to report EIO on next fsync too, but it
turns out that won't even help us due to our opening and closing behaviour,
so we're going to have to take responsibility for handling and
communicating that ourselves, preventing checkpoint completion if any
backend gets an fsync error. Probably by PANICing. Some extra work may be
needed to ensure reliable ordering and stop checkpoints completing if their
fsync() succeeds due to a recent failed fsync() on a normal backend that
hasn't PANICed or where the postmaster hasn't noticed yet.

Our only option might be to tell administrators to closely watch for
> kernel write failure messages, and then restore or failover.  :-(
>

Speaking of, there's not necessarily any lost page write error in the logs
AFAICS. My tests often just show "Buffer I/O error on device dm-0, logical
block 59393" or the like.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

2018-04-04 Thread Jesper Pedersen

Hi,

On 04/04/2018 09:29 AM, David Rowley wrote:

Thanks for updating. I've made a pass over v49 and I didn't find very
much wrong with it.

The only real bug I found was a missing IsA(rinfo->clause, Const) in
the pseudoconstant check inside
generate_partition_pruning_steps_internal.

Most of the changes are comment fixes with a few stylistic changes
thrown which are pretty much all there just to try to shrink the code
a line or two or reduce indentation.

I feel pretty familiar with this code now and assuming the attached is
included I'm happy for someone else, hopefully, a committer to take a
look at it.

I'll leave the following notes:

1. Still not sure about RelOptInfo->has_default_part. This flag is
only looked at in generate_partition_pruning_steps. The RelOptInfo and
the boundinfo is available to look at, it's just that the
partition_bound_has_default macro is defined in partition.c rather
than partition.h.

2. Don't really like the new isopne variable name. It's not very
simple to decode, perhaps something like is_not_eq is better?

3. The part of the code I'm least familiar with is
get_steps_using_prefix_recurse(). I admit to not having had time to
fully understand that and consider ways to break it.

Marking as ready for committer.



Passes check-world, and CommitFest app has been updated to reflect the 
current patch set. Trivial changes attached.


Best regards,
 Jesper
>From 82f718579dc8e06ab77d76df4ed72f0f03ed4a4e Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Wed, 4 Apr 2018 11:27:59 -0400
Subject: [PATCH] Trivial changes

---
 src/backend/catalog/partition.c| 10 +-
 src/backend/optimizer/util/partprune.c |  8 
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index d6bce9f348..7a268e05dc 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -146,7 +146,7 @@ typedef struct PruneStepResult
 {
 	/*
 	 * This contains the offsets of the bounds in a table's boundinfo, each of
-	 * which is a bound whose corresponding partition is selected by a a given
+	 * which is a bound whose corresponding partition is selected by a given
 	 * pruning step.
 	 */
 	Bitmapset  *bound_offsets;
@@ -2026,7 +2026,7 @@ get_matching_hash_bounds(PartitionPruneContext *context,
 		 int opstrategy, Datum *values, int nvalues,
 		 FmgrInfo *partsupfunc, Bitmapset *nullkeys)
 {
-	PruneStepResult *result = palloc0(sizeof(PruneStepResult));
+	PruneStepResult *result = (PruneStepResult *) palloc0(sizeof(PruneStepResult));
 	PartitionBoundInfo boundinfo = context->boundinfo;
 	int		   *partindices = boundinfo->indexes;
 	int			partnatts = context->partnatts;
@@ -2093,7 +2093,7 @@ get_matching_list_bounds(PartitionPruneContext *context,
 		 int opstrategy, Datum value, int nvalues,
 		 FmgrInfo *partsupfunc, Bitmapset *nullkeys)
 {
-	PruneStepResult *result = palloc0(sizeof(PruneStepResult));
+	PruneStepResult *result = (PruneStepResult *) palloc0(sizeof(PruneStepResult));
 	PartitionBoundInfo boundinfo = context->boundinfo;
 	int			off,
 minoff,
@@ -2147,7 +2147,7 @@ get_matching_list_bounds(PartitionPruneContext *context,
 		return result;
 	}
 
-	/* Speical case handling of values coming from a <> operator clause. */
+	/* Special case handling of values coming from a <> operator clause. */
 	if (opstrategy == InvalidStrategy)
 	{
 		/*
@@ -2295,7 +2295,7 @@ get_matching_range_bounds(PartitionPruneContext *context,
 		  int opstrategy, Datum *values, int nvalues,
 		  FmgrInfo *partsupfunc, Bitmapset *nullkeys)
 {
-	PruneStepResult *result = palloc0(sizeof(PruneStepResult));
+	PruneStepResult *result = (PruneStepResult *) palloc0(sizeof(PruneStepResult));
 	PartitionBoundInfo boundinfo = context->boundinfo;
 	Oid		   *partcollation = context->partcollation;
 	int			partnatts = context->partnatts;
diff --git a/src/backend/optimizer/util/partprune.c b/src/backend/optimizer/util/partprune.c
index 75b7232f5d..2d06c1a519 100644
--- a/src/backend/optimizer/util/partprune.c
+++ b/src/backend/optimizer/util/partprune.c
@@ -433,8 +433,8 @@ generate_partition_pruning_steps_internal(RelOptInfo *rel,
 			}
 
 			/*
-			 * Fall-through for a NOT clause, which if it's a Boolean clause
-			 * clause, will be handled in match_clause_to_partition_key(). We
+			 * Fall-through for a NOT clause, which if it's a Boolean clause,
+			 * will be handled in match_clause_to_partition_key(). We
 			 * currently don't perform any pruning for more complex NOT
 			 * clauses.
 			 */
@@ -665,7 +665,7 @@ match_clause_to_partition_key(RelOptInfo *rel,
 	 */
 	if (match_boolean_partition_clause(partopfamily, clause, partkey, &expr))
 	{
-		*pc = palloc(sizeof(PartClauseInfo));
+		*pc = (PartClauseInfo *) palloc(sizeof(PartClauseInfo));
 		(*pc)->keyno = partkeyidx;
 		/* Do pruning with the Boolean equality operator. */
 		(*pc)->opno = BooleanEqualOperator;
@@ -806,7 +806,7 @@ match_c

Re: Add support for printing/reading MergeAction nodes

2018-04-04 Thread Tom Lane
Marina Polyakova  writes:
> When debugging is enabled for server logging, isolation tests fail 
> because there're no corresponding output functions for InsertStmt / 
> DeleteStmt / UpdateStmt that are used in the output of the MergeAction 
> nodes (see the attached regressions diffs and output). I also attached a 
> try that makes the tests pass. Sorry if I missed that it was already 
> discussed somewhere.

Uh ... what?

Those node types are supposed to appear in raw grammar output only;
they should never survive past parse analysis.

If the MERGE patch has broken this, I'm going to push back on that
and push back on it hard, because it probably means there are a
whole bunch of other raw-grammar-output-only node types that can
now get past the parser stage as well, as children of these nodes.
The answer to that is not to add a ton of new infrastructure, it's
"you did MERGE wrong".

BTW, poking around in the grammar, I notice that MergeStmt did not
get added to RuleActionStmt.  That seems like a rather serious
omission.

regards, tom lane



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-04 Thread Tomas Vondra
Hi,

I think the patch looks mostly fine. I'm about to do a bit more testing
on it, but a few comments. Attached diff shows which the discussed
places / comments more closely.


1) There's a race condition in LogicalLockTransaction. The code does
roughly this:

  if (!BecomeDecodeGroupMember(...))
 ... bail out ...

  Assert(MyProc->decodeGroupLeader);
  lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
  ...

but AFAICS there is no guarantee that the transaction does not commit
(or even abort) right after the become decode group member. In which
case LogicalDecodeRemoveTransaction might have already reset our pointer
to a leader to NULL. In which case the Assert() and lock will fail.

I've initially thought this can be fixed by setting decodeLocked=true in
BecomeDecodeGroupMember, but that's not really true - that would fix the
race for aborts, but not commits. LogicalDecodeRemoveTransaction skips
the wait for commits entirely, and just resets the flags anyway.

So this needs a different fix, I think. BecomeDecodeGroupMember also
needs the leader PGPROC pointer, but it does not have the issue because
it gets it as a parameter. I think the same thing would work for here
too - that is, use the AssignDecodeGroupLeader() result instead.


2) BecomeDecodeGroupMember sets the decodeGroupLeader=NULL when the
leader does not match the parameters, despite enforcing it by Assert()
at the beginning. Let's remove that assignment.


3) I don't quite understand why BecomeDecodeGroupMember does the
cross-check using PID. In which case would it help?


4) AssignDecodeGroupLeader still sets pid, which is never read. Remove.


5) ReorderBufferCommitInternal does elog(LOG) about interrupting the
decoding of aborted transaction only in one place. There are about three
other places where we check LogicalLockTransaction. Seems inconsistent.


6) The comment before LogicalLockTransaction is somewhat inaccurate,
because it talks about adding/removing the backend to the group, but
that's not what's happening. We join the group on the first call and
then we only tweak the decodeLocked flag.


7) I propose minor changes to a couple of comments.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 65382c2..b8b73a4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1236,13 +1236,19 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
  * while accessing any catalogs. To enforce that, each decoding backend
  * has to call LogicalLockTransaction prior to any catalog access, and
  * then LogicalUnlockTransaction immediately after it. These functions
- * add/remove the decoding backend from a "decoding group" for a given
- * transaction. While aborting a prepared transaction, the backend will
- * wait for all current members of the decoding group to leave (see
- * LogicalDecodeRemoveTransaction).
+ * add the decoding backend into a "decoding group" for the transaction
+ * (on the first call), and then update a flag indicating whether the
+ * decoding backend may be accessing any catalogs.
  *
- * The function return true when it's safe to access catalogs, and
- * false when the transaction aborted (or is being aborted) in which
+ * While aborting a prepared transaction, the backend is made to wait
+ * for all current members of the decoding group that may be currently
+ * accessing catalogs (see LogicalDecodeRemoveTransaction). Once the
+ * transaction completes (applies to both abort and commit), the group
+ * is destroyed and is not needed anymore (we can check transaction
+ * status directly, instead).
+ *
+ * The function returns true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted), in which
  * case the plugin should stop decoding it.
  *
  * The decoding backend joins the decoding group only when actually
@@ -1324,6 +1330,12 @@ LogicalLockTransaction(ReorderBufferTXN *txn)
 	/*
 	 * If we were able to add ourself, then Abort processing will
 	 * interlock with us.
+	 *
+	 * XXX There's a race condition here, I think. BecomeDecodeGroupMember
+	 * made us a member of the group, but the transaction might have
+	 * finished since then. In which case (decodeGroupLeader == NULL).
+	 * We need to set (decodeLocked = true) in BecomeDecodeGroupMember,
+	 * so that the leader waits for us.
 	 */
 	Assert(MyProc->decodeGroupLeader);
 
@@ -1333,6 +1345,9 @@ LogicalLockTransaction(ReorderBufferTXN *txn)
 	/*
 	 * Re-check if we were told to abort by the leader after taking
 	 * the above lock
+	 *
+	 * XXX It's not quite clear to me why we need the separate flag
+	 * in our process. Why not to simply check the leader's flag?
 	 */
 	if (MyProc->decodeAbortPending)
 	{
@@ -1410,7 +1425,12 @@ LogicalUnlockTransaction(ReorderBufferTXN 

Re: [HACKERS] GUC for cleanup indexes threshold.

2018-04-04 Thread Teodor Sigaev

Thanks for everyone, pushed with minor editorization


Alexander Korotkov wrote:
On Tue, Apr 3, 2018 at 6:42 PM, Masahiko Sawada > wrote:


On Tue, Mar 27, 2018 at 8:01 PM, Alexander Korotkov
> So, I would like to clarify why could my patch block future improvements
> in this area?  For instance, if we would decide to make btree able to skip
> cleanup when some delete pages are pending for recycle, we can add
> it in future.
>

Anyway, for approaches of this feature I agree your version patch and
your patch looks good to me as the first step of this feature.


Agreed.  I think we got consensus that this patch is good first step,
which doesn't block further enhancements in future.

So, I'm attaching rebased version of patch and marking this RFC.

--
Alexander Korotkov
Postgres Professional:http://www.postgrespro.com 
The Russian Postgres Company


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Foreign keys and partitioned tables

2018-04-04 Thread Alvaro Herrera
Robert Haas wrote:
> On Sun, Dec 31, 2017 at 2:43 PM, Alvaro Herrera
>  wrote:
> > This patch removes all the ONLY markers from queries in ri_triggers.c.
> > That makes the queries work for the new use case, but I haven't figured
> > if it breaks things for other use cases.  I suppose not, since regular
> > inheritance isn't supposed to allow foreign keys in the first place, but
> > I haven't dug any further.
> 
> I suspect that this leads to bugs under concurrency, something to do
> with crosscheck_snapshot, but I couldn't say exactly what the problem
> is off the top of my head.   My hope is that partitioning might be
> immune on the strength of knowing that any given tuple could only be
> present in one particular partition, but that might be wishful
> thinking.

I think you're thinking of this problem: if I insert a row in
partitioned table F, and simultaneously remove the referenced row from
table P, it is possible that we fail to reject the insertion in some
corner-case scenario.  I suppose it's not completely far-fetched, if P
is partitioned.  I don't see any way in which it could be a problem if
only F is partitioned.

For the record: in the patch I'm about to push, I did not implement
foreign key references to partitioned tables.  So it should be safe.
More thought may be needed to implement the other direction.  Offhand, I
don't see a problem, but I may well be wrong.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Add support for printing/reading MergeAction nodes

2018-04-04 Thread Simon Riggs
On 4 April 2018 at 17:19, Tom Lane  wrote:

> BTW, poking around in the grammar, I notice that MergeStmt did not
> get added to RuleActionStmt.  That seems like a rather serious
> omission.

MERGE isn't a privilege, a trigger action or a policy action. Why
would it be in RuleActionStmt?

Could you explain what command you think should be supported?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Add support for printing/reading MergeAction nodes

2018-04-04 Thread Simon Riggs
On 4 April 2018 at 17:19, Tom Lane  wrote:
> Marina Polyakova  writes:
>> When debugging is enabled for server logging, isolation tests fail
>> because there're no corresponding output functions for InsertStmt /
>> DeleteStmt / UpdateStmt that are used in the output of the MergeAction
>> nodes (see the attached regressions diffs and output). I also attached a
>> try that makes the tests pass. Sorry if I missed that it was already
>> discussed somewhere.
>
> Uh ... what?
>
> Those node types are supposed to appear in raw grammar output only;
> they should never survive past parse analysis.

So if I understand this correctly, it has nothing to do with the
isolation tester, that's just the place where the report was from.

Which debug mode are we talking about, please?

> If the MERGE patch has broken this, I'm going to push back on that
> and push back on it hard, because it probably means there are a
> whole bunch of other raw-grammar-output-only node types that can
> now get past the parser stage as well, as children of these nodes.
> The answer to that is not to add a ton of new infrastructure, it's
> "you did MERGE wrong".

MERGE hasn't broken anything. Marina is saying that the debug output
for MERGE isn't generated correctly.

I accept it shouldn't give spurious messages and I'm sure we can fix.

MERGE contains multiple actions of Insert, Update and Delete and these
could be output in various debug modes. I'm not clear what meaning we
might attach to them if we looked since that differs from normal
INSERTs, UPDATEs, DELETEs, but lets see.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Add support for printing/reading MergeAction nodes

2018-04-04 Thread Tom Lane
Simon Riggs  writes:
> On 4 April 2018 at 17:19, Tom Lane  wrote:
>> BTW, poking around in the grammar, I notice that MergeStmt did not
>> get added to RuleActionStmt.  That seems like a rather serious
>> omission.

> MERGE isn't a privilege, a trigger action or a policy action. Why
> would it be in RuleActionStmt?

Because it seems likely that somebody would want to write a rule along
the lines of "ON UPDATE TO mytable DO INSTEAD MERGE ...".

Looking a little further ahead, one can easily envision somebody
wanting to do "ON MERGE TO mytable DO INSTEAD something".  I'd be
prepared to give a pass on that for the present, partly because
it's not very clear what stuff from the original MERGE needs to be
available to the rule.  But the other case seems pretty fundamental.
MERGE is not supposed to be a utility command IMO, it's supposed to
be DML, and that means it needs to work anywhere that you could
write e.g. UPDATE.

regards, tom lane



Re: pgsql: New files for MERGE

2018-04-04 Thread Andres Freund
Hi,

On 2018-04-03 08:32:45 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-04-03 09:24:12 +, Simon Riggs wrote:
> > New files for MERGE
> > src/backend/executor/nodeMerge.c   |  575 +++
> > src/backend/parser/parse_merge.c   |  660 
> > src/include/executor/nodeMerge.h   |   22 +
> > src/include/parser/parse_merge.h   |   19 +
> 
> Getting a bit grumpy here.  So you pushed this, without responding in
> any way to the objections I made in
> http://archives.postgresql.org/message-id/20180403021800.b5nsgiclzanobiup%40alap3.anarazel.de
> and did it in a manner that doesn't even compile?

This needs at the very least a response to the issues pointed out in the
referenced email that you chose to ignore without any sort of comment.

Greetings,

Andres Freund



Re: Foreign keys and partitioned tables

2018-04-04 Thread Alvaro Herrera
Peter Eisentraut wrote:
> On 4/3/18 15:11, Alvaro Herrera wrote:
> > 0003 is the main patch, which is a bit changed from v4, notably to cover
> > your review comments:
> 
> Looks good now.

Thanks, pushed.

I added a couple of test cases for ON UPDATE/DELETE and MATCH PARTIAL,
after noticing that ri_triggers.c could use some additional coverage
after deleting the parts of it that did not correspond to partitioned
tables.  I think it is possible to keep adding tests, if someone really
wanted to.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Gasper Zejn


On 04. 04. 2018 15:49, Bruce Momjian wrote:
> I can understand why kernel developers don't want to keep failed sync
> buffers in memory, and once they are gone we lose reporting of their
> failure.  Also, if the kernel is going to not retry the syncs, how long
> should it keep reporting the sync failure?  To the first fsync that
> happens after the failure?  How long should it continue to record the
> failure?  What if no fsync() every happens, which is likely for
> non-Postgres workloads?  I think once they decided to discard failed
> syncs and not retry them, the fsync behavior we are complaining about
> was almost required.
Ideally the kernel would keep its data for as little time as possible.
With fsync, it doesn't really know which process is interested in
knowing about a write error, it just assumes the caller will know how to
deal with it. Most unfortunate issue is there's no way to get
information about a write error.

Thinking aloud - couldn't/shouldn't a write error also be a file system
event reported by inotify? Admittedly that's only a thing on Linux, but
still.


Kind regards,
Gasper



Re: pgsql: Validate page level checksums in base backups

2018-04-04 Thread Magnus Hagander
On Wed, Apr 4, 2018 at 3:47 PM, Alvaro Herrera 
wrote:

> Michael Banck wrote:
>
> > However, the pg_basebackup testsuite takes up 800+ MB to run,
>
> Uh, you're right.  This seems a bit over the top.  Can we reduce that
> without losing coverage?  We've gone great lengths to avoid large
> amounts of data in tests elsewhere.
>

That didn't come out of this patch, right? This is a pre-existing issue?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] Runtime Partition Pruning

2018-04-04 Thread Jesper Pedersen

Hi David,

On 04/03/2018 10:10 PM, David Rowley wrote:

The attached case doesn't trigger a generic plan, so basically all time is
spent in GetCachedPlan.


Yeah, there's still no resolution to the fact that a generic plan +
runtime pruning might be cheaper than a custom plan.  The problem is
the generic plan appears expensive to the custom vs generic plan
comparison due to it containing more Append subnodes and the run-time
pruning not being taking into account by that comparison.

There's been some discussion about this on this thread somewhere.



Forgot about that, sorry.


I think the best solution is probably the one suggested by Robert [1]
and that's to alter the Append plan's cost when run-time pruning is
enabled to try to account for the run-time pruning. This would be a
bit of a blind guess akin to what we do for clause selectivity
estimates for Params, but it's probably better than nothing, and
likely better than doing nothing.



Yeah, something based on the number of WHERE clauses, and if the 
partition type has DEFAULT / NULL partition could help. Forcing 
choose_custom_plan() to return false does help a lot (> 400%) for the 
HASH case.


But maybe this area is best left for PG12.


Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for
that with Amit over on [2].



I was running with a version of faster_part_prune_v45_fixups.patch.

Patch v49 with v18 (0001-0004) works. 0005 needs a rebase.

Thanks again,
 Jesper



Re: some last patches breaks plan cache

2018-04-04 Thread Tomas Vondra


On 04/01/2018 10:01 AM, Pavel Stehule wrote:
> 
> 
> 2018-04-01 1:00 GMT+02:00 Tomas Vondra  >:
> 
> 
> 
> On 03/31/2018 08:28 PM, Tomas Vondra wrote:
> >
> >
> > On 03/31/2018 07:56 PM, Tomas Vondra wrote:
> >> On 03/31/2018 07:38 PM, Pavel Stehule wrote:
> >>> Hi
> >>>
> >>> CREATE OR REPLACE PROCEDURE public.proc(a integer, INOUT b
> integer, c
> >>> integer)
> >>>  LANGUAGE plpgsql
> >>> AS $procedure$
> >>> begin
> >>>   b := a + c;
> >>> end;
> >>> $procedure$
> >>>
> >>> CREATE OR REPLACE PROCEDURE public.testproc()
> >>>  LANGUAGE plpgsql
> >>> AS $procedure$
> >>> declare r int;
> >>> begin
> >>>   call proc(10, r, 20);
> >>> end;
> >>> $procedure$
> >>>
> >>> postgres=# call testproc();
> >>> CALL
> >>> postgres=# call testproc();
> >>> ERROR:  SPI_execute_plan_with_paramlist failed executing query "CALL
> >>> proc(10, r, 20)": SPI_ERROR_ARGUMENT
> >>> CONTEXT:  PL/pgSQL function testproc() line 4 at CALL
> >>> postgres=#
> >>>
> >>> second call fails
> >>
> >> Yeah.
> >>
> >> d92bc83c48bdea9888e64cf1e2edbac9693099c9 seems to have broken
> this :-/
> >>
> >
> > FWIW it seems the issue is somewhere in exec_stmt_call, which does
> this:
> >
> >     /*
> >      * Don't save the plan if not in atomic context.  Otherwise,
> >      * transaction ends would cause warnings about plan leaks.
> >      */
> >     exec_prepare_plan(estate, expr, 0, estate->atomic);
> >
> > When executed outside transaction, CALL has estate->atomic=false,
> and so
> > calls exec_prepare_plan() with keepplan=false. And on the second
> call it
> > gets bogus Plan, of course (with the usual 0x7f7f7f7f7f7f7f7f
> patterns).
> >
> > When in a transaction, it sets keepplan=true, and everything works
> fine.
> >
> > So either estate->atomic is not sufficient on it's own, or we need to
> > reset the expr->plan somewhere.
> >
> 
> The attached patch fixes this, but I'm not really sure it's the right
> fix - I'd expect there to be a more principled way, doing resetting the
> plan pointer when 'plan->saved == false'.
> 
> 
> it fixes some issue, but not all
> 
> I see changes in plpgsql_check regress tests
> 
> CREATE OR REPLACE PROCEDURE public.testproc()
>  LANGUAGE plpgsql
> AS $procedure$
> declare r int;
> begin
>   call proc(10, r + 10, 20);
> end;
> $procedure$
> 
> postgres=# call testproc();
> ERROR:  argument 2 is an output argument but is not writable
> CONTEXT:  PL/pgSQL function testproc() line 4 at CALL
> postgres=# call testproc();
> ERROR:  SPI_execute_plan_with_paramlist failed executing query "CALL
> proc(10, r + 10, 20)": SPI_ERROR_ARGUMENT
> CONTEXT:  PL/pgSQL function testproc() line 4 at CALL
> 

This should do the trick - I've failed to realize exec_stmt_call may
exit by calling elog(ERROR) too, in which case the plan pointer was not
reset.

This does fix the failures presented here, but I don't think it's the
right solution - for example, if any other function call ends with
elog(ERROR), the dangling pointer will be there. There must be a better
place to cleanup this automatically ...

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 67123f8..4337b78 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -2167,9 +2167,15 @@ exec_stmt_call(PLpgSQL_execstate *estate, PLpgSQL_stmt_call *stmt)
 	Param	   *param;
 
 	if (!IsA(n, Param))
+	{
+		/* cleanup the plan pointer */
+		if (!estate->atomic)
+			expr->plan = NULL;
+
 		ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),
  errmsg("argument %d is an output argument but is not writable", i + 1)));
+	}
 
 	param = castNode(Param, n);
 	/* paramid is offset by 1 (see make_datum_param()) */
@@ -2193,6 +2199,10 @@ exec_stmt_call(PLpgSQL_execstate *estate, PLpgSQL_stmt_call *stmt)
 	exec_eval_cleanup(estate);
 	SPI_freetuptable(SPI_tuptable);
 
+	/* cleanup the plan pointer */
+	if (!estate->atomic)
+		expr->plan = NULL;
+
 	return PLPGSQL_RC_OK;
 }
 


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Bruce Momjian
On Wed, Apr  4, 2018 at 11:23:51PM +0800, Craig Ringer wrote:
> On 4 April 2018 at 21:49, Bruce Momjian  wrote:
> I can understand why kernel developers don't want to keep failed sync
> buffers in memory, and once they are gone we lose reporting of their
> failure.  Also, if the kernel is going to not retry the syncs, how long
> should it keep reporting the sync failure?
> 
> Ideally until the app tells it not to.
> 
> But there's no standard API for that.

You would almost need an API that registers _before_ the failure that
you care about sync failures, and that you plan to call fsync() to
gather such information.  I am not sure how you would allow more than
the first fsync() to see the failure unless you added _another_ API to
clear the fsync failure, but I don't see the point since the first
fsync() might call that clear function.  How many applications are going
to know there is _another_ application that cares about the failure? Not
many.

> Currently we kind of hope the kernel will deal with figuring out which
> is which and retrying. Turns out it doesn't do that so much, and I
> don't think the reasons for that are wholly unreasonable. We may have
> been asking too much.

Agreed.

> Our only option might be to tell administrators to closely watch for
> kernel write failure messages, and then restore or failover.  :-(
> 
> Speaking of, there's not necessarily any lost page write error in the logs
> AFAICS. My tests often just show "Buffer I/O error on device dm-0, logical
> block 59393" or the like.

I assume that is the kernel logs.  I am thinking the kernel logs have to
be monitored, but how many administrators do that?  The other issue I
think you are pointing out is how is the administrator going to know
this is a Postgres file?  I guess any sync error to a device that
contains Postgres has to assume Postgres is corrupted.  :-(

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Add support for printing/reading MergeAction nodes

2018-04-04 Thread Tom Lane
Simon Riggs  writes:
> On 4 April 2018 at 17:19, Tom Lane  wrote:
>> If the MERGE patch has broken this, I'm going to push back on that
>> and push back on it hard, because it probably means there are a
>> whole bunch of other raw-grammar-output-only node types that can
>> now get past the parser stage as well, as children of these nodes.
>> The answer to that is not to add a ton of new infrastructure, it's
>> "you did MERGE wrong".

> MERGE contains multiple actions of Insert, Update and Delete and these
> could be output in various debug modes. I'm not clear what meaning we
> might attach to them if we looked since that differs from normal
> INSERTs, UPDATEs, DELETEs, but lets see.

What I'm complaining about is that that's a very poorly designed parsetree
representation.  It may not be the worst one I've ever seen, but it's
got claims in that direction.  You're repurposing InsertStmt et al to
do something that's *not* an INSERT statement, but by chance happens
to share some (not all) of the same fields.  This is bad because it
invites confusion, and then bugs of commission or omission (eg, assuming
that some particular processing has happened or not happened to subtrees
of that parse node).  The most egregious way in which it's a bad idea
is that, as I said, InsertStmt doesn't even exist in post-parse-analysis
trees so far as the normal type of INSERT is concerned.  This just opens
a whole batch of ways to screw up.  We have some types of raw parse nodes
that are replaced entirely during parse analysis, and we have some others
where it's convenient to use the same node type before and after parse
analysis, but we don't have any that are one way in one context and the
other way in other contexts.  And this is not the place to start.

I think you'd have been better off to fold all of those fields into
MergeAction, or perhaps make several variants of MergeAction.

regards, tom lane



Re: SET TRANSACTION in PL/pgSQL

2018-04-04 Thread Tomas Vondra
On 03/29/2018 06:30 PM, Peter Eisentraut wrote:
> On 3/15/18 17:49, Alexander Korotkov wrote:
>> I didn't dig deeply into this subject.  But should we rather teach SPI
>> to execute
>> utility statements without taking snapshot when not necessary.  That seems
>> like what executor do for client provided queries.  And that seems a bit
>> unlogical
>> that SPI behaves differently.
> 
> Here is the same patch rewritten using SPI, using the new no_snapshots
> facility recently introduced.
> 

Yeah, doing that using SPI seems much cleaner and more like the rest of
the commands. Most of the patch is boilerplate to support the grammar,
and the one interesting piece exec_stmt_set seems fine to me.

Barring any objections, I'll mark it as RFC tomorrow morning.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: some last patches breaks plan cache

2018-04-04 Thread Tom Lane
Tomas Vondra  writes:
> This should do the trick - I've failed to realize exec_stmt_call may
> exit by calling elog(ERROR) too, in which case the plan pointer was not
> reset.

> This does fix the failures presented here, but I don't think it's the
> right solution

No, it's completely unacceptable.

If there's really no other way, you could use a PG_TRY block to ensure
that the pointer gets reset on the way out.  But I question why we've
got a design that requires that in the first place.  It's likely to
have more problems than this.

regards, tom lane



Re: some last patches breaks plan cache

2018-04-04 Thread Tomas Vondra


On 04/04/2018 07:54 PM, Tom Lane wrote:
> Tomas Vondra  writes:
>> This should do the trick - I've failed to realize exec_stmt_call may
>> exit by calling elog(ERROR) too, in which case the plan pointer was not
>> reset.
> 
>> This does fix the failures presented here, but I don't think it's the
>> right solution
> 
> No, it's completely unacceptable.
> 

Yes, I realize that and I was not really suggesting this as a proper
fix. It was meant more to demonstrate that it's still the same issue
with the same dangling pointer.

> If there's really no other way, you could use a PG_TRY block to 
> ensure that the pointer gets reset on the way out. But I question
> why we've got a design that requires that in the first place. It's
> likely to have more problems than this.
> 

I agree it needs a solution that does not require us to track and
manually reset pointers on random places. No argument here.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Foreign keys and partitioned tables

2018-04-04 Thread Tom Lane
Alvaro Herrera  writes:
> Thanks, pushed.

This has broken the selinux regression tests, evidently because it
removed ONLY from the emitted FK test queries.  While we could change
the expected results, I would first like to hear a defense of why that
change is a good idea.  It seems highly likely to be the wrong thing
for non-partitioned cases.

regards, tom lane



Re: Add support for printing/reading MergeAction nodes

2018-04-04 Thread Simon Riggs
On 4 April 2018 at 18:51, Tom Lane  wrote:
> Simon Riggs  writes:
>> On 4 April 2018 at 17:19, Tom Lane  wrote:
>>> If the MERGE patch has broken this, I'm going to push back on that
>>> and push back on it hard, because it probably means there are a
>>> whole bunch of other raw-grammar-output-only node types that can
>>> now get past the parser stage as well, as children of these nodes.
>>> The answer to that is not to add a ton of new infrastructure, it's
>>> "you did MERGE wrong".
>
>> MERGE contains multiple actions of Insert, Update and Delete and these
>> could be output in various debug modes. I'm not clear what meaning we
>> might attach to them if we looked since that differs from normal
>> INSERTs, UPDATEs, DELETEs, but lets see.
>
> What I'm complaining about is that that's a very poorly designed parsetree
> representation.  It may not be the worst one I've ever seen, but it's
> got claims in that direction.  You're repurposing InsertStmt et al to
> do something that's *not* an INSERT statement, but by chance happens
> to share some (not all) of the same fields.  This is bad because it
> invites confusion, and then bugs of commission or omission (eg, assuming
> that some particular processing has happened or not happened to subtrees
> of that parse node).  The most egregious way in which it's a bad idea
> is that, as I said, InsertStmt doesn't even exist in post-parse-analysis
> trees so far as the normal type of INSERT is concerned.  This just opens
> a whole batch of ways to screw up.  We have some types of raw parse nodes
> that are replaced entirely during parse analysis, and we have some others
> where it's convenient to use the same node type before and after parse
> analysis, but we don't have any that are one way in one context and the
> other way in other contexts.  And this is not the place to start.
>
> I think you'd have been better off to fold all of those fields into
> MergeAction, or perhaps make several variants of MergeAction.

OK, that can be changed, will check and report back tomorrow.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Postgres stucks in deadlock detection

2018-04-04 Thread Andres Freund
Hi,

On 2018-04-04 11:54:14 +0300, Konstantin Knizhnik wrote:
> Several times we and our customers are suffered from the problem that
> Postgres got stuck in deadlock detection.
> One scenario is YCSB workload with Zipf's distribution when many clients are
> trying to update the same record and compete for it's lock.

> Another scenario is large number of clients performing inserts in the same
> table. In this case the source of the problem is relation extension lock.
> In both cases number of connection is large enough: several hundreds.

Have you ever observed that in the field? This sounds more artificial
than real to me.


> So, I see three possible ways to fix this problem:
> 1. Yury Sololov's patch with two phase deadlock check
> 2. Avoid concurrent deadlock detection
> 3. Avoid concurrent deadlock detection + let CheckDeadLock detect all
> deadlocks, not only those in which current transaction is involved.

4) Don't do anything about deadlock detection, because this is just a
   symptom five steps removed from the cause.

We've to pay attention to fixing actual problems, rather than purely
benchmark ones. Complexity has a significant price.

Greetings,

Andres Freund



Re: pgsql: Validate page level checksums in base backups

2018-04-04 Thread Michael Banck
Hi,

On Wed, Apr 04, 2018 at 07:25:11PM +0200, Magnus Hagander wrote:
> On Wed, Apr 4, 2018 at 3:47 PM, Alvaro Herrera 
> wrote:
> > Michael Banck wrote:
> >
> > > However, the pg_basebackup testsuite takes up 800+ MB to run,
> >
> > Uh, you're right.  This seems a bit over the top.  Can we reduce that
> > without losing coverage?  We've gone great lengths to avoid large
> > amounts of data in tests elsewhere.
> 
> That didn't come out of this patch, right? This is a pre-existing issue?

It was around that ballpark before, but this patch probably made
things worse as it adds four additional datadirs (around 40 MB each
here) and we are close to 1 GB now.

I haven't looked at the other testsuites, but if it is ok to cleanup the
basebackups after each set of tests suceeded, that would alleviate the
problem.

Otherwise, I had a quick look and there is no obvious outlier; the
pgdata is 220 MB after the testrun (195 MB of which is WAL, maybe that
could be cut down somehow?) and the base backups are 22-40 MB each, and
there is around 20 of them, so that adds up to more than 750 MB.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer



Re: pgsql: New files for MERGE

2018-04-04 Thread Pavan Deolasee
On Wed, Apr 4, 2018 at 10:40 PM, Andres Freund  wrote:

> Hi,
>
> On 2018-04-03 08:32:45 -0700, Andres Freund wrote:
> > Hi,
> >
> > On 2018-04-03 09:24:12 +, Simon Riggs wrote:
> > > New files for MERGE
> > > src/backend/executor/nodeMerge.c   |  575 +++
> > > src/backend/parser/parse_merge.c   |  660 
> > > src/include/executor/nodeMerge.h   |   22 +
> > > src/include/parser/parse_merge.h   |   19 +
> >
> > Getting a bit grumpy here.  So you pushed this, without responding in
> > any way to the objections I made in
> > http://archives.postgresql.org/message-id/20180403021800.
> b5nsgiclzanobiup%40alap3.anarazel.de
> > and did it in a manner that doesn't even compile?
>
> This needs at the very least a response to the issues pointed out in the
> referenced email that you chose to ignore without any sort of comment.
>
>
Apologies from my end. Simon checked with me regarding your referenced
email. I was in the middle of responding to it (with a add-on patch to take
care of your review comments), but got side tracked by some high priority
customer escalation. I shall respond soon.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Postgres stucks in deadlock detection

2018-04-04 Thread Teodor Sigaev

Have you ever observed that in the field? This sounds more artificial
than real to me.


Zabbix storage with 50Gb WAL per hour on insertion.

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: Postgres stucks in deadlock detection

2018-04-04 Thread Andres Freund
On 2018-04-04 21:34:27 +0300, Teodor Sigaev wrote:
> > Have you ever observed that in the field? This sounds more artificial
> > than real to me.
> 
> Zabbix storage with 50Gb WAL per hour on insertion.

That's not a very detailed description. Isn't that a largely insertion
only workload?



Re: pgsql: New files for MERGE

2018-04-04 Thread Andres Freund
Hi,

On 2018-04-05 00:02:06 +0530, Pavan Deolasee wrote:
> Apologies from my end. Simon checked with me regarding your referenced
> email. I was in the middle of responding to it (with a add-on patch to take
> care of your review comments), but got side tracked by some high priority
> customer escalation. I shall respond soon.

Hows that an explanation for just going ahead and committing? Without
even commenting on why one thinks the pointed out issues are something
that can be resolved later or somesuch?  This has an incredibly rushed
feel to it.

Greetings,

Andres Freund



Re: pgsql: New files for MERGE

2018-04-04 Thread Peter Geoghegan
On Wed, Apr 4, 2018 at 11:46 AM, Andres Freund  wrote:
> Hows that an explanation for just going ahead and committing? Without
> even commenting on why one thinks the pointed out issues are something
> that can be resolved later or somesuch?  This has an incredibly rushed
> feel to it.

I agree that this was handled in a way that was just unsatisfactory.
It was also unnecessary, since we still have a few days left until
feature freeze.

-- 
Peter Geoghegan



Re: Postgres stucks in deadlock detection

2018-04-04 Thread Teodor Sigaev

Have you ever observed that in the field? This sounds more artificial
than real to me.


Zabbix storage with 50Gb WAL per hour on insertion.


That's not a very detailed description. Isn't that a largely insertion
only workload?


It's mostly insert load, collecting monitoring information. Also there 
are some read-only transactions, ~10 per second.


We didn't invent this from mind, client came to us with support request, 
now we have two different cases.


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: pgsql: New files for MERGE

2018-04-04 Thread Pavan Deolasee
On Thu, Apr 5, 2018 at 12:16 AM, Andres Freund  wrote:

> Hi,
>
> On 2018-04-05 00:02:06 +0530, Pavan Deolasee wrote:
> > Apologies from my end. Simon checked with me regarding your referenced
> > email. I was in the middle of responding to it (with a add-on patch to
> take
> > care of your review comments), but got side tracked by some high priority
> > customer escalation. I shall respond soon.
>
> Hows that an explanation for just going ahead and committing? Without
> even commenting on why one thinks the pointed out issues are something
> that can be resolved later or somesuch?  This has an incredibly rushed
> feel to it.
>

While I don't want to answer that on Simon's behalf, my feeling is that he
may not seen your email since it came pretty late. He had probably planned
to commit the patch again first thing in the morning with the fixes I'd
sent.

Anyways, I think your reviews comments are useful and I've incorporated
most of those. Obviously certain things like creating a complete new
executor machinery is not practical given where we're in the release cycle
and I am not sure if that has any significant advantages over what we have
today.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Postgres stucks in deadlock detection

2018-04-04 Thread Andres Freund
Hi,

On 2018-04-04 21:55:10 +0300, Teodor Sigaev wrote:
> > > > Have you ever observed that in the field? This sounds more artificial
> > > > than real to me.
> > > 
> > > Zabbix storage with 50Gb WAL per hour on insertion.
> > 
> > That's not a very detailed description. Isn't that a largely insertion
> > only workload?
> 
> It's mostly insert load, collecting monitoring information. Also there are
> some read-only transactions, ~10 per second.

How are you running into deadlock detector issues then? Where do the
significant number of heavyweight lock weights come from?


> We didn't invent this from mind, client came to us with support request, now
> we have two different cases.

You gotta actually start describing those then.

Greetings,

Andres Freund



Re: pgsql: New files for MERGE

2018-04-04 Thread Tom Lane
[ removing -committers from cc ]

Pavan Deolasee  writes:
> On Thu, Apr 5, 2018 at 12:16 AM, Andres Freund  wrote:
>> Hows that an explanation for just going ahead and committing? Without
>> even commenting on why one thinks the pointed out issues are something
>> that can be resolved later or somesuch?  This has an incredibly rushed
>> feel to it.

> Anyways, I think your reviews comments are useful and I've incorporated
> most of those. Obviously certain things like creating a complete new
> executor machinery is not practical given where we're in the release cycle
> and I am not sure if that has any significant advantages over what we have
> today.

Well, what's on the table is reverting this patch and asking you to try
again in the v12 cycle.  Given Andres' concerns about the executor design,
and mine about the way the parsing end is built, there's certainly no way
that that's all getting fixed by Saturday.  Given pretty much everybody's
unhappiness with the way this patch was forced through at the last minute,
I do not think you should expect that we'll say, "okay, we'll let you ship
a bad version of MERGE because there's no more time in this cycle".

Personally, I didn't think we had consensus on whether the semantics
are right, let alone on whether this is a satisfactory implementation
code-wise.  I know I've never looked at the patch before today; I did not
think it was close enough to being committed that I would need to.

regards, tom lane



Re: Parallel Aggregates for string_agg and array_agg

2018-04-04 Thread Tomas Vondra


On 03/31/2018 04:42 PM, David Rowley wrote:
> On 30 March 2018 at 02:55, Tomas Vondra  wrote:
>> On 03/29/2018 03:09 PM, David Rowley wrote:
>>> I meant to mention earlier that I coded
>>> agg_args_have_sendreceive_funcs() to only check for send/receive
>>> functions. Really we could allow a byval types without send/receive
>>> functions, since the serial/deserial just send the raw datums in that
>>> case, but then the function becomes
>>> agg_byref_args_have_sendreceive_funcs(), which seemed a bit obscure,
>>> so I didn't do that.  Maybe I should?
>>
>> I'd do that. Not sure the function name needs to change, but perhaps
>> agg_args_support_sendreceive() would be better - it covers both byref
>> types (which require send/receive functions) and byval (which don't).
> 
> The attached patch implements this.
> 

Seems fine to me, although we should handle the anyarray case too, I
guess. That is, get_agg_clause_costs_walker() should do this too:

/* Same thing for array_agg_array_(de)serialize. */
if ((aggserialfn == F_ARRAY_AGG_ARRAY_SERIALIZE ||
 aggdeserialfn == F_ARRAY_AGG_ARRAY_DESERIALIZE) &&
 !agg_args_support_sendreceive(aggref))
costs->hasNonSerial = true;

Other than that, the patch seems fine to me, and it's already marked as
RFC so I'll leave it at that.

The last obstacle seems to be the argument about the risks of the patch
breaking queries of people relying on the ordering. Not sure what's are
the right next steps in this regard ...

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: New files for MERGE

2018-04-04 Thread Peter Geoghegan
On Wed, Apr 4, 2018 at 12:09 PM, Tom Lane  wrote:
> Personally, I didn't think we had consensus on whether the semantics
> are right, let alone on whether this is a satisfactory implementation
> code-wise.  I know I've never looked at the patch before today; I did not
> think it was close enough to being committed that I would need to.

To be fair, I was happy with the semantics we came up with for READ
COMMITTED conflict handling, although it wasn't that long ago that
that ceased to be the big concern. This happened due to a truly heroic
effort from Pavan.

The problems that remained were with the representation used during
parsing, planning, and execution, which seem like they could have a
lot of unforeseen consequences. Plus a general lack of maturity.
Things like column-level privileges were broken as recently as a week
before commit, due to being totally untested. That was a consequence
of the representation in the executor.

-- 
Peter Geoghegan



Re: Foreign keys and partitioned tables

2018-04-04 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera  writes:
> > Thanks, pushed.
> 
> This has broken the selinux regression tests, evidently because it
> removed ONLY from the emitted FK test queries.  While we could change
> the expected results, I would first like to hear a defense of why that
> change is a good idea.  It seems highly likely to be the wrong thing
> for non-partitioned cases.

Yeah, there ain't one, because this was a reversal mistake.  I restored
that ONLY.  (There were two ONLYs in the original query; I initially
removed both, and then went over the file and included them
conditionally on the table not being a partitioned one, based on review
comments.  In this line I restored one conditionally but failed to
realize I should have been restoring the other unconditionally.)

Pushed a fix blind.  Let's see if it appeases rhinoceros.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: New files for MERGE

2018-04-04 Thread Simon Riggs
On 4 April 2018 at 20:09, Tom Lane  wrote:
> [ removing -committers from cc ]
>
> Pavan Deolasee  writes:
>> On Thu, Apr 5, 2018 at 12:16 AM, Andres Freund  wrote:
>>> Hows that an explanation for just going ahead and committing? Without
>>> even commenting on why one thinks the pointed out issues are something
>>> that can be resolved later or somesuch?  This has an incredibly rushed
>>> feel to it.
>
>> Anyways, I think your reviews comments are useful and I've incorporated
>> most of those. Obviously certain things like creating a complete new
>> executor machinery is not practical given where we're in the release cycle
>> and I am not sure if that has any significant advantages over what we have
>> today.
>
> Well, what's on the table is reverting this patch and asking you to try
> again in the v12 cycle.  Given Andres' concerns about the executor design,
> and mine about the way the parsing end is built, there's certainly no way
> that that's all getting fixed by Saturday.  Given pretty much everybody's
> unhappiness with the way this patch was forced through at the last minute,
> I do not think you should expect that we'll say, "okay, we'll let you ship
> a bad version of MERGE because there's no more time in this cycle".

This version works, with agreed semantics, all fully tested and documented.

It's also neat and tight. Look how easy it was for Peter to add WITH
semantics on top of it.

And it's isolated, so its not a threat to anybody that doesn't choose
to use it. Users want it and will use this; if I didn't know that for
certain I wouldn't spend time on it.

> Personally, I didn't think we had consensus on whether the semantics
> are right, let alone on whether this is a satisfactory implementation
> code-wise.  I know I've never looked at the patch before today; I did not
> think it was close enough to being committed that I would need to.

I have rejected patches in the past for clearly stated reasons that
affect users. I regret that those people didn't discuss their designs
before they posted patches and serious holes were found. And in
response I provided clear design outlines of what was needed. That is
not what is happening here.

The normal way is to make review comments that allow change. Your
request for change of the parser data structures is fine and can be
done, possibly by Saturday

If saying "I'm unhappy with something" is sufficient grounds for
rejecting a patch, I'm surprised to hear it. There has been no
discussion of what exactly would be better, only that what we have is
somehow wrong, a point which both Pavan and I dispute, not least
because the executor has already been rewritten once at Peter's
request.

I was under no pressure at all to commit this. In my opinion this is a
good version of MERGE and that is why I committed it. If it were not,
I would have no hesitation in pushing back a year or more, if I knew
of technical reasons to do so.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: New files for MERGE

2018-04-04 Thread Andres Freund
Hi,

On 2018-04-04 21:07:25 +0100, Simon Riggs wrote:
> It's also neat and tight. Look how easy it was for Peter to add WITH
> semantics on top of it.

Err. Several parts of the code definitely do not look "neat and
tight". As detailed in my email. Possibly that's necessary, but you've
not argued that.


> And it's isolated, so its not a threat to anybody that doesn't choose
> to use it. Users want it and will use this; if I didn't know that for
> certain I wouldn't spend time on it.

Architectural costs are a thing.


> The normal way is to make review comments that allow change. Your
> request for change of the parser data structures is fine and can be
> done, possibly by Saturday

I did request changes, and you've so far ignored those requests.


> If saying "I'm unhappy with something" is sufficient grounds for
> rejecting a patch, I'm surprised to hear it. There has been no
> discussion of what exactly would be better, only that what we have is
> somehow wrong, a point which both Pavan and I dispute, not least
> because the executor has already been rewritten once at Peter's
> request.

You've not publicly disputed that, no.


> I was under no pressure at all to commit this. In my opinion this is a
> good version of MERGE and that is why I committed it. If it were not,

Why did you then commit a patch six hours after objections were raised?
Without responding to them? And again breaking the patch into commits in
a way that made no sense and in fact was not compilable for an hour?

That does looks rushed, unless you provide a better explanation

- Andres



Re: pgsql: New files for MERGE

2018-04-04 Thread Peter Geoghegan
On Wed, Apr 4, 2018 at 1:07 PM, Simon Riggs  wrote:
> This version works, with agreed semantics, all fully tested and documented.

I agree that it's more or less true that this works, and implements
the agreed-upon semantics. I also agree that that's very important.
That's beside the point, though.

> And it's isolated, so its not a threat to anybody that doesn't choose
> to use it. Users want it and will use this; if I didn't know that for
> certain I wouldn't spend time on it.

I strongly doubt it.

> If saying "I'm unhappy with something" is sufficient grounds for
> rejecting a patch, I'm surprised to hear it. There has been no
> discussion of what exactly would be better, only that what we have is
> somehow wrong, a point which both Pavan and I dispute, not least
> because the executor has already been rewritten once at Peter's
> request.

That's a total exaggeration. What happened was that Pavan cleaned up a
lot of the EPQ code, and related code in nodeModifyTable.c, as part of
getting the RC mode conflict handling right. Again, yes, that was
really essentially work.

A lot of the things that are bad about this patch are the same things
that were bad about my own ON CONFLICT patch before Andres arrived on
the scene. Very little changed about the fundamental semantics after
he joined that project, but a lot changed about the representation
used within the parser, planner, and executor. I think that the same
thing needs to happen here.

I knew from direct experience that it would be perfectly possible to
have a very useful discussion about the most important issue, the
semantics, without really having to address the very real concerns
that I had about the representation until a later date. Indeed, we
managed to do that, and I'm very glad that we managed to do that. It
was almost certainly the best strategy available.

Perhaps I should have been more forceful about the fundamental issue,
rather than making specific points about specific consequence of that
problem, but it probably wouldn't have made a big difference in the
end. There is only so much time available.

-- 
Peter Geoghegan



Re: pgsql: New files for MERGE

2018-04-04 Thread Simon Riggs
On 4 April 2018 at 21:14, Andres Freund  wrote:

>> The normal way is to make review comments that allow change. Your
>> request for change of the parser data structures is fine and can be
>> done, possibly by Saturday
>
> I did request changes, and you've so far ignored those requests.

Pavan tells me he has replied to you and is working on specific changes.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Foreign keys and partitioned tables

2018-04-04 Thread Alvaro Herrera
Robert Haas wrote:

> I suspect that this leads to bugs under concurrency, something to do
> with crosscheck_snapshot, but I couldn't say exactly what the problem
> is off the top of my head.   My hope is that partitioning might be
> immune on the strength of knowing that any given tuple could only be
> present in one particular partition, but that might be wishful
> thinking.

Speaking of crosscheck_snapshot, I just noticed that the case of FKs
with repeatable read or serializable snapshot seems not to be covered by
tests at all, judging from the coverage report:

2635 : /*
2636 :  * In READ COMMITTED mode, we just need to use an 
up-to-date regular
2637 :  * snapshot, and we will see all rows that could be 
interesting. But in
2638 :  * transaction-snapshot mode, we can't change the 
transaction snapshot. If
2639 :  * the caller passes detectNewRows == false then 
it's okay to do the query
2640 :  * with the transaction snapshot; otherwise we use a 
current snapshot, and
2641 :  * tell the executor to error out if it finds any 
rows under the current
2642 :  * snapshot that wouldn't be visible per the 
transaction snapshot.  Note
2643 :  * that SPI_execute_snapshot will register the 
snapshots, so we don't need
2644 :  * to bother here.
2645 :  */
26463026 : if (IsolationUsesXactSnapshot() && detectNewRows)
2647 : {
2648   0 : CommandCounterIncrement();  /* be sure all my 
own work is visible */
2649   0 : test_snapshot = GetLatestSnapshot();
2650   0 : crosscheck_snapshot = GetTransactionSnapshot();
2651 : }
2652 : else
2653 : {
2654 : /* the default SPI behavior is okay */
26553026 : test_snapshot = InvalidSnapshot;
26563026 : crosscheck_snapshot = InvalidSnapshot;
2657 : }
https://coverage.postgresql.org/src/backend/utils/adt/ri_triggers.c.gcov.html

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



comments around heap_lock_tuple confus{ing,ed} around deleted tuples

2018-04-04 Thread Andres Freund
Hi,

While looking at resolving [1] I re-read heap_lock_tuple() and
subsidiary routines and got thoroughly confused for a while.

One reason was that function names and comments talk about updated, when
they also actually deal with deletes.  heap_lock_updated_tuple()
specifically is called on tuples that have not been updated, but have
been deleted.

/*
 * heap_lock_updated_tuple
 *  Follow update chain when locking an updated tuple, acquiring 
locks (row
 *  marks) on the updated versions.
 *
 * The initial tuple is assumed to be already locked.

So

a) The function name is wrong, we're not necessarily dealing with an
   updated tuple.
b) The initial tuple is actually not generally locked when the function
   is called. See the call below the
   /* if there are updates, follow the update chain */
   comment.

   Or is that supposed to mean that the initial tuple has already been
   locked with the heavyweight lock?  But that can't be true either,
   because afaics the heap_lock_updated_tuple() call for
   LockTupleKeyShare doesn't even do that?


It's also fairly weird that heap_lock_updated_tuple() returns
/* nothing to lock */
return HeapTupleMayBeUpdated;
when the tuple has been deleted (and thus
ItemPointerEquals(&tuple->t_self, ctid)). That'll not get returned by
heap_lock_tuple() itself, but seems thoroughly confusing.


There's some argument to be made for not changing this because "it seems
to work", but the wrong comments and function names are not unlikely to
cause future bugs...

Greetings,

Andres Freund

[1] 
http://archives.postgresql.org/message-id/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw%40mail.gmail.com



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Thomas Munro
On Thu, Apr 5, 2018 at 2:00 AM, Craig Ringer  wrote:
> I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO.
> Didn't try zfs-on-linux or other platforms yet.

I think ZFS will be an outlier here, at least in a pure
write()/fsync() test.  (1) It doesn't even use the OS page cache,
except when you mmap()*.  (2) Its idea of syncing data is to journal
it, and its journal presumably isn't in the OS page cache.  In other
words it doesn't use Linux's usual write-back code paths.

While contemplating what exactly it would do (not sure), I came across
an interesting old thread on the freebsd-current mailing list that
discusses UFS, ZFS and the meaning of POSIX fsync().  Here we see a
report of FreeBSD + UFS doing exactly what the code suggests:

https://lists.freebsd.org/pipermail/freebsd-current/2007-August/076578.html

That is, it keeps the pages dirty so it tells the truth later.
Apparently like Solaris/Illumos (based on drive-by code inspection,
see explicit treatment of retrying, though I'm not entirely sure if
the retry flag is set just for async write-back), and apparently
unlike every other kernel I've tried to grok so far (things descended
from ancestral BSD but not descended from FreeBSD, with macOS/Darwin
apparently in the first category for this purpose).

Here's a new ticket in the NetBSD bug database for this stuff:

http://gnats.netbsd.org/53152

As mentioned in that ticket and by Andres earlier in this thread,
keeping the page dirty isn't the only strategy that would work and may
be problematic in different ways (it tells the truth but floods your
cache with unflushable stuff until eventually you force unmount it and
your buffers are eventually invalidated after ENXIO errors?  I don't
know.).  I have no qualified opinion on that.  I just know that we
need a way for fsync() to tell the truth about all preceding writes or
our checkpoints are busted.

*We mmap() + msync() in pg_flush_data() if you don't have
sync_file_range(), and I see now that that is probably not a great
idea on ZFS because you'll finish up double-buffering (or is that
triple-buffering?), flooding your page cache with transient data.
Oops.  That is off-topic and not relevant for the checkpoint
correctness topic of this thread through, since pg_flush_data() is
advisory only.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: comments around heap_lock_tuple confus{ing,ed} around deleted tuples

2018-04-04 Thread Alvaro Herrera
Andres Freund wrote:

> While looking at resolving [1] I re-read heap_lock_tuple() and
> subsidiary routines and got thoroughly confused for a while.
> 
> One reason was that function names and comments talk about updated, when
> they also actually deal with deletes.  heap_lock_updated_tuple()
> specifically is called on tuples that have not been updated, but have
> been deleted.

No objection to renaming the function.  I am certain that when I first
wrote it, it was going to be used for updated tuples; I never considered
deletes.  After it was repurposed, I never thought about renaming it.

> b) The initial tuple is actually not generally locked when the function
>is called. See the call below the
>/* if there are updates, follow the update chain */
>comment.

Hmm, OK, I don't remember this.  But no, it's not about the heavyweight
lock -- it's about the xmax-level tuple lock.

> It's also fairly weird that heap_lock_updated_tuple() returns
>   /* nothing to lock */
>   return HeapTupleMayBeUpdated;
> when the tuple has been deleted (and thus
> ItemPointerEquals(&tuple->t_self, ctid)). That'll not get returned by
> heap_lock_tuple() itself, but seems thoroughly confusing.

Yeah, what MayBeUpdated is supposed to mean in this case is "there is no
error, we were able to do the thing we were asked to do", rather than
exactly "yes, you may update the tuple".  I guess you could argue that
reusing HTSU result values for it was wrong.  It was certainly
convenient.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: comments around heap_lock_tuple confus{ing,ed} around deleted tuples

2018-04-04 Thread Andres Freund
Hi,

On 2018-04-04 18:34:26 -0300, Alvaro Herrera wrote:
> > It's also fairly weird that heap_lock_updated_tuple() returns
> > /* nothing to lock */
> > return HeapTupleMayBeUpdated;
> > when the tuple has been deleted (and thus
> > ItemPointerEquals(&tuple->t_self, ctid)). That'll not get returned by
> > heap_lock_tuple() itself, but seems thoroughly confusing.
> 
> Yeah, what MayBeUpdated is supposed to mean in this case is "there is no
> error, we were able to do the thing we were asked to do", rather than
> exactly "yes, you may update the tuple".  I guess you could argue that
> reusing HTSU result values for it was wrong.  It was certainly
> convenient.

I think just adding a comment along those lines should be good enough...

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-04 Thread Thomas Munro
On Thu, Apr 5, 2018 at 9:28 AM, Thomas Munro
 wrote:
> On Thu, Apr 5, 2018 at 2:00 AM, Craig Ringer  wrote:
>> I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO.
>> Didn't try zfs-on-linux or other platforms yet.
>
> While contemplating what exactly it would do (not sure),

See manual for failmode=wait | continue | panic.  Even "continue"
returns EIO to all new write requests, so they apparently didn't
bother to supply an 'eat-my-data-but-tell-me-everything-is-fine' mode.
Figures.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: WIP: a way forward on bootstrap data

2018-04-04 Thread Tom Lane
I'm starting to look through v13 seriously, and one thing struck
me that could use some general discussion: what is our policy
going to be for choosing the default values for catalog columns?
In particular, I noticed that you have for pg_proc

boolproisstrict BKI_DEFAULT(f);

charprovolatile BKI_DEFAULT(v);

charproparallel BKI_DEFAULT(u);

which do not comport at all with the most common values in those
columns.  As of HEAD, I see

postgres=# select proisstrict, count(*) from pg_proc group by 1;
 proisstrict | count 
-+---
 f   |   312
 t   |  2640
(2 rows)

postgres=# select provolatile, count(*) from pg_proc group by 1;
 provolatile | count 
-+---
 i   |  2080
 s   |   570
 v   |   302
(3 rows)

postgres=# select proparallel, count(*) from pg_proc group by 1;
 proparallel | count 
-+---
 r   |   139
 s   |  2722
 u   |91
(3 rows)

(Since this is from the final initdb state, this overstates the number
of .bki entries for pg_proc a bit, but not by much.)

I think there's no question that the default for proisstrict ought
to be "true" --- not only is that by far the more common choice,
but it's actually the safer choice.  A C function that needs to be
marked strict and isn't will at best do the wrong thing, and quite
likely will crash, if passed a NULL value.

The defaults for provolatile and proparallel maybe require more thought
though.  What you've chosen corresponds to the default assumptions of
CREATE FUNCTION, which are what we need for user-defined functions that
we don't know anything about; but I'm not sure that makes them the best
defaults for built-in functions.  I'm inclined to go with the majority
values here, in part because that will make the outliers stand out when
looking at pg_proc.dat.  I don't think it's great that we'll have 2800+
entries explicitly marked with proparallel 'i' or 's', but the less-than-
100 with proparallel 'u' will be so only implicitly because the rewrite
script will strip out any field entries that match the default.  That's
really the worst of all worlds: it'd be better to have no default
in this column at all, I think, than to behave like that.

In short, I'm tempted to say that when there's a clear majority of
entries that would use a particular default, that's the default we
should use, whether or not it's "surprising" or "unsafe" according
to the semantics.  It's clearly not "surprising" for a C function
to be marked proparallel 's'; the other cases are more so.

I'm not seeing any other BKI_DEFAULT choices that I'm inclined to
question, so maybe it's a mistake to try to derive any general
policy choices from such a small number of cases.  But anyway
I'm inclined to change these cases.

Comments anyone?

regards, tom lane



Re: WIP: a way forward on bootstrap data

2018-04-04 Thread Andres Freund
Hi,

On 2018-04-04 18:29:31 -0400, Tom Lane wrote:
> I'm starting to look through v13 seriously, and one thing struck
> me that could use some general discussion: what is our policy
> going to be for choosing the default values for catalog columns?
>
> [...]
>
> In short, I'm tempted to say that when there's a clear majority of
> entries that would use a particular default, that's the default we
> should use, whether or not it's "surprising" or "unsafe" according
> to the semantics.  It's clearly not "surprising" for a C function
> to be marked proparallel 's'; the other cases are more so.
>
> [...]
>
> I'm not seeing any other BKI_DEFAULT choices that I'm inclined to
> question, so maybe it's a mistake to try to derive any general
> policy choices from such a small number of cases.  But anyway
> I'm inclined to change these cases.
> 
> Comments anyone?

I think choosing SQL defaults is defensible, but so is choosing the most
common value as default to make uncommon stand out more, and so is
choosing the safest values. In short, I don't think it matters terribly
much, we just should try to be reasonably consistent about.

Greetings,

Andres Freund



Re: [PROPOSAL] timestamp informations to pg_stat_statements

2018-04-04 Thread legrand legrand
> No, the entries are not removed randomly. We track "usage" for each
> entry (essentially +1 for each time the query got executed, with a decay
> factor applied on each eviction (and we evict 5% at a time).

OK I apologize, I hadn't looked in source code in details, and effectively
the "Usage" algorithm based on execution_time and calls will keep the 
longest queries (and that's a good thing). In some cases (after an 
application release or a postgres patch for exemple) we may be interested 
in leastiests ones.

I'm still convinced that adding those kinds of informations 
(with a planid as done in pg_stat_plans) would be an improvement.

Maybe because having used v$sql, first_load_time, last_active_time 
and plan_hash_value for (too) many years).

It's always important to know when a new plan (good or bad) occurs, and pgss
could
give this kind of informations even without aggregation collector.

Last point (for developer), after testing this patch, in most cases when
calls=1: created and 
last_updated values are identical, they don't even differ to reflect 
execution (nor planing) duration. Is that a precision or coding problem ?

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: [HACKERS] [PATCH] Lockable views

2018-04-04 Thread Tatsuo Ishii
>> > > +typedef struct
>> > > +{
>> > > +Oid root_reloid;
>> > > +LOCKMODE lockmode;
>> > > +bool nowait;
>> > > +Oid viewowner;
>> > > +Oid viewoid;
>> > > +} LockViewRecurse_context;
>> > 
>> > Probably wouldn't hurt to pgindent the larger changes in the patch.

Yeah. Also, each struct member needs a comment.

>> > Hm, how can that happen? And if it can happen, why can it only happen
>> > with the root relation?
>> 
>> For example, the following queries cause the infinite recursion of views. 
>> This is detected and the error is raised.
>> 
>>  create table t (i int);
>>  create view v1 as select 1;
>>  create view v2 as select * from v1;
>>  create or replace view v1 as select * from v2;
>>  begin;
>>  lock v1;
>>  abort;
>> 
>> However, I found that the previous patch could not handle the following
>> situation in which the root relation itself doesn't have infinite recursion.
>> 
>>  create view v3 as select * from v1;
>>  begin;
>>  lock v3;
>>  abort;

Shouldn't they be in the regression test?

It's shame that create_view test does not include the cases, but it's
another story.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [PATCH] btree_gin, add support for uuid, bool, name, bpchar and anyrange types

2018-04-04 Thread Tomas Vondra


On 03/30/2018 10:51 PM, Matheus de Oliveira wrote:
> Hi all.
> 
> On Wed, Mar 21, 2018 at 1:47 PM, Tomas Vondra
> mailto:tomas.von...@2ndquadrant.com>> wrote:
> 
> 
> Do you plan to post an updated version of the patch, of what is your
> response to the points raised last week?
> 
> 
> Very sorry about the long delay. I've been in a long trip, no time to
> look that carefully.
>  
> 
> I still haven't made my mind regarding usefulness of range opclasses, so
> I suggest to split the patch into two parts - 0001 for the opclasses
> we're confident are useful, and 0002 for those extras. The committer
> then may apply either 0001 or 0001+0002, depending on his judgment.
> 
> 
> I liked the idea. So, follows the patches:
> - 0001-btree_gin-uuid--base.v2.patch - all types but anyrange, and with
> the adjustments on comments you proposed
> - 0002-btree_gin-uuid--anyrange.v2.patch - adding the anyrange type
> (must be applied after 0001)
> 
> Anything else missing?
> 

I don't think so. I've marked it as RFC, but I have no idea if anyone is
going to push it by the end of this commitfest.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: Configurable file mode mask

2018-04-04 Thread David Steele
Hi Michael,

On 4/2/18 2:28 AM, Michael Paquier wrote:
> 
> @@ -285,7 +286,7 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size 
> request_size,
>* returning.
>*/
>   flags = O_RDWR | (op == DSM_OP_CREATE ? O_CREAT | O_EXCL : 0);
> - if ((fd = shm_open(name, flags, 0600)) == -1)
> + if ((fd = shm_open(name, flags, PG_FILE_MODE_DEFAULT)) == -1)
> 
> Hm.  Sorry for the extra noise.  This one is actually incorrect as
> shm_open will use the path specified by glibc, which is out of
> pg_dynshmem so it does not matter for base backups, and we can keep
> 0600.  pg_dymshmem is used for mmap, still this would map with the umask
> setup by the postmaster as OpenTransientFile & friends are used.  sysv
> uses IPCProtection but there is no need to care about it as well.  No
> need for a comment perhaps..

Yeah, I think I figured that out when I first went through the code but
managed to forget it.  Reverted.

> pg_basebackup.c creates recovery.conf with 0600 all the time ;)

Fixed.

> +  
> +If the data directory allows group read access then certificate files may
> +need to be located outside of the data directory in order to conform to 
> the
> +security requirements outlined above.  Generally, group access is enabled
> +to allow an unprivileged user to backup the database, and in that case 
> the
> +backup software will not be able to read the certificate files and will
> +likely error.
> +   The answer is no for me and likely the same for others, but I am raising
> the point for the archives.  Should we relax
> check_ssl_key_file_permissions() for group permissions by the way from
> 0600 to 0640 when the file is owned by the user running Postgres?  If we
> don't do that, then SSL private keys will need to be used with 0600,
> potentially breaking backups...  At the same time this reduces the
> security of private keys but if the administrator is ready to accept
> group permissions that should be a risk he is ready to accept? 

I feel this should be considered in a separate patch.  These files are
not created by initdb so it seems to be an admin/packaging issue.  There
is always the option to locate the certs outside the data directory and
some distros do that by default.

> + &DataDirMode,
> + 0700, , 0777,
> + NULL, NULL, show_data_directory_mode
> Instead of a camel case, renaming that to data_directory_mode would be
> nice to ease future greps.  I do that all the time for existing code,
> pesting when things are not consistent.  A nit.

Done.

> There is a noise diff in miscinit.c.

Fixed.

> I am pretty sure that we want more documentation in pg_basebackup,
> pg_rewind and pg_resetwal telling that those consider grouping access.

I think this makes sense for pg_basebackup, pg_receivewal, and
pg_recvlogical so I have added notes for those.  Not sure I'm happy with
the language but at least we have something to bikeshed.

It seems to me that pg_resetwal and pg_rewind should be expected to not
break the permissions in PGDATA, just as they do now.

> There is no need to include sys/stat.h as this is already part of
> file_perm.h as DataDirectoryMask uses mode_t for its definition.

I removed it from file_perm.h instead.  With the new variables (see
below), most call sites will have no need for the mode constants.

> +/*
> + * Mode of the data directory.  The default is 0700 but may it be changed in
> + * checkDataDir() to 0750 if the data directory actually has that mode.
> + */
> +int DataDirMode = PG_DIR_MODE_NOGROUP
> 
>  /*
> - * Default mode for directories.
> + * Default mode for created files, unless something else is specified using
> + * the *Perm() function variants.
>   */
> -#define PG_DIR_MODE_DEFAULTS_IRWXU
> +#define PG_FILE_MODE_DEFAULT   (S_IRUSR | S_IWUSR | S_IRGRP)
> To reduce the code churn and simplify the logic, I would recommend to
> not use variables which have a negative meaning, so PG_DIR_MODE_DEFAULT
> should remain the same in patch 2, and there should be PG_DIR_MODE_GROUP
> instead of PG_DIR_MODE_NOGROUP.  That would be more consistent with the
> file modes as well.

I decided that the constants were a bit confusing in general.  What does
"default" mean, anyway? Instead I have created variables in file_perm.c
that hold the current file create mode, dir create mode, and mode mask.
All call sites use those variables (e.g. pg_dir_create_mode), which I
think are much clear.

This causes a bit of churn with the constants we added last September
but those were added for v11 so it won't create more complications for
back-patching.

> Yes, we can.

Yes!  We can!

New patches attached.

Thanks,
-- 
-David
da...@pgmasters.net
From f8c2604ebb6ae16b3b843c2db491d10efe2737fe Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Wed, 4 Apr 2018 19:17:46 -0400
Subject: [PATCH 1/2] Refactor file permissions in backend/frontend

Adds a new header (file_perm.h) and makes all front-end utilities use 

Re: Rewrite of pg_dump TAP tests

2018-04-04 Thread Michael Paquier
On Wed, Apr 04, 2018 at 10:25:03AM -0400, Stephen Frost wrote:
> I've updated those tests as well and rebased the patch (though no real
> changes were needed for the rebase).  Passes all tests.  I'll take
> another look through the changes again but plan to push them in a few
> hours, later on this afternoon.

Okay, I can see that those have been pushed as 446f7f5.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-04-04 Thread Andres Freund
Hi,

On 2018-04-02 11:26:38 -0400, Robert Haas wrote:
> On Wed, Mar 28, 2018 at 2:12 PM, Andres Freund  wrote:
> > How will it break it? They'll see an invalid ctid and conclude that the
> > tuple is dead? Without any changes that's already something that can
> > happen if a later tuple in the chain has been pruned away.  Sure, that
> > code won't realize it should error out because the tuple is now in a
> > different partition, but neither would a infomask bit.
> >
> > I think my big problem is that I just don't see what the worst that can
> > happen is. We'd potentially see a broken ctid chain, something that very
> > commonly happens, and consider the tuple to be invisible.  That seems
> > pretty sane behaviour for unadapted code, and not any worse than other
> > potential solutions.
>
> This is more or less my feeling as well.  I think it's better to
> conserve our limited supply of infomask bits as much as we can, and I
> do think that we should try to reclaimed HEAP_MOVED_IN and
> HEAP_MOVED_OFF in the future instead of defining the combination of
> the two of them to mean something now.

Yep.

It'd also make locking more complicated or require to keep more
information around in HeapUpdateFailureData. In a number of places we
currently release the buffer pin before switching over heap_lock_tuple
etc results, or there's not even a way to get at the infomask currently
(heap_update failing).


> Modulo implementation quality, I think the risk
> level of this patch is somewhat but not vastly higher than
> 37484ad2aacef5ec794f4dd3d5cf814475180a78, which similarly defined a
> previously-unused bit pattern in the tuple header.

Personally I think that change was vastly riskier, because it affected
freezing and wraparounds. Which is something we've repeatedly gotten
wrong.


> The reason I think this one might be somewhat riskier is because
> AFAICS it's not so easy to make sure we've found all the code, even in
> core, that might care, as it was in that case; and also because
> updates happen more than freezing.

Butthe consequences of not catching a changed piece of code are fairly
harmless. And I'd say things that happen more often are actually easier
to validate than something that with default settings requires hours of
testing...

I've attached a noticeably editorialized patch:

- I'm uncomfortable with the "moved" information not being crash-safe /
  replicated. Thus I added a new flag to preserve it, and removed the
  masking of the moved bit in the ctid from heap_mask().

- renamed macros to not mention valid / invalid block numbers, but
  rather
  HeapTupleHeaderSetMovedPartitions / HeapTupleHeaderIndicatesMovedPartitions
  and
  ItemPointerSetMovedPartitions /  ItemPointerIndicatesMovedPartitions

  I'm not wedded to these names, but I'l be adamant they they're not
  talking about invalid block numbers. Makes code harder to understand
  imo.

- removed new assertion from heap_get_latest_tid(), it's wrong for the
  case where all row versions are invisible.

- editorialized comments a bit

- added a few more assertions


I went through the existing code to make sure that
a) no checks where missed
b) to evaluate what the consequences when chasing chains would be
c) to evaluate what the consequences when we miss erroring out

WRT b), it's usually just superflous extra work if the new checks
weren't there. I went through all callers accessing xmax (via GetRawXmax
and GetUpdateXid):

b)
- heap rewrites will keep a tuple in hashtable till end of run, then
  reset the ctid to self. No real corruption, but we'd not detect
  further errors when attempting to follow chain.
- EvalPlanQualFetch would fail to abort loop, attempt to fetch
  tuple. This'll extend the relation by a single page, because P_NEW ==
  InvalidBlockNumber.
- heap_prune_chain - no changes needed (delete isn't recursed through)
- heap_get_root_tuples - same
- heap_hot_search_buffer - only continues over hot updates
- heap_lock_tuple (and subsidiary routines) - same as EvalPlanQualFetch,
  would then return HeapTupleUpdated.

c)
- GetTupleForTrigger - the proper error wouldn't be raised, instead a
  NULL tuple would be passed to the trigger
- EvalPlanQualFetch - a NULL tuple would be returned after the
  consequences above
- RelationFindReplTupleBy* - wrong error message
- ExecLockRows - no error would be raised, continue normally
- ExecDelete() - tuple ignored without error
- ExecUpdate() - same


Questions:

- I'm not perfectly happy with
  "tuple to be locked was already moved to another partition due to concurrent 
update"
  as the error message. If somebody has a better suggestions.
- should heap_get_latest_tid() error out when the chain ends in a moved
  tuple?  I personally think it doesn't matter much, the functionality
  is so bonkers and underspecified that it doesn't matter anyway ;)
- I'm not that happy with the number of added spec test files with
  number postfixes. Can't we combine them into a single file?
- as remarked else

Re: pgsql: New files for MERGE

2018-04-04 Thread Michael Paquier
On Wed, Apr 04, 2018 at 10:10:46AM -0700, Andres Freund wrote:
> This needs at the very least a response to the issues pointed out in the
> referenced email that you chose to ignore without any sort of comment.

That's definitely not cool.
--
Michael


signature.asc
Description: PGP signature


Re: lazy detoasting

2018-04-04 Thread Chapman Flack
On 04/01/18 01:19, Tom Lane wrote:
> Chapman Flack  writes:
>> If I copy an out-of-line, on-disk TOAST pointer into a memory context
>> with transaction lifetime, with an eye to detoasting it later in the
>> same transaction, are there circumstances where it wouldn't work?
> 
> Should be safe *as long as you hold onto a snapshot that can see the
> toast value's parent row*.  Otherwise, it's not.
> 
> For a counterexample, see the changes we had to make to avoid depending
> on out-of-line toast values in the catcaches, commit 08e261cbc.

Thanks. I think I see two approaches happening in that commit: retaining
a snapshot, if the tuples will be used within the transaction, or eagerly
detoasting, when persisting a holdable portal so tuples can be used after
the transaction.

Please bear with me as I check my understanding of snapshot management
by looking at PersistHoldablePortal(). There's a
PushActiveSnapshot(queryDesc->snapshot) in there. Is that because:

(a) that snapshot might not be on the active stack at this point, and it
needs to be be there to keep it alive during this executor re-run, or

(b) it's known to be somewhere on the active stack already, but not
necessarily on top, and needs to be pushed so it is topmost while
re-running this executor (does the top snapshot on the active stack
affect which tuples the executor sees? or is this stack's only purpose
to keep snapshots alive?), or

(c) it's known to be somewhere on the stack already, but needs to be
be pushed to make sure some stack-depth invariant is preserved
(perhaps because of an implied pop in case of an error), or

(d) some other reason I haven't thought of ?

I *think* I'm smart enough to rule out choice (a), because it wouldn't
make sense to have queryDesc->snapshot refer to a snapshot that isn't
already on the stack (unless it's also been registered, in which case,
why would it need to be pushed on the stack too?), as then I would be
reckless to assume it's still alive to *be* pushed. Am I close?

I haven't yet gone to track down the code that assigns a snapshot to
queryDesc->snapshot.

Thanks,
-Chap



Re: WIP: Covering + unique indexes.

2018-04-04 Thread Peter Geoghegan
On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov
 wrote:
> Thank you for review!  Revised patchset is attached.

Cool.

* btree_xlog_split() still has this code:

/*
 * On leaf level, the high key of the left page is equal to the first key
 * on the right page.
 */
if (isleaf)
{
ItemId  hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));

left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
left_hikeysz = ItemIdGetLength(hiItemId);
}

However, we never fail to store the high key now, even at the leaf
level, because of this change to the corresponding point in
_bt_split():

> -   /* Log left page */
> -   if (!isleaf)
> -   {
> -   /*
> -* We must also log the left page's high key, because the right
> -* page's leftmost key is suppressed on non-leaf levels.  Show it
> -* as belonging to the left page buffer, so that it is not stored
> -* if XLogInsert decides it needs a full-page image of the left
> -* page.
> -*/
> -   itemid = PageGetItemId(origpage, P_HIKEY);
> -   item = (IndexTuple) PageGetItem(origpage, itemid);
> -   XLogRegisterBufData(0, (char *) item, 
> MAXALIGN(IndexTupleSize(item)));
> -   }
> +   /*
> +* We must also log the left page's high key.  There are two reasons
> +* for that: right page's leftmost key is suppressed on non-leaf 
> levels,
> +* in covering indexes, included columns are truncated from high keys.
> +* For simplicity, we don't distinguish these cases, but log the high
> +* key every time.  Show it as belonging to the left page buffer, so
> +* that it is not stored if XLogInsert decides it needs a full-page
> +* image of the left page.
> +*/
> +   itemid = PageGetItemId(origpage, P_HIKEY);
> +   item = (IndexTuple) PageGetItem(origpage, itemid);
> +   XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));

So should we remove the first block of code? Note also that this
existing comment has been made obsolete:

/* don't release the buffer yet; we touch right page's first item below */

/* Now reconstruct left (original) sibling page */
if (XLogReadBufferForRedo(record, 0, &lbuf) == BLK_NEEDS_REDO)

Maybe we *should* release the right sibling buffer at the point of the
comment now?

* _bt_mkscankey() should assert that the IndexTuple has the correct
number of attributes.

I don't expect you to change routines like _bt_mkscankey() so they
actually respect the number of attributes from BTreeTupGetNAtts(),
rather than just relying on IndexRelationGetNumberOfKeyAttributes().
However, an assertion seems well worthwhile. It's a big reason for
having BTreeTupGetNAtts().

This also lets you get rid of at least one assertion from
_bt_doinsert(), I think.

* _bt_isequal() should assert that the IndexTuple was not truncated.

* The order could be switched here:

> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
> if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
> return 1;
>
> +   /*
> +* Check tuple has correct number of attributes.
> +*/
> +   if (unlikely(!_bt_check_natts(rel, page, offnum)))
> +   {
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INTERNAL_ERROR),
> +errmsg("tuple has wrong number of attributes in index 
> \"%s\"",
> +   RelationGetRelationName(rel;
> +   }

In principle, we should also check _bt_check_natts() for "minus
infinity" items, just like you did within verify_nbtree.c. Also, there
is no need for parenthesis here.

* Maybe _bt_truncate_tuple() should assert that the caller has not
tried to truncate a tuple that has already been truncated.

I'm not sure if our assertion should be quite that strong, but I think
that that might be good because in general we only need to truncate on
the leaf level -- truncating at any other level on the tree (e.g.
doing traditional suffix truncation) is always subtly wrong. What we
definitely should do, at a minimum, is make sure that attempting to
truncate a tuple to 2 attributes when it already has 0 attributes
fails with an assertion failure.

Can you try adding the strong assertion (truncate only once) to
_bt_truncate_tuple()? Maybe that's not possible, but it seems worth a
try.

* I suggest we invent a new flag for 0x2000 within itup.h, to replace
"/* bit 0x2000 is reserved for index-AM specific usage */".

We can call it INDEX_AM_RESERVED_BIT. Then, we can change
INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the
same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this
neater.

* We should "use" one of the 4 new status bit that are available from
an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf
index tuples. Perhaps call it BT_ALT_TID_NONPIVOT.

I guess you could say that I want us to reserve one of our 4 reserve b

Re: [HACKERS] Runtime Partition Pruning

2018-04-04 Thread Amit Langote
Hi David.

On 2018/03/31 22:52, David Rowley wrote:
> The attached patchset is based on Amit's v45 faster partition pruning [1].
> 
> I've made a few changes since the v14 version. Since Amit's v45 patch
> now creates the partition pruning details in a data structure that can
> be copied from the planner over to the executor, it means this patch
> is now able to do much of the setup work for run-time pruning in the
> planner. Doing this now allows us to determine if run-time pruning is
> even possible at plan time, rather than the executor attempting it and
> sometimes wasting effort when it failed to find Params matching the
> partition key.
>
> Another change from the last version is that I've separated out the
> handling of exec Params and external Params.  The new patch now will
> perform a pruning step at executor startup if some external params
> match the partition key.  This is very beneficial to a PREPAREd OLTP
> type query against a partitioned table as it means we can skip
> sub-plan initialisation for all non-matching partitions.

This is quite nice.

> Initialising
> Append/MergeAppend/ModifyTable nodes with fewer subnodes than were
> originally planned did require a small change in explain.c in some
> code that was assuming the subnode arrays were the same length as the
> subplan list. I also ended up adding a Int property to EXPLAIN to show
> the number of subnodes that have been removed during execution.
> Unfortunately, there is a small corner case with this in the case
> where all subnodes are removed it leaves EXPLAIN with no subnodes to
> search for outer side Vars in. I didn't really see any sane way to
> handle this in EXPLAIN, so I think the best fix for this is what I've
> done, and that's just to always initalise at least 1 subnode, even
> when none match the external Params. Cost-wise, I don't think this is
> such a big deal as the cost savings here are coming from saving on
> initalising ten's or hundreds of subnodes, not 1.

I have wondered about the possibility of setting a valid (non-dummy)
targetlist in Append and MergeAppend if they're created for a partitioned
table.  Currently they're overwritten by a dummy one using
set_dummy_tlist_references() in set_plan_refs() citing the following reason:

 * set_dummy_tlist_references
 *Replace the targetlist of an upper-level plan node with a simple
 *list of OUTER_VAR references to its child.
 *
 * This is used for plan types like Sort and Append that don't evaluate
 * their targetlists.  Although the executor doesn't care at all what's in
 * the tlist, EXPLAIN needs it to be realistic.

In fact, when I had noticed that this EXPLAIN behavior I had wondered if
that's something we should have discussed when d3cc37f1d801a [1] went in.

> To put the new patch to the test, I tried pgbench -S -M prepared -s
> 100 with and without having modified pgbench_accounts to separate into
> 10 RANGE partitions of equal size.
> 
> A non-partitioned table was getting 12503 TPS.
> With partitioned tables, the old version of this patch was getting: 5470 TPS.
> With partitioned tables, the attached version gets 11247 TPS.
> For perspective, today's master with a partitioned table gets 4719 TPS.

Quite nice!

> So you can see it's a pretty good performance boost by skipping
> initialisation of the 9 non-matching subplans.  It's not hard to
> imagine the gains getting more significant with a larger number of
> partitions. Ideally, the performance of a partitioned table would be
> faster than the non-partitioned case, but please remember this query
> is a single row PK lookup SELECT, so is a very fast query in both
> cases. Partitioning cases should improve more as the table grows and
> indexes struggle to fit in RAM, or when many rows are being taken from
> the partition and being aggregated.
> 
> Unfortunately, the story is not quite as good for the non -M prepared
> version of the benchmark. This shows that planning time of partitioned
> table queries could still use some improvements. Amit's v45 patch
> certainly makes a dent in the planner slow performance here, but it's
> still nothing like as fast as the non-partitioned case. More work is
> required there in the future.

Certainly.  Things like the issue that we both replied to yesterday should
be addressed for starters [2].

> This patchset also contains a patch to improve the performance of
> inheritance planning of UPDATE/DELETE queries. This patch also
> implements run-time pruning for UPDATE/DELETE too. This also has a
> significant performance improvement for planning of UPDATE/DELETE
> operations on partitioned tables with a large number of partitions,
> performance is as follows:
> 
> Values in transactions per second.
> 
> Partitions = 1
> Unpatched: 7323.3
> Patched: 6573.2 (-10.24%)
> 
> Partitions = 2
> Unpatched: 6784.8
> Patched: 6377.1 (-6.01%)
> 
> Partitions = 4
> Unpatched: 5903.0
> Patched: 6106.8 (3.45%)
> 
> Partitions = 8
> Unpatched: 4582.0
> Patched: 5579.9 (21.78%)
>

Re: Parallel Aggregates for string_agg and array_agg

2018-04-04 Thread David Rowley
Hi Tomas,

Thanks for taking another look.

On 5 April 2018 at 07:12, Tomas Vondra  wrote:
> Seems fine to me, although we should handle the anyarray case too, I
> guess. That is, get_agg_clause_costs_walker() should do this too:
>
> /* Same thing for array_agg_array_(de)serialize. */
> if ((aggserialfn == F_ARRAY_AGG_ARRAY_SERIALIZE ||
>  aggdeserialfn == F_ARRAY_AGG_ARRAY_DESERIALIZE) &&
>  !agg_args_support_sendreceive(aggref))
> costs->hasNonSerial = true;

hmm, array_agg_array_serialize and array_agg_array_deserialize don't
use the send/receive functions though, so not sure why that's
required?

> Other than that, the patch seems fine to me, and it's already marked as
> RFC so I'll leave it at that.

Thanks.

> The last obstacle seems to be the argument about the risks of the patch
> breaking queries of people relying on the ordering. Not sure what's are
> the right next steps in this regard ...

yeah, seems like a bit of a stalemate.

Personally, think if the group of people Tom mentions do exist, then
they've already been through some troubled times since Parallel Query
was born. I'd hope they've already put up their defenses due to the
advent of that feature. I stand by my thoughts that it's pretty
bizarre to draw the line here when we've probably been causing these
people issues for many years already. I said my piece on this already
so likely not much point in going on about it. These people are also
perfectly capable of sidestepping the whole issue with setting
max_parallel_workers_per_gather TO 0.

Perhaps one solution is to drop the string_agg and keep the array_agg.
Unsure if that would be good enough for Tom? More people seem to have
voiced that array_agg ordering is generally less of a concern, which I
imagine is probably true, but my opinion might not matter here as I'm
the sort of person who if I needed a specific ordering I'd have
written an ORDER BY clause.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Unstable number of workers in select_parallel test on spurfowl

2018-04-04 Thread Thomas Munro
Hi,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=spurfowl&dt=2018-04-05%2003%3A22%3A05

*** 
/home/pgbuild/buildfarm/HEAD/pgsql.build/src/test/regress/expected/select_parallel.out
2018-04-04 23:22:07.297108347 -0400
--- 
/home/pgbuild/buildfarm/HEAD/pgsql.build/src/test/regress/results/select_parallel.out
2018-04-04 23:24:11.709955623 -0400
***
*** 103,109 
  -
   Finalize Aggregate
 ->  Gather
!  Workers Planned: 1
   ->  Partial Aggregate
 ->  Append
   ->  Parallel Seq Scan on a_star
--- 103,109 
  -
   Finalize Aggregate
 ->  Gather
!  Workers Planned: 3
   ->  Partial Aggregate
 ->  Append
   ->  Parallel Seq Scan on a_star

This is the query like this:

-- Disable Parallel Append
alter table a_star reset (parallel_workers);
alter table b_star reset (parallel_workers);
alter table c_star reset (parallel_workers);
alter table d_star reset (parallel_workers);
alter table e_star reset (parallel_workers);
alter table f_star reset (parallel_workers);
set enable_parallel_append to off;
explain (costs off)
  select round(avg(aa)), sum(aa) from a_star;

I don't see why...

-- 
Thomas Munro
http://www.enterprisedb.com



Re: WIP: a way forward on bootstrap data

2018-04-04 Thread Tom Lane
Here are the results of an evening's desultory hacking on v13.

I was dissatisfied with the fact that we still had several
function-referencing columns that had numeric instead of symbolic
contents, for instance pg_aggregate.aggfnoid.  Of course, the main reason
is that those are declared regproc but reference functions with overloaded
names, which regproc can't handle.  Now that the lookups are being done in
genbki.pl there's no reason why we have to live with that limitation.
In the attached, I've generalized the BKI_LOOKUP(pg_proc) code so that
you can use either regproc-like or regprocedure-like notation, and then
applied that to relevant columns.

I did not like the hard-wired handling of proargtypes and proallargtypes
in genbki.pl; it hardly seems impossible that we'll want similar
conversions for other array-of-OID columns in future.  After a bit of
thought, it seemed like we could allow

oidvectorproargtypes BKI_LOOKUP(pg_type);

Oid  proallargtypes[1] BKI_DEFAULT(_null_) BKI_LOOKUP(pg_type);

and just teach genbki.pl that if a lookup rule is attached to
an oidvector or Oid[] column, it means to apply the rule to
each array element individually.

I also changed genbki.pl so that it'd warn about entries that aren't
recognized by the lookup rules.  This seems like a good idea for
catching errors, such as (ahem) applying BKI_LOOKUP to a column
that isn't even an OID.

bootstrap-v13-delta.patch is a diff atop your patch series for the
in-tree files, and convert_oid2name.patch adjusts that script to
make use of the additional conversion capability.

regards, tom lane

diff --git a/src/backend/catalog/README.data b/src/backend/catalog/README.data
index b7c680c..22ad0f2 100644
*** a/src/backend/catalog/README.data
--- b/src/backend/catalog/README.data
*** teach Catalog::ParseData() how to expand
*** 62,71 
  representation.
  
  - To aid readability, some values that are references to other catalog
! entries are represented by macros rather than numeric OIDs. This is
! the case for index access methods, opclasses, operators, opfamilies,
! and types. This is done for functions as well, but only if the proname
! is unique.
  
  Bootstrap Data Conventions
  ==
--- 62,103 
  representation.
  
  - To aid readability, some values that are references to other catalog
! entries are represented by names rather than numeric OIDs.  Currently
! this is the case for access methods, functions, operators, opclasses,
! opfamilies, and types.  The rules are as follows:
! 
! * Use of names rather than numbers is enabled for a particular catalog
! column by attaching BKI_LOOKUP(lookuprule) to the column's definition,
! where "lookuprule" is pg_am, pg_proc, pg_operator, pg_opclass,
! pg_opfamily, or pg_type.
! 
! * In a name-lookup column, all entries must use the name format except
! when writing "0" for InvalidOid.  (If the column is declared regproc,
! you can optionally write "-" instead of "0".)  genbki.pl will warn
! about unrecognized names.
! 
! * Access methods are just represented by their names, as are types.
! Type names must match the referenced pg_type entry's typname; you
! do not get to use any aliases such as "integer" for "int4".
! 
! * A function can be represented by its proname, if that is unique among
! the pg_proc.dat entries (this works like regproc input).  Otherwise,
! write it as "proname(argtypename,argtypename,...)", like regprocedure.
! The argument type names must be spelled exactly as they are in the
! pg_proc.dat entry's proargtypes field.  Do not insert any spaces.
! 
! * Operators are represented by "oprname(lefttype,righttype)", writing the
! type names exactly as they appear in the pg_operator.dat entry's oprleft
! and oprright fields.  (Write 0 for the omitted operand of a unary
! operator.)
! 
! * The names of opclasses and opfamilies are only unique within an access
! method, so they are represented by "access_method_name/object_name".
! 
! In none of these cases is there any provision for schema-qualification;
! all objects created during bootstrap are expected to be in the pg_catalog
! schema.
! 
  
  Bootstrap Data Conventions
  ==
*** You can also use the duplicate_oids scri
*** 105,111 
  build if a duplicate is found.)
  
  - The OID counter starts at 1 at bootstrap.  If a catalog row is
! in a table that requires OIDs, but no OID was preassigned by an "OID ="
  clause, then it will receive an OID of 1 or above.
  
  
--- 137,143 
  build if a duplicate is found.)
  
  - The OID counter starts at 1 at bootstrap.  If a catalog row is
! in a table that requires OIDs, but no OID was preassigned by an "oid =>"
  clause, then it will receive an OID of 1 or above.
  
  
diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index 27494d9..f6be50a 100644
*** a/src/backend/catalog/genbki.pl
--- b/src/backend/catalog/genbk

  1   2   >