RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-13 Thread Smith, Peter
On Sat, Aug 10, 2019 at 1:19 AM Tomas Vondra  
wrote: 

> We can of course support "forced" re-encryption, but I think it's acceptable 
> if that's fairly expensive as long as it can be throttled and executed in the 
> background (kinda similar to the patch to enable checksums in the background).

As an alternative way to provide for a "forced" re-encryption couldn't you just 
run pg_dumpall + psql?

Regards,
--
Peter Smith
Fujitsu Australia




Re: subscriptionCheck failures on nightjar

2019-08-13 Thread Michael Paquier
On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote:
> I'm not yet sure that that's actually something that's supposed to
> happen, I got to spend some time analysing how this actually
> happens. Normally the contents of the slot should actually prevent it
> from being removed (as they're newer than
> ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's
> a bug in the drop logic in newer releases.

In the same context, could it be a consequence of 9915de6c which has
introduced a conditional variable to control slot operations?  This
could have exposed more easily a pre-existing race condition.
--
Michael


signature.asc
Description: PGP signature


Re: errbacktrace

2019-08-13 Thread Peter Eisentraut
On 2019-07-22 20:19, Peter Eisentraut wrote:
> On 2019-07-09 11:43, Peter Eisentraut wrote:
>> After further research I'm thinking about dropping the libunwind
>> support.  The backtrace()/backtrace_symbols() API is more widely
>> available: darwin, freebsd, linux, netbsd, openbsd (via port), solaris,
>> and of course it's built-in, whereas libunwind is only available for
>> linux, freebsd, hpux, solaris, and requires an external dependency.
> 
> Here is an updated patch without the libunwind support, some minor
> cleanups, documentation, and automatic back traces from assertion failures.

Another updated version.

I have changed the configuration setting to backtrace_functions plural,
so that you can debug more than one location at once.  I had originally
wanted to do that but using existing functions like
SplitIdentifierString() resulted in lots of complications with error
handling (inside error handling!).  So here I just hand-coded the list
splitting.  Seems simple enough.

I think this patch is now good to go from my perspective.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From f9d440ec41f49464661960c580c29ce0558a1642 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 13 Aug 2019 01:47:22 +0200
Subject: [PATCH v4] Add backtrace support

Add some support for automatically showing backtraces in certain error
situations in the server.  Backtraces are shown on assertion failure.
A new setting backtrace_functions can be set to a list of C function
names, and all ereport()s and elog()s from the functions will have
backtraces generated.  Finally, the function errbacktrace() can be
manually added to an ereport() call to generate a backtrace for that
call.

Discussion: 
https://www.postgresql.org/message-id/flat/5f48cb47-bf1e-05b6-7aae-3bf2cd015...@2ndquadrant.com
---
 configure|  61 +++-
 configure.in |   4 ++
 doc/src/sgml/config.sgml |  26 +++
 src/backend/utils/error/assert.c |  13 
 src/backend/utils/error/elog.c   | 115 +++
 src/backend/utils/misc/guc.c |  12 
 src/include/pg_config.h.in   |   6 ++
 src/include/utils/elog.h |   3 +
 src/include/utils/guc.h  |   1 +
 9 files changed, 239 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 7a6bfc2339..fbd8224f50 100755
--- a/configure
+++ b/configure
@@ -11662,6 +11662,63 @@ if test "$ac_res" != no; then :
 
 fi
 
+# *BSD:
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing 
backtrace_symbols" >&5
+$as_echo_n "checking for library containing backtrace_symbols... " >&6; }
+if ${ac_cv_search_backtrace_symbols+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_func_search_save_LIBS=$LIBS
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char backtrace_symbols ();
+int
+main ()
+{
+return backtrace_symbols ();
+  ;
+  return 0;
+}
+_ACEOF
+for ac_lib in '' execinfo; do
+  if test -z "$ac_lib"; then
+ac_res="none required"
+  else
+ac_res=-l$ac_lib
+LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
+  fi
+  if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_search_backtrace_symbols=$ac_res
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext
+  if ${ac_cv_search_backtrace_symbols+:} false; then :
+  break
+fi
+done
+if ${ac_cv_search_backtrace_symbols+:} false; then :
+
+else
+  ac_cv_search_backtrace_symbols=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$ac_cv_search_backtrace_symbols" >&5
+$as_echo "$ac_cv_search_backtrace_symbols" >&6; }
+ac_res=$ac_cv_search_backtrace_symbols
+if test "$ac_res" != no; then :
+  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
 
 if test "$with_readline" = yes; then
 
@@ -12760,7 +12817,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h
 fi
 
 
-for ac_header in atomic.h copyfile.h crypt.h fp_class.h getopt.h ieeefp.h 
ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/prctl.h 
sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h 
sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h copyfile.h crypt.h execinfo.h fp_class.h getopt.h 
ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h 
sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h 
sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h utime.h wchar.h 
wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" 
"$ac_includes_de

Re: Global temporary tables

2019-08-13 Thread Konstantin Knizhnik



On 13.08.2019 8:34, Craig Ringer wrote:
On Tue, 13 Aug 2019 at 00:47, Pavel Stehule > wrote:


But Postgres is not storing this information now anywhere else
except statistic, isn't it?


not only - critical numbers are reltuples, relpages from pg_class


That's a very good point. relallvisible too. How's the global temp 
table impl handling that right now, since you won't be changing the 
pg_class row? AFAICS relpages doesn't need to be up to date (and 
reltuples certainly doesn't) so presumably you're just leaving them as 
zero?
As far as I understand relpages and reltuples are set only when you 
perform "analyze" of the table.




What happens right now if you ANALYZE or VACUUM ANALYZE a global temp 
table? Is it just disallowed?


No, it is not disallowed now.
It updates the statistic and also fields in pg_class which are shared by 
all backends.
So all backends will now build plans according to this statistic. 
Certainly it may lead to not so efficient plans if there are large 
differences in number of tuples stored in this table in different backends.
But seems to me critical mostly in case of presence of indexes for 
temporary table. And it seems to me that users are created indexes for 
temporary tables even rarely than doing analyze for them.


I'll need to check, but I wonder if periodically updating those fields 
in pg_class impacts logical decoding too. Logical decoding must treat 
transactions with catalog changes as special cases where it creates 
custom snapshots and does other expensive additional work. 
(See ReorderBufferXidSetCatalogChanges in reorderbuffer.c and its 
callsites). We don't actually need to know relpages and reltuples 
during logical decoding. It can probably ignore relfrozenxid 
and relminmxid changes too, maybe even pg_statistic changes though I'd 
be less confident about that one.


At some point I need to patch in a bunch of extra tracepoints and do 
some perf tracing to see how often we do potentially unnecessary 
snapshot related work in logical decoding.


Temporary tables (both local and global) as well as unlogged tables are 
not subject of logical replication, aren't them?



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Global temporary tables

2019-08-13 Thread Craig Ringer
On Fri, 9 Aug 2019 at 22:07, Konstantin Knizhnik 
wrote:

>
>
> Ok, here it is: global_private_temp-1.patch
>


Initial pass review follows.

Relation name "SESSION" is odd. I guess you're avoiding "global" because
the data is session-scoped, not globally temporary. But I'm not sure
"session" fits either - after all regular temp tables are also session
temporary tables. I won't focus on naming further beyond asking that it be
consistent though, right now there's a mix of "global" in some places and
"session" in others.


Why do you need to do all this indirection with changing RelFileNode to
RelFileNodeBackend in the bufmgr, changing BufferGetTag etc? Similarly,
your changes of RelFileNodeBackendIsTemp to RelFileNodeBackendIsLocalTemp .
Did you look into my suggestion of extending the relmapper so that global
temp tables would have a relfilenode of 0 like pg_class etc, and use a
backend-local map of oid-to-relfilenode mappings? I'm guessing you did it
the way you did instead to lay the groundwork for cross-backend sharing,
but if so it should IMO be in that patch. Maybe my understanding of the
existing temp table mechanics is just insufficient as I
see RelFileNodeBackendIsTemp is already used in some aspects of existing
temp relation handling.

Similarly, TruncateSessionRelations probably shouldn't need to exist in
this patch in its current form; there's no shared_buffers use to clean and
the same file cleanup mechanism should handle both session-temp and
local-temp relfilenodes.


A number of places make a change like this:

rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP ||
+ rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION

and I'd like to see a test macro or inline static for it since it's
repeated so much. Mostly to make the intent clear: "is this a relation with
temporary backend-scoped data?"


This test:

+ if (blkno == BTREE_METAPAGE && PageIsNew(BufferGetPage(buf)) &&
IsSessionRelationBackendId(rel->rd_backend))
+ _bt_initmetapage(BufferGetPage(buf), P_NONE, 0);

seems sensible but I'm wondering if there's a way to channel initialization
of global-temp objects through a bit more of a common path, so it reads
more obviously as a common test applying to all global-temp tables. "Global
temp table not initialized in session yet? Initialize it." So we don't have
to have every object type do an object type specific test for "am I
actually uninitialized?" in all paths it might hit. I guess I expected to
see something more like a

if (RelGlobalTempUninitialized(rel))

but maybe I've been doing too much Java ;)

A similar test reappears here for sequences:

+ if (rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION &&
PageIsNew(page))

Why is this written differently?


Sequence initialization ignores sequence startval/firstval settings. Why?


- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ else if (newrelpersistence != RELPERSISTENCE_TEMP)

Doesn't this change the test outcome for RELPERSISTENCE_UNLOGGED?


In PreCommit_on_commit_actions, in the the ONCOMMIT_DELETE_ROWS case, is
there any way to still respect the XACT_FLAGS_ACCESSEDTEMPNAMESPACE flag if
the oid is for a backend-temp table not a global-temp table?


+ bool isLocalBuf = SmgrIsTemp(smgr) && relpersistence ==
RELPERSISTENCE_TEMP;

Won't session-temp tables have local buffers too? Until your next patch
that adds shared_buffers storage for them anyway?


+ * These is no need to separate them at file system level, so just
subtract SessionRelFirstBackendId
+ * to avoid too long file names.

I agree that there's no reason to be able to differentiate between
local-temp and session-temp relfilenodes at the filesystem level.









> Also I have attached updated version of the global temp tables with shared
> buffers - global_shared_temp-1.patch
> It is certainly larger (~2k lines vs. 1.5k lines) because it is changing
> BufferTag and related functions.
> But I do not think that this different is so critical.
> Still have a wish to kill two birds with one stone:)
>
>
>
>
>
>
>
>
> --
>
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Dilip Kumar
On Tue, Jul 30, 2019 at 1:32 PM Andres Freund  wrote:
>
> Hi,
>
> Amit, short note: The patches aren't attached in patch order. Obviously
> a miniscule thing, but still nicer if that's not the case.
>
> Dilip, this also contains the start of a review for the undo record
> interface further down.

> > Subject: [PATCH 07/14] Provide interfaces to store and fetch undo records.
> >
> > Add the capability to form undo records and store them in undo logs.  We
> > also provide the capability to fetch the undo records.  This layer will use
> > undo-log-storage to reserve the space for the undo records and buffer
> > management routines to write and read the undo records.
> >
>
> > Undo records are stored in sequential order in the undo log.
>
> Maybe "In each und log undo records are stored in sequential order."?
Done
>
>
>
> > +++ b/src/backend/access/undo/README.undointerface
> > @@ -0,0 +1,29 @@
> > +Undo record interface layer
> > +---
> > +This is the next layer which sits on top of the undo log storage, which 
> > will
> > +provide an interface for prepare, insert, or fetch the undo records.  This
> > +layer will use undo-log-storage to reserve the space for the undo records
> > +and buffer management routine to write and read the undo records.
>
> The reference to "undo log storage" kinda seems like a reference into
> nothingness...
Changed
>
>
> > +Writing an undo record
> > +--
> > +To prepare an undo record, first, it will allocate required space using
> > +undo log storage module.  Next, it will pin and lock the required buffers 
> > and
> > +return an undo record pointer where it will insert the record.  Finally, it
> > +calls the Insert routine for final insertion of prepared record.  
> > Additionally,
> > +there is a mechanism for multi-insert, wherein multiple records are 
> > prepared
> > +and inserted at a time.
>
> I'm not sure whta this is telling me. Who is "it"?
>
> To me the filename ("interface"), and the title of this section,
> suggests this provides documentation on how to write code to insert undo
> records. But I don't think this does.
I have improved it
>
>
> > +Fetching and undo record
> > +
> > +To fetch an undo record, a caller must provide a valid undo record pointer.
> > +Optionally, the caller can provide a callback function with the 
> > information of
> > +the block and offset, which will help in faster retrieval of undo record,
> > +otherwise, it has to traverse the undo-chain.
>
> > +There is also an interface to bulk fetch the undo records.  Where the 
> > caller
> > +can provide a TO and FROM undo record pointer and the memory limit for 
> > storing
> > +the undo records.  This API will return all the undo record between FROM 
> > and TO
> > +undo record pointers if they can fit into provided memory limit otherwise, 
> > it
> > +return whatever can fit into the memory limit.  And, the caller can call it
> > +repeatedly until it fetches all the records.
>
> There's a lot of  terminology in this file that's not been introduced. I
> think this needs to be greatly expanded and restructured to allow people
> unfamiliar with the code to benefit.
I have improved it, but I think still I need to work on it to
introduce the terminology used.
>
>
> > +/*-
> > + *
> > + * undoaccess.c
> > + * entry points for inserting/fetching undo records
>
> > + * NOTES:
> > + * Undo record layout:
> > + *
> > + * Undo records are stored in sequential order in the undo log.  Each undo
> > + * record consists of a variable length header, tuple data, and payload
> > + * information.
>
> Is that actually true? There's records without tuples, no?
Right, changed this
>
> > The first undo record of each transaction contains a
> > + * transaction header that points to the next transaction's start
> > header.
>
> Seems like this needs to reference different persistence levels,
> otherwise it seems misleading, given there can be multiple first records
> in multiple undo logs?
I have changed it.
>
>
> > + * This allows us to discard the entire transaction's log at one-shot
> > rather
>
> s/at/in/
Fixed
>
> > + * than record-by-record.  The callers are not aware of transaction header,
>
> s/of/of the/
Fixed
>
> > + * this is entirely maintained and used by undo record layer.   See
>
> s/this/it/
Fixed
>
> > + * undorecord.h for detailed information about undo record header.
>
> s/undo record/the undo record/
Fixed
>
>
> I think at the very least there's explanations missing for:
> - what is the locking protocol for multiple buffers
> - what are the contexts for insertion
> - what phases an undo insertion happens in
> - updating previous records in general
> - what "packing" actually is
>
>
> > +
> > +/* Prototypes for static functions. */
>
>
> Don't think we commonly include that...
Changed, removed all unwanted prototypes
>
> > +static UnpackedUndoRecord *UndoGetOne

Re: Global temporary tables

2019-08-13 Thread Craig Ringer
On Tue, 13 Aug 2019 at 16:19, Konstantin Knizhnik 
wrote:

>
>
> On 13.08.2019 8:34, Craig Ringer wrote:
>
> On Tue, 13 Aug 2019 at 00:47, Pavel Stehule 
> wrote:
>
>
>> But Postgres is not storing this information now anywhere else except
>>> statistic, isn't it?
>>>
>>
>> not only - critical numbers are reltuples, relpages from pg_class
>>
>
> That's a very good point. relallvisible too. How's the global temp table
> impl handling that right now, since you won't be changing the pg_class row?
> AFAICS relpages doesn't need to be up to date (and reltuples certainly
> doesn't) so presumably you're just leaving them as zero?
>
> As far as I understand relpages and reltuples are set only when you
> perform "analyze" of the table.
>

Also autovacuum's autoanalyze.

What happens right now if you ANALYZE or VACUUM ANALYZE a global temp
> table? Is it just disallowed?
>
>
> No, it is not disallowed now.
> It updates the statistic and also fields in pg_class which are shared by
> all backends.
> So all backends will now build plans according to this statistic.
> Certainly it may lead to not so efficient plans if there are large
> differences in number of tuples stored in this table in different backends.
> But seems to me critical mostly in case of presence of indexes for
> temporary table. And it seems to me that users are created indexes for
> temporary tables even rarely than doing analyze for them.
>

That doesn't seem too bad TBH. Hacky but it doesn't seem dangerously wrong
and as likely to be helpful as not if clearly documented.


> Temporary tables (both local and global) as well as unlogged tables are
> not subject of logical replication, aren't them?
>
>
Right. But in the same way that they're still present in the catalogs, I
think they still affect catalog snapshots maintained by logical decoding's
historic snapshot manager as temp table creation/drop will still AFAIK
cause catalog invalidations to be written on commit. I need to double check
that.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: using explicit_bzero

2019-08-13 Thread Peter Eisentraut
On 2019-07-30 07:08, Michael Paquier wrote:
> On Mon, Jul 29, 2019 at 11:30:53AM +0200, Peter Eisentraut wrote:
>> Another patch, with various fallback implementations.
> 
> I have spotted some issues with this patch:
> 1) The list of port files @pgportfiles in Mkvcbuild.pm has not been
> updated with the new file explicit_bzero.c, so the compilation would
> fail with MSVC.
> 2) pg_config.h.win32 does not include the two new flags (same as
> https://www.postgresql.org/message-id/20190624050850.ge1...@paquier.xyz)

Another patch, to attempt to fix the Windows build.

> 3) What about CreateRole() and AlterRole() which can manipulate a
> password in plain format before hashing?  (same message as previous
> point).

If you want to secure CREATE ROLE foo PASSWORD 'plaintext' then you need
to also analyze memory usage in protocol processing and parsing and the
like.  This would be a laborious and difficult to verify undertaking.
It's better to say, if you want to be secure, don't do that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From df7a712cabe4f29b9bcc135db403e7ea8ed9f1e0 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 13 Aug 2019 10:25:17 +0200
Subject: [PATCH v5] Use explicit_bzero

Use the explicit_bzero() function in places where it is important that
security information such as passwords is cleared from memory.  There
might be other places where it could be useful; this is just an
initial collection.

For platforms that don't have explicit_bzero(), provide various
fallback implementations.  (explicit_bzero() itself isn't standard,
but as Linux/glibc and OpenBSD have it, it's the most common spelling,
so it makes sense to make that the invocation point.)

Discussion: 
https://www.postgresql.org/message-id/flat/42d26bde-5d5b-c90d-87ae-6cab875f73be%402ndquadrant.com
---
 configure| 15 +++-
 configure.in |  2 +
 src/backend/libpq/be-secure-common.c |  3 ++
 src/include/pg_config.h.in   |  6 +++
 src/include/pg_config.h.win32|  6 +++
 src/include/port.h   |  4 ++
 src/interfaces/libpq/fe-connect.c|  8 
 src/port/explicit_bzero.c| 55 
 src/tools/msvc/Mkvcbuild.pm  |  2 +-
 9 files changed, 99 insertions(+), 2 deletions(-)
 create mode 100644 src/port/explicit_bzero.c

diff --git a/configure b/configure
index 7a6bfc2339..d3b1108218 100755
--- a/configure
+++ b/configure
@@ -15143,7 +15143,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
strchrnul strsignal symlink sync_file_range uselocale utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memset_s memmove poll posix_fallocate ppoll pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
strchrnul strsignal symlink sync_file_range uselocale utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
@@ -15808,6 +15808,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "explicit_bzero" "ac_cv_func_explicit_bzero"
+if test "x$ac_cv_func_explicit_bzero" = xyes; then :
+  $as_echo "#define HAVE_EXPLICIT_BZERO 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" explicit_bzero.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS explicit_bzero.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "fls" "ac_cv_func_fls"
 if test "x$ac_cv_func_fls" = xyes; then :
   $as_echo "#define HAVE_FLS 1" >>confdefs.h
diff --git a/configure.in b/configure.in
index dde3eec89f..c639a32a79 100644
--- a/configure.in
+++ b/configure.in
@@ -1596,6 +1596,7 @@ AC_CHECK_FUNCS(m4_normalize([
getpeerucred
getrlimit
mbstowcs_l
+   memset_s
memmove
poll
posix_fallocate
@@ -1694,6 +1695,7 @@ fi
 AC_REPLACE_FUNCS(m4_normalize([
crypt
dlopen
+   explicit_bzero
fls
getopt
getrusage
diff --git a/src/backend/libpq/be-secure-common.c 
b/src/backend/libpq/be-secure-common.c
index e8f27bc782..d801929ea2 100644
--- a/src/backend/libpq/be-secure-common.c
+++ b/src/backend/libpq/be-secure-common.c
@@ -87,6 +87,7 @@ run_ssl_passphrase_command(const char *prompt, bool 
is_server_start, char *buf,
{
if (ferror(fh))
{
+   explicit_bzero(buf, size);
ereport(loglevel,
(errcode_for_file_access(),
 errmsg("could not read from command 
\"%s\": %m",
@@ -98,6 +9

Re: using explicit_bzero

2019-08-13 Thread Peter Eisentraut
On 2019-07-30 07:58, Andres Freund wrote:
> I think it's better to have a pg_explicit_bzero or such, and implement
> that via the various platform dependant mechanisms.  It's considerably
> harder to understand code when one is surprised that a function normally
> not available is called, the buildsystem part is really hard to
> understand (with runtime and code filenames differing etc), and invites
> API breakages.  And it's not really more work to have our own name.

explicit_bzero() is a pretty established and quasi-standard name by now,
not too different from other things in src/port/.

>> +/*
>> + * Indirect call through a volatile pointer to hopefully avoid dead-store
>> + * optimisation eliminating the call.  (Idea taken from OpenSSH.)  We can't
>> + * assume bzero() is present either, so for simplicity we define our own.
>> + */
>> +
>> +static void
>> +bzero2(void *buf, size_t len)
>> +{
>> +memset(buf, 0, len);
>> +}
>> +
>> +static void (* volatile bzero_p)(void *, size_t) = bzero2;
> 
> Hm, I'm not really sure that this does that much. Especially when the
> call is via a function in the same translation unit.

This is the fallback implementation from OpenSSH, so it's plausible that
it does something.  It's worth verifying, of course.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Global temporary tables

2019-08-13 Thread Konstantin Knizhnik



On 13.08.2019 11:21, Craig Ringer wrote:
On Fri, 9 Aug 2019 at 22:07, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:




Ok, here it is: global_private_temp-1.patch



Initial pass review follows.

Relation name "SESSION" is odd. I guess you're avoiding "global" 
because the data is session-scoped, not globally temporary. But I'm 
not sure "session" fits either - after all regular temp tables are 
also session temporary tables. I won't focus on naming further beyond 
asking that it be consistent though, right now there's a mix of 
"global" in some places and "session" in others.



I have supported both forms "create session table" and "create global temp".
Both "session" and "global" are expected PostgreSQL keywords so we do 
not need to introduce new one (unlike "public" or "shared").
The form "global temp" is used in Oracle so it seems to be natural to 
provide similar syntax.


I am not insisting on this syntax and will consider all other suggestions.
But IMHO almost any SQL keyword is overloaded and have different meaning.
Temporary tables has session as living area (or transaction if created 
with "ON COMMIT DROP" clause).
So calling them "session tables" is actually more clear than just 
"temporary tables".
But "local temp tables" can be also considered as session tables. So may 
be it is really not so good idea

to use "session" keyword for them.



Why do you need to do all this indirection with changing RelFileNode 
to RelFileNodeBackend in the bufmgr, changing BufferGetTag etc? 
Similarly, your changes of RelFileNodeBackendIsTemp 
to RelFileNodeBackendIsLocalTemp . Did you look into my suggestion of 
extending the relmapper so that global temp tables would have a 
relfilenode of 0 like pg_class etc, and use a backend-local map of 
oid-to-relfilenode mappings? I'm guessing you did it the way you did 
instead to lay the groundwork for cross-backend sharing, but if so it 
should IMO be in that patch. Maybe my understanding of the existing 
temp table mechanics is just insufficient as I 
see RelFileNodeBackendIsTemp is already used in some aspects of 
existing temp relation handling.


Sorry, are you really speaking about global_private_temp-1.patch?
This patch doesn't change bufmgr file at all.
May be you looked at another patch - global_shared_temp-1.patch
which is accessing shared tables though shared buffers and so have to 
change buffer tag to include backend ID in it.




Similarly, TruncateSessionRelations probably shouldn't need to exist 
in this patch in its current form; there's no shared_buffers use to 
clean and the same file cleanup mechanism should handle both 
session-temp and local-temp relfilenodes.


In global_private_temp-1.patch TruncateSessionRelations does nothing 
with shared buffers, it just delete relation files.




A number of places make a change like this:
rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP ||
+ rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION

and I'd like to see a test macro or inline static for it since it's 
repeated so much. Mostly to make the intent clear: "is this a relation 
with temporary backend-scoped data?"


I consider to call such macro IsSessionRelation() or something like this 
but you do not like notion "session".

Is IsBackendScopedRelation() a better choice?



This test:

+ if (blkno == BTREE_METAPAGE && PageIsNew(BufferGetPage(buf)) && 
IsSessionRelationBackendId(rel->rd_backend))

+ _bt_initmetapage(BufferGetPage(buf), P_NONE, 0);

seems sensible but I'm wondering if there's a way to channel 
initialization of global-temp objects through a bit more of a common 
path, so it reads more obviously as a common test applying to all 
global-temp tables. "Global temp table not initialized in session yet? 
Initialize it." So we don't have to have every object type do an 
object type specific test for "am I actually uninitialized?" in all 
paths it might hit. I guess I expected to see something more like a


if (RelGlobalTempUninitialized(rel))

but maybe I've been doing too much Java ;)

A similar test reappears here for sequences:

+ if (rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION && 
PageIsNew(page))


Why is this written differently?



Just because I wrote them in different moment of times:)
I think that adding RelGlobalTempUninitialized is really good idea.



Sequence initialization ignores sequence startval/firstval settings. Why?


I am handling only case of implicitly created sequences for 
SERIAL/BIGSERIAL columns.

Is it possible to explicitly specify initial value and step for them?
If so, this place should definitely be rewritten.



- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ else if (newrelpersistence != RELPERSISTENCE_TEMP)

Doesn't this change the test outcome for RELPERSISTENCE_UNLOGGED?


RELPERSISTENCE_UNLOGGED case is handle in previous IF branch.



In PreCommit_on_commit_actions, in the the ONCOMMIT_DELETE_ROWS case, 
is there any way to still respect the XACT_FLAGS_

Re: Some memory not freed at the exit of RelationBuildPartitionDesc()

2019-08-13 Thread amul sul
On Sat, Aug 10, 2019 at 12:16 AM David Fetter  wrote:

> On Thu, Aug 08, 2019 at 05:42:21PM +0900, Amit Langote wrote:
> > On Thu, Aug 8, 2019 at 5:33 PM amul sul  wrote:
> > > On Thu, Aug 8, 2019 at 1:27 PM Amit Langote 
> wrote:
> >
> > >> Thanks for the patch.  This was discussed recently in the "hyrax vs.
> > >> RelationBuildPartitionDesc()" thread [1] and I think Alvaro proposed
> > >> an approach that's similar to yours.  Not sure why it wasn't pursued
> > >> though.  Maybe the reason is buried somewhere in that discussion.
> > >
> > > Oh, quite similar, thanks Amit for pointing that out.
> > >
> > > Look like "hyrax vs.RelationBuildPartitionDesc()" is in discussion for
> the master
> > > branch only, not sure though, but we need the similar fix for the back
> branches as well.
> >
> > Well, this is not a bug as such, so it's very unlikely that a fix like
> > this will be back-patched.  Also, if this becomes an issue only for
> > more than over 1000 partitions, then it's not very relevant for PG 10
> > and PG 11, because we don't recommend using so many partitions with
> > them.  Maybe we can consider fixing PG 12 though.
>
> A fix for the thousands-of-partitions case would be very welcome for 12.
>
>
Look like commit # d3f48dfae42 added the required fix but is enabled only
for
the clobber-cache builds :(

Regards,
Amul


Re: Converting NOT IN to anti-joins during planning

2019-08-13 Thread Antonin Houska
David Rowley  wrote:

> On Mon, 27 May 2019 at 20:43, Antonin Houska  wrote:
> > I've spent some time looking into this.
> 
> Thank you for having a look at this.
> 
> > One problem I see is that SubLink can be in the JOIN/ON clause and thus it's
> > not necessarily at the top of the join tree. Consider this example:
> >
> > CREATE TABLE a(i int);
> > CREATE TABLE b(j int);
> > CREATE TABLE c(k int NOT NULL);
> > CREATE TABLE d(l int);
> >
> >   SELECT *
> > FROM
> > a
> > JOIN b ON b.j NOT IN
> > ( SELECT
> > c.k
> > FROM
> > c)
> > JOIN d ON b.j = d.l;
> 
> hmm yeah. Since the proofs that are being used in
> expressions_are_not_nullable assume the join has already taken place,
> then we'll either need to not use the join conditions are proofs in
> that case, or just disable the optimisation instead.   I think it's
> fine to just disable the optimisation since it seem rather unlikely
> that someone would write a join condition like that.  Plus it seems
> quite a bit more complex to validate that the optimisation would even
> be correct if NULLs were not possible.
> 
> I've attached a patch which restricts the pullups to FromExpr quals.
> Anything below a JoinExpr disables the optimisation now.

ok. The planner pulls-up other sublinks located in the ON clause, but it'd be
quite tricky to do the same for the NOT IN case.

Now that we only consider the WHERE clause, I wonder if the code can be
simplified a bit more. In particular, pull_up_sublinks_jointree_recurse()
passes valid pointer for notnull_proofs to pull_up_sublinks_qual_recurse(),
while it also passes under_joinexpr=true. The latter should imply that NOT IN
won't be converted to ANTI JOIN anyway, so no notnull_proofs should be needed.

BTW, I'm not sure if notnull_proofs=j->quals is correct in cases like this:

case JOIN_LEFT:
j->quals = pull_up_sublinks_qual_recurse(root, j->quals,
 &j->rarg,
 rightrelids,
 NULL, NULL, j->quals,
 true);

Even if j->quals evaluates to NULL or FALSE (due to NULL value on its input),
it does not remove any rows (possibly containing NULL values) from the input
of the SubLink's expression.

I'm not even sure that expressions_are_not_nullable() needs the notnull_proofs
argument. Now that we only consider SubLinks in the WHERE clause, it seems to
me that nonnullable_vars is always a subset of nonnullable_inner_vars, isn't
it?

A few more minor findings:

@@ -225,10 +227,13 @@ pull_up_sublinks(PlannerInfo *root)
  *
  * In addition to returning the possibly-modified jointree node, we return
  * a relids set of the contained rels into *relids.
+ *
+ * under_joinexpr must be passed as true if 'jtnode' is or is under a
+ * JoinExpr.
  */
 static Node *
 pull_up_sublinks_jointree_recurse(PlannerInfo *root, Node *jtnode,
- Relids 
*relids)
+ Relids 
*relids, bool under_joinexpr)
 {
if (jtnode == NULL)
{


The comment "if 'jtnode' is or is under ..." is unclear.


* is_NOTANY_compatible_with_antijoin()

  **  "'outerquery' is the parse of the query" -> "'outerquery' is the parse 
tree of the query"

  ** "2.  We assume that each join qual is an OpExpr" -> "2.  We assume that
 each sublink expression is an OpExpr" ?

  ** (OpExpr *) lfirst(lc) -> lfirst_node(OpExpr, lc)

  ** The kind of bool expression (AND_EXPR) might need to be tested here too:

+   /* Extract exprs from multiple expressions ANDed together */
+   else if (IsA(testexpr, BoolExpr))


* find_innerjoined_rels()

 "we locate all WHERE and JOIN/ON quals that constrain these rels add them to"
 ->
 " ... and add them ..."


* get_attnotnull()

  The comment says that FALSE is returned if the attribute is dropped, however
  the function it does not test att_tup->attisdropped. (This patch should not
  call the function for a dropped attribute, so I'm only saying that the
  function code is not consistent with the comment.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Feature: Use DNS SRV records for connecting

2019-08-13 Thread Feike Steenbergen
Hi all,

I'd like to get some feedback on whether or not implementing a DNS SRV feature
for connecting to PostgreSQL would be desirable/useful.

The main use case is to have a DNS SRV record that lists all the possible
primaries of a given replicated PostgreSQL cluster. With auto failover
solutions like patroni, pg_auto_failover, stolon, etc. any of these endpoints
could be serving the primary server at any point in time.

Combined with target_session_attrs a connection string to a highly-available
cluster could be something like:

   psql "dnssrv=mydb.prod.example.com target_session_attr=read_write"

Which would then resolve the SRV record _postgresql._tcp.mydb.prod.example.com
and using the method described in RFC 2782 connect to the host/port combination
one by one until it finds the primary.

A benefit of using SRV records would be that the port is also part of the DNS
record and therefore a single IP could be used to serve many databases on
separate ports. When working with a cloud environment or containerized setup
(or both) this would open up some good possibilities.

Note: We currently can already do this somehow by specifying multiple
hosts/ports in the connection string, however it would be useful if we could
refer to a single SRV record instead, as that would have a list of hosts
and ports to connect to.

DNS SRV is described in detail here:
https://tools.ietf.org/html/rfc2782

I'd love to hear some support/dissent,

regards,

Feike




Re: Bison state table

2019-08-13 Thread John Naylor
On Sat, Jan 26, 2019 at 5:39 AM Bruce Momjian  wrote:
>
> With our scanner keywords list now more cache-aware, and with us
> planning to use Bison for years to come, has anyone ever looked at
> reordering the bison state machine array to be more cache aware, e.g.,
> having common states next to each other rather than scattered around the
> array?

I recently spent some time investigating how the various arrays are
generated and accessed, and found this informative article:

https://www.cs.uic.edu/~spopuri/cparser.html

It's dated from 2006, but still seems to be correct on the whole,
judging by the gram.c output file. Anyway, the short answer is,
grouping common states would require changing Bison's algorithm for
compressing a sparse 2-D array into multiple less-sparse 1-D arrays,
if it's even possible to control the effect you have in mind.

That said, I had an idea. When Bison consults yytable, it has to also
consult yycheck with the same index to make sure the result is valid.
If the two arrays of int16 were merged into a single array of structs,
the state and the check would be on the same cache line. I tried this
by directly patching the gram.c output file (see attached for the
basic idea) and #include'-ing a separate file with the merged array.
It didn't improve raw parsing microbenchmarks using information schema
or short pgbench-like queries. So I'm guessing either a). the
microbenchmark already has better cache behavior than real queries so
won't show much difference here, and/or b). the bottleneck is
elsewhere than accessing yytable and yycheck.

In any case, I hope to follow Bison development and test any
performance improvements the maintainers come up with, as that was
reported to be on the roadmap:

https://www.postgresql.org/message-id/74dd0f55-f3cf-447e-acf2-88c01e42a...@lrde.epita.fr

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


gram-combine.patch
Description: Binary data


Re: Feature: Use DNS SRV records for connecting

2019-08-13 Thread Graham Leggett
On 13 Aug 2019, at 11:50, Feike Steenbergen  wrote:

> I'd like to get some feedback on whether or not implementing a DNS SRV feature
> for connecting to PostgreSQL would be desirable/useful.

A big +1.

We currently use SRV records to tell postgresql what kind of server it is. This 
way all of our postgresql servers have an identical configuration, they just 
tailor themselves on startup as appropriate:

_postgresql-master._tcp.sql.example.com.

The above record in our case declares who the master is. If the postgresql 
startup says “hey, that’s me” it configures itself as a master. If the 
postgresql startup says “hey, that’s not me” it configures itself as a slave of 
the master.

We also use TXT records to define the databases we want (with protection 
against DNS security issues, we never remove a database based on a TXT record, 
but signed DNS records will help here).

_postgresql.sql.example.com TXT "v=PGSQL1;d=mydb;u=myuser"

We use a series of systemd “daemons” that are configured to run before and 
after postgresql to do the actual configuration on bootup, but it would be 
great if postgresql could just do this out the box.

Regards,
Graham
—



smime.p7s
Description: S/MIME cryptographic signature


Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Amit Kapila
On Fri, Aug 9, 2019 at 1:57 AM Robert Haas  wrote:
>
> On Thu, Aug 8, 2019 at 9:31 AM Andres Freund  wrote:
> > I know that Robert is working on a patch that revises the undo request
> > layer somewhat, it's possible that this is best discussed afterwards.
>
> Here's what I have at the moment.  This is not by any means a complete
> replacement for Amit's undo worker machinery, but it is a significant
> redesign (and I believe a significant improvement to) the queue
> management stuff from Amit's patch.  I wrote this pretty quickly, so
> while it passes simple testing, it probably has a number of bugs, and
> to actually use it, it would need to be integrated with xact.c; right
> now it's just a standalone module that doesn't do anything except let
> itself be tested.
>
> Some of the ways it is different from Amit's patches include:
>
> * It uses RBTree rather than binaryheap, so when we look ahead, we
> look ahead in the right order.
>
> * There's no limit to the lookahead distance; when looking ahead, it
> will search the entirety of all 3 RBTrees for an entry from the right
> database.
>
> * It doesn't have a separate hash table keyed by XID.  I didn't find
> that necessary.
>
> * It's better-isolated, as you can see from the fact that I've
> included a test module that tests this code without actually ever
> putting an UndoRequestManager in shared memory.  I would've liked to
> expand this test module, but I don't have time to do that today and
> felt it better to get this much sent out.
>
> * It has a lot of comments explaining the design and how it's intended
> to integrate with the rest of the system.
>
> Broadly, my vision for how this would get used is:
>
> - Create an UndoRecordManager in shared memory.
> - Before a transaction first attaches to a permanent or unlogged undo
> log, xact.c would call RegisterUndoRequest(); thereafter, xact.c would
> store a pointer to the UndoRecord for the lifetime of the toplevel
> transaction.

So, for top-level transactions rollback, we can directly refer from
UndoRequest *, the start and end locations.  But, what should we do
for sub-transactions (rollback to savepoint)?  One related point is
that we also need information about last_log_start_undo_location to
update the undo apply progress (The basic idea is if the transactions
undo is spanned across multiple logs, we update the progress in each
of the logs.).  We can remember that in the transaction state or
undorequest *.  Any suggestion?

> - Immediately after attaching to a permanent or unlogged undo log,
> xact.c would call UndoRequestSetLocation.
> - xact.c would track the number of bytes of permanent and unlogged
> undo records the transaction generates.  If the transaction goes onto
> abort, it reports these by calling FinalizeUndoRequest.
> - If the transaction commits, it doesn't need that information, but
> does need to call UnregisterUndoRequest() as a post-commit step in
> CommitTransaction().
>

IIUC, for each transaction, we have to take a lock first time it
attaches to a log and then the same lock at commit time.  It seems the
work under lock is less, but still, can't this cause a contention?  It
seems to me this is similar to what we saw in ProcArrayLock where work
under lock was few instructions, but acquiring and releasing the lock
by each backend at commit time was causing a bottleneck.

It might be due to some reason this won't matter in a similar way in
which case we can find that after integrating it with other patches
from undo processing machinery and rebasing zheap branch over it?

How will computation of oldestXidHavingUnappliedUndo will work?

We can probably check the fxid queue and error queue to get that
value.  However, I am not sure if that is sufficient because incase we
perform the request in the foreground, it won't be present in queues.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Amit Kapila
On Thu, Aug 8, 2019 at 7:01 PM Andres Freund  wrote:
> On 2019-08-07 14:50:17 +0530, Amit Kapila wrote:
> > On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
> > > On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
>
> > > >  typedef struct TwoPhaseFileHeader
> > > >  {
> > > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader
> > > >   uint16  gidlen; /* length of the GID - 
> > > > GID follows the header */
> > > >   XLogRecPtr  origin_lsn; /* lsn of this record at 
> > > > origin node */
> > > >   TimestampTz origin_timestamp;   /* time of prepare at origin node 
> > > > */
> > > > +
> > > > + /*
> > > > +  * We need the locations of the start and end undo record 
> > > > pointers when
> > > > +  * rollbacks are to be performed for prepared transactions using 
> > > > undo-based
> > > > +  * relations.  We need to store this information in the file as 
> > > > the user
> > > > +  * might rollback the prepared transaction after recovery and for 
> > > > that we
> > > > +  * need its start and end undo locations.
> > > > +  */
> > > > + UndoRecPtr  start_urec_ptr[UndoLogCategories];
> > > > + UndoRecPtr  end_urec_ptr[UndoLogCategories];
> > > >  } TwoPhaseFileHeader;
> > >
> > > Why do we not need that knowledge for undo processing of a non-prepared
> > > transaction?
>
> > The non-prepared transaction also needs to be aware of that.  It is
> > stored in TransactionStateData.  I am not sure if I understand your
> > question here.
>
> My concern is that I think it's fairly ugly to store data like this in
> the 2pc state file. And it's not an insubstantial amount of additional
> data either, compared to the current size, even when no undo is in
> use. There's a difference between an unused feature increasing backend
> local memory and increasing the size of WAL logged data. Obviously it's
> not by a huge amount, but still.  It also just feels wrong to me.
>
> We don't need the UndoRecPtr's when recovering from a crash/restart to
> process undo. Now we obviously don't want to unnecessarily search for
> data that is expensive to gather, which is a good reason for keeping
> track of this data. But I do wonder if this is the right approach.
>
> I know that Robert is working on a patch that revises the undo request
> layer somewhat, it's possible that this is best discussed afterwards.
>

Okay, we have started working on integrating with Robert's patch.  I
think not only this but many of the other things will also change.
So, I will respond to other comments after integrating with Robert's
patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Dilip Kumar
On Mon, Aug 5, 2019 at 11:59 PM Andres Freund  wrote:
>
> Hi,
>
> (as I was out of context due to dealing with bugs, I've switched to
> lookign at the current zheap/undoprocessing branch.
>
> On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> > +/*
> > + * Insert a previously-prepared undo records.
> > + *
> > + * This function will write the actual undo record into the buffers which 
> > are
> > + * already pinned and locked in PreparedUndoInsert, and mark them dirty.  
> > This
> > + * step should be performed inside a critical section.
> > + */
>
> Again, I think it's not ok to just assume you can lock an essentially
> unbounded number of buffers. This seems almost guaranteed to result in
> deadlocks. And there's limits on how many lwlocks one can hold etc.

I think for controlling that we need to put a limit on max prepared
undo?  I am not sure any other way of limiting the number of buffers
because we must lock all the buffer in which we are going to insert
the undo record under one WAL logged operation.

>
> As far as I can tell there's simply no deadlock avoidance scheme in use
> here *at all*? I must be missing something.

We are always locking buffer in block order so I am not sure how it
can deadlock?  Am I missing something?

>
>
> > + /* Main loop for writing the undo record. */
> > + do
> > + {
>
> I'd prefer this to not be a do{} while(true) loop - as written I need to
> read to the end to see what the condition is. I don't think we have any
> loops like that in the code.
Right, changed
>
>
> > + /*
> > +  * During recovery, there might be some blocks which 
> > are already
> > +  * deleted due to some discard command so we can just 
> > skip
> > +  * inserting into those blocks.
> > +  */
> > + if (!BufferIsValid(buffer))
> > + {
> > + Assert(InRecovery);
> > +
> > + /*
> > +  * Skip actual writing just update the 
> > context so that we have
> > +  * write offset for inserting into next 
> > blocks.
> > +  */
> > + SkipInsertingUndoData(&ucontext, BLCKSZ - 
> > starting_byte);
> > + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > + break;
> > + }
>
> How exactly can this happen?

Suppose you insert one record for the transaction which split in
block1 and 2.  Now, before this block is actually going to the disk
the transaction committed and become all visible the undo logs are
discarded.  It's possible that block 1 is completely discarded but
block 2 is not because it might have undo for the next transaction.
Now, during recovery (FPW is off) if block 1 is missing but block 2 is
their so we need to skip inserting undo for block 1 as it does not
exist.

>
>
> > + else
> > + {
> > + page = BufferGetPage(buffer);
> > +
> > + /*
> > +  * Initialize the page whenever we try to 
> > write the first
> > +  * record in page.  We start writing 
> > immediately after the
> > +  * block header.
> > +  */
> > + if (starting_byte == UndoLogBlockHeaderSize)
> > + UndoPageInit(page, BLCKSZ, 
> > prepared_undo->urec->uur_info,
> > +  
> > ucontext.already_processed,
> > +  
> > prepared_undo->urec->uur_tuple.len,
> > +  
> > prepared_undo->urec->uur_payload.len);
> > +
> > + /*
> > +  * Try to insert the record into the current 
> > page. If it
> > +  * doesn't succeed then recall the routine 
> > with the next page.
> > +  */
> > + InsertUndoData(&ucontext, page, 
> > starting_byte);
> > + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > + {
> > + MarkBufferDirty(buffer);
> > + break;
>
> At this point we're five indentation levels deep. I'd extract at least
> either the the per prepared undo code or the code performing the writing
> across block boundaries into a separate function. Perhaps both.

I have moved it to the separate function.
>
>
>
> > +/*
> > + * Helper function for UndoGetOneRecord
> > + *
> > + * If any of  rmid/reloid/xid/cid is not available in the undo record, 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Dilip Kumar
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > Need to do something else for a bit. More later.
>
>
> > + /*
> > +  * Compute the header size of the undo record.
> > +  */
> > +Size
> > +UndoRecordHeaderSize(uint16 uur_info)
> > +{
> > + Sizesize;
> > +
> > + /* Add fixed header size. */
> > + size = SizeOfUndoRecordHeader;
> > +
> > + /* Add size of transaction header if it presets. */
> > + if ((uur_info & UREC_INFO_TRANSACTION) != 0)
> > + size += SizeOfUndoRecordTransaction;
> > +
> > + /* Add size of rmid if it presets. */
> > + if ((uur_info & UREC_INFO_RMID) != 0)
> > + size += sizeof(RmgrId);
> > +
> > + /* Add size of reloid if it presets. */
> > + if ((uur_info & UREC_INFO_RELOID) != 0)
> > + size += sizeof(Oid);
> > +
> > + /* Add size of fxid if it presets. */
> > + if ((uur_info & UREC_INFO_XID) != 0)
> > + size += sizeof(FullTransactionId);
> > +
> > + /* Add size of cid if it presets. */
> > + if ((uur_info & UREC_INFO_CID) != 0)
> > + size += sizeof(CommandId);
> > +
> > + /* Add size of forknum if it presets. */
> > + if ((uur_info & UREC_INFO_FORK) != 0)
> > + size += sizeof(ForkNumber);
> > +
> > + /* Add size of prevundo if it presets. */
> > + if ((uur_info & UREC_INFO_PREVUNDO) != 0)
> > + size += sizeof(UndoRecPtr);
> > +
> > + /* Add size of the block header if it presets. */
> > + if ((uur_info & UREC_INFO_BLOCK) != 0)
> > + size += SizeOfUndoRecordBlock;
> > +
> > + /* Add size of the log switch header if it presets. */
> > + if ((uur_info & UREC_INFO_LOGSWITCH) != 0)
> > + size += SizeOfUndoRecordLogSwitch;
> > +
> > + /* Add size of the payload header if it presets. */
> > + if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> > + size += SizeOfUndoRecordPayload;
>
> There's numerous blocks with one if for each type, and the body copied
> basically the same for each alternative. That doesn't seem like a
> reasonable approach to me. Means that many places need to be adjusted
> when we invariably add another type, and seems likely to lead to bugs
> over time.
I think I have expressed my thought on this in another email
[https://www.postgresql.org/message-id/CAFiTN-vDrXuL6tHK1f_V9PAXp2%2BEFRpPtxCG_DRx08PZXAPkyw%40mail.gmail.com]
>
> > + /* Add size of the payload header if it presets. */
>
> FWIW, repeating the same comment, with or without minor differences, 10
> times is a bad idea. Especially when the comment doesn't add *any* sort
> of information.
Ok, fixed
>
> Also, "if it presets" presumably is a typo?
Fixed
>
>
> > +/*
> > + * Compute and return the expected size of an undo record.
> > + */
> > +Size
> > +UndoRecordExpectedSize(UnpackedUndoRecord *uur)
> > +{
> > + Sizesize;
> > +
> > + /* Header size. */
> > + size = UndoRecordHeaderSize(uur->uur_info);
> > +
> > + /* Payload data size. */
> > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + size += uur->uur_payload.len;
> > + size += uur->uur_tuple.len;
> > + }
> > +
> > + /* Add undo record length size. */
> > + size += sizeof(uint16);
> > +
> > + return size;
> > +}
> > +
> > +/*
> > + * Calculate the size of the undo record stored on the page.
> > + */
> > +static inline Size
> > +UndoRecordSizeOnPage(char *page_ptr)
> > +{
> > + uint16  uur_info = ((UndoRecordHeader *) page_ptr)->urec_info;
> > + Sizesize;
> > +
> > + /* Header size. */
> > + size = UndoRecordHeaderSize(uur_info);
> > +
> > + /* Payload data size. */
> > + if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + UndoRecordPayload *payload = (UndoRecordPayload *) (page_ptr 
> > + size);
> > +
> > + size += payload->urec_payload_len;
> > + size += payload->urec_tuple_len;
> > + }
> > +
> > + return size;
> > +}
> > +
> > +/*
> > + * Compute size of the Unpacked undo record in memory
> > + */
> > +Size
> > +UnpackedUndoRecordSize(UnpackedUndoRecord *uur)
> > +{
> > + Sizesize;
> > +
> > + size = sizeof(UnpackedUndoRecord);
> > +
> > + /* Add payload size if record contains payload data. */
> > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + size += uur->uur_payload.len;
> > + size += uur->uur_tuple.len;
> > + }
> > +
> > + return size;
> > +}
>
> These functions are all basically the same. We shouldn't copy code over
> and over like this.
UnpackedUndoRecordSize -> computes the size of the unpacked undo
record so it's different from above two, just payload part is common
so moved payload size to common function.

UndoRecordExpectedSize and UndoRecordSizeOnPage are two different
functions except for the header size com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Robert Haas
On Tue, Aug 13, 2019 at 6:28 AM Amit Kapila  wrote:
> So, for top-level transactions rollback, we can directly refer from
> UndoRequest *, the start and end locations.  But, what should we do
> for sub-transactions (rollback to savepoint)?  One related point is
> that we also need information about last_log_start_undo_location to
> update the undo apply progress (The basic idea is if the transactions
> undo is spanned across multiple logs, we update the progress in each
> of the logs.).  We can remember that in the transaction state or
> undorequest *.  Any suggestion?

The UndoRequest is only for top-level rollback.  Any state that you
need in order to do subtransaction rollback needs to be maintained
someplace else, probably in the transaction state state, or some
subsidiary data structure.  The point here is that the UndoRequest is
going to be stored in shared memory, but there is no reason ever to
store the information about a subtransaction in shared memory, because
that undo always has to be completed by the backend that is
responsible for that transaction. Those things should not get mixed
together.

> IIUC, for each transaction, we have to take a lock first time it
> attaches to a log and then the same lock at commit time.  It seems the
> work under lock is less, but still, can't this cause a contention?  It
> seems to me this is similar to what we saw in ProcArrayLock where work
> under lock was few instructions, but acquiring and releasing the lock
> by each backend at commit time was causing a bottleneck.

LWLocks are pretty fast these days and the critical section is pretty
short, so I think there's a chance it'll be just fine, but maybe it'll
cause enough cache line bouncing to be problematic. If so, I think
there are several possible ways to redesign the locking to improve
things, but it made sense to me to try the simple approach first.

> How will computation of oldestXidHavingUnappliedUndo will work?
>
> We can probably check the fxid queue and error queue to get that
> value.  However, I am not sure if that is sufficient because incase we
> perform the request in the foreground, it won't be present in queues.

Oh, I forgot about that requirement.  I think I can fix it so it does
that fairly easily, but it will require a little bit of redesign which
I won't have time to do this week.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do not check unlogged indexes on standby

2019-08-13 Thread Andrey Borodin



> 13 авг. 2019 г., в 3:23, Peter Geoghegan  написал(а):
> 
> I pushed your patch to all branches that have amcheck just now, so now
> we skip over unlogged relations when in recovery, though I made some
> revisions.
Oh, cool, thanks!

> Your patch didn't handle temp tables/indexes that were created in the
> first session correctly -- we must be careful about the distinction
> between unlogged tables, and tables that don't require WAL logging
> (the later includes temp tables). Also, I thought that it was a good
> idea to actively test for the presence of a main fork when we don't
> skip (i.e. when the system isn't in recovery and the B-Tree indexes
> isn't unlogged) -- we now give a clean report of corruption when that
> happens, rather than letting an ambiguous "can't happen" error get
> raised by low-level code. This might be possible with system catalog
> corruption, for example. Finally, I thought that the WARNING was a bit
> strong -- a NOTICE is more appropriate.
+1

> 13 авг. 2019 г., в 3:36, Peter Geoghegan  написал(а):
> 
> On Mon, Aug 12, 2019 at 2:58 AM Andrey Borodin  wrote:
>> BTW I really want to enable rightlink-leftlink invariant validation on 
>> standby..
> 
> That seems very hard. My hope was that bt_check_index() can detect the
> same problem a different way. The bt_right_page_check_scankey()
> cross-page check (which needs nothing more than an AccessShareLock)
> will often detect such problems, because the page image itself will be
> totally wrong in some way.
> 
> I'm guessing that you have direct experience with that *not* being
> good enough, though. Can you share further details? I suppose that
> bt_right_page_check_scankey() helps with transposed pages, but doesn't
> help so much when you have WAL-level inconsistencies.

We have a bunch of internal testing HA clusters that suffered from corruption 
conditions.
We fixed everything that can be detected with parent-check on primaries or 
usual check on standbys.
(page updates were lost both on primary and during WAL replay)
But from time to time when clusters switch primary from one availability zone 
to another we observe
"right sibling's left-link doesn't match: block 32709 links to 37022 instead of 
expected 40953 in index"

We are going to search for these clusters with this [0] tolerating possible 
fraction of false positives, we have them anyway.
But I think I could put some effort into making corruption-detection tooling 
better.
I think if we observe links discrepancy, we can acquire lock of left and right 
pages and recheck.


Best regards, Andrey Borodin.

[0] 
https://github.com/x4m/amcheck/commit/894d8bafb3c9a26bbc168ea5f4f33bcd1fc9f495



Configuring bgw_restart_time

2019-08-13 Thread Jeremy Finzel
I am working on an extension that uses background workers, and interested
in adding code for it to auto-restart after a crash or server restart
(similar to as-coded in worker_spi).

But I'm also interested in being able to configure bgw_restart_time using a
GUC without having to restart the server, using only SIGHUP.  For example,
I want to normally have the worker restart after 10 seconds.  But if I am
doing maintenance on the server (without a db restart), I perhaps want to
change this to -1 (BGW_NEVER_RESTART), kill the worker, do my business,
then restart the worker.  Or another reason would be my background worker
has some bug and I want to disable it without having to restart my db
server.  For us as for many, a small outage for a db restart is expensive.

I have played around with this and done some digging around the codebase
in bgworker.c (with my limited knowledge thus far of the pg codebase), and
so far as I can tell, it isn't possible to change bgw_restart_time without
a server restart.  But I'm not sure if that's just because I don't know how
this code works, or if the current libraries actually don't support
modifying this part of the background worker.  I am setting the GUC in
_PG_init, but I can see that changing it after it has been registered has
no effect unless I restart the server.

If indeed this is possible, I'd be very grateful for some insight on how to
do it.  I may even try to add such an example to worker_spi.

Thanks!
Jeremy


Re: errbacktrace

2019-08-13 Thread Alvaro Herrera
On 2019-Aug-13, Peter Eisentraut wrote:

> I have changed the configuration setting to backtrace_functions plural,
> so that you can debug more than one location at once.  I had originally
> wanted to do that but using existing functions like
> SplitIdentifierString() resulted in lots of complications with error
> handling (inside error handling!).  So here I just hand-coded the list
> splitting.  Seems simple enough.

Hmm ... but is that the natural way to write this?  I would have thought
you'd split the list at config-read time (the assign hook for the GUC)
and turn it into a List of simple strings.  Then you don't have to
loop strtok() on each errfinish().

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Global temporary tables

2019-08-13 Thread Konstantin Knizhnik



On 13.08.2019 11:27, Craig Ringer wrote:



On Tue, 13 Aug 2019 at 16:19, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:




On 13.08.2019 8:34, Craig Ringer wrote:

On Tue, 13 Aug 2019 at 00:47, Pavel Stehule
mailto:pavel.steh...@gmail.com>> wrote:

But Postgres is not storing this information now anywhere
else except statistic, isn't it?


not only - critical numbers are reltuples, relpages from pg_class


That's a very good point. relallvisible too. How's the global
temp table impl handling that right now, since you won't be
changing the pg_class row? AFAICS relpages doesn't need to be up
to date (and reltuples certainly doesn't) so presumably you're
just leaving them as zero?

As far as I understand relpages and reltuples are set only when
you perform "analyze" of the table.


Also autovacuum's autoanalyze.


When it happen?
I have created normal table, populated it with some data and then wait 
several hours but pg_class was not updated for this table.



I attach to this mail slightly refactored versions of this patches with 
fixes of issues reported in your review.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 1bd579f..2d93f6f 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -153,9 +153,9 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 			buf_state = LockBufHdr(bufHdr);
 
 			fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
-			fctx->record[i].relfilenode = bufHdr->tag.rnode.relNode;
-			fctx->record[i].reltablespace = bufHdr->tag.rnode.spcNode;
-			fctx->record[i].reldatabase = bufHdr->tag.rnode.dbNode;
+			fctx->record[i].relfilenode = bufHdr->tag.rnode.node.relNode;
+			fctx->record[i].reltablespace = bufHdr->tag.rnode.node.spcNode;
+			fctx->record[i].reldatabase = bufHdr->tag.rnode.node.dbNode;
 			fctx->record[i].forknum = bufHdr->tag.forkNum;
 			fctx->record[i].blocknum = bufHdr->tag.blockNum;
 			fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 38ae240..8a04954 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -608,9 +608,9 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged)
 		if (buf_state & BM_TAG_VALID &&
 			((buf_state & BM_PERMANENT) || dump_unlogged))
 		{
-			block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
-			block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
-			block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
+			block_info_array[num_blocks].database = bufHdr->tag.rnode.node.dbNode;
+			block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.node.spcNode;
+			block_info_array[num_blocks].filenode = bufHdr->tag.rnode.node.relNode;
 			block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
 			block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
 			++num_blocks;
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index c945b28..14d4e48 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -95,13 +95,13 @@ ginRedoInsertEntry(Buffer buffer, bool isLeaf, BlockNumber rightblkno, void *rda
 
 	if (PageAddItem(page, (Item) itup, IndexTupleSize(itup), offset, false, false) == InvalidOffsetNumber)
 	{
-		RelFileNode node;
+		RelFileNodeBackend rnode;
 		ForkNumber	forknum;
 		BlockNumber blknum;
 
-		BufferGetTag(buffer, &node, &forknum, &blknum);
+		BufferGetTag(buffer, &rnode, &forknum, &blknum);
 		elog(ERROR, "failed to add item to index page in %u/%u/%u",
-			 node.spcNode, node.dbNode, node.relNode);
+			 rnode.node.spcNode, rnode.node.dbNode, rnode.node.relNode);
 	}
 }
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 9726020..c99701d 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1028,7 +1028,7 @@ gistGetFakeLSN(Relation rel)
 {
 	static XLogRecPtr counter = FirstNormalUnloggedLSN;
 
-	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+	if (RelationHasSessionScope(rel))
 	{
 		/*
 		 * Temporary relations are only accessible in our session, so a simple
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09bc6fe..c60effd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -671,6 +671,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 			 * init fork of an unlogged relation.
 			 */
 			if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION ||
 (rel->rd_rel->relpersistence == RELPERSISTENCE

Re: progress report for ANALYZE

2019-08-13 Thread Alvaro Herrera
Hello,

On 2019-Jul-03, Tatsuro Yamada wrote:

> My ex-colleague Vinayak created same patch in 2017 [1], and he
> couldn't get commit because there are some reasons such as the
> patch couldn't handle analyzing Foreign table. Therefore, I wonder
> whether your patch is able to do that or not.

> [1] ANALYZE command progress checker
> https://www.postgresql.org/message-id/flat/968b4eda-2417-8b7b-d468-71643cf088b6%40openscg.com#574488592fcc9708c38fa44b0dae9006

So just now I went to check the jold thread (which I should have
searched for before writing my own implementation!).  It seems clear
that many things are pretty similar in both patch, but I think for the
most part they are similar just because the underlying infrastructure
imposes a certain design already, rather than there being any actual
copying.  (To be perfectly clear: I didn't even know that this patch
existed and I didn't grab any code from there to produce my v1.)

However, you've now modified the patch from what I submitted and I'm
wondering if you've taken any inspiration from Vinayak's old patch.  If
so, it seems fair to credit him as co-author in the commit message.  It
would be good to get his input on the current patch, though.

I have not looked at the current version of the patch yet, but I intend
to do so during the upcoming commitfest.

Thanks for moving this forward!


On the subject of FDW support: I did look into supporting that before
submitting this.  I think it's not academically difficult: just have the
FDW's acquire_sample_rows callback invoke the update_param functions
once in a while.  Sadly, in practical terms it looks like postgres_fdw
is quite stupid about ANALYZE (it scans the whole table??) so doing
something that's actually useful may not be so easy.  At least, we know
the total relation size and maybe we can add the ctid column to the
cursor in postgresAcquireSampleRowsFunc so that we have a current block
number to report (becing careful about synchronized seqscans).  I think
this should *not* be part of the main ANALYZE-progress commit, though,
because getting that properly sorted out is going to take some more
time.

I do wonder why doesn't postgres_fdw use TABLESAMPLE.

I did not look at other FDWs at all, mind.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Configuring bgw_restart_time

2019-08-13 Thread Craig Ringer
On Tue, 13 Aug 2019 at 20:37, Jeremy Finzel  wrote:

> I am working on an extension that uses background workers, and interested
> in adding code for it to auto-restart after a crash or server restart
> (similar to as-coded in worker_spi).
>

What pglogical does for this is use dynamic background workers with restart
turned off. It does its own worker exit and restart handling from a manager
worker that's an always-running static bgworker.

It's not ideal as it involves a considerable amount of extra work, but with
BDR we rapidly found that letting postgres itself restart bgworkers was
much too inflexible and hard to control. Especially given the issues around
soft-crash restarts and worker registration (see thread
https://www.postgresql.org/message-id/flat/534E6569.1080506%402ndquadrant.com)
.

But I'm also interested in being able to configure bgw_restart_time using a
> GUC without having to restart the server, using only SIGHUP.  For example,
> I want to normally have the worker restart after 10 seconds.  But if I am
> doing maintenance on the server (without a db restart), I perhaps want to
> change this to -1 (BGW_NEVER_RESTART), kill the worker, do my business,
> then restart the worker.
>

Instead of doing that I suggest having a SQL-callable function that sets a
flag in a shared memory segment used by the worker then sets the worker's
ProcLatch to wake it up if it's sleeping. The flag can *ask* it to exit
cleanly. If its exit code is 0 it will not be restarted.

You could also choose to have the worker exit with code 0 on SIGTERM, again
causing itself to be unregistered and not restarted.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2019-08-13 Thread Alvaro Herrera
On 2019-Aug-13, Michael Paquier wrote:

> On Mon, Aug 12, 2019 at 05:32:08PM -0400, Alvaro Herrera wrote:
> > So do we have an updated patch for this?  It's been a while since this
> > patch saw any movement ...
> 
> Please note that this involves a couple of people in Japan, and this
> week is the Obon vacation season for a lot of people.  So there could
> be delays in replies.

Understood, thanks for the info.  We still have two weeks to the start
of commitfest anyway.  And since it's been sleeping since November 2018,
I guess we can wait a little bit yet.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: errbacktrace

2019-08-13 Thread Tom Lane
Peter Eisentraut  writes:
> I have changed the configuration setting to backtrace_functions plural,
> so that you can debug more than one location at once.  I had originally
> wanted to do that but using existing functions like
> SplitIdentifierString() resulted in lots of complications with error
> handling (inside error handling!).  So here I just hand-coded the list
> splitting.  Seems simple enough.

I think it's a pretty bad idea for anything invocable from elog to
trample on the process-wide strtok() state.  Even if there's no
conflict today, there will be one eventually, unless you are going
to adopt the position that nobody else is allowed to use strtok().

regards, tom lane




Re: Some memory not freed at the exit of RelationBuildPartitionDesc()

2019-08-13 Thread David Fetter
On Tue, Aug 13, 2019 at 03:08:34PM +0530, amul sul wrote:
> On Sat, Aug 10, 2019 at 12:16 AM David Fetter  wrote:
> 
> > On Thu, Aug 08, 2019 at 05:42:21PM +0900, Amit Langote wrote:
> > > On Thu, Aug 8, 2019 at 5:33 PM amul sul  wrote:
> > > > On Thu, Aug 8, 2019 at 1:27 PM Amit Langote 
> > wrote:
> > >
> > > >> Thanks for the patch.  This was discussed recently in the "hyrax vs.
> > > >> RelationBuildPartitionDesc()" thread [1] and I think Alvaro proposed
> > > >> an approach that's similar to yours.  Not sure why it wasn't pursued
> > > >> though.  Maybe the reason is buried somewhere in that discussion.
> > > >
> > > > Oh, quite similar, thanks Amit for pointing that out.
> > > >
> > > > Look like "hyrax vs.RelationBuildPartitionDesc()" is in discussion for
> > the master
> > > > branch only, not sure though, but we need the similar fix for the back
> > branches as well.
> > >
> > > Well, this is not a bug as such, so it's very unlikely that a fix like
> > > this will be back-patched.  Also, if this becomes an issue only for
> > > more than over 1000 partitions, then it's not very relevant for PG 10
> > > and PG 11, because we don't recommend using so many partitions with
> > > them.  Maybe we can consider fixing PG 12 though.
> >
> > A fix for the thousands-of-partitions case would be very welcome for 12.
> >
> >
> Look like commit # d3f48dfae42 added the required fix but is enabled only
> for
> the clobber-cache builds :(

I've got a real world multi-tenancy case that would really be helped
by this. Can we enable it for all builds, please?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Some memory not freed at the exit of RelationBuildPartitionDesc()

2019-08-13 Thread Tom Lane
David Fetter  writes:
> On Tue, Aug 13, 2019 at 03:08:34PM +0530, amul sul wrote:
>> Look like commit # d3f48dfae42 added the required fix but is enabled only
>> for the clobber-cache builds :(

> I've got a real world multi-tenancy case that would really be helped
> by this. Can we enable it for all builds, please?

This sounds like nonsense to me.  As pointed out by the commit message,
that fix was just taking care of bloat observed with CCA on.

regards, tom lane




Re: Feature: Use DNS SRV records for connecting

2019-08-13 Thread Tom Lane
Feike Steenbergen  writes:
> I'd like to get some feedback on whether or not implementing a DNS SRV feature
> for connecting to PostgreSQL would be desirable/useful.

How would we get at that data without writing our own DNS client?
(AFAIK, our existing DNS interactions are all handled by getnameinfo()
or other library-supplied functions.)

Maybe that'd be worth doing, but it sounds like a lot of work and a
lot of new code to maintain, relative to the value of the feature.

regards, tom lane




Re: Built-in connection pooler

2019-08-13 Thread Konstantin Knizhnik

Updated version of the patch is attached.
I rewrote  edge-triggered mode emulation and have tested it at MacOS/X.
So right now three major platforms: Linux, MaxOS and Windows are covered.
In theory it should work on most of other Unix dialects.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/spi/refint.c b/contrib/spi/refint.c
index adf0490..5c2095f 100644
--- a/contrib/spi/refint.c
+++ b/contrib/spi/refint.c
@@ -11,6 +11,7 @@
 
 #include "commands/trigger.h"
 #include "executor/spi.h"
+#include "storage/proc.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -93,6 +94,8 @@ check_primary_key(PG_FUNCTION_ARGS)
 	else
 		tuple = trigdata->tg_newtuple;
 
+	MyProc->is_tainted = true;
+
 	trigger = trigdata->tg_trigger;
 	nargs = trigger->tgnargs;
 	args = trigger->tgargs;
@@ -284,6 +287,8 @@ check_foreign_key(PG_FUNCTION_ARGS)
 		/* internal error */
 		elog(ERROR, "check_foreign_key: cannot process INSERT events");
 
+	MyProc->is_tainted = true;
+
 	/* Have to check tg_trigtuple - tuple being deleted */
 	trigtuple = trigdata->tg_trigtuple;
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1..119daac 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -719,6 +719,137 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_sessions (integer)
+  
+   max_sessions configuration parameter
+  
+  
+  
+
+  The maximum number of client sessions that can be handled by
+  one connection proxy when session pooling is enabled.
+  This parameter does not add any memory or CPU overhead, so
+  specifying a large max_sessions value
+  does not affect performance.
+  If the max_sessions limit is reached new connections are not accepted.
+
+
+  The default value is 1000. This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  session_pool_size (integer)
+  
+   session_pool_size configuration parameter
+  
+  
+  
+
+  Enables session pooling and defines the maximum number of
+  backends that can be used by client sessions for each database/user combination.
+  Launched non-tainted backends are never terminated even if there are no active sessions.
+  Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements.
+  Tainted backend can server only one client.
+
+
+  The default value is 10, so up to 10 backends will serve each database,
+
+  
+ 
+
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+
+  Sets the TCP port for the connection pooler.
+  Clients connected to main "port" will be assigned dedicated backends,
+  while client connected to proxy port will be connected to backends through proxy which
+  performs transaction level scheduling. 
+   
+
+  The default value is 6543.
+
+  
+ 
+
+ 
+  connection_proxies (integer)
+  
+   connection_proxies configuration parameter
+  
+  
+  
+
+  Sets number of connection proxies.
+  Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing).
+  Each proxy launches its own subset of backends.
+  So maximal number of non-tainted backends is  session_pool_size*connection_proxies*databases*roles.
+   
+
+  The default value is 0, so session pooling is disabled.
+
+  
+ 
+
+ 
+  session_schedule (enum)
+  
+   session_schedule configuration parameter
+  
+  
+  
+
+  Specifies scheduling policy for assigning session to proxies in case of
+  connection pooling. Default policy is round-robin.
+
+
+  With round-robin policy postmaster cyclicly scatter sessions between proxies.
+
+
+  With random policy postmaster randomly choose proxy for new session.
+
+
+  With load-balancing policy postmaster choose proxy with lowest load average.
+  Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections.
+   
+  
+ 
+
+ 
+  idle_pool_worker_timeout (integer)
+  
+   idle_pool_worker_timeout configuration parameter
+  
+  
+  
+
+ Terminate an idle connection pool worker after the specified number of milliseconds.
+ The default value is 0, so pool workers are never terminated.
+   
+  
+ 
+
+ 
+  restart_pooler_on_reload (string)
+  
+   re

Re: Problem with default partition pruning

2019-08-13 Thread Alvaro Herrera
On 2019-Aug-13, Amit Langote wrote:

> Thanks a lot for revising.  Looks neat, except:
> 
> + * This is a measure of last resort only to be used because the default
> + * partition cannot be pruned using the steps; regular pruning, which is
> + * cheaper, is sufficient when no default partition exists.
> 
> This text appears to imply that the default can *never* be pruned with
> steps.  Maybe, the first sentence should read something like: "...the
> default cannot be pruned using the steps generated from clauses that
> contradict the parent's partition constraint".

Thanks!  I have pushed it with this change.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-13 Thread Anastasia Lubennikova

06.08.2019 4:28, Peter Geoghegan wrote:

Attached is v5, which is based on your v4. The three main differences
between this and v4 are:

* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.

Fair enough.
I think we can measure performance and make a decision, when patch will 
stabilize.



* We now have experimental support for unique indexes. This is broken
out into its own patch.

* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().

As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.


If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.

I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.

Could you explain more about these cases?
Now I don't understand the problem.


The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.

I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.

BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.

Thank you for the patch.
Still, I'd suggest to leave it as a possible future improvement, so that 
it doesn't

distract us from the original feature.

if (ItemIdIsDead(itemId))
continue;

In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.

I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.


I added to v6 another related fix for _bt_compress_one_page().
Previous code was implicitly deleted DEAD items without
calling index_compute_xid_horizon_for_tuples().
New code has a check whether DEAD items on the page exist and remove 
them if any.
Another possible solution is to copy dead items as is from old page to 
the new one,

but I think it's good to remove dead tuples as fast as possible.


v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call  index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken).

Is it possible that DEAD tuple to delete was smaller than itup?

  I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.


Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.


How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.

I agree.
Should I rename all related names of functions and variables in the patch?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 9ac37503c71f7623413a2e406d81f5c9a4b02742
Author: Anastasia 
Date:   Tue Aug 13 17:00:41 2019 +0300

v6-0001-Compression-deduplication-in-nbtree.p

pgbench - rework variable management

2019-08-13 Thread Fabien COELHO


Hello pgdevs,

The attached patch improves pgbench variable management as discussed in:

https://www.postgresql.org/message-id/flat/alpine.DEB.2.21.1904081752210.5867@lancre

As discussed there as well, the overall effect is small compared to libpq 
& system costs when pgbench is talking to a postgres server. When someone 
says "pgbench is slow", they really mean "libpq &  are slow", 
because pgbench does not do much beyond jumping from one libpq call to the 
next. Anyway, the patch has a measurable positive effect.


###

Rework pgbench variables and associated values for better performance

 - a (hopefully) thread-safe symbol table which maps variable names to integers
   note that all variables are statically known, but \gset stuff.
 - numbers are then used to access per-client arrays

The symbol table stores names as distinct leaves in a tree on bytes.
Each symbol name is the shortest-prefix leaf, possibly including the final
'\0'. Some windows-specific hacks are note tested. File "symbol_table_test.c"
does what it says and can be compiled standalone.

Most malloc/free cycles are taken out of running a benchmark:
 - there is a (large?) maximum number of variables of 32*MAX_SCRIPTS
 - variable names and string  values are statically allocated,
   and limited to, 64 bytes
 - a per-client persistent buffer is used for various purpose,
   to avoid mallocs/frees.

Functions assignVariables & parseQuery basically shared the same variable
substitution logic, but differed in what was substituted. The logic has been
abstracted into a common function.

This patch brings pgbench-specific overheads down on some tests, one 
thread one client, on my laptop, with the attached scripts, in tps:

 - set_x_1.sql: 11.1M -> 14.2M
 - sets.sql: 0.8M -> 2.7M # 20 \set
 - set.sql: 1.5M -> 2.0M # 3 \set & "complex" expressions
 - empty.sql: 63.9K -> 64.1K (…)
 - select_aid.sql: 29.3K -> 29.3K
 - select_aids.sql: 23.4K -> 24.2K
 - gset_aid.sql: 28.3K -> 29.2K

So we are talking significant improvements on pgbench-only scripts, only
a few percents once pgbench must interact with a CPU-bound server, because 
time is spent elsewhere.


--
Fabien.

empty.sql
Description: application/sql


select_aids.sql
Description: application/sql


select_aid.sql
Description: application/sql
diff --git a/src/bin/pgbench/Makefile b/src/bin/pgbench/Makefile
index 25abd0a875..39bba12c23 100644
--- a/src/bin/pgbench/Makefile
+++ b/src/bin/pgbench/Makefile
@@ -7,7 +7,7 @@ subdir = src/bin/pgbench
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = pgbench.o exprparse.o $(WIN32RES)
+OBJS = pgbench.o exprparse.o symbol_table.o $(WIN32RES)
 
 override CPPFLAGS := -I. -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS)
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index 17c9ec32c5..8c4ad7eff3 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -211,7 +211,7 @@ make_variable(char *varname)
 	PgBenchExpr *expr = pg_malloc(sizeof(PgBenchExpr));
 
 	expr->etype = ENODE_VARIABLE;
-	expr->u.variable.varname = varname;
+	expr->u.variable.varnum = getOrCreateSymbolId(symbol_table, varname);
 	return expr;
 }
 
diff --git a/src/bin/pgbench/exprscan.l b/src/bin/pgbench/exprscan.l
index e9020ad231..1c80e5cac6 100644
--- a/src/bin/pgbench/exprscan.l
+++ b/src/bin/pgbench/exprscan.l
@@ -207,19 +207,19 @@ notnull			[Nn][Oo][Tt][Nn][Uu][Ll][Ll]
 {digit}+		{
 	if (!strtoint64(yytext, true, &yylval->ival))
 		expr_yyerror_more(yyscanner, "bigint constant overflow",
-		  strdup(yytext));
+		  pg_strdup(yytext));
 	return INTEGER_CONST;
 }
 {digit}+(\.{digit}*)?([eE][-+]?{digit}+)?	{
 	if (!strtodouble(yytext, true, &yylval->dval))
 		expr_yyerror_more(yyscanner, "double constant overflow",
-		  strdup(yytext));
+		  pg_strdup(yytext));
 	return DOUBLE_CONST;
 }
 \.{digit}+([eE][-+]?{digit}+)?	{
 	if (!strtodouble(yytext, true, &yylval->dval))
 		expr_yyerror_more(yyscanner, "double constant overflow",
-		  strdup(yytext));
+		  pg_strdup(yytext));
 	return DOUBLE_CONST;
 }
 {alpha}{alnum}*	{
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 570cf3306a..8025be302d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -232,24 +232,30 @@ const char *progname;
 volatile bool timer_exceeded = false;	/* flag from signal handler */
 
 /*
- * Variable definitions.
+ * Variable contents.
+ *
+ * Variable names are managed in symbol_table with a number.
  *
  * If a variable only has a string value, "svalue" is that value, and value is
- * "not set".  If the value is known, "value" contains the value (in any
- * variant).
+ * "not set".
+ *
+ * If the value is known, "value" contains the value (in any variant).
  *
  * In this case "svalue" contains the string equivalent of the value, if we've
- * 

Re: SegFault on 9.6.14

2019-08-13 Thread Tom Lane
Amit Kapila  writes:
> On Tue, Aug 13, 2019 at 3:18 AM Tom Lane  wrote:
>> To clarify my position --- I think it's definitely possible to improve
>> the situation a great deal.  We "just" have to pass down more information
>> about whether rescans are possible.

> Right, you have speculated above that it is possible via adding some
> eflag bits.  Can you please describe a bit more about that idea, so
> that somebody else can try to write a patch?

Well, there are two components to solving this problem:

1. What are we going to do about the executor's external API?

Right now, callers of ExecutorStart don't have to say whether they
might call ExecutorRewind.  We need some way for callers to make a
binding promise that they won't do any such thing.  Perhaps we just
want to invent another flag that's like EXEC_FLAG_BACKWARD, but it's
not clear how it should interact with the existing "soft" REWIND
flag.  Nor do I know how far up the call stack will we have to make
changes to make it possible to promise as much as we can -- for
instance, will we have to adapt the SPI interfaces?

2. What happens inside ExecutorStart in response to such promises?

I imagine that we translate them into additional eflags bits that
get passed down to node init functions, possibly with modification
(e.g., nodeNestloop.c would have to revoke the no-rescans promise
to its inner input).  You'd need to work out what is the most
convenient set of conventions (positive or negative sense of the
flag bits, etc), and go through all the non-leaf node types to
determine what they can pass down.

(BTW, unless I'm missing something, there's not currently any
enforcement of EXEC_FLAG_BACKWARD, ie a caller can fail to pass
that and then try to back up anyway.  We probably want to improve
that situation, and also enforce this new flag about
ExecutorRewind.)

The reason I'm dubious about back-patching this is that each
of these things seems likely to affect external code.  Point 1
could affect external callers of the executor, and point 2 is
likely to have consequences for FDWs and custom-scan providers.
Maybe we can set things up so that everything defaults in a
safe direction for unchanged code, but I don't want to contort
the design just to do that.

regards, tom lane




Re: pg_upgrade fails with non-standard ACL

2019-08-13 Thread Anastasia Lubennikova

29.07.2019 18:37, Stephen Frost wrote:

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:

Bruce Momjian  writes:

On Thu, Jul 18, 2019 at 06:53:12PM +0300, Anastasia Lubennikova wrote:

pg_upgrade from 9.6 fails if old cluster had non-standard ACL
on pg_catalog functions that have changed between versions,
for example pg_stop_backup(boolean).

Uh, wouldn't this affect any default-installed function where the
permission are modified?  Is fixing only a few functions really helpful?

No, it's just functions whose signatures have changed enough that
a GRANT won't find them.  I think the idea is that the set of
potentially-affected functions is determinate.  I have to say that
the proposed patch seems like a complete kluge, though.  For one
thing we'd have to maintain the list of affected functions in each
future release, and I have no faith in our remembering to do that.

Well, we aren't likely to do that ourselves, no, but perhaps we could
manage it with some prodding by having the buildfarm check for such
cases, not unlike how library maintainers check the ABI between versions
of the library they manage.  Extension authors also deal with these
kinds of changes routinely when writing the upgrade scripts to go
between versions of their extension.  I'm not convinced that this is a
great approach to go down either, to be clear.  When going across major
versions, making people update their systems/code and re-test is
typically entirely reasonable to me.
Whatever we choose to do, we need to keep a list of changed functions. I 
don't
think that it will add too much extra work to maintaining other catalog 
changes

such as adding or renaming columns.
What's more, we must mention changed functions in migration release 
notes. I've

checked documentation [1] and found out, that function API changes are not
described properly.

I think it is an important omission, so I attached the patch for 
documentation.

Not quite sure, how many users have already migrated to version 10, still, I
believe it will help many others.


Suppressing the GRANT also seems reasonable for the case of objects
which have been renamed- clearly whatever is using those functions is
going to have to be modified to deal with the new name of the function,
requiring that the GRANT be re-issued doesn't seem like it's that much
more to ask of users.  On the other hand, properly written tools that
check the version of PG and use the right function names could possibly
"just work" following a major version upgrade, if the privilege was
brought across to the new major version correctly.

That's exactly the case.


We also don't want to mistakenly GRANT users more access then they
should have though- if pg_stop_backup() one day grows an
optional argument to run some server-side script, I don't think we'd
want to necessairly just give access to that ability to roles who,
today, can execute the current pg_stop_backup() function.  Of course, if
we added such a capability, hopefully we would do so in a way that less
privileged roles could continue to use the existing capability without
having access to run such a server-side script.

I also don't think that the current patch is actually sufficient to deal
with all the changes we've made between the versions- what about column
names on catalog tables/views that were removed, or changed/renamed..?

I can't get the problem you describe here. As far as I understand, various
changes of catalog tables and views are already handled correctly in 
pg_upgrade.



In an ideal world, it seems like we'd make a judgement call and arrange
to pull the privileges across when we can do so without granting the
user privileges beyond those that were intended, and otherwise we'd
suppress the GRANT to avoid getting an error.  I'm not sure what a good
way is to go about either figuring out a way to pull the privileges
across, or to suppress the GRANTs when we need to (or always), would be
though.  Neither seems easy to solve in a clean way.

Certainly open to suggestions.
Based on our initial bug report, I would vote against suppressing the 
GRANTS,
because it leads to an unexpected failure in case a user has a special 
role for
use in a third-party utility such as a backup tool, which already took 
care of

internal API changes.

Still I agree with your arguments about possibility of providing more grants
than expected. Ideally, we do not change behaviour of existing functions 
that

much, but in real-world it may happen.

Maybe, as a compromise, we can reset grants to default for all changed 
functions
and also generate a script that will allow a user to preserve privileges 
of the

old cluster by analogy with analyze_new_cluster script.
What do you think?

[1] https://www.postgresql.org/docs/10/release-10.html

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/release-10.sgml b/doc/src/sgml/release-10.sgml
index 7b2130e3c1..1b685b44da 100644
--- a

Re: Add "password_protocol" connection parameter to libpq

2019-08-13 Thread Jeff Davis
On Tue, 2019-08-13 at 11:56 +0900, Michael Paquier wrote:
> I tend to prefer #2 as well and that's the kind of approach we were
> tending to agree on when we discussed this issue during the v11 beta
> for the downgrade issues with libpq.  And as you say extend it so as
> we can apply filtering of more AUTH_REQ requests, inclusing GSS and
> krb5.

Can you please offer a concrete proposal? I know the proposals I've put
out aren't perfect (otherwise there wouldn't be three of them), so if
you have something better, please share.

Regards,
Jeff Davis






Re: Bison state table

2019-08-13 Thread Bruce Momjian
On Tue, Aug 13, 2019 at 05:09:30PM +0700, John Naylor wrote:
> On Sat, Jan 26, 2019 at 5:39 AM Bruce Momjian  wrote:
> >
> > With our scanner keywords list now more cache-aware, and with us
> > planning to use Bison for years to come, has anyone ever looked at
> > reordering the bison state machine array to be more cache aware, e.g.,
> > having common states next to each other rather than scattered around the
> > array?
> 
> I recently spent some time investigating how the various arrays are
> generated and accessed, and found this informative article:
> 
> https://www.cs.uic.edu/~spopuri/cparser.html
> 
> It's dated from 2006, but still seems to be correct on the whole,
> judging by the gram.c output file. Anyway, the short answer is,
> grouping common states would require changing Bison's algorithm for
> compressing a sparse 2-D array into multiple less-sparse 1-D arrays,
> if it's even possible to control the effect you have in mind.
> 
> That said, I had an idea. When Bison consults yytable, it has to also
> consult yycheck with the same index to make sure the result is valid.
> If the two arrays of int16 were merged into a single array of structs,
> the state and the check would be on the same cache line. I tried this
> by directly patching the gram.c output file (see attached for the
> basic idea) and #include'-ing a separate file with the merged array.
> It didn't improve raw parsing microbenchmarks using information schema
> or short pgbench-like queries. So I'm guessing either a). the
> microbenchmark already has better cache behavior than real queries so
> won't show much difference here, and/or b). the bottleneck is
> elsewhere than accessing yytable and yycheck.
> 
> In any case, I hope to follow Bison development and test any
> performance improvements the maintainers come up with, as that was
> reported to be on the roadmap:

Very interesting.  Thanks for researching this.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: pg_upgrade fails with non-standard ACL

2019-08-13 Thread Bruce Momjian
On Tue, Aug 13, 2019 at 07:04:42PM +0300, Anastasia Lubennikova wrote:
> > In an ideal world, it seems like we'd make a judgement call and arrange
> > to pull the privileges across when we can do so without granting the
> > user privileges beyond those that were intended, and otherwise we'd
> > suppress the GRANT to avoid getting an error.  I'm not sure what a good
> > way is to go about either figuring out a way to pull the privileges
> > across, or to suppress the GRANTs when we need to (or always), would be
> > though.  Neither seems easy to solve in a clean way.
> > 
> > Certainly open to suggestions.
> Based on our initial bug report, I would vote against suppressing the
> GRANTS,
> because it leads to an unexpected failure in case a user has a special role
> for
> use in a third-party utility such as a backup tool, which already took care
> of
> internal API changes.
> 
> Still I agree with your arguments about possibility of providing more grants
> than expected. Ideally, we do not change behaviour of existing functions
> that
> much, but in real-world it may happen.
> 
> Maybe, as a compromise, we can reset grants to default for all changed
> functions
> and also generate a script that will allow a user to preserve privileges of
> the
> old cluster by analogy with analyze_new_cluster script.
> What do you think?

I agree pg_upgrade should work without user correction as much as
possible.  However, as you can see, it can fail when user objects
reference system table objects that have changed between major releases.

As much as it would be nice if the release notes covered all that, and
we updated pg_upgrade to somehow handle them, it just isn't realistic. 
As we can see here, the problems often take years to show up, and even
then there were probably many other people who had the problem who never
reported it to us.

I think a realistic approach is to come up with a list of all the user
behaviors that can cause pg_upgrade to break (by reviewing previous
pg_upgrade bug reports), and then add code to pg_upgrade to detect them
and either fix them or report them in --check mode.

In summary, I am saying that the odds that patch authors, committers,
release note writers, and pg_upgrade maintainers are going to form a
consistent work flow that catches all these changes is unrealistic ---
our best bet is to create something in the pg_upgrade code to detects
this.  pg_upgrade already connects to the old and new cluster, so
technically it can perform system table modification checks itself.

The only positive is that when pg_upgrade does fail, at least we have a
system that clearly points to the cause, but unfortunately usually at
run-time, not at --check time.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Do not check unlogged indexes on standby

2019-08-13 Thread Peter Geoghegan
On Tue, Aug 13, 2019 at 5:17 AM Andrey Borodin  wrote:
> We have a bunch of internal testing HA clusters that suffered from corruption 
> conditions.
> We fixed everything that can be detected with parent-check on primaries or 
> usual check on standbys.
> (page updates were lost both on primary and during WAL replay)
> But from time to time when clusters switch primary from one availability zone 
> to another we observe
> "right sibling's left-link doesn't match: block 32709 links to 37022 instead 
> of expected 40953 in index"

That sounds like an issue caused by a failure to replay all available
WAL, where only one page happened to get written out by a checkpoint
before a crash. It's something like that. That wouldn't be caught by
the cross-page bt_index_check() check that we do already.

> We are going to search for these clusters with this [0] tolerating possible 
> fraction of false positives, we have them anyway.
> But I think I could put some effort into making corruption-detection tooling 
> better.
> I think if we observe links discrepancy, we can acquire lock of left and 
> right pages and recheck.

That's one possibility. When I first designed amcheck it was important
to be conservative, so I invented a general rule about never acquiring
multiple buffer locks at once. I still think that that was the correct
decision for the bt_downlink_check() check (the main extra
bt_index_parent_check() check), but I think that you're right about
retrying to verify the sibling links when bt_index_check() is called
from SQL.

nbtree will often "couple" buffer locks on the leaf level; it will
acquire a lock on a leaf page, and not release that lock until it has
also acquired a lock on the right sibling page (I'm mostly thinking of
_bt_stepright()). I am in favor of a patch that makes amcheck perform
sibling link verification within bt_index_check(), by retrying while
pessimistically coupling buffer locks. (Though I think that that
should just happen on the leaf level. We should not try to be too
clever about ignorable/half-dead/deleted pages, to be conservative.)

-- 
Peter Geoghegan




Re: Unix-domain socket support on Windows

2019-08-13 Thread Peter Eisentraut
On 2019-08-07 15:56, Peter Eisentraut wrote:
> Depending on your Windows environment, there might not be a suitable
> /tmp directory, so you'll need to specify a directory explicitly using
> postgres -k or similar.  This leads to the question what the default for
> DEFAULT_PGSOCKET_DIR should be on Windows.  I think it's probably best,
> at least for now, to set it so that by default, neither server nor libpq
> use Unix sockets unless explicitly selected.  This can be done easily on
> the server side by defining DEFAULT_PGSOCKET_DIR as "".  But in libpq, I
> don't think the code would handle that correctly everywhere, so it would
> need some more analysis and restructuring possibly.

Updated patches, which now also address that issue: There is no default
socket dir on Windows and it's disabled by default on both client and
server.

Some comments on the patches:

v2-0001-Enable-Unix-domain-sockets-support-on-Windows.patch

This is pretty straightforward, apart from maybe some comments, but it
would need to be committed last, because it would enable all the Unix
socket related code on Windows, which needs to be fixed up by the
subsequent patches first.

v2-0002-Sort-out-getpeereid-and-struct-passwd-handling-on.patch

Maybe a more elegant way with fewer #ifdef WIN32 can be found?

v2-0003-psql-Remove-one-use-of-HAVE_UNIX_SOCKETS.patch

This could be committed independently.

v2-0004-libpq-Remove-unnecessary-uses-of-HAVE_UNIX_SOCKET.patch

This one as well.

v2-0005-initdb-Detect-Unix-domain-socket-support-dynamica.patch

I think this patch contains some nice improvements in general.  How much
of that ends up being useful depends on how the subsequent patches (esp.
0007) end up, since with Unix-domain sockets disabled by default on
Windows, we won't need initdb doing any detection.

v2-0006-Fix-handling-of-Unix-domain-sockets-on-Windows-in.patch

This is a fairly independent and isolated change.

v2-0007-Disable-Unix-sockets-by-default-on-Windows.patch

This one is a bit complicated.  Since there is no good default location
for Unix sockets on Windows, and many systems won't support them for
some time, the default implemented here is to not use them by default on
the server or client.  This needs a fair amount of restructuring in the
to support the case of "supports Unix sockets but don't use them by
default", while maintaining the existing cases of "doesn't support Unix
sockets" and "use Unix sockets by default".  There is some room for
discussion here.


This patch set needs testers with various Windows versions to test
different configurations, combinations, and versions.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c817dfc0ec2aead25e6fcd33f9e89bd55cb2c0b1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 7 Aug 2019 15:44:19 +0200
Subject: [PATCH v2 1/7] Enable Unix-domain sockets support on Windows

As of Windows 10 version 1803, Unix-domain sockets are supported on
Windows.  But it's not automatically detected by configure because it
looks for struct sockaddr_un and Windows doesn't define that.  So we
just make our own definition on Windows and override the configure
result.
---
 config/c-library.m4|  5 +++--
 configure  |  5 -
 src/include/c.h| 11 +++
 src/include/pg_config.h.in |  6 +++---
 src/include/pg_config.h.win32  |  6 +++---
 src/include/pg_config_manual.h |  7 ---
 6 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/config/c-library.m4 b/config/c-library.m4
index 6f2b0fbb4e..1469b07d2f 100644
--- a/config/c-library.m4
+++ b/config/c-library.m4
@@ -121,10 +121,11 @@ AC_DEFUN([PGAC_UNION_SEMUN],
 
 # PGAC_STRUCT_SOCKADDR_UN
 # ---
-# If `struct sockaddr_un' exists, define HAVE_UNIX_SOCKETS.
+# If `struct sockaddr_un' exists, define HAVE_STRUCT_SOCKADDR_UN.
+# If it is missing then one could define it.
 # (Requires test for !)
 AC_DEFUN([PGAC_STRUCT_SOCKADDR_UN],
-[AC_CHECK_TYPE([struct sockaddr_un], [AC_DEFINE(HAVE_UNIX_SOCKETS, 1, [Define 
to 1 if you have unix sockets.])], [],
+[AC_CHECK_TYPES([struct sockaddr_un], [], [],
 [#include 
 #ifdef HAVE_SYS_UN_H
 #include 
diff --git a/configure b/configure
index 7a6bfc2339..6e87537be3 100755
--- a/configure
+++ b/configure
@@ -14135,7 +14135,10 @@ ac_fn_c_check_type "$LINENO" "struct sockaddr_un" 
"ac_cv_type_struct_sockaddr_un
 "
 if test "x$ac_cv_type_struct_sockaddr_un" = xyes; then :
 
-$as_echo "#define HAVE_UNIX_SOCKETS 1" >>confdefs.h
+cat >>confdefs.h <<_ACEOF
+#define HAVE_STRUCT_SOCKADDR_UN 1
+_ACEOF
+
 
 fi
 
diff --git a/src/include/c.h b/src/include/c.h
index 2a082afab1..434c403269 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1263,6 +1263,17 @@ extern unsigned long long strtoull(const char *str, char 
**endptr, int base);
 #define NON_EXEC_STATIC static
 #endif
 
+#ifdef HAVE_STRUCT_SOCKADDR_UN
+#define HAVE_UNIX_SOCKETS 1
+#e

Re: Fix runtime errors from -fsanitize=undefined

2019-08-13 Thread Peter Eisentraut
On 2019-07-05 19:10, Tom Lane wrote:
> Peter Eisentraut  writes:
>> On 2019-07-05 01:33, Noah Misch wrote:
>>> I just saw this proposal.  The undefined behavior in question is strictly
>>> academic.  These changes do remove the need for new users to discover
>>> -fno-sanitize=nonnull-attribute, but they make the code longer and no 
>>> clearer.
>>> Given the variety of code this touches, I expect future commits will
>>> reintroduce the complained-of usage patterns, prompting yet more commits to
>>> restore the invariant achieved here.  Hence, I'm -0 for this change.
> 
>> This sanitizer has found real problems in the past.  By removing these
>> trivial issues we can then set up a build farm animal or similar to
>> automatically check for any new issues.
> 
> I think Noah's point is just that we can do that with the addition of
> -fno-sanitize=nonnull-attribute.  I agree with him that it's very
> unclear why we should bother to make the code clean against that
> specific subset of warnings.

OK, I'm withdrawing this patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Use PageIndexTupleOverwrite() within nbtsort.c

2019-08-13 Thread Peter Geoghegan
On Tue, Aug 6, 2019 at 8:30 AM Anastasia Lubennikova
 wrote:
> Should we also update similar code in _bt_mark_page_halfdead()?
> I attached a new version of the patch with this change.

Pushed.

At first I thought that there might be a problem with doing the same
thing within _bt_mark_page_halfdead(), because we still won't use
PageIndexTupleOverwrite() in the corresponding recovery routine -- in
theory, that could break WAL consistency checking because the redo
routine works by reconstructing a half-deleted leaf page from scratch,
resulting in a logically equivalent though physically different page
(even after masking within btree_mask()). However, I eventually
decided that you had it right. Your _bt_mark_page_halfdead() change is
clearer overall and doesn't break WAL consistency checking in
practice, for reasons that are no less obvious than before.

Thanks!
-- 
Peter Geoghegan




Re: Small patch to fix build on Windows

2019-08-13 Thread Dmitry Igrishin
вт, 13 авг. 2019 г. в 06:19, Michael Paquier :
>
> On Fri, Aug 09, 2019 at 11:21:52AM +0300, Dmitry Igrishin wrote:
> > Personally I don't care. I used || notation only in order to be
> > consistent, since this notation is already used in Solution.pm. If
> > this consistency is not required let me provide a patch with {}
> > notation. What do you think?
>
> We are talking about one place in src/tools/msvc/ using qq on HEAD.
> So one or the other is fine by me as long as we remain in the
> acceptable ASCII ranks.
Okay. 5th version of patch is attached.
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index d1d0aed07e..67e9eede49 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -495,7 +495,7 @@ sub mkvcbuild
 		my $pythonprog = "import sys;print(sys.prefix);"
 		  . "print(str(sys.version_info[0])+str(sys.version_info[1]))";
 		my $prefixcmd =
-		  $solution->{options}->{python} . "\\python -c \"$pythonprog\"";
+		  qq{"$solution->{options}->{python}\\python" -c "$pythonprog"};
 		my $pyout = `$prefixcmd`;
 		die "Could not query for python version!\n" if $?;
 		my ($pyprefix, $pyver) = split(/\r?\n/, $pyout);
diff --git a/src/tools/msvc/Project.pm b/src/tools/msvc/Project.pm
index b5d1dc6e89..88e9e3187d 100644
--- a/src/tools/msvc/Project.pm
+++ b/src/tools/msvc/Project.pm
@@ -132,11 +132,6 @@ sub AddLibrary
 {
 	my ($self, $lib, $dbgsuffix) = @_;
 
-	if ($lib =~ m/\s/)
-	{
-		$lib = '"' . $lib . """;
-	}
-
 	push @{ $self->{libraries} }, $lib;
 	if ($dbgsuffix)
 	{
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 318594db5d..400c6706fb 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -126,7 +126,7 @@ sub GetOpenSSLVersion
 	# Attempt to get OpenSSL version and location.  This assumes that
 	# openssl.exe is in the specified directory.
 	my $opensslcmd =
-	  $self->{options}->{openssl} . "\\bin\\openssl.exe version 2>&1";
+	  qq{"$self->{options}->{openssl}\\bin\\openssl.exe" version 2>&1};
 	my $sslout = `$opensslcmd`;
 
 	$? >> 8 == 0


Re: Improve search for missing parent downlinks in amcheck

2019-08-13 Thread Peter Geoghegan
On Mon, Aug 12, 2019 at 12:01 PM Alexander Korotkov
 wrote:
> BTW, there is next revision of patch I'm proposing for v13.

Cool.

> In this revision check for missing downlinks is combined with
> bt_downlink_check().  So, pages visited by bt_downlink_check() patch
> doesn't cause extra accessed.  It only causes following additional
> page accesses:
> 1) Downlinks corresponding to "negative infinity" keys,
> 2) Pages of child level, which are not referenced by downlinks.
>
> But I think these two kinds are very minority, and those accesses
> could be trade off with more precise missing downlink check without
> bloom filter.  What do you think?

I am generally in favor of making the downlink check absolutely
reliable, and am not too worried about the modest additional overhead.
After all, bt_index_parent_check() is supposed to be thorough though
expensive. The only reason that I used a Bloom filter for
fingerprinting downlink blocks was that it seemed important to quickly
get amcheck coverage for subtle multi-level page deletion bugs just
after v11 feature freeze. We can now come up with a better design for
that.

I was confused about how this patch worked at first. But then I
remembered that Lehman and Yao consider downlinks to be distinct
things to separator keys in internal pages. The high key of an
internal page in the final separator key, so you have n downlinks and
n + 1 separator keys per internal page -- two distinct things that
appear in alternating order (the negative infinity item is not
considered to have any separator key here). So, while internal page
items are explicitly "(downlink, separator)" pairs that are grouped
into a single tuple in nbtree, with a separate tuple just for the high
key, Lehman and Yao would find it just as natural to treat them as
"(separator, downlink)" pairs. You have to skip the first downlink on
each internal level to make that work, but this makes our
bt_downlink_check() check have something from the target page (child's
parent page) that is like the high key in the child.

It's already really confusing that we don't quite mean the same thing
as Lehman and Yao when we say downlink (See also my long "why a
highkey is never truly a copy of another item" comment block within
bt_target_page_check()), and that is not your patch's fault. But maybe
we need to fix that to make your patch easier to understand. (i.e.
maybe we need to go over every use of the word "downlink" in nbtree,
and make it say something more precise, to make everything less
confusing.)

Other feedback:

* Did you miss the opportunity to verify that every high key matches
its right sibling page's downlink tuple in parent page? We talked
about this already, but you don't seem to match the key (you only
match the downlink block).

* You are decoupling the new check from the bt_index_parent_check()
"heapallindexed" option. That's probably a good thing, but you need to
update the sgml docs.

* Didn't you forget to remove the BtreeCheckState.rightsplit flag?

* You've significantly changed the behavior of bt_downlink_check() --
I would note this in its header comments. This is where ~99% of the
new work happens.

* I don't like that you use the loaded term "target" here -- anything
called "target" should be BtreeCheckState.target:

>  static void
> -bt_downlink_missing_check(BtreeCheckState *state)
> +bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
> + BlockNumber targetblock, Page target)
>  {

If it's unclear what I mean, this old comment should make it clearer:

/*
 * State associated with verifying a B-Tree index
 *
 * target is the point of reference for a verification operation.
 *
 * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
 * they are current target's child pages).  Conceptually, problems are only
 * ever found in the current target page (or for a particular heap tuple during
 * heapallindexed verification).  Each page found by verification's left/right,
 * top/bottom scan becomes the target exactly once.
 */

-- 
Peter Geoghegan




Re: block-level incremental backup

2019-08-13 Thread Ibrar Ahmed
On Mon, Aug 12, 2019 at 4:57 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Fri, Aug 9, 2019 at 6:36 PM Robert Haas  wrote:
>
>> On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
>>  wrote:
>> > So, do you mean we should just do fread() and fwrite() for the whole
>> file?
>> >
>> > I thought it is better if it was done by the OS itself instead of
>> reading 1GB
>> > into the memory and writing the same to the file.
>>
>> Well, 'cp' is just a C program.  If they can write code to copy a
>> file, so can we, and then we're not dependent on 'cp' being installed,
>> working properly, being in the user's path or at the hard-coded
>> pathname we expect, etc.  There's an existing copy_file() function in
>> src/backed/storage/file/copydir.c which I'd probably look into
>> adapting for frontend use.  I'm not sure whether it would be important
>> to adapt the data-flushing code that's present in that routine or
>> whether we could get by with just the loop to read() and write() data.
>>
>
> Agree that we can certainly use open(), read(), write(), and close() here,
> but
> given that pg_basebackup.c and basbackup.c are using file operations, I
> think
> using fopen(), fread(), fwrite(), and fclose() will be better here,
> at-least
> for consistetncy.
>

+1 for using  fopen(), fread(), fwrite(), and fclose()


> Let me know if we still want to go with native OS calls.
>
>

-1 for OS call


>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>

-- 
Ibrar Ahmed


Re: Add "password_protocol" connection parameter to libpq

2019-08-13 Thread Jonathan S. Katz
On 8/13/19 12:25 PM, Jeff Davis wrote:
> On Tue, 2019-08-13 at 11:56 +0900, Michael Paquier wrote:
>> I tend to prefer #2 as well and that's the kind of approach we were
>> tending to agree on when we discussed this issue during the v11 beta
>> for the downgrade issues with libpq.  And as you say extend it so as
>> we can apply filtering of more AUTH_REQ requests, inclusing GSS and
>> krb5.
> 
> Can you please offer a concrete proposal? I know the proposals I've put
> out aren't perfect (otherwise there wouldn't be three of them), so if
> you have something better, please share.

I think all of them get at the same thing, i.e. specifying which
password protocol you want to use, and a lot of it is a matter of how
much onus we want to put on the user.

Back to the thee proposals[1], I've warmed up to #3 a bit. I do think it
puts more onus on the client to set the correct knobs to get the desired
outcome, but what I like is the specific `channel_binding=require`
attribute.

However, I don't think it's completely future proof to adding a new hash
digest. If we wanted to prevent someone from using scram-sha-256 in a
scram-sha-512 world, we'd likely need an option for that.

Alternatively, we could combine 2 & 3, e.g.:

  channel_binding = {disable|prefer|require}

  # comma-separated list of protocols that are ok to the user, remove
  # ones you don't want. empty means all is ok
  password_protocol = "plaintext,md5,scram-sha-256,scram-sha-256-plus"

If the client selects "channel_binding=require" but does not include a
protocol that supports it, we should error. Likewise, if the client does
something like "channel_binding=require" and
"password_protocol=scram-sha-256,scram-sha-256-plus" but the server
refuses to do channel binding, we should error.

I think this gives us both future-proofing against newer password digest
methods + the fix for the downgrade issue.

I would not be opposed to extending "password_protocol" to read
"auth_protocol" or the like and work for everything covered in AUTH_REQ,
but I would need to think about it some more.

Thanks,

Jonathan

[1]
https://www.postgresql.org/message-id/daf0017a1a5c2caabf88a4e00f66b4fcbdfeccad.camel%40j-davis.com



signature.asc
Description: OpenPGP digital signature


Re: Add "password_protocol" connection parameter to libpq

2019-08-13 Thread Jeff Davis
On Tue, 2019-08-13 at 16:51 -0400, Jonathan S. Katz wrote:
> Alternatively, we could combine 2 & 3, e.g.:
> 
>   channel_binding = {disable|prefer|require}
> 
>   # comma-separated list of protocols that are ok to the user, remove
>   # ones you don't want. empty means all is ok
>   password_protocol = "plaintext,md5,scram-sha-256,scram-sha-256-
> plus"

I still feel like lists are over-specifying things. Let me step back
and offer an MVP of a single new parameter:

  channel_binding={prefer|require}

And has a lot of benefits:
* solves the immediate need to make channel binding useful, which
is a really nice feature
* compatible with most of the other proposals we're considering, so
we can always extend it when we have a better understanding and
consensus
* clear purpose for the user
* doesn't introduce new concepts that might be confusing to the
user, like SASL or the use of "-plus" to mean "with channel binding"
* guides users toward the good practice of using SSL and SCRAM
* simple to implement

The other use cases are less clear to me, and seem less urgent.

Regards,
Jeff Davis






12's AND CHAIN doesn't chain when transaction raised an error

2019-08-13 Thread Philip Dubé
The easiest way to see this is to BEGIN READ ONLY & then attempt an insert. 
Execute either of COMMIT AND CHAIN or ROLLBACK AND CHAIN & attempt the insert a 
second time

This seems incorrect. The documentation should at least point out this behavior 
if it's intended


Re: Add "password_protocol" connection parameter to libpq

2019-08-13 Thread Jonathan S. Katz
On 8/13/19 6:25 PM, Jeff Davis wrote:
> On Tue, 2019-08-13 at 16:51 -0400, Jonathan S. Katz wrote:
>> Alternatively, we could combine 2 & 3, e.g.:
>>
>>   channel_binding = {disable|prefer|require}
>>
>>   # comma-separated list of protocols that are ok to the user, remove
>>   # ones you don't want. empty means all is ok
>>   password_protocol = "plaintext,md5,scram-sha-256,scram-sha-256-
>> plus"
> 
> I still feel like lists are over-specifying things. Let me step back
> and offer an MVP of a single new parameter:
> 
>   channel_binding={prefer|require}
> 
> And has a lot of benefits:
> * solves the immediate need to make channel binding useful, which
> is a really nice feature
> * compatible with most of the other proposals we're considering, so
> we can always extend it when we have a better understanding and
> consensus
> * clear purpose for the user
> * doesn't introduce new concepts that might be confusing to the
> user, like SASL or the use of "-plus" to mean "with channel binding"
> * guides users toward the good practice of using SSL and SCRAM
> * simple to implement

+1; I agree with your overall argument. The only thing I debate is if we
want to have an explicit "disable" option. Looking at the negotiation
standard[1] specified for channel binding with SCRAM, I don't think we
need an explicit disable option. I can't think of any good use cases for
"disable" off the top of my head either. The only thing is it would be
consistent with some of our other parameters in terms of having an
explicit "opt-out."

Jonathan

[1] https://tools.ietf.org/html/rfc5802#section-6



signature.asc
Description: OpenPGP digital signature


Re: Unexpected "shared memory block is still in use"

2019-08-13 Thread Tom Lane
Noah Misch  writes:
> On Fri, May 10, 2019 at 04:46:40PM -0400, Tom Lane wrote:
>> Done now, but while thinking more about the issue, I had an idea: why is
>> it that we base the shmem key on the postmaster's port number, and not
>> on the data directory's inode number?  Using the port number not only
>> increases the risk of collisions (though admittedly only in testing
>> situations), but it *decreases* our ability to detect real conflicts.
>> Consider case where DBA wants to change the installation's port number,
>> and he edits postgresql.conf, but then uses "kill -9 && rm postmaster.pid"
>> rather than some saner way of stopping the old postmaster.  When he
>> starts the new one, it won't detect any remaining children of the old
>> postmaster because it'll be looking in the wrong range of shmem keys.
>> It seems like something tied to the data directory's identity would
>> be much more trustworthy.

> Good point.  Since we now ignore (SHMSTATE_FOREIGN) any segment that bears
> (st_dev,st_ino) not matching $PGDATA, the change you describe couldn't make us
> fail to detect a real conflict or miss a cleanup opportunity.  It would reduce
> the ability to test sysv_shmem.c; I suppose one could add a debug GUC to
> override the start of the key space.

Attached is a draft patch to change both shmem and sema key selection
to be based on data directory inode rather than port.

I considered using "st_ino ^ st_dev", or some such, but decided that
that would largely just make it harder to manually correlate IPC
keys with running postmasters.  It's generally easy to find out the
data directory inode number with "ls", but the extra work to find out
and XOR in the device number is not so easy, and it's not clear what
it'd buy us in typical scenarios.

The Windows code seems fine as-is: it's already using data directory
name, not port, to set up shmem, and it doesn't need anything for
semaphores.

I'm not quite sure what's going on in src/test/recovery/t/017_shm.pl.
As expected, the test for port number non-collision no longer sees
a failure.  After fixing that, the test passes, but it takes a
ridiculously long time (minutes); apparently each postmaster start/stop
cycle takes much longer than it ought to.  I suppose this patch is
breaking its assumptions, but I've not studied it.  We'd have to do
something about that before this would be committable.

I'll add this to the next commitfest.

regards, tom lane

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 3370adf..bdd6552 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "miscadmin.h"
 #include "storage/ipc.h"
@@ -181,10 +182,6 @@ PGSemaphoreShmemSize(int maxSemas)
  * are acquired here or in PGSemaphoreCreate, register an on_shmem_exit
  * callback to release them.
  *
- * The port number is passed for possible use as a key (for Posix, we use
- * it to generate the starting semaphore name).  In a standalone backend,
- * zero will be passed.
- *
  * In the Posix implementation, we acquire semaphores on-demand; the
  * maxSemas parameter is just used to size the arrays.  For unnamed
  * semaphores, there is an array of PGSemaphoreData structs in shared memory.
@@ -196,8 +193,22 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas, int port)
+PGReserveSemaphores(int maxSemas)
 {
+	struct stat statbuf;
+
+	/*
+	 * We use the data directory's inode number to seed the search for free
+	 * semaphore keys.  This minimizes the odds of collision with other
+	 * postmasters, while maximizing the odds that we will detect and clean up
+	 * semaphores left over from a crashed postmaster in our own directory.
+	 */
+	if (stat(DataDir, &statbuf) < 0)
+		ereport(FATAL,
+(errcode_for_file_access(),
+ errmsg("could not stat data directory \"%s\": %m",
+		DataDir)));
+
 #ifdef USE_NAMED_POSIX_SEMAPHORES
 	mySemPointers = (sem_t **) malloc(maxSemas * sizeof(sem_t *));
 	if (mySemPointers == NULL)
@@ -215,7 +226,7 @@ PGReserveSemaphores(int maxSemas, int port)
 
 	numSems = 0;
 	maxSems = maxSemas;
-	nextSemKey = port * 1000;
+	nextSemKey = statbuf.st_ino;
 
 	on_shmem_exit(ReleaseSemaphores, 0);
 }
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index ac5106f..a1652cc 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef HAVE_SYS_IPC_H
 #include 
 #endif
@@ -301,10 +302,6 @@ PGSemaphoreShmemSize(int maxSemas)
  * are acquired here or in PGSemaphoreCreate, register an on_shmem_exit
  * callback to release them.
  *
- * The port number is passed for possible use as a key (for SysV, we use
- * it to generate the starting semaphore key).  In a standalone backend,
- * zero will be passed.
- *
  * In the SysV impleme

BF failure: could not open relation with OID XXXX while querying pg_views

2019-08-13 Thread Thomas Munro
Hi,

Here are three strange recent failures in the "rules" test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=blenny&dt=2019-08-13%2022:19:27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alewife&dt=2019-07-27%2009:39:05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2019-07-18%2003:00:27

They all raised "ERROR:  could not open relation with OID "
while running:

 SELECT viewname, definition FROM pg_views
 WHERE schemaname IN ('pg_catalog', 'public')
 ORDER BY viewname;

-- 
Thomas Munro
https://enterprisedb.com




Re: pg_upgrade fails with non-standard ACL

2019-08-13 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, Aug 13, 2019 at 07:04:42PM +0300, Anastasia Lubennikova wrote:
> > Maybe, as a compromise, we can reset grants to default for all changed
> > functions
> > and also generate a script that will allow a user to preserve privileges of
> > the
> > old cluster by analogy with analyze_new_cluster script.
> > What do you think?
> 
> I agree pg_upgrade should work without user correction as much as
> possible.  However, as you can see, it can fail when user objects
> reference system table objects that have changed between major releases.

Right.

> As much as it would be nice if the release notes covered all that, and
> we updated pg_upgrade to somehow handle them, it just isn't realistic. 
> As we can see here, the problems often take years to show up, and even
> then there were probably many other people who had the problem who never
> reported it to us.

Yeah, the possible changes when you think about column-level privileges
as well really gets to be quite large..

This is why my thinking is that we should come up with additional
default roles, which aren't tied to specific catalog structures but
instead are for a more general set of capabilities which we manage and
users can either decide to use, or not.  If they decide to work with the
individual functions then they have to manage the upgrade process if and
when we make changes to those functions.

> I think a realistic approach is to come up with a list of all the user
> behaviors that can cause pg_upgrade to break (by reviewing previous
> pg_upgrade bug reports), and then add code to pg_upgrade to detect them
> and either fix them or report them in --check mode.

In this case, we could, at least conceptually, perform a comparison
between the different major versions and then check for any non-default
privileges for any of the objects changed and then report on those in
--check mode with a recommendation to revert to the default privileges
in the old cluster before running pg_upgrade, and then apply whatever
privileges are desired in the new cluster after the upgrade completes.

> In summary, I am saying that the odds that patch authors, committers,
> release note writers, and pg_upgrade maintainers are going to form a
> consistent work flow that catches all these changes is unrealistic ---
> our best bet is to create something in the pg_upgrade code to detects
> this.  pg_upgrade already connects to the old and new cluster, so
> technically it can perform system table modification checks itself.

It'd be pretty neat if pg_upgrade could connect to the old and new
clusters concurrently and then perform a generic catalog comparison
between them and identify all objects which have been changed and
determine if there's any non-default ACLs or dependencies on the catalog
objects which are different between the clusters.  That seems like an
awful lot of work though, and I'm not sure there's really any need,
given that we don't change the catalog for a given major version- we
could just generate the list using some queries whenever we release a
new major version and update pg_upgrade with it.

> The only positive is that when pg_upgrade does fail, at least we have a
> system that clearly points to the cause, but unfortunately usually at
> run-time, not at --check time.

Getting it to be at check time would certainly be a great improvement.

Solving this in pg_upgrade does seem like it's probably the better
approach rather than trying to do it in pg_dump.  Unfortunately, that
likely means that all we can do is have pg_upgrade point out to the user
when something will fail, with recommendations on how to address it, but
that's also something users are likely used to and willing to accept,
and puts the onus on them to consider their ACL decisions when we're
making catalog changes, and it keeps these issues out of pg_dump.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Problem with default partition pruning

2019-08-13 Thread Amit Langote
On Wed, Aug 14, 2019 at 12:25 AM Alvaro Herrera
 wrote:
> On 2019-Aug-13, Amit Langote wrote:
>
> > Thanks a lot for revising.  Looks neat, except:
> >
> > + * This is a measure of last resort only to be used because the default
> > + * partition cannot be pruned using the steps; regular pruning, which 
> > is
> > + * cheaper, is sufficient when no default partition exists.
> >
> > This text appears to imply that the default can *never* be pruned with
> > steps.  Maybe, the first sentence should read something like: "...the
> > default cannot be pruned using the steps generated from clauses that
> > contradict the parent's partition constraint".
>
> Thanks!  I have pushed it with this change.

Thank you Alvaro.  This takes care of all the issues around default
partition pruning reported on this thread.  Thanks everyone.

Regards,
Amit




Re: pg_upgrade fails with non-standard ACL

2019-08-13 Thread Bruce Momjian
On Tue, Aug 13, 2019 at 08:28:12PM -0400, Stephen Frost wrote:
> Getting it to be at check time would certainly be a great improvement.
> 
> Solving this in pg_upgrade does seem like it's probably the better
> approach rather than trying to do it in pg_dump.  Unfortunately, that
> likely means that all we can do is have pg_upgrade point out to the user
> when something will fail, with recommendations on how to address it, but
> that's also something users are likely used to and willing to accept,
> and puts the onus on them to consider their ACL decisions when we're
> making catalog changes, and it keeps these issues out of pg_dump.

Yeah, I think we just need to bite the bullet and create some
infrastructure to help folks solve the problem.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Recent failures in IsolationCheck deadlock-hard

2019-08-13 Thread Thomas Munro
On Tue, Aug 6, 2019 at 6:18 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Sat, Aug 3, 2019 at 2:11 AM Tom Lane  wrote:
> >> Also worth noting is that anole failed its first try at the new
> >> deadlock-parallel isolation test:
> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-01%2015%3A48%3A16
>
> > And friarbird (also CLOBBER_CACHE_ALWAYS) fails every time.
>
> Yeah, there have been half a dozen failures since deadlock-parallel
> went in, mostly on critters that are slowed by CLOBBER_CACHE_ALWAYS
> or valgrind.  I've tried repeatedly to reproduce that here, without
> success :-(.  It's unclear whether the failures represent a real
> code bug or just a problem in the test case, so I don't really want
> to speculate about fixes till I can reproduce it.

I managed to reproduce a failure that looks a lot like lousyjack's
(note that there are two slightly different failure modes):

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2019-08-05%2011:33:02

I did that by changing the deadlock_timeout values for sessions d1 and
d2 to just a few milliseconds on my slowest computer, guessing that
this might be a race involving the deadlock timeout and the time it
takes for workers to fork and join a lock queue.  While normally
deadlock.c with DEBUG_DEADLOCK defined prints out something like this
during this test:

DeadLockCheck: lock 0x80a2812d0 queue  33087 33088 33089 33090 33091
rearranged to: lock 0x80a2812d0 queue  33091 33090 33089 33088 33087

... when it failed like lousyjack my run printed out:

DeadLockCheck: lock 0x80a2721f8 queue  33108 33114
rearranged to: lock 0x80a2721f8 queue  33114 33108

... and then it hung for a while, so I could inspect the lock table
and see that PID 33108 was e1l (not granted), and PID 33114 was gone
but was almost certainly the first worker for d2a1 (I can tell because
33110-33113 are the workers for d1a2 and they're still waiting and
d2a1's first worker should have had the next sequential PID, on my
OS).

Another thing I noticed is that all 4 times I managed to reproduce
this, the "rearranged to" queue had only two entries; I can understand
that d1's workers might not feature yet due to bad timing, but it's
not clear to me why there should always be only one d2a1 worker and
not more.  I don't have time to study this further today and I might
be way off, but my first guess is that in theory we need a way to make
sure that the d1-e2 edge exists before d2's deadlock timer expires,
no?  That's pretty tricky though, so maybe we just need to crank the
times up.

-- 
Thomas Munro
https://enterprisedb.com




Re: [HACKERS] CLUSTER command progress monitor

2019-08-13 Thread Tatsuro Yamada

Hi Alvaro and All,

On 2019/08/13 14:40, Tatsuro Yamada wrote:

Hi Alvaro!

On 2019/08/02 3:43, Alvaro Herrera wrote:

Hmm, I'm trying this out now and I don't see the index_rebuild_count
ever go up.  I think it's because the indexes are built using parallel
index build ... or maybe it was the table AM changes that moved things
around, not sure.  There's a period at the end when the CLUSTER command
keeps working, but it's gone from pg_stat_progress_cluster.



Thanks for your report.
I'll investigate it. :)



I did "git bisect" and found the commit:

 03f9e5cba0ee1633af4abe734504df50af46fbd8
 Report progress of REINDEX operations

In src/backend/catalog/index.c,
CLUSTER progress reporting increases index_rebuild_count in reindex_relation()
by pgstat_progress_update_param().
However, reindex_relation() calls reindex_index(), and REINDEX progress 
reporting is existing on the latter function, and it starts 
pgstat_progress_start_command() pgstat_progress_end_command() for REINDEX 
progress reporting.
Therefore, CLUSTER progress reporting failed to update index_rebuild_count 
because
it made a mistake to update the REINDEX's view, I think.

My Idea to fix that is following:

 - Add a target view name parameter to Progress monitor's API

  For example:
  
pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT, i).

  
pgstat_progress_update_param(*PROGRESS_CLUSTER_VIEW*, 
PROGRESS_CLUSTER_INDEX_REBUILD_COUNT, i).

However, I'm not sure whether it is able or not because I haven't read
the code of the API yet.
What do you think about that? :)

Thanks,
Tatsuro Yamada






Re: Add "password_protocol" connection parameter to libpq

2019-08-13 Thread Michael Paquier
On Tue, Aug 13, 2019 at 04:51:57PM -0400, Jonathan S. Katz wrote:
> On 8/13/19 12:25 PM, Jeff Davis wrote:
>> On Tue, 2019-08-13 at 11:56 +0900, Michael Paquier wrote:
>>> I tend to prefer #2 as well and that's the kind of approach we were
>>> tending to agree on when we discussed this issue during the v11 beta
>>> for the downgrade issues with libpq.  And as you say extend it so as
>>> we can apply filtering of more AUTH_REQ requests, inclusing GSS and
>>> krb5.
>> 
>> Can you please offer a concrete proposal? I know the proposals I've put
>> out aren't perfect (otherwise there wouldn't be three of them), so if
>> you have something better, please share.
> 
> I think all of them get at the same thing, i.e. specifying which
> password protocol you want to use, and a lot of it is a matter of how
> much onus we want to put on the user.

What I got in mind was a comma-separated list of authorized protocols
which can be specified as a connection parameter, which extends to all
the types of AUTH_REQ requests that libpq can understand, plus an
extra for channel binding.  I also liked the idea mentioned upthread
of "any" to be an alias to authorize everything, which should be the
default.  So you basically get at that:
auth_protocol = 
{any,password,md5,scram-sha-256,scram-sha-256-plus,krb5,gss,sspi}

So from an implementation point of view, just using bit flags would
make things rather clean.

> Back to the thee proposals[1], I've warmed up to #3 a bit. I do think it
> puts more onus on the client to set the correct knobs to get the desired
> outcome, but what I like is the specific `channel_binding=require`
> attribute.

I could see a point in separating the channel binding part into a
second parameter though.  We don't have (at least yet) an hba option
to allow only channel binding with scram, so a one-one mapping with
the elements of the connection parameter brings some consistency.

> If the client selects "channel_binding=require" but does not include a
> protocol that supports it, we should error.

Yep.

> Likewise, if the client does
> something like "channel_binding=require" and
> "password_protocol=scram-sha-256,scram-sha-256-plus" but the server
> refuses to do channel binding, we should error.

If using a second parameter to control channel binding requirement, I
don't think that there is any point in keeping scram-sha-256-plus in
password_protocol.

> I would not be opposed to extending "password_protocol" to read
> "auth_protocol" or the like and work for everything covered in AUTH_REQ,
> but I would need to think about it some more.

And for this one I would like to push for not only having
password-based methods considered.
--
Michael


signature.asc
Description: PGP signature


Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2019-08-13 Thread Etsuro Fujita
Hi Alvaro and Michael,

On Tue, Aug 13, 2019 at 11:04 PM Alvaro Herrera
 wrote:
> On 2019-Aug-13, Michael Paquier wrote:
> > On Mon, Aug 12, 2019 at 05:32:08PM -0400, Alvaro Herrera wrote:
> > > So do we have an updated patch for this?  It's been a while since this
> > > patch saw any movement ...

Thanks for reminding me about this, Alvaro!

> > Please note that this involves a couple of people in Japan, and this
> > week is the Obon vacation season for a lot of people.  So there could
> > be delays in replies.

Yeah, I was on that vacation.  Thanks, Michael!

> Understood, thanks for the info.  We still have two weeks to the start
> of commitfest anyway.  And since it's been sleeping since November 2018,
> I guess we can wait a little bit yet.

This is my TODO item for PG13, but I'll give priority to other things
in the next commitfest.  If anyone wants to work on it, feel free;
else I'll move this to the November commitfest when it opens.

Best regards,
Etsuro Fujita




Re: [HACKERS] CLUSTER command progress monitor

2019-08-13 Thread Michael Paquier
On Wed, Aug 14, 2019 at 11:38:01AM +0900, Tatsuro Yamada wrote:
> On 2019/08/13 14:40, Tatsuro Yamada wrote:
>> On 2019/08/02 3:43, Alvaro Herrera wrote:
>>> Hmm, I'm trying this out now and I don't see the index_rebuild_count
>>> ever go up.  I think it's because the indexes are built using parallel
>>> index build ... or maybe it was the table AM changes that moved things
>>> around, not sure.  There's a period at the end when the CLUSTER command
>>> keeps working, but it's gone from pg_stat_progress_cluster.
>> 
>> Thanks for your report.
>> I'll investigate it. :)
> 
> I did "git bisect" and found the commit:
> 
>  03f9e5cba0ee1633af4abe734504df50af46fbd8
>  Report progress of REINDEX operations

I am adding an open item for this one.
--
Michael


signature.asc
Description: PGP signature


Re: using explicit_bzero

2019-08-13 Thread Michael Paquier
On Tue, Aug 13, 2019 at 10:30:39AM +0200, Peter Eisentraut wrote:
> Another patch, to attempt to fix the Windows build.

I have not been able to test the compilation, but the changes look
good on this side.
--
Michael


signature.asc
Description: PGP signature


Re: BF failure: could not open relation with OID XXXX while querying pg_views

2019-08-13 Thread Tom Lane
Thomas Munro  writes:
> Here are three strange recent failures in the "rules" test:
> ...
> They all raised "ERROR:  could not open relation with OID "
> while running:
>  SELECT viewname, definition FROM pg_views
>  WHERE schemaname IN ('pg_catalog', 'public')
>  ORDER BY viewname;

I think the problem is probably that Peter ignored this bit of advice
in parallel_schedule:

# rules cannot run concurrently with any test that creates
# a view or rule in the public schema

when he inserted "collate.linux.utf8" concurrently with "rules".

(I suspect BTW that the point is not so much that you better not
*create* such an object, as that you better not *drop* it concurrently
with that query.)

regards, tom lane




Re: Regression test failure in regression test temp.sql

2019-08-13 Thread Michael Paquier
On Tue, Aug 13, 2019 at 12:15:26PM +0900, Michael Paquier wrote:
> Indeed.  Good catch!  Perhaps you would like to fix it?  There are two
> queries in need of an ORDER BY, and the second query even uses two
> semicolons (spoiler warning: that's a nit).

And fixed.  The test case was new as of v12.
--
Michael


signature.asc
Description: PGP signature


Re: SegFault on 9.6.14

2019-08-13 Thread Amit Kapila
On Tue, Aug 13, 2019 at 9:28 PM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Tue, Aug 13, 2019 at 3:18 AM Tom Lane  wrote:
> >> To clarify my position --- I think it's definitely possible to improve
> >> the situation a great deal.  We "just" have to pass down more information
> >> about whether rescans are possible.
>
> > Right, you have speculated above that it is possible via adding some
> > eflag bits.  Can you please describe a bit more about that idea, so
> > that somebody else can try to write a patch?
>
> Well, there are two components to solving this problem:
>
> 1. What are we going to do about the executor's external API?
>
> Right now, callers of ExecutorStart don't have to say whether they
> might call ExecutorRewind.  We need some way for callers to make a
> binding promise that they won't do any such thing.  Perhaps we just
> want to invent another flag that's like EXEC_FLAG_BACKWARD, but it's
> not clear how it should interact with the existing "soft" REWIND
> flag.

Yeah making it interact with REWIND will be a bit of challenge as I
think to some extent the REWIND flag also indicates the same.  Do I
understand correctly that, we have some form of rule such that if
EXEC_FLAG_REWIND is set or node's chgParam is NULL, then we can expect
that node can support rescan?  If it is true, then maybe we need to
design this new flag in such a way that it covers existing cases of
REWIND as well.

Another point which I am wondering is why can't we use the existing
REWIND flag to solve the current issue, basically if we have access to
that information in nodeLimit.c (ExecLimit), then can't we just pass
down that to ExecShutdownNode?   I guess the problem could be that if
LimitNode doesn't support REWIND, but one of the nodes beneath it
supports that same, then we won't be able to rely on the information
passed to ExecShutdownNode.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: SegFault on 9.6.14

2019-08-13 Thread Tom Lane
Amit Kapila  writes:
> Another point which I am wondering is why can't we use the existing
> REWIND flag to solve the current issue, basically if we have access to
> that information in nodeLimit.c (ExecLimit), then can't we just pass
> down that to ExecShutdownNode?

The existing REWIND flag tells subnodes whether they should *optimize*
for getting rewound or not.  I don't recall right now (well past
midnight) why that seemed like a useful definition, but if you grep for
places that are paying attention to that flag, I'm sure you'll find out.

We probably don't want to give up that distinction --- if it had been
equally good to define the flag as a hard yes-or-no, I'm sure we would
have taken that definition, because it's simpler.

regards, tom lane




Re: clean up obsolete initdb locale handling

2019-08-13 Thread Peter Eisentraut
On 2019-08-12 20:17, Tom Lane wrote:
> Peter Eisentraut  writes:
>> OK, let's do it like that.  Updated patch attached.
> 
> LGTM, but I don't have the ability to test it on Windows.

Committed, after some testing on Windows.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: BF failure: could not open relation with OID XXXX while querying pg_views

2019-08-13 Thread Peter Eisentraut
On 2019-08-14 05:52, Tom Lane wrote:
> Thomas Munro  writes:
>> Here are three strange recent failures in the "rules" test:
>> ...
>> They all raised "ERROR:  could not open relation with OID "
>> while running:
>>  SELECT viewname, definition FROM pg_views
>>  WHERE schemaname IN ('pg_catalog', 'public')
>>  ORDER BY viewname;
> 
> I think the problem is probably that Peter ignored this bit of advice
> in parallel_schedule:
> 
> # rules cannot run concurrently with any test that creates
> # a view or rule in the public schema
> 
> when he inserted "collate.linux.utf8" concurrently with "rules".

This test file is set up to create everything in the "collate_tests"
schema.  Is that not working correctly?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [proposal] de-TOAST'ing using a iterator

2019-08-13 Thread John Naylor
On Sat, Aug 3, 2019 at 11:11 PM Binguo Bao  wrote:
>
> John Naylor  于2019年8月2日周五 下午3:12写道:
>>
>> I like the use of a macro here. However, I think we can find a better
>> location for the definition. See the header comment of fmgr.h:
>> "Definitions for the Postgres function manager and function-call
>> interface." Maybe tuptoaster.h is as good a place as any?
>
> PG_DETOAST_ITERATE isn't a sample function-call interface,
> But I notice that PG_FREE_IF_COPY is also defined in fmgr.h, whose logic is
> similar to PG_DETOAST_ITERATE, make condition check first and then
> decide whether to call the function. Besides, PG_DETOAST_DATUM,
> PG_DETOAST_DATUM_COPY, PG_DETOAST_DATUM_SLICE,
> PG_DETOAST_DATUM_PACKED are all defined in fmgr.h, it is reasonable
> to put all the de-TOAST interface together.

Hmm, it's strange that those macros ended up there, but now I see why
it makes sense to add new ones there also.

>> Okay, and see these functions now return void. The orignal
>> pglz_decompress() returned a value that was check against corruption.
>> Is there a similar check we can do for the iterative version?
>
> As far as I know, we can just do such check after all compressed data is 
> decompressed.
> If we are slicing, we can't do the check.

Okay.

>> >> 3). Speaking of pglz_decompress_iterate(), I diff'd it with
>> >> pglz_decompress(), and I have some questions on it:
>> >>
>> >> a).
>> >> + srcend = (const unsigned char *) (source->limit == source->capacity
>> >> ? source->limit : (source->limit - 4));
>> >>
>> >> What does the 4 here mean in this expression?
>> >
>> > Since we fetch chunks one by one, if we make srcend equals to the source 
>> > buffer limit,
>> > In the while loop "while (sp < srcend && dp < destend)", sp may exceed the 
>> > source buffer limit and read unallocated bytes.
>>
>> Why is this? That tells me the limit is incorrect. Can the setter not
>> determine the right value?
>
> There are three statments change `sp` value in the while loop `while (sp < 
> srcend && dp < destend)`:
> `ctrl = *sp++;`
> `off = ((sp[0]) & 0xf0) << 4) | sp[1]; sp += 2;`
> `len += *sp++`
> Although we make sure `sp` is less than `srcend` when enter while loop, `sp` 
> is likely to
> go beyond the `srcend` in the loop, and we should ensure that `sp` is always 
> smaller than `buf->limit` to avoid
> reading unallocated data. So, `srcend` can't be initialized to `buf->limit`. 
> Only one case is exceptional,
> we've fetched all data chunks and 'buf->limit' reaches 'buf->capacity', it's 
> imposisble to read unallocated
> data via `sp`.

Thank you for the detailed explanation and the comment.

>> Please explain further. Was the "sp + 1" correct behavior (and why),
>> or only for debugging setting ctrl/c correctly?
>
> In patch v5, If the condition is `sp < srcend`, suppose `sp = srcend - 1` 
> before
> entering the loop `while (sp < srcend && dp < destend)`, when entering the 
> loop
> and read a control byte(sp equals to `srcend` now), the program can't enter 
> the
> loop `for (; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)`, then set 
> `iter->ctrlc` to 0,
> exit the first loop and then this iteration is over. At the next iteration,
> the control byte will be reread since `iter->ctrlc` equals to 0, but the 
> previous control byte
> is not used. Changing the condition to `sp + 1 < srcend` avoid only one 
> control byte is read
> then the iterator is over.

Okay, that's quite subtle. I agree the v6/7 way is more clear in this regard.

>> Also, I don't think
>> the new logic for the ctrl/c variables is an improvement:
>>
>> 1. iter->ctrlc is intialized with '8' (even in the uncompressed case,
>> which is confusing). Any time you initialize with something not 0 or
>> 1, it's a magic number, and here it's far from where the loop variable
>> is used. This is harder to read.
>
> `iter->ctrlc` is used to record the value of `ctrl` in pglz_decompress at the 
> end of
> the last iteration(or loop). In the pglz_decompress, `ctrlc`’s valid value is 
> 0~7,
> When `ctrlc` reaches 8,  a control byte is read from the source
> buffer to `ctrl` then set `ctrlc` to 0. And a control bytes should be read 
> from the
> source buffer to `ctrlc` on the first iteration. So `iter->ctrlc` should be 
> intialized with '8'.

My point here is it looks strange out of context, but "0" looked
normal. Maybe a comment in init_detoast_buffer(), something like "8
means read a control byte from the source buffer on the first
iteration, see pg_lzdecompress_iterate()".

Or, possibly, we could have a macro like INVALID_CTRLC. That might
even improve the readability of the original function. This is just an
idea, and maybe others would disagree, so you don't need to change it
for now.

>> 3. At the end of the loop, iter->ctrl/c are unconditionally set. In
>> v5, there was a condition which would usually avoid this copying of
>> values through pointers.
>
> Patch v6 just records the value of `ctrlc` at the end of each iteration(or 
> loop)
> whe

Re: BF failure: could not open relation with OID XXXX while querying pg_views

2019-08-13 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-08-14 05:52, Tom Lane wrote:
>> Thomas Munro  writes:
>>> They all raised "ERROR:  could not open relation with OID "
>>> while running:
>>> SELECT viewname, definition FROM pg_views
>>> WHERE schemaname IN ('pg_catalog', 'public')
>>> ORDER BY viewname;

>> I think the problem is probably that Peter ignored this bit of advice
>> in parallel_schedule:
>> # rules cannot run concurrently with any test that creates
>> # a view or rule in the public schema
>> when he inserted "collate.linux.utf8" concurrently with "rules".

> This test file is set up to create everything in the "collate_tests"
> schema.  Is that not working correctly?

Oh, hmm --- yeah, that should mean it's safe.  Maybe somebody incautiously
changed one of the other tests that run concurrently with "rules"?

regards, tom lane




Re: BF failure: could not open relation with OID XXXX while querying pg_views

2019-08-13 Thread Thomas Munro
On Wed, Aug 14, 2019 at 5:06 PM Tom Lane  wrote:
> Oh, hmm --- yeah, that should mean it's safe.  Maybe somebody incautiously
> changed one of the other tests that run concurrently with "rules"?

Looks like stats_ext.sql could be the problem.  It creates and drops
priv_test_view, not in a schema.  Adding Dean, author of commit
d7f8d26d.

-- 
Thomas Munro
https://enterprisedb.com




use of the term "verifier" with SCRAM

2019-08-13 Thread Peter Eisentraut
I'm confused by how the code uses the term "verifier" in relation to SCRAM.

ISTM that the code uses the term as meaning whatever is or would be
stored in pg_auth.rolpassword.

I don't see this usage supported in the RFCs.  In RFC 5802,

verifier= "v=" base64
;; base-64 encoded ServerSignature.

where

ServerSignature := HMAC(ServerKey, AuthMessage)
ServerKey   := HMAC(SaltedPassword, "Server Key")
AuthMessage := client-first-message-bare + "," +
   server-first-message + "," +
   client-final-message-without-proof

whereas what is stored in rolpassword is

SCRAM-SHA-256$:$:

where

StoredKey   := H(ClientKey)
ClientKey   := HMAC(SaltedPassword, "Client Key")

So while these are all related, I don't think it's accurate to call what
is in rolpassword a SCRAM "verifier".

RFC 5803 is titled "Lightweight Directory Access Protocol (LDAP) Schema
for Storing Salted Challenge Response Authentication Mechanism (SCRAM)
Secrets".  Following that, I think calling the contents of rolpassword a
"secret" or a "stored secret" would be better.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-08-13 Thread David Rowley
On Wed, 17 Jul 2019 at 06:46, Andres Freund  wrote:
> 2) Add a execPartition.c function that returns all the used tables from
>a PartitionTupleRouting*.

Here's a patch which implements it that way.

I struggled a bit to think of a good name for the execPartition.c
function. I ended up with ExecGetRoutedToRelations. I'm open to better
ideas.

I also chose to leave the change of function signatures done in
f7c830f1a in place. I don't think the additional now unused parameter
is that out of place. Also, the function is inlined, so removing it
wouldn't help performance any.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


call_table_finish_bulk_insert_less_often.patch
Description: Binary data


Re: POC: Cleaning up orphaned files using undo logs

2019-08-13 Thread Andres Freund
Hi,

On 2019-08-06 14:18:42 -0700, Andres Freund wrote:
> Here's the last section of my low-leve review. Plan to write a higher
> level summary afterwards, now that I have a better picture of the code.

General comments:

- For a feature of this complexity, there's very little architectural
  documentation. Some individual parts have a bit, but there's basically
  nothing over-arching. That makes it extremely hard for anybody that is
  not already involved to understand the design constraints, and even
  for people involved it's hard to understand.

  I think it's very important for this to have a document that first
  explains what the goals, and non-goals, of this feature are. And then
  secondly explains the chosen architecture referencing those
  constraints.  Once that's there it's a lot easier to review this
  patchset, to discuss the overall architecture, etc.


- There are too many different email threads and git branches. The same
  patches are discussed in different threads, layers exist in slightly
  diverging versions in different git trees. Again making it very hard
  for anybody not primarily focussing on undo to join the discussion.

  I think most of the older git branches should be renamed into
  something indicating their historic status. The remaining branches
  should be referenced from a wiki page (linked to in each submission of
  a new patch version), explaining what they're about.  I don't think
  it's realistic to have people infer meaning from the current branch
  names (undo, proposal-undo-log, undo-log-storage, undo-log-storage-v2,
  undo_interface_v1, undoprocessing).

  Given the size of the the overall project it's quite possibly not
  realistic to manage the whole work in a single git branch. With
  separate git branches, as currently done, it's however hard to
  understand which version of a what layer is used.  I think at the very
  least higher layers need to indicate the version of the underlying
  layers is being used.  I suggest just adding a "v01: " style prefix to
  all commit subjects a branch is rebased onto.

  It's also currently hard to understand what version of a layer is
  being discussed. I think submissions all need to include a version
  number (c.f. git format-patch's -v option), and that version ought to
  be included in the subject line. Each new major version of a patch
  should be started as a reply to the first message of a thread, to
  keep the structure of a discussion in a managable shape. New versions
  should include explicit explanations about the major changes compared
  to the last version.


- The code quality of pretty significant parts of the patchset is not
  even close to being good enough. There are areas with a lot of code
  duplication. There are very few higher level explanations for
  interfaces. There's a lot of "i++; /* increment i to increment it */"
  style comments, but not enough higher level comments. There are
  significant issues with parts of the code that aren't noted anywhere
  in comments, leading to reviewers having to repeatedly re-discover
  them (and wasting time on that).

  There's different naming styles in related code without a discernible
  pattern (e.g. UndoRecordSetInfo being followed by
  get_undo_rec_cid_offset). The word-ordering of different layers is
  confusing (e.g. BeginUndoRecordInsert vs UndoLogBeginInsert vs
  PrepareUndoInsert). Different important externally visible functions
  have names that don't allow to determine which is supposed to do what
  (PrepareUndoInsert vs BeginUndoRecordInsert).


More specific comments:

- The whole sequencing of undo record insertion in combination with WAL
  logging does not appear to be right. It's a bit hard to say, because
  there's very little documentation on what the intended default
  sequence of operations is.

  My understanding is that the currently expected pattern is to:

  1) Collect information / perform work needed to perform the action
 that needs to be UNDO logged. E.g. perform visibility
 determinations, wait for lockers, compute new infomask, etc.

 This will likely end with the "content" page(s) (e.g. a table's
 page) being exclusively locked.

  2) Estimate space needed for all UNDO logging (BeginUndoRecordInsert)

  3) Prepare for each undo record, this includes building the
 content for each undo record. PrepareUndoInsert(). This acquires,
 pins and locks buffers for undo.

  4) begin a critical section

  5) modify content page, mark buffer dirty

  6) write UNDO, using InsertPreparedUndo()

  7) associate undo with WAL record (RegisterUndoLogBuffers)

  8) write WAL

  9) End critical section

  But despite reading through the code, including READMEs, I'm doubtful
  that's quite the intended pattern. It REALLY can't be right that one
  needs to parse many function headers to figure out how the most basic
  use of undo could possibly work.  There needs to be very clear
  documentation about how to write undo recor