date:20210201

On Mon, Feb 1, 2021 at 11:59 AM Kyotaro Horiguchi
 wrote:
>
> At Sun, 31 Jan 2021 11:24:30 +0530, Dilip Kumar  wrote 
> in
> > On Fri, Jan 29, 2021 at 4:33 PM Dilip Kumar  wrote:
> > >
> > > On Fri, Jan 29, 2021 at 3:25 PM Yugo NAGATA  wrote:
> > > >
> > > > On Thu, 28 Jan 2021 09:55:42 +0530
> > > > Dilip Kumar  wrote:
> > > >
> > > > > On Wed, Jan 27, 2021 at 2:28 PM Dilip Kumar  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Jan 27, 2021 at 2:06 PM Yugo NAGATA  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, 27 Jan 2021 13:29:23 +0530
> > > > > > > Dilip Kumar  wrote:
> > > > > > >
> > > > > > > > On Wed, Jan 27, 2021 at 12:50 PM Masahiko Sawada 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Jan 26, 2021 at 2:00 AM Robert Haas 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Sat, Jan 23, 2021 at 6:10 AM Bharath Rupireddy
> > > > > > > > > >  wrote:
> > > > > > > > > > > +1 to just show the recovery pause state in the output of
> > > > > > > > > > > pg_is_wal_replay_paused. But, should the function name
> > > > > > > > > > > "pg_is_wal_replay_paused" be something like
> > > > > > > > > > > "pg_get_wal_replay_pause_state" or some other? To me, 
> > > > > > > > > > > when "is" exists
> > > > > > > > > > > in a function, I expect a boolean output. Others may have 
> > > > > > > > > > > better
> > > > > > > > > > > thoughts.
> > > > > > > > > >
> > > > > > > > > > Maybe we should leave the existing function 
> > > > > > > > > > pg_is_wal_replay_paused()
> > > > > > > > > > alone and add a new one with the name you suggest that 
> > > > > > > > > > returns text.
> > > > > > > > > > That would create less burden for tool authors.
> > > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yeah, we can do that, I will send an updated patch soon.
> > > > > > >
> > > > > > > This means pg_is_wal_replay_paused is left without any change and 
> > > > > > > this
> > > > > > > returns whether pause is requested or not? If so, it seems good 
> > > > > > > to modify
> > > > > > > the documentation of this function in order to note that this 
> > > > > > > could not
> > > > > > > return the actual pause state.
> > > > > >
> > > > > > Yes, we can say that it will return true if the replay pause is
> > > > > > requested.  I am changing that in my new patch.
> > > > >
> > > > > I have modified the patch, changes
> > > > >
> > > > > - I have added a new interface pg_get_wal_replay_pause_state to get
> > > > > the pause request state
> > > > > - Now, we are not waiting for the recovery to actually get paused so I
> > > > > think it doesn't make sense to put a lot of checkpoints to check the
> > > > > pause requested so I have removed that check from the
> > > > > recoveryApplyDelay but I think it better we still keep that check in
> > > > > the WaitForWalToBecomeAvailable because it can wait forever before the
> > > > > next wal get available.
> > > >
> > > > I think basically the check in WaitForWalToBecomeAvailable is 
> > > > independent
> > > > of the feature of pg_get_wal_replay_pause_state, that is, reporting the
> > > > actual pause state.  This function could just return 'pause requested'
> > > > if a pause is requested during waiting for WAL.
> > > >
> > > > However, I agree the change to allow recovery to transit the state to
> > > > 'paused' during WAL waiting because 'paused' has more useful information
> > > > for users than 'pause requested'.  Returning 'paused' lets users know
> > > > clearly that no more WAL are applied until recovery is resumed.  On the
> > > > other hand, when 'pause requested' is returned, user can't say whether
> > > > the next WAL wiill be applied or not from this information.
> > > >
> > > > For the same reason, I think it is also useful to call 
> > > > recoveryPausesHere
> > > > in recoveryApplyDelay.
> > >
> > > IMHO the WaitForWalToBecomeAvailable can wait until the next wal get
> > > available so it can not be controlled by user so it is good to put a
> > > check for the recovery pause,  however recoveryApplyDelay wait for the
> > > apply delay which is configured by user and it is predictable value by
> > > the user.  I don't have much objection to putting that check in the
> > > recoveryApplyDelay as well but I feel it is not necessary.  Any other
> > > thoughts on this?
> > >
> > > > In addition, in RecoveryRequiresIntParameter, recovery should get paused
> > > > if a parameter value has a problem.  However, 
> > > > pg_get_wal_replay_pause_state
> > > > will return 'pause requested' in this case. So, I think, we should pass
> > > > RECOVERY_PAUSED to SetRecoveryPause() instead of 
> > > > RECOVERY_PAUSE_REQUESTED,
> > > > or call CheckAndSetRecoveryPause() in the loop like 
> > > > recoveryPausesHere().
> > >
> > > Yeah, absolutely right, it must pass RECOVERY_PAUSED.  I will change
> > > this, thanks for noticing this.
> >
> > I have changed this in the new patch.
>
> It seems to work well. The

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-01 Thread Tang, Haiying

Hi Greg,

Recently, I was keeping evaluating performance of this patch(1/28 V13).
Here I find a regression test case which is parallel insert with bitmap heap 
scan.
when the target table has primary key or index, then the patched performance 
will have a 7%-19% declines than unpatched. 

Could you please have a look about this?

I tried max_parallel_workers_per_gather=2/4/8, and I didn't tune other 
parameters(like GUCs or other enforce parallel parameters). 

1. max_parallel_workers_per_gather=2(default)
target_tablepatched   master  %reg
--
without_PK_index83.683142.183-41%
with_PK 382.824   321.10119%
with_index  372.682   324.24615%

2. max_parallel_workers_per_gather=4
target_tablepatched   master  %reg
--
without_PK_index73.189141.879 -48%
with_PK 362.104   329.759 10%
with_index  372.237   333.718 12%

3. max_parallel_workers_per_gather=8 (also set max_parallel_workers=16, 
max_worker_processes = 16)
target_tablepatched   master  %reg
--
without_PK_index75.072146.100 -49%
with_PK 365.312   324.339 13%
with_index  362.636   338.366 7%

Attached test_bitmap.sql which includes my test data and sql if you want to 
have a look. 

Regards,
Tang





test_bitmap.sql
Description: test_bitmap.sql

Re: [HACKERS] [PATCH] Generic type subscripting

2021-02-01 Thread Dmitry Dolgov

> On Fri, Jan 29, 2021 at 7:01 PM Alexander Korotkov  
> wrote:
> Pushed with minor cleanup.

Thanks a lot!

> On Sun, Jan 31, 2021 at 05:23:25PM -0500, Tom Lane wrote:
>
> thorntail seems unhappy:
>
> [From 7c5d57c...]
> Fix portability issue in new jsonbsubs code.
>
> On machines where sizeof(Datum) > sizeof(Oid) (that is, any 64-bit
> platform), the previous coding would compute a misaligned
> workspace->index pointer if nupper is odd.  Architectures where
> misaligned access is a hard no-no would then fail.  This appears
> to explain why thorntail is unhappy but other buildfarm members
> are not.

Yeah, that was an unexpected issue, thanks! I assume few other failing
buildfarm members are the same, as they show similar symptoms (e.g.
mussurana or ibisbill).

Re: Single transaction in the tablesync worker?

2021-02-01 Thread Amit Kapila

On Mon, Feb 1, 2021 at 1:08 PM Peter Smith  wrote:
>
> On Mon, Feb 1, 2021 at 5:19 PM Amit Kapila  wrote:
>
>  > > > AFAIK there is always a potential race with DropSubscription dropping
> > > > > slots. The DropSubscription might be running at exactly the same time
> > > > > the apply worker has just dropped the very same tablesync slot.
> > > > >
> > > >
> > > > We stopped the workers before getting a list of NotReady relations and
> > > > then we try to drop the corresponding slots. So, how such a race
> > > > condition can happen? Note, because we have a lock on pg_subscrition,
> > > > there is no chance that the workers can restart till the transaction
> > > > end.
> > >
> > > OK. I think I was forgetting the logicalrep_worker_stop would also go
> > > into a loop waiting for the worker process to die. So even if the
> > > tablesync worker does simultaneously drop it's own slot, I think it
> > > will certainly at least be in SYNCDONE state before DropSubscription
> > > does anything else with that worker.
> > >
> >
> > How is that ensured? We don't have anything like HOLD_INTERRUPTS
> > between the time dropped the slot and updated rel state as SYNCDONE.
> > So, isn't it possible that after we dropped the slot and before we
> > update the state, the SIGTERM signal arrives and led to worker exit?
> >
>
> The worker has the SIGTERM handler of "die". IIUC the "die" function
> doesn't normally do anything except set some flags to say please die
> at the next convenient opportunity. My understanding is that the
> worker process will not actually exit until it next executes
> CHECK_FOR_INTERRUPTS(), whereupon it will see the ProcDiePending flag
> and *really* die. So even if the SIGTERM signal arrives immediately
> after the slot is dropped, the tablesync will still become SYNCDONE.
> Is this wrong understanding?
>
> But your scenario could still be possible if "die" exited immediately
> (e.g. only in single user mode?).
>

I think it is possible without that as well. There are many calls
in-between those two operations which can internally call
CHECK_FOR_INTERRUPTS. One of the flows where such a possibility exists
is 
UpdateSubscriptionRelState->SearchSysCacheCopy2->SearchSysCacheCopy->SearchSysCache->SearchCatCache->SearchCatCacheInternal->SearchCatCacheMiss->systable_getnext.
This can internally call heapgetpage where we have
CHECK_FOR_INTERRUPTS. I think even if today there was no CFI call we
can't take a guarantee for the future as the calls used are quite
common. So, probably we need missing_ok flag in DropSubscription.

One more point in the tablesync code you are calling
ReplicationSlotDropAtPubNode with missing_ok as false. What if we get
an error after that and before we have marked the state as SYNCDONE? I
guess it will always error from ReplicationSlotDropAtPubNode after
that because we had already dropped the slot.

-- 
With Regards,
Amit Kapila.

Re: Printing backtrace of postgres processes

2021-02-01 Thread Kyotaro Horiguchi

At Fri, 29 Jan 2021 19:10:24 +0530, vignesh C  wrote in 
> Attached v5 patch has the fixes for the same.

PMSIGNAL_BACKTRACE_EMIT is not used anywhere?
(the commit message seems stale.)

+++ b/src/bin/pg_ctl/t/005_backtrace.pl

pg_ctl doesn't (or no longer?) seem relevant to this patch.


+   eval {
+   $current_logfiles = slurp_file($node->data_dir . 
'/current_logfiles');
+   };
+   last unless $@;

Is there any reason not just to do "while (! -e 
+This is free software: you are free to change and redistribute it.
+There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
+and "show warranty" for details.
+This GDB was configured as "x86_64-redhat-linux-gnu".
+For bug reporting instructions, please see:
+...
+Reading symbols from /home/postgresdba/inst/bin/postgres...done.

Almost all of the banner lines seem to be useless here.


 #define SIGNAL_BACKEND_SUCCESS 0
+#define BACKEND_VALIDATION_SUCCESS 0
 #define SIGNAL_BACKEND_ERROR 1
 #define SIGNAL_BACKEND_NOPERMISSION 2
 #define SIGNAL_BACKEND_NOSUPERUSER 3

Even though I can share the feeling that SIGNAL_BACKEND_SUCCESS is a
bit odd to represent "sending signal is allowed", I feel that it's
better using the existing symbol than using the new symbol.


+validate_backend_pid(int pid)

The function needs a comment. The name is somewhat
confusing. check_privilege_to_send_singal()?

Maybe the return value of the function should be changed to an enum,
and its callers should use switch-case to handle the value.


+   if (!SendProcSignal(bt_pid, PROCSIG_BACKTRACE_PRINT, 
InvalidBackendId))
+   PG_RETURN_BOOL(true);
+   else
+   ereport(WARNING,
+   (errmsg("failed to send signal 
to postmaster: %m")));
+   }
+
+   PG_RETURN_BOOL(result);

The variable "result" seems useless.


+   elog(LOG_SERVER_ONLY, "current backtrace:%s", errtrace.data);
+
+   errno = save_errno;
+}

You need to release the resouces held by the errtrace. And the
errtrace is a bit pointless. Why isn't it "backtrace"?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: row filtering for logical replication

2021-02-01 Thread japin



On Mon, 01 Feb 2021 at 08:23, Euler Taveira  wrote:
> On Mon, Mar 16, 2020, at 10:58 AM, David Steele wrote:
>> Please submit to a future CF when a new patch is available.
> Hi,
>
> This is another version of the row filter patch. Patch summary:
>
> 0001: refactor to remove dead code
> 0002: grammar refactor for row filter
> 0003: core code, documentation, and tests
> 0004: psql code
> 0005: pg_dump support
> 0006: debug messages (only for test purposes)
> 0007: measure row filter overhead (only for test purposes)
>

Thanks for updating the patch.  Here are some comments:

(1)
+ 
+  If this parameter is false, it uses the
+  WHERE clause from the partition; otherwise,the
+  WHERE clause from the partitioned table is used.
  

otherwise,the -> otherwise, the

(2)
+  
+  Columns used in the WHERE clause must be part of the
+  primary key or be covered by REPLICA IDENTITY otherwise
+  UPDATE and DELETE operations will not
+  be replicated.
+  
+

IMO we should indent one space here.

(3)
+
+  
+  The WHERE clause expression is executed with the role used
+  for the replication connection.
+  

Same as (2).

The documentation says:

>  Columns used in the WHERE clause must be part of the
>  primary key or be covered by REPLICA IDENTITY otherwise
>  UPDATE and DELETE operations will not
>  be replicated.

Why we need this limitation? Am I missing something?

When I tested, I find that the UPDATE can be replicated, while the DELETE
cannot be replicated.  Here is my test-case:

-- 1. Create tables and publications on publisher
CREATE TABLE t1 (a int primary key, b int);
CREATE TABLE t2 (a int primary key, b int);
INSERT INTO t1 VALUES (1, 11);
INSERT INTO t2 VALUES (1, 11);
CREATE PUBLICATION mypub1 FOR TABLE t1;
CREATE PUBLICATION mypub2 FOR TABLE t2 WHERE (b > 10);

-- 2. Create tables and subscriptions on subscriber
CREATE TABLE t1 (a int primary key, b int);
CREATE TABLE t2 (a int primary key, b int);
CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=8765 
dbname=postgres' PUBLICATION mypub1;
CREATE SUBSCRIPTION mysub2 CONNECTION 'host=localhost port=8765 
dbname=postgres' PUBLICATION mypub2;

-- 3. Check publications on publisher
postgres=# \dRp+
   Publication mypub1
 Owner | All tables | Inserts | Updates | Deletes | Truncates | Via root

---++-+-+-+---+--
 japin | f  | t   | t   | t   | t | f
Tables:
"public.t1"

   Publication mypub2
 Owner | All tables | Inserts | Updates | Deletes | Truncates | Via root

---++-+-+-+---+--
 japin | f  | t   | t   | t   | t | f
Tables:
"public.t2"  WHERE (b > 10)

-- 4. Check initialization data on subscriber
postgres=# table t1;
 a | b
---+
 1 | 11
(1 row)

postgres=# table t2;
 a | b
---+
 1 | 11
(1 row)

-- 5. The update on publisher
postgres=# update t1 set b = 111 where b = 11;
UPDATE 1
postgres=# table t1;
 a |  b
---+-
 1 | 111
(1 row)

postgres=# update t2 set b = 111 where b = 11;
UPDATE 1
postgres=# table t2;
 a |  b
---+-
 1 | 111
(1 row)

-- 6. check the updated records on subscriber
postgres=# table  t1;
 a |  b
---+-
 1 | 111
(1 row)

postgres=# table  t2;
 a |  b
---+-
 1 | 111
(1 row)

-- 7. Delete records on publisher
postgres=# delete from t1 where b = 111;
DELETE 1
postgres=# table t1;
 a | b
---+---
(0 rows)

postgres=# delete from t2 where b = 111;
DELETE 1
postgres=# table t2;
 a | b
---+---
(0 rows)

-- 8. Check the deleted records on subscriber
postgres=# table t1;
 a | b
---+---
(0 rows)

postgres=# table t2;
 a |  b
---+-
 1 | 111
(1 row)

I do a simple debug, and find that the pgoutput_row_filter() return false when I
execute "delete from t2 where b = 111;".

Does the publication only load the REPLICA IDENTITY columns into oldtuple when 
we
execute DELETE? So the pgoutput_row_filter() cannot find non REPLICA IDENTITY
columns, which cause it return false, right?  If that's right, the UPDATE might
not be limitation by REPLICA IDENTITY, because all columns are in newtuple,
isn't it?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Tech

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-01 Thread Hou, Zhijie

Hi,

When developing the reloption patch, I noticed some issues in the patch.

1).
> - Reduce Insert parallel-safety checks required for some SQL, by noting
> that the subquery must operate on a relation (check for RTE_RELATION in
> subquery range-table)

+   foreach(lcSub, rte->subquery->rtable)
+   {
+   rteSub = lfirst_node(RangeTblEntry, lcSub);
+   if (rteSub->rtekind == RTE_RELATION)
+   {
+   hasSubQueryOnRelation = true;
+   break;
+   }
+   }
It seems we can not only search RTE_RELATION in rtable,
because RTE_RELATION may exist in other place like:

---
--** explain insert into target select (select * from test);
Subplan's subplan

--** with cte as (select * from test) insert into target select * from cte;
In query's ctelist.
---

May be we should use a walker function [1] to
search the subquery and ctelist.

2).

+--
+-- Test INSERT into temporary table with underlying query.
+-- (should not use a parallel plan)
+--

May be the comment here need some change since
we currently support parallel plan for temp table.

3)
Do you think we can add a testcase for foreign-table ?
To test parallel query with serial insert on foreign table.


[1]
static bool
relation_walker(Node *node)
{
if (node == NULL)
return false;

else if (IsA(node, RangeTblEntry))
{
RangeTblEntry *rte = (RangeTblEntry *) node;
if (rte->rtekind == RTE_RELATION)
return true;

return false;
}

else if (IsA(node, Query))
{
Query  *query = (Query *) node;

/* Recurse into subselects */
return query_tree_walker(query, relation_walker,
 NULL, 
QTW_EXAMINE_RTES_BEFORE);
}

/* Recurse to check arguments */
return expression_tree_walker(node,
  
relation_walker,
  NULL);
}


Best regards,
houzj

Re: Printing backtrace of postgres processes

On Fri, Jan 29, 2021 at 7:10 PM vignesh C  wrote:
>
> Thanks Bharath for your review comments. Please find my comments inline below.
>
> On Thu, Jan 28, 2021 at 7:40 PM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Jan 28, 2021 at 5:22 PM vignesh C  wrote:
> > > Thanks for the comments, I have fixed and attached an updated patch
> > > with the fixes for the same.
> > > Thoughts?
> >
> > Thanks for the patch. Here are few comments:
> >
> > 1) I think it's return SIGNAL_BACKEND_SUCCESS; instead of return 0; in
> > check_valid_pid?
> >
>
> I did not want to use SIGNAL_BACKEND_SUCCESS as we have not yet
> signalled the backend process at this time. I have added
> BACKEND_VALIDATION_SUCCESS macro and used it here for better
> readability.
>
> > 2) How about following in pg_signal_backend for more readability
> > +if (ret != SIGNAL_BACKEND_SUCCESS)
> > +return ret;
> > instead of
> > +if (ret)
> > +return ret;
> >
>
> Modified it to ret != BACKEND_VALIDATION_SUCCESS
>
> > 3) How about validate_backend_pid or some better name instead of
> > check_valid_pid?
> >
>
> Modified it to validate_backend_pid
>
> > 4) How about following
> > + errmsg("must be a superuser to print backtrace
> > of backend process")));
> > instead of
> > + errmsg("must be a superuser to print backtrace
> > of superuser query process")));
> >
>
> Here the message should include superuser, we cannot remove it. Non
> super user can log non super user provided if user has permissions for
> it.
>
> > 5) How about following
> >  errmsg("must be a member of the role whose backed
> > process's backtrace is being printed or member of
> > pg_signal_backend")));
> > instead of
> > + errmsg("must be a member of the role whose
> > backtrace is being logged or member of pg_signal_backend")));
> >
>
> Modified it.
>
> > 6) I'm not sure whether "backtrace" or "call stack" is a generic term
> > from the user/developer perspective. In the patch, the function name
> > and documentation says callstack(I think it is "call stack" actually),
> > but the error/warning messages says backtrace. IMHO, having
> > "backtrace" everywhere in the patch, even the function name changed to
> > pg_print_backtrace, looks better and consistent. Thoughts?
> >
>
> Modified it to pg_print_backtrace.
>
> > 7) How about following in pg_print_callstack?
> > {
> > intbt_pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
> > boolresult = false;
> >
> > if (r == SIGNAL_BACKEND_SUCCESS)
> > {
> > if (EmitProcSignalPrintCallStack(bt_pid))
> > result = true;
> > else
> > ereport(WARNING,
> > (errmsg("failed to send signal to postmaster: 
> > %m")));
> > }
> >
> > PG_RETURN_BOOL(result);
> > }
> >
>
> Modified similarly with slight change.
>
> > 8) How about following
> > +(errmsg("backtrace generation is not supported by
> > this PostgresSQL installation")));
> > instead of
> > +(errmsg("backtrace generation is not supported by
> > this installation")));
> >
>
> I used the existing message to maintain consistency with
> set_backtrace. I feel we can keep it the same.
>
> > 9) Typo - it's "example" +2) Using "addr2line -e postgres address", For 
> > exmple:
> >
>
> Modified it.
>
> > 10) How about
> > + * Handle print backtrace signal
> > instead of
> > + * Handle receipt of an print backtrace.
> >
>
> I used the existing message to maintain consistency similar to
> HandleProcSignalBarrierInterrupt. I feel we can keep it the same.
>
> > 11) Isn't below in documentation specific to Linux platform. What
> > happens if GDB is not there on the platform?
> > +
> > +1)  "info line *address" from gdb on postgres executable. For example:
> > +gdb ./postgres
> > +GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7
> >
>
> I have made changes "You can get the file name and line number by
> using gdb/addr2line in linux platforms, as a prerequisite users must
> ensure gdb/addr2line is already installed".
>
> User will get an error like this in windows:
> select pg_print_backtrace(pg_backend_pid());
> WARNING:  backtrace generation is not supported by this installation
>  pg_print_callstack
> 
>  f
> (1 row)
>
> The backtrace will not be logged in case of windows, it will throw a
> warning "backtrace generation is not supported by this installation"
> Thoughts?
>
> > 12) +The callstack will be logged in the log file. What happens if the
> > server is started without a log file , ./pg_ctl -D data start? Where
> > will the backtrace go?
> >
>
> Updated to: The backtrace will be logged to the log file if logging is
> enabled, if logging is disabled backtrace will be logged to the
> console where the postmaster was started.
>
> > 13) Not sure, if it's an overkill, but how about pg_print_callstack
> > returning a warning/

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-02-01 Thread Julien Rouhaud

On Mon, Feb 1, 2021 at 2:05 PM Dilip Kumar  wrote:
>
> On Sun, Jan 24, 2021 at 9:31 PM Alvaro Herrera  
> wrote:
> >
> > On 2021-Jan-24, Julien Rouhaud wrote:
> >
> > > + /*
> > > +  * Do not allow tuples with invalid combinations of hint bits to be 
> > > placed
> > > +  * on a page.  These combinations are detected as corruption by the
> > > +  * contrib/amcheck logic, so if you disable one or both of these
> > > +  * assertions, make corresponding changes there.
> > > +  */
> > > + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> > > +  (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED)));
> > >
> > >
> > > I attach a simple self contained script to reproduce the problem, the last
> > > UPDATE triggering the Assert.
> > >
> > > I'm not really familiar with this part of the code, so it's not exactly 
> > > clear
> > > to me if some logic is missing in compute_new_xmax_infomask() /
> > > heap_prepare_insert(), or if this should actually be an allowed 
> > > combination of
> > > hint bit.
> >
> > Hmm, it's probably a bug in compute_new_xmax_infomask.  I don't think
> > the combination is sensible.
> >
>
> If we see the logic of GetMultiXactIdHintBits then it appeared that we
> can get this combination in the case of multi-xact.
>
> switch (members[i].status)
> {
> ...
>case MultiXactStatusForUpdate:
>bits2 |= HEAP_KEYS_UPDATED;
>break;
> }
>
> 
> if (!has_update)
> bits |= HEAP_XMAX_LOCK_ONLY;
>
> Basically, if it is "select for update" then we will mark infomask2 as
> HEAP_KEYS_UPDATED and the informask as HEAP_XMAX_LOCK_ONLY.

Yes I saw that too, I don't know if the MultiXactStatusForUpdate case
is ok or not.

Note that this hint bit can get cleaned later in heap_update in case
of hot_update or if there's TOAST:

/*
* To prevent concurrent sessions from updating the tuple, we have to
* temporarily mark it locked, while we release the page-level lock.
[...]
/* Clear obsolete visibility flags ... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;

Re: How to expose session vs txn lock info in pg_locks view?

2021-02-01 Thread Craig Ringer

On Sun, 24 Jan 2021 at 09:12, Andres Freund  wrote:

> Hi,
>
> On 2021-01-19 14:16:07 +0800, Craig Ringer wrote:
> > AFAICS it'd be necessary to expand PROCLOG to expose this in shmem.
> > Probably by adding a small bitfield where bit 0 is set if there's a txn
> > level lock and bit 1 is set if there's a session level lock. But I'm not
> > convinced that expanding PROCLOCK is justifiable for this.
> sizeof(PROCLOCK)
> > is 64 on a typical x64 machine. Adding anything to it increases it to 72
> > bytes.
>
> Indeed - I really don't want to increase the size, it's already a
> problem.
>
>
> > It's frustrating to be unable to tell the difference between
> session-level
> > and txn-level locks in diagnostic output.
>
> It'd be useful, I agree.
>
>
> > And the deadlock detector has no way to tell the difference when
> > selecting a victim for a deadlock abort - it'd probably make sense to
> > prefer to send a deadlock abort for txn-only lockers.
>
> I'm doubtful this is worth going for.
>
>
> > But I'm not sure I see a sensible way to add the info - PROCLOCK is
> > already free of any padding, and I wouldn't want to use hacks like
> > pointer-tagging.
>
> I think there's an easy way to squeeze out space: make groupLeader be an
> integer index into allProcs instead. That requires only 4 bytes...
>
> Alternatively, I think it'd be reasonably easy to add the scope as a bit
> in LOCKMASK - there's plenty space.
>

I was wondering about that, but concerned that there would be impacts I did
not understand.

I'm happy to pursue that angle.

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

On Mon, Feb 1, 2021 at 4:05 PM Julien Rouhaud  wrote:
>
> On Mon, Feb 1, 2021 at 2:05 PM Dilip Kumar  wrote:
> >
> > On Sun, Jan 24, 2021 at 9:31 PM Alvaro Herrera  
> > wrote:
> > >
> > > On 2021-Jan-24, Julien Rouhaud wrote:
> > >
> > > > + /*
> > > > +  * Do not allow tuples with invalid combinations of hint bits to 
> > > > be placed
> > > > +  * on a page.  These combinations are detected as corruption by 
> > > > the
> > > > +  * contrib/amcheck logic, so if you disable one or both of these
> > > > +  * assertions, make corresponding changes there.
> > > > +  */
> > > > + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> > > > +  (tuple->t_data->t_infomask2 & 
> > > > HEAP_KEYS_UPDATED)));
> > > >
> > > >
> > > > I attach a simple self contained script to reproduce the problem, the 
> > > > last
> > > > UPDATE triggering the Assert.
> > > >
> > > > I'm not really familiar with this part of the code, so it's not exactly 
> > > > clear
> > > > to me if some logic is missing in compute_new_xmax_infomask() /
> > > > heap_prepare_insert(), or if this should actually be an allowed 
> > > > combination of
> > > > hint bit.
> > >
> > > Hmm, it's probably a bug in compute_new_xmax_infomask.  I don't think
> > > the combination is sensible.
> > >
> >
> > If we see the logic of GetMultiXactIdHintBits then it appeared that we
> > can get this combination in the case of multi-xact.
> >
> > switch (members[i].status)
> > {
> > ...
> >case MultiXactStatusForUpdate:
> >bits2 |= HEAP_KEYS_UPDATED;
> >break;
> > }
> >
> > 
> > if (!has_update)
> > bits |= HEAP_XMAX_LOCK_ONLY;
> >
> > Basically, if it is "select for update" then we will mark infomask2 as
> > HEAP_KEYS_UPDATED and the informask as HEAP_XMAX_LOCK_ONLY.
>
> Yes I saw that too, I don't know if the MultiXactStatusForUpdate case
> is ok or not.

It seems it is done intentionally to handle some case, I am not sure
which case though.  But Setting HEAP_KEYS_UPDATED in case of "for
update" seems wrong.
The comment of this flag clearly says that "tuple was updated and key
cols modified, or tuple deleted " and that is obviously not the case
here.

#define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
* modified, or tuple deleted */


> Note that this hint bit can get cleaned later in heap_update in case
> of hot_update or if there's TOAST:
>
> /*
> * To prevent concurrent sessions from updating the tuple, we have to
> * temporarily mark it locked, while we release the page-level lock.
> [...]
> /* Clear obsolete visibility flags ... */
> oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
> oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;

I see.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: Attempt to make dbsize a bit more consistent

On Thu, Nov 12, 2020 at 2:54 AM Soumyadeep Chakraborty
 wrote:
>
> Hey Georgios,
>
> On Tue, Nov 10, 2020 at 6:20 AM  wrote:
> >
> >
> >
> >
> >
> >
> > ‐‐‐ Original Message ‐‐‐
> > On Monday, November 9, 2020 7:50 PM, Soumyadeep Chakraborty 
> >  wrote:
> >
> > > Hey Georgios,
> > >
> > > Thanks for looking for more avenues to invoke tableAM APIS! Please find
> > > my review below:
> >
> > A great review Soumyadeep, it is much appreciated.
> > Please remember to add yourself as a reviewer in the commitfest
> > [https://commitfest.postgresql.org/30/2701/]
>
> Ah yes. Sorry, I forgot that!
>
> > >
> > > On Tue, Oct 13, 2020 at 6:28 AM gkokola...@pm.me wrote:
> > >
> > > 1.
> > >
> > > > /*
> > > >
> > > > -   -   heap size, including FSM and VM
> > > >
> > > > -   -   table size, including FSM and VM
> > > > */
> > > >
> > >
> > > We should not mention FSM and VM in dbsize.c at all as these are
> > > heap AM specific. We can say:
> > > table size, excluding TOAST relation
> >
> > Yeah, I was thinking that the notion that FSM and VM are still taken
> > into account should be stated. We are iterating over ForkNumber
> > after all.
> >
> > How about phrasing it as:
> >
> > + table size, including all implemented forks from the AM (e.g. FSM, VM)
> > + excluding TOAST relations
> >
> > Thoughts?
>
> Yes, I was thinking along the same lines. The concept of a "fork" forced
> should not be forced down into the tableAM. But that is a discussion for
> another day. We can probably say:
>
> + table size, including all implemented forks from the AM (e.g. FSM, VM
> + for the heap AM) excluding TOAST relations
>
> > >
> > > 2.
> > >
> > > > /*
> > > >
> > > > -   Size of toast relation
> > > > */
> > > > if (OidIsValid(rel->rd_rel->reltoastrelid))
> > > >
> > > >
> > > > -   size += calculate_toast_table_size(rel->rd_rel->reltoastrelid);
> > > >
> > > > -   {
> > > > -   Relation toastRel;
> > > > -
> > > > -   toastRel = relation_open(rel->rd_rel->reltoastrelid, 
> > > > AccessShareLock);
> > >
> > > We can replace the OidIsValid check with relation_needs_toast_table()
> > > and then have the OidIsValid() check in an Assert. Perhaps, that will
> > > make things more readable.
> >
> > Please, allow me to kindly disagree.
> >
> > Relation is already open at this stage. Even create_toast_table(), the
> > internal workhorse for creating toast relations, does check reltoastrelid
> > to test if the relation is already toasted.
> >
> > Furthermore, we do call:
> >
> > + toastRel = relation_open(rel->rd_rel->reltoastrelid, AccessShareLock);
> >
> > and in order to avoid elog() errors underneath us, we ought to have
> > verified the validity of reltoastrelid.
> >
> > In short, I think that the code in the proposal is not too unreadable
> > nor that it breaks the coding patterns throughout the codebase.
> >
> > Am I too wrong?
>
> No not at all. The code in the proposal is indeed adhering to the
> codebase. What I was going for here was to increase the usage of
> relation_needs_toast_table(). What I meant was:
>
> if (table_relation_needs_toast_table(rel))
> {
> if (!OidIsValid(rel->rd_rel->reltoastrelid))
> {
> elog(ERROR, );
> }
> //size calculation here..
> }
>
> We want to error out if the toast table can't be found and the relation
> is expected to have one, which the existing code doesn't handle.
>
>
> >
> > >
> > > 3.
> > >
> > > > -   if (rel->rd_rel->relkind == RELKIND_RELATION ||
> > > > -   rel->rd_rel->relkind == RELKIND_TOASTVALUE ||
> > > > -   rel->rd_rel->relkind == RELKIND_MATVIEW)
> > > > -   size = calculate_table_size(rel);
> > > > -   else
> > > > -   {
> > > > -   relation_close(rel, AccessShareLock);
> > > > -   PG_RETURN_NULL();
> > > > -   }
> > >
> > > This leads to behavioral changes:
> > >
> > > I am talking about the case where one calls pg_table_size() on an index.
> > > W/o your patch, it returns the size of the index. W/ your patch it
> > > returns NULL. Judging by the documentation, this function should not
> > > ideally apply to indexes but it does! I have a sinking feeling that lots
> > > of users use it for this purpose, as there is no function to calculate
> > > the size of a single specific index (except pg_relation_size()).
> > > The same argument I have made above applies to sequences. Users may have
> > > trial-and-errored their way into finding that pg_table_size() can tell
> > > them the size of a specific index/sequence! I don't know how widespread
> > > the use is in the user community, so IMO maybe we should be conservative
> > > and not introduce this change? Alternatively, we could call out that
> > > pg_table_size() is only for tables by throwing an error if anything
> > > other than a table is passed in.
> > >
> > > If we decide to preserve the existing behavior of the pg_table_size():
> > > It seems that for things not backed by the tableAM (indexes and
> > > sequences), they should still go through calculate_relation_size().
> > > We can call table_relatio

Re: Single transaction in the tablesync worker?

2021-02-01 Thread Amit Kapila

On Mon, Feb 1, 2021 at 11:23 AM Peter Smith  wrote:
>
> On Mon, Feb 1, 2021 at 3:44 PM Amit Kapila  wrote:
> >
> > On Mon, Feb 1, 2021 at 9:38 AM Peter Smith  wrote:
> >
> > No, there is a functionality change as well. The way we have code in
> > v22 can easily lead to a problem when we have dropped the slots but
> > get an error while removing origins or an entry from subscription rel.
> > In such cases, we won't be able to rollback the drop of slots but the
> > other database operations will be rolled back. This is the reason we
> > have to drop the slots at the end. We need to ensure the same thing
> > for AlterSubscription_refresh. Does this make sense now?
> >
>
> OK.
>

I have updated the patch to display WARNING for each of the tablesync
slots during DropSubscription. As discussed, I have moved the drop
slot related code towards the end in AlterSubscription_refresh. Apart
from this, I have fixed one more issue in tablesync code where in
after catching the exception we were not clearing the transaction
state on the publisher, see changes in LogicalRepSyncTableStart. I
have also fixed other comments raised by you. Additionally, I have
removed the test because it was creating the same name slot as the
tablesync worker and tablesync worker removed the same due to new
logic in LogicalRepSyncStart. Earlier, it was not failing because of
the bug in that code which I have fixed in the attached.

I wonder whether we should restrict creating slots with prefix pg_
because we are internally creating slots with those names? I think
this was a problem previously also. We already prohibit it for few
other objects like origins, schema, etc., see the usage of
IsReservedName.

--
With Regards,
Amit Kapila.

v24-0001-Tablesync-Solution1.patch
Description: Binary data

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-01 Thread Greg Nancarrow

On Mon, Feb 1, 2021 at 8:19 PM Hou, Zhijie  wrote:
>
>
> When developing the reloption patch, I noticed some issues in the patch.
>
> 1).
> > - Reduce Insert parallel-safety checks required for some SQL, by noting
> > that the subquery must operate on a relation (check for RTE_RELATION in
> > subquery range-table)
>
> +   foreach(lcSub, rte->subquery->rtable)
> +   {
> +   rteSub = lfirst_node(RangeTblEntry, lcSub);
> +   if (rteSub->rtekind == RTE_RELATION)
> +   {
> +   hasSubQueryOnRelation = true;
> +   break;
> +   }
> +   }
> It seems we can not only search RTE_RELATION in rtable,
> because RTE_RELATION may exist in other place like:
>
> ---
> --** explain insert into target select (select * from test);
> Subplan's subplan
>
> --** with cte as (select * from test) insert into target select * from cte;
> In query's ctelist.
> ---
>
> May be we should use a walker function [1] to
> search the subquery and ctelist.
>

Yes, the current checks are too simple, as you point out, there seem
to be more complex cases that it doesn't pick up. Unfortunately
expanding the testing for them does detract from the original
intention of this code (which was to avoid extra parallel-safety check
processing on code which can't be run in parallel). I guess the
relation walker function should additionally check for SELECT queries
only (commandType == CMD_SELECT), and exclude SELECT FOR UPDATE/SHARE
(rowMarks != NIL) too. I'll need to look further into it, but will
certainly update the code for the next version of the patch.

> 2).
>
> +--
> +-- Test INSERT into temporary table with underlying query.
> +-- (should not use a parallel plan)
> +--
>
> May be the comment here need some change since
> we currently support parallel plan for temp table.
>

Thanks, it should say something like "should create the plan with
INSERT + parallel SELECT".

> 3)
> Do you think we can add a testcase for foreign-table ?
> To test parallel query with serial insert on foreign table.
>

I have intended to do it, but as a lower-priority task.

>
> [1]
> static bool
> relation_walker(Node *node)
> {
> if (node == NULL)
> return false;
>
> else if (IsA(node, RangeTblEntry))
> {
> RangeTblEntry *rte = (RangeTblEntry *) node;
> if (rte->rtekind == RTE_RELATION)
> return true;
>
> return false;
> }
>
> else if (IsA(node, Query))
> {
> Query  *query = (Query *) node;
>
> /* Recurse into subselects */
> return query_tree_walker(query, relation_walker,
>  NULL, 
> QTW_EXAMINE_RTES_BEFORE);
> }
>
> /* Recurse to check arguments */
> return expression_tree_walker(node,
>   
> relation_walker,
>   NULL);
> }
>

Regards,
Greg Nancarrow
Fujitsu Australia

Re: [PATCH] Add extra statistics to explain for Nested Loop

2021-02-01 Thread Yugo NAGATA

On Mon, 1 Feb 2021 13:28:45 +0800
Julien Rouhaud  wrote:

> On Thu, Jan 28, 2021 at 8:38 PM Yugo NAGATA  wrote:
> >
> > postgres=# explain (analyze, verbose) select * from a,b where a.i=b.j;
> > 
> > QUERY PLAN
> > ---
> >  Nested Loop  (cost=0.00..2752.00 rows=991 width=8) (actual 
> > time=0.021..17.651 rows=991 loops=1)
> >Output: a.i, b.j
> >Join Filter: (a.i = b.j)
> >Rows Removed by Join Filter: 99009
> >->  Seq Scan on public.b  (cost=0.00..2.00 rows=100 width=4) (actual 
> > time=0.009..0.023 rows=100 loops=1)
> >  Output: b.j
> >->  Seq Scan on public.a  (cost=0.00..15.00 rows=1000 width=4) (actual 
> > time=0.005..0.091 min_time=0.065 max_time=0.163 min_rows=1000 rows=1000 
> > max_rows=1000 loops=100)
> >  Output: a.i
> >  Planning Time: 0.066 ms
> >  Execution Time: 17.719 ms
> > (10 rows)
> >
> > I don't like this format where the extra statistics appear in the same
> > line of existing information because the output format differs depended
> > on whether the plan node's loops > 1 or not. This makes the length of a
> > line too long. Also, other information reported by VERBOSE doesn't change
> > the exiting row format and just add extra rows for new information.
> >
> > Instead, it seems good for me to add extra rows for the new statistics
> > without changint the existing row format as other VERBOSE information,
> > like below.
> >
> >->  Seq Scan on public.a  (cost=0.00..15.00 rows=1000 width=4) (actual 
> > time=0.005..0.091 rows=1000  loops=100)
> >  Output: a.i
> >  Min Time: 0.065 ms
> >  Max Time: 0.163 ms
> >  Min Rows: 1000
> >  Max Rows: 1000
> >
> > or, like Buffers,
> >
> >->  Seq Scan on public.a  (cost=0.00..15.00 rows=1000 width=4) (actual 
> > time=0.005..0.091 rows=1000  loops=100)
> >  Output: a.i
> >  Loops: min_time=0.065 max_time=0.163 min_rows=1000 max_rows=1000
> >
> > and so  on. What do you think about it?
> 
> It's true that the current output is a bit long, which isn't really
> convenient to read.  Using one of those alternative format would also
> have the advantage of not breaking compatibility with tools that
> process those entries.  I personally prefer the 2nd option with the
> extra "Loops:" line .  For non text format, should we keep the current
> format?

For non text format, I think "Max/Min Rows", "Max/Min Times" are a bit
simple and the meaning is unclear. Instead, similar to a style of "Buffers",
does it make sense using "Max/Min Rows in Loops" and "Max/Min Times in Loops"?

Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: pgbench stopped supporting large number of client connections on Windows

Hi,

On Thu, Jan 7, 2021 at 10:48 PM Masahiko Sawada  wrote:
>
>  Hi Marina,
>
> On Sun, Nov 8, 2020 at 11:59 PM Marina Polyakova
>  wrote:
> >
> > On 2020-11-07 01:01, Fabien COELHO wrote:
> > > Hello Marina,
> >
> > Hello, Fabien!
> >
> > Thank you for your comments!
> >
> > >> While trying to test a patch that adds a synchronization barrier in
> > >> pgbench [1] on Windows,
> > >
> > > Thanks for trying that, I do not have a windows setup for testing, and
> > > the sync code I wrote for Windows is basically blind coding:-(
> >
> > FYI:
> >
> > 1) It looks like pgbench will no longer support Windows XP due to the
> > function DeleteSynchronizationBarrier. From
> > https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-deletesynchronizationbarrier
> > :
> >
> > Minimum supported client: Windows 8 [desktop apps only]
> > Minimum supported server: Windows Server 2012 [desktop apps only]
> >
> > On Windows Server 2008 R2 (MSVC 2013) the 6-th version of the patch [1]
> > has compiled without (new) warnings, but when running pgbench I got the
> > following error:
> >
> > The procedure entry point DeleteSynchronizationBarrier could not be
> > located in the dynamic link library KERNEL32.dll.
> >
> > 2) On Windows Server 2019 (MSVC 2019) the 6-th version of the patch [1]
> > with fix_max_client_conn_on_Windows.patch has compiled without (new)
> > warnings. I made a few runs (-M prepared -c 100 -j 10 -T 10 -P1 -S) with
> > and without your patches. On Linux (-M prepared -c 1000 -j 500 -T 10 -P1
> > -S) your patches fix problems with progress reports as in [2], but on
> > Windows I did not notice such changes, see attached
> > pgbench_runs_linux_vs_windows.zip.
> >
> > >> The almost same thing happens with reindexdb and vacuumdb (build on
> > >> commit [3]):
> > >
> > > Windows fd implementation is somehow buggy because it does not return
> > > the smallest number available, and then with the assumption that
> > > select uses a dense array indexed with them (true on linux, less so on
> > > Windows which probably uses a sparse array), so that the number gets
> > > over the limit, even if less are actually used, hence the catch, as
> > > you noted.
> >
> > I agree with you. It looks like the structure fd_set just contains used
> > sockets by this application on Windows, and the macro FD_SETSIZE is used
> > only here.
> >
> >  From
> > https://docs.microsoft.com/en-us/windows/win32/api/winsock/ns-winsock-fd_set
> > :
> >
> > typedef struct fd_set {
> >u_int  fd_count;
> >SOCKET fd_array[FD_SETSIZE];
> > } fd_set, FD_SET, *PFD_SET, *LPFD_SET;
> >
> >  From
> > https://docs.microsoft.com/en-us/windows/win32/winsock/maximum-number-of-sockets-supported-2
> > :
> >
> > The maximum number of sockets that a Windows Sockets application can use
> > is not affected by the manifest constant FD_SETSIZE. This value defined
> > in the Winsock2.h header file is used in constructing the FD_SET
> > structures used with select function.
> >
> > >> IIUC the checks below are not correct on Windows, since on this system
> > >> sockets can have values equal to or greater than FD_SETSIZE (see
> > >> Windows documentation [4] and pgbench debug output in attached
> > >> pgbench_debug.txt).
> > >
> > > Okay.
> > >
> > > But then, how may one detect that there are too many fds in the set?
> > >
> > > I think that an earlier version of the code needed to make assumptions
> > > about the internal implementation of windows (there is a counter
> > > somewhere in windows fd_set struct), which was rejected because if was
> > > breaking the interface. Now your patch is basically resurrecting that.
> >
> > I tried to keep the behaviour "we check if the socket value can be used
> > in select() at runtime", but now I will also read that thread...
> >
> > > Why not if there is no other solution, but this is quite depressing,
> > > and because it breaks the interface it would be broken if windows
> > > changed its internals for some reason:-(
> >
> > It looks like if the internals of the structure fd_set are changed, we
> > will also have problems with the function pgwin32_select from
> > src/backend/port/win32/socket.c, because it uses fd_set.fd_count too?..
> >
> > (I'm writing responses to the rest of your comments but it takes
> > time...)
> >
>
> This patch on Commitfest has been "Waiting on Author" for almost 2
> months. Could you share the current status? Are you updating the
> patch?

Status update for a commitfest entry.

Since this is a bug fix, I've moved this patch to the next commitfest.
I think if this patch is still inactive until the feature freeze we
can remove this entry by setting it to "Returned with Feedback".

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Fix typo about generate_gather_paths

Hi,

On Mon, Feb 1, 2021 at 11:44 AM Masahiko Sawada  wrote:
>
> Hi,
>
> On Wed, Dec 23, 2020 at 3:24 AM Tomas Vondra
>  wrote:
> >
> > On 12/9/20 3:21 AM, Hou, Zhijie wrote:
> > > Hi
> > >
> > > Since ba3e76c,
> > > the optimizer call generate_useful_gather_paths instead of 
> > > generate_gather_paths() outside.
> > >
> > > But I noticed that some comment still talking about generate_gather_paths 
> > > not generate_useful_gather_paths.
> > > I think we should fix these comment, and I try to replace these " 
> > > generate_gather_paths " with " generate_useful_gather_paths "
> > >
> >
> > Thanks. I started looking at this a bit more closely, and I think most
> > of the changes are fine - the code was changed to call a different
> > function, but the comments still reference generate_gather_paths().
> >
> > The one exception seems to be create_ordered_paths(), because that
> > comment also makes statements about what generate_gather_pathes is
> > doing. And some of it does not apply to generate_useful_gather_paths.
> > For example it says it generates order-preserving Gather Merge paths,
> > but generate_useful_gather_paths also generates paths with sorts (which
> > are clearly not order-preserving).
> >
> > So I think this comment will need a bit more work to update ...
>
> Status update for a commitfest entry.
>
> This patch has been "Waiting on Author" without seeing any activity
> since Tomas sent review comments. I'm planning to set it to "Returned
> with Feedback”, barring objections.
>

This patch, which you submitted to this CommitFest, has been awaiting
your attention for more than one month. As such, we have moved it to
"Returned with Feedback" and removed it from the reviewing queue.
Depending on timing, this may be reversable, so let us know if there
are extenuating circumstances. In any case, you are welcome to address
the feedback you have received, and resubmit the patch to the next
CommitFest.

Thank you for contributing to PostgreSQL.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Support for NSS as a libpq TLS backend

2021-02-01 Thread Michael Paquier

On Fri, Jan 29, 2021 at 01:57:02PM +0100, Daniel Gustafsson wrote:
> This has been discussed elsewhere in the thread, so let's continue that there.
> The attached v23 does however split off --with-ssl for OpenSSL in 0001, adding
> the nss option in 0002.

While going through 0001, I have found a couple of things.

-CF_SRCS = $(if $(subst no,,$(with_openssl)), $(OSSL_SRCS), $(INT_SRCS))
-CF_TESTS = $(if $(subst no,,$(with_openssl)), $(OSSL_TESTS), $(INT_TESTS))
+CF_SRCS = $(if $(subst openssl,,$(with_ssl)), $(OSSL_SRCS), $(INT_SRCS))
+CF_TESTS = $(if $(subst openssl,,$(with_ssl)), $(OSSL_TESTS), $(INT_TESTS))
It seems to me that this part is the opposite, aka here the OpenSSL
files and tests (OSSL*) would be used if with_ssl is not openssl.

-ifeq ($(with_openssl),yes)
+ifneq ($(with_ssl),no)
+OBJS += \
+   fe-secure-common.o
+endif
This split is better, good idea.

The two SSL tests still included a reference to with_openssl after
0001:
src/test/ssl/t/001_ssltests.pl:if ($ENV{with_openssl} eq 'yes')
src/test/ssl/t/002_scram.pl:if ($ENV{with_openssl} ne 'yes')

I have refreshed the docs on top to be consistent with the new
configuration, and applied it after more checks.  I'll try to look in
more details at the failures with cryptohashes I found upthread.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] remove deprecated v8.2 containment operators

On Thu, Jan 28, 2021 at 9:50 PM Masahiko Sawada  wrote:
>
> On Tue, Dec 1, 2020 at 4:09 AM Tom Lane  wrote:
> >
> > Justin Pryzby  writes:
> > > I think this is waiting on me to provide a patch for the contrib/ modules 
> > > with
> > > update script removing the SQL operators, and documentating their 
> > > deprecation.
> >
> > Right.  We can remove the SQL operators, but not (yet) the C code support.
> >
> > I'm not sure that the docs change would do more than remove any existing
> > mentions of the operators.
> >
>
> Status update for a commitfest entry.
>
> Almost 2 months passed since the last update. Are you planning to work
> on this, Justin? If not, I'm planning to set it "Returned with
> Feedback" barring objectinos.
>

This patch has been awaiting updates for more than one month. As such,
we have moved it to "Returned with Feedback" and removed it from the
reviewing queue. Depending on timing, this may be reversable, so let
us know if there are extenuating circumstances. In any case, you are
welcome to address the feedback you have received, and resubmit the
patch to the next CommitFest.

Thank you for contributing to PostgreSQL.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Recording foreign key relationships for the system catalogs

Very nice. Thanks to this patch, I can get rid of my own parse-catalogs.pl hack 
and use pg_get_catalog_foreign_keys() instead.

+1

I can with high confidence assert the correctness of 
pg_get_catalog_foreign_keys()'s output,
as it matches the lookup tables for the tool I'm hacking on precisely:

--
-- verify single column foreign keys
--
WITH
a AS (
  SELECT
fktable::text,
fkcols[1]::text,
pktable::text,
pkcols[1]::text
  FROM pg_get_catalog_foreign_keys()
  WHERE cardinality(fkcols) = 1
),
b AS (
  SELECT
table_name,
column_name,
ref_table_name,
ref_column_name
  FROM pit.oid_joins
)
SELECT
  (SELECT COUNT(*) FROM (SELECT * FROM a EXCEPT SELECT * FROM b) AS x) AS 
except_b,
  (SELECT COUNT(*) FROM (SELECT * FROM b EXCEPT SELECT * FROM a) AS x) AS 
except_a,
  (SELECT COUNT(*) FROM (SELECT * FROM b INTERSECT SELECT * FROM a) AS x) AS 
a_intersect_b
;

except_b | except_a | a_intersect_b
--+--+---
0 |0 |   209
(1 row)

--
-- verify multi-column foreign keys
--
WITH
a AS (
  SELECT
fktable::text,
fkcols,
pktable::text,
pkcols
  FROM pg_get_catalog_foreign_keys()
  WHERE cardinality(fkcols) > 1
),
b AS (
  SELECT
table_name,
ARRAY[rel_column_name,attnum_column_name],
'pg_attribute',
'{attrelid,attnum}'::text[]
  FROM pit.attnum_joins
)
SELECT
  (SELECT COUNT(*) FROM (SELECT * FROM a EXCEPT SELECT * FROM b) AS x) AS 
except_b,
  (SELECT COUNT(*) FROM (SELECT * FROM b EXCEPT SELECT * FROM a) AS x) AS 
except_a,
  (SELECT COUNT(*) FROM (SELECT * FROM b INTERSECT SELECT * FROM a) AS x) AS 
a_intersect_b
;

except_b | except_a | a_intersect_b
--+--+---
0 |0 | 8
(1 row)

/Joel

On Sun, Jan 31, 2021, at 22:39, Tom Lane wrote:
> Now that dfb75e478 is in, I looked into whether we can have some
> machine-readable representation of the catalogs' foreign key
> relationships.  As per the previous discussion [1], it's not
> practical to have "real" SQL foreign key constraints, because
> the semantics we use aren't quite right (i.e., using 0 instead
> of NULL in rows with no reference).  Nonetheless it would be
> nice to have the knowledge available in some form, and we agreed
> that a set-returning function returning the data would be useful.
> 
> The attached patch creates that function, and rewrites the oidjoins.sql
> regression test to use it, in place of the very incomplete info that's
> reverse-engineered by findoidjoins.  It's mostly straightforward.
> 
> My original thought had been to add DECLARE_FOREIGN_KEY() macros
> for all references, but I soon realized that in a large majority of
> the cases, that's redundant with the BKI_LOOKUP() annotations we
> already have.  So I taught genbki.pl to extract FK data from
> BKI_LOOKUP() as well as the explicit DECLARE macros.  That didn't
> remove the work entirely, because it turned out that we hadn't
> bothered to apply BKI_LOOKUP() labels to most of the catalogs that
> have no hand-made data.  A big chunk of the patch consists in
> adding those as needed.  Also, I had to make the BKI_LOOKUP()
> mechanism a little more complete, because it failed on pg_namespace
> and pg_authid references.  (It will still fail on some other
> cases such as BKI_LOOKUP(pg_foreign_server), but I think there's
> no need to fill that in until/unless we have some built-in data
> that needs it.)
> 
> There are various loose ends yet to be cleaned up:
> 
> * I'm unsure whether it's better for the SRF to return the
> column names as textual names, or as column numbers.  Names was
> a bit easier for all the parts of the current patch so I did
> it that way, but maybe there's a case for the other way.
> Actually the whole API for the SRF is just spur-of-the-moment,
> so maybe a different API would be better.
> 
> * It would now be possible to remove the PGNSP and PGUID kluges
> entirely in favor of plain BKI_LOOKUP references to pg_namespace
> and pg_authid.  The catalog header usage would get a little
> more verbose: BKI_DEFAULT(PGNSP) becomes BKI_DEFAULT(pg_catalog)
> and BKI_DEFAULT(PGUID) becomes BKI_DEFAULT(POSTGRES).  I'm a bit
> inclined to do it, simply to remove one bit of mechanism that has
> to be documented; but it's material for a separate patch perhaps.
> 
> * src/tools/findoidjoins should be nuked entirely, AFAICS.
> Again, that could be a follow-on patch.
> 
> * I've not touched the SGML docs.  Certainly
> pg_get_catalog_foreign_keys() should be documented, and some
> adjustments in bki.sgml might be appropriate.
> 
> regards, tom lane
> 
> [1] 
> https://www.postgresql.org/message-id/flat/dc5f44d9-5ec1-a596-0251-dadadcdede98%402ndquadrant.com
> 
> 
> 
> *Attachments:*
>  * add-catalog-foreign-key-info-1.patch

Kind regards,

Joel

Re: bitmaps and correlation

On Thu, Jan 28, 2021 at 9:51 PM Masahiko Sawada  wrote:
>
> On Sat, Nov 28, 2020 at 5:49 AM Tom Lane  wrote:
> >
> > Heikki Linnakangas  writes:
> > > Other than that, and a quick pgdindent run, this seems ready to me. I'll
> > > mark it as Ready for Committer.
> >
> > I dunno, this seems largely misguided to me.
> >
> > It's already the case that index correlation is just not the right
> > stat for this purpose, since it doesn't give you much of a toehold
> > on whether a particular scan is going to be accessing tightly-clumped
> > data.  For specific kinds of index conditions, such as a range query
> > on a btree index, maybe you could draw that conclusion ... but this
> > patch isn't paying any attention to the index condition in use.
> >
> > And then the rules for bitmap AND and OR correlations, if not just
> > plucked out of the air, still seem *far* too optimistic.  As an
> > example, even if my individual indexes are perfectly correlated and
> > so a probe would touch only one page, OR'ing ten such probes together
> > is likely going to touch ten different pages.  But unless I'm
> > misreading the patch, it's going to report back an OR correlation
> > that corresponds to touching one page.
> >
> > Even if we assume that the correlation is nonetheless predictive of
> > how big a part of the table we'll be examining, I don't see a lot
> > of basis for deciding that the equations the patch adds to
> > cost_bitmap_heap_scan are the right ones.
> >
> > I'd have expected this thread to focus a whole lot more on actual
> > examples than it has done, so that we could have some confidence
> > that these equations have something to do with reality.
> >
>
> Status update for a commitfest entry.
>
> The discussion has been inactive since the end of the last CF. It
> seems to me that we need some discussion on the point Tom mentioned.
> It looks either "Needs Review" or "Ready for Committer" status but
> Justin set it to "Waiting on Author" on 2020-12-03 by himself. Are you
> working on this, Justin?
>

Status update for a commitfest entry.

This patch, which you submitted to this CommitFest, has been awaiting
your attention for more than one month. As such, we have moved it to
"Returned with Feedback" and removed it from the reviewing queue.
Depending on timing, this may be reversable, so let us know if there
are extenuating circumstances. In any case, you are welcome to address
the feedback you have received, and resubmit the patch to the next
CommitFest.

Thank you for contributing to PostgreSQL.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] orphaned function

On Mon, Feb 1, 2021 at 11:26 AM Masahiko Sawada  wrote:
>
> Hi Bertrand,
>
> On Fri, Dec 18, 2020 at 6:52 PM Gilles Darold  wrote:
> >
> > Le 18/12/2020 à 00:26, Tom Lane a écrit :
> > > Gilles Darold  writes:
> > >> The same problem applies if the returned type or the procedural language
> > >> is dropped. I have tried to fix that in ProcedureCreate() by using an
> > >> AccessSharedLock for the data types and languages involved in the
> > >> function declaration. With this patch the race condition with parameters
> > >> types, return types and PL languages are fixed. Only data types and
> > >> procedural languages with Oid greater than FirstBootstrapObjectId will
> > >> be locked locked. But this is probably more complex than this fix so it
> > >> is proposed as a separate patch
> > >> (v1-003-orphaned_function_types_language.patch) to not interfere with
> > >> the applying of Bertran's patch.
> > > Indeed.  This points up one of the things that is standing in the way
> > > of any serious consideration of applying this patch.  To my mind there
> > > are two fundamental, somewhat interrelated considerations that haven't
> > > been meaningfully addressed:
> > >
> > > 1. What set of objects is it worth trying to remove this race condition
> > > for, and with respect to what dependencies?  Bertrand gave no
> > > justification for worrying about function-to-namespace dependencies
> > > specifically, and you've likewise given none for expanding the patch
> > > to consider function-to-datatype dependencies.  There are dozens more
> > > cases that could be considered; but I sure don't want to be processing
> > > another ad-hoc fix every time someone thinks they're worried about
> > > another specific case.
> > >
> > > Taking a practical viewpoint, I'm far more worried about dependencies
> > > of tables than those of any other kind of object.  A messed-up function
> > > definition doesn't cost you much: you can just drop and recreate the
> > > function.  A table full of data is way more trouble to recreate, and
> > > indeed the data might be irreplaceable.  So it seems pretty silly to
> > > propose that we try to remove race conditions for functions' dependencies
> > > on datatypes without removing the same race condition for tables'
> > > dependencies on their column datatypes.
> > >
> > > But any of these options lead to the same question: why stop there?
> > > An approach that would actually be defensible, perhaps, is to incorporate
> > > this functionality into the dependency mechanism: any time we're about to
> > > create a pg_depend or pg_shdepend entry, take out an AccessShareLock on
> > > the referenced object, and then check to make sure it still exists.
> > > This would improve the dependency mechanism so it prevents creation-time
> > > race conditions, not just attempts to drop dependencies of
> > > already-committed objects.  It would mean that the one patch would be
> > > the end of the problem, rather than looking forward to a steady drip of
> > > new fixes.
> > >
> > > 2. How much are we willing to pay for this added protection?  The fact
> > > that we've gotten along fine without it for years suggests that the
> > > answer needs to be "not much".  But I'm not sure that that actually
> > > is the answer, especially if we don't have a policy that says "we'll
> > > protect against these race conditions but no other ones".  I think
> > > there are possibly-serious costs in three different areas:
> > >
> > > * Speed.  How much do all these additional lock acquisitions slow
> > > down a typical DDL command?
> > >
> > > * Number of locks taken per transaction.  This'd be particularly an
> > > issue for pg_restore runs using single-transaction mode: they'd take
> > > not only locks on the objects they create, but also a bunch of
> > > reference-protection locks.  It's not very hard to believe that this'd
> > > make a difference in whether restoring a large database is possible
> > > without increasing max_locks_per_transaction.
> > >
> > > * Risk of deadlock.  The reference locks themselves should be sharable,
> > > so maybe there isn't much of a problem, but I want to see this question
> > > seriously analyzed not just ignored.
> > >
> > > Obviously, the magnitude of these costs depends a lot on how many
> > > dependencies we want to remove the race condition for.  But that's
> > > exactly the reason why I don't want a piecemeal approach of fixing
> > > some problems now and some other problems later.  That's way too
> > > much like the old recipe for boiling a frog: we could gradually get
> > > into serious performance problems without anyone ever having stopped
> > > to consider the issue.
> > >
> > > In short, I think we should either go for a 100% solution if careful
> > > analysis shows we can afford it, or else have a reasoned policy
> > > why we are going to close these specific race conditions and no others
> > > (implying that we'll reject future patches in the same area).  We
> > > haven't got e

Re: avoid bitmapOR-ing indexes with scan condition inconsistent with partition constraint

On Mon, Feb 1, 2021 at 12:09 PM Masahiko Sawada  wrote:
>
> Hi,
>
> On Fri, Nov 13, 2020 at 5:14 AM Tom Lane  wrote:
> >
> > I started looking through this patch.  I really quite dislike solving
> > this via a kluge in indxpath.c.  There are multiple disadvantages
> > to that:
> >
> > * It only helps for the very specific problem of redundant bitmap
> > index scans, whereas the problem of applying redundant qual checks
> > in partition scans seems pretty general.
> >
> > * It's not unlikely that this will end up trying to make the same
> > proof multiple times (and the lack of any way to turn that off,
> > through constraint_exclusion or some other knob, isn't too cool).
> >
> > * It does nothing to fix rowcount estimates in the light of the
> > knowledge that some of the restriction clauses are no-ops.  Now,
> > if we have up-to-date stats we'll probably manage to come out with
> > an appropriate 0 or 1 selectivity anyway, but we might not have those.
> > In any case, spending significant effort to estimate a selectivity
> > when some other part of the code has taken the trouble to *prove* the
> > clause true or false seems very undesirable.
> >
> > * I'm not even convinced that the logic is correct, specifically that
> > it's okay to just "continue" if we refute the OR clause.  That seems
> > likely to break generate_bitmap_or_paths' surrounding loop logic about
> > "We must be able to match at least one index to each of the arms of
> > the OR".  At least, if that still works it requires more than zero
> > commentary about why.
> >
> >
> > So I like much better the idea of Konstantin's old patch, that we modify
> > the rel's baserestrictinfo list by removing quals that we can prove
> > true.  We could extend that to solve the bitmapscan problem by removing
> > OR arms that we can prove false.  So I started to review that patch more
> > carefully, and after awhile realized that it has a really fundamental
> > problem: it is trying to use CHECK predicates to prove WHERE clauses.
> > But we don't know that CHECK predicates are true, only that they are
> > not-false, and there is no proof mode in predtest.c that will allow
> > proving some clauses true based only on other ones being not-false.
> >
> > We can salvage something by restricting the input quals to be only
> > partition quals, since those are built to be guaranteed-true-or-false;
> > we can assume they don't yield NULL.  There's a hole in that for
> > hashing, as I noted elsewhere, but we'll fail to prove anything anyway
> > from a satisfies_hash_partition() qual.  (In principle we could also use
> > attnotnull quals, which also have that property.  But I'm dubious that
> > that will help often enough to be worth the extra cycles for predtest.c
> > to process them.)
> >
> > So after a bit of coding I had the attached.  This follows Konstantin's
> > original patch in letting relation_excluded_by_constraints() change
> > the baserestrictinfo list.  I read the comments in the older thread
> > about people not liking that, and I can see the point.  But I'm not
> > convinced that the later iterations of the patch were an improvement,
> > because (a) the call locations for
> > remove_restrictions_implied_by_constraints() seemed pretty random, and
> > (b) it seems necessary to have relation_excluded_by_constraints() and
> > remove_restrictions_implied_by_constraints() pretty much in bed with
> > each other if we don't want to duplicate constraint-fetching work.
> > Note the comment on get_relation_constraints() that it's called at
> > most once per relation; that's not something I particularly desire
> > to give up, because a relcache open isn't terribly cheap.  Also
> > (c) I think it's important that there be a way to suppress this
> > overhead when it's not useful.  In the patch as attached, turning off
> > constraint_exclusion does that since relation_excluded_by_constraints()
> > falls out before getting to the new code.  If we make
> > remove_restrictions_implied_by_constraints() independent then it
> > will need some possibly-quite-duplicative logic to check
> > constraint_exclusion.  (Of course, if we'd rather condition this
> > on some other GUC then that argument falls down.  But I think we
> > need something.)  So, I'm not dead set on this code structure, but
> > I haven't seen one I like better.
> >
> > Anyway, this seems to work, and if the regression test changes are
> > any guide then it may fire often enough in the real world to be useful.
> > Nonetheless, I'm concerned about performance, because predtest.c is a
> > pretty expensive thing and there will be a lot of cases where the work
> > is useless.  I did a quick check using pgbench's option to partition
> > the tables, and observed that the -S (select only) test case seemed to
> > get about 2.5% slower with the patch than without.  That's not far
> > outside the noise floor, so maybe it's not real, but if it is real then
> > it seems pretty disastrous.  Perhaps we could avoid that pro

Re: Recording foreign key relationships for the system catalogs

Could it be an idea to also add

   OUT can_be_zero boolean

to pg_get_catalog_foreign_keys()'s out parameters?

This information is useful to know if one should be doing an INNER JOIN or a 
LEFT JOIN on the foreign keys.

The information is mostly available in the documentation already,
but not quite accurate, which I've proposed a patch [1] to fix.

[1] 
https://www.postgresql.org/message-id/4ed9a372-7bf9-479a-926c-ae8e77471...@www.fastmail.com

Re: [BUG] orphaned function

On Mon, Feb 1, 2021 at 11:06 PM Drouvot, Bertrand  wrote:
>
> Hi Sawada,
>
> On 2/1/21 2:37 PM, Masahiko Sawada wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> >
> >
> >
> > On Mon, Feb 1, 2021 at 11:26 AM Masahiko Sawada  
> > wrote:
> >> Status update for a commitfest entry.
> >>
> >> This patch has been "Waiting on Author" and inactive for almost 2
> >> months. Are you planning to work on address the review comments Tom
> >> sent? This is categorized as bug fixes on CF app but I'm going to set
> >> it to "Returned with Feedback" barring objections as the patch got
> >> feedback but there has been no activity for a long time.
> >>
> > This patch, which you submitted to this CommitFest, has been awaiting
> > your attention for more than one month. As such, we have moved it to
> > "Returned with Feedback" and removed it from the reviewing queue.
> > Depending on timing, this may be reversable, so let us know if there
> > are extenuating circumstances. In any case, you are welcome to address
> > the feedback you have received, and resubmit the patch to the next
> > CommitFest.
> >
> I had in mind to reply to Tom's comments later this week (sorry for the
> delay).
>
> Please let me know if it's possible to put it back to the reviewing
> queue (if not I'll open a new thread).

Thank you for letting me know.

I've moved this patch to the next commitfest. Please check the
status on the CF app.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Add primary keys to system catalogs

Peter Eisentraut  writes:
> On 2021-01-30 22:56, Tom Lane wrote:
>> Hmm, shouldn't there have been a catversion bump in there?

> I suppose yes on the grounds that it introduces something new in a 
> freshly initdb-ed database.  But I thought it wasn't necessary because 
> there is no dependency between the binaries and the on-disk state.

I've generally worked on the theory that a catversion bump is indicated
if you need to initdb in order to pass the updated regression tests.
Which one did in this case.  However ...

> There has already been another catversion change since, so it's no 
> longer relevant.

... yeah, it's moot now.

regards, tom lane

Commitfest 2021-01 is now closed.

Hi all,

Commitfest 2021-01 is now closed. All patches have now been dealt
with. Before starting on moving and closing patches the stats were:

Needs review: 146 (-4)
Waiting on Author: 25 (+1)
Ready for Committer : 23 (-1)
Committed: 56 (+4)
Moved to next CF: 2 (+2)
Withdrawn: 8 (+0)

The following 4 patches out of 25 WoA patches were waiting for the
author during this commit fest without any updates. I've closed those
patches as "Returned with Feedback":

* Fix comment about generate_gather_paths
* https://commitfest.postgresql.org/31/2876/
* remove deprecated v8.2 containment operators
* https://commitfest.postgresql.org/31/2798/
* bitmap cost should account for correlated indexes
* https://commitfest.postgresql.org/31/2310/
* avoid bitmapOR-ing indexes for scan conditions implied by partition constraint
* https://commitfest.postgresql.org/31/2644/

Other patches have been moved to the commitfest. As a result, a lot of
patches are still in the reviewing queue. When closing the patches, I
chose and closed the patches that are clearly inactive for more than
about 1 month. But if I confirm that the author has a plan to update
the patch soon I didn't close them. So I might have left too many
patches for the next commitfest. If you have a patch that was moved,
and you intend to rewrite enough of it to warrant a resubmission then
please go in and close your entry.

Finally, during this commitfest whereas a lot of patches get reviewed
and committed, about 80 patches have been moved to the next commitfest
without any reviews. We clearly need more bandwidth among reviewers
and committers to cope with the increasing size of the commitfests.
>From another point of view, those patches are likely to have a long
discussion and a certain level of difficulty, so it's relatively hard
for beginners. It would be good if the experienced hackers more focus
on such difficult patches.  It's a just idea but I thought that it
would be helpful if we could have something like a mark on CF app
indicating the patch is good for beginners like we have [E] mark in
the ToDo wiki page[1]. This would be a good indicator for new-coming
contributors to choose the patch to review and might help increase the
reviewers. Which could help that the experienced hackers can focus on
other patches. The mark can be added/edited either by the patch author
or CFM.

I've closed this commitfest. If you have feedback or comment on my CFM
work, please tell me here or by directly emailing me. Thanks to
everyone who participated in writing, reviewing a patch, joining the
discussion, and the commitfest!

Regards,

[1] https://wiki.postgresql.org/wiki/Todo

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Commitfest 2021-01 is now closed.

2021-02-01 Thread Laurenz Albe

On Mon, 2021-02-01 at 23:33 +0900, Masahiko Sawada wrote:
> I've closed this commitfest. If you have feedback or comment on my CFM
> work, please tell me here or by directly emailing me.

I think you did an excellent job.

Yours,
Laurenz Albe

Re: Commitfest 2021-01 is now closed.

2021-02-01 Thread Julien Rouhaud

Le lun. 1 févr. 2021 à 22:53, Laurenz Albe  a
écrit :

> On Mon, 2021-02-01 at 23:33 +0900, Masahiko Sawada wrote:
> > I've closed this commitfest. If you have feedback or comment on my CFM
> > work, please tell me here or by directly emailing me.
>
> I think you did an excellent job.
>

definitely agreed, thanks a lot for running the commit fest Sawada-san!

>

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-02-01 Thread Alexey Kondratov


On 2021-01-30 05:23, Michael Paquier wrote:

On Fri, Jan 29, 2021 at 08:56:47PM +0300, Alexey Kondratov wrote:

On 2021-01-28 14:42, Alexey Kondratov wrote:
No, you are right, we are doing REINDEX with AccessExclusiveLock as 
it

was before. This part is a more specific one. It only applies to
partitioned indexes, which do not hold any data, so we do not reindex
them directly, only their leafs. However, if we are doing a 
TABLESPACE

change, we have to record it in their pg_class entry, so all future
leaf partitions were created in the proper tablespace.

That way, we open partitioned index relation only for a reference,
i.e. read-only, but modify pg_class entry under a proper lock
(RowExclusiveLock). That's why I thought that ShareLock will be
enough.

IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even
for relations with no storage, since AlterTableGetLockLevel() chooses
it if AT_SetTableSpace is met. This is very similar to our case, so
probably we should do the same?

Actually it is not completely clear for me why
ShareUpdateExclusiveLock is sufficient for newly added
SetRelationTableSpace() as Michael wrote in the comment.


Nay, it was not fine.  That's something Alvaro has mentioned, leading
to 2484329.  This also means that the main patch of this thread should
refresh the comments at the top of CheckRelationTableSpaceMove() and
SetRelationTableSpace() to mention that this is used by REINDEX
CONCURRENTLY with a lower lock.



Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly 
uses index_create() with a proper tablespaceOid instead of 
SetRelationTableSpace(). And its checks structure is more restrictive 
even without tablespace change, so it doesn't use 
CheckRelationTableSpaceMove().


Changed patch to use AccessExclusiveLock in this part for now. This is 
what
'ALTER TABLE/INDEX ... SET TABLESPACE' and 'REINDEX' usually do. 
Anyway, all
real leaf partitions are processed in the independent transactions 
later.


+   if (partkind == RELKIND_PARTITIONED_INDEX)
+   {
+   Relation iRel = index_open(partoid, AccessExclusiveLock);
+
+   if (CheckRelationTableSpaceMove(iRel, 
params->tablespaceOid))

+   SetRelationTableSpace(iRel,
+ params->tablespaceOid,
+ InvalidOid);
+   index_close(iRel, NoLock);
Are you sure that this does not represent a risk of deadlocks as EAL
is not taken consistently across all the partitions?  A second issue
here is that this breaks the assumption of REINDEX CONCURRENTLY kicked
on partitioned relations that should use ShareUpdateExclusiveLock for
all its steps.  This would make the first transaction invasive for the
user, but we don't want that.

This makes me really wonder if we would not be better to restrict this
operation for partitioned relation as part of REINDEX as a first step.
Another thing, mentioned upthread, is that we could do this part of
the switch at the last transaction, or we could silently *not* do the
switch for partitioned indexes in the flow of REINDEX, letting users
handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has
finished on all the partitions, cascading the command only on the
partitioned relation of a tree.  It may be interesting to look as well
at if we could lower the lock used for partitioned relations with
ALTER TABLE SET TABLESPACE from AEL to SUEL, choosing AEL only if at
least one partition with storage is involved in the command,
CheckRelationTableSpaceMove() discarding anything that has no need to
change.



I am not sure right now, so I split previous patch into two parts:

0001: Adds TABLESPACE into REINDEX with tests, doc and all the stuff we 
did before with the only exception that it doesn't move partitioned 
indexes into the new tablespace.


Basically, it implements this option "we could silently *not* do the 
switch for partitioned indexes in the flow of REINDEX, letting users 
handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has 
finished". It probably makes sense, since we are doing tablespace change 
altogether with index relation rewrite and don't touch relations without 
storage. Doing ALTER INDEX ... SET TABLESPACE will be almost cost-less 
on them, since they do not hold any data.


0002: Implements the remaining part where pg_class entry is also changed 
for partitioned indexes. I think that we should think more about it, 
maybe it is not so dangerous and proper locking strategy could be 
achieved.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 6322032b472e6b1a76e0ca9326974e5774371fb9 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 1 Feb 2021 15:20:29 +0300
Subject: [PATCH v10 2/2] Change tablespace of partitioned indexes during
 REINDEX.

There are some doubts about proper locking of partitions
here. AccessExclusiveLock is surely enough, but may be
a reason of

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-02-01 Thread Justin Pryzby

On Mon, Feb 01, 2021 at 06:28:57PM +0300, Alexey Kondratov wrote:
> On 2021-01-30 05:23, Michael Paquier wrote:
> > This makes me really wonder if we would not be better to restrict this
> > operation for partitioned relation as part of REINDEX as a first step.
> > Another thing, mentioned upthread, is that we could do this part of
> > the switch at the last transaction, or we could silently *not* do the
> > switch for partitioned indexes in the flow of REINDEX, letting users
> > handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has
> > finished on all the partitions, cascading the command only on the
> > partitioned relation of a tree.  

I suggest that it'd be un-intuitive to skip partitioned rels , silently
requiring a user to also run "ALTER .. SET TABLESPACE".  

But I think it'd be okay if REINDEX(TABLESPACE) didn't support partitioned
tables/indexes at first.  I think it'd be better as an ERROR.

-- 
Justin

Re: Proposal: Save user's original authenticated identity for logging

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Sat, Jan 30, 2021 at 12:21 AM Jacob Champion  wrote:
> > > I'm also just generally not thrilled with
> > > putting much effort into LDAP as it's a demonstrably insecure
> > > authentication mechanism.
> >
> > Because Postgres has to proxy the password? Or is there something else?

Yes.

> Stephen is on a bit of a crusade against ldap :) Mostly for good
> reasons of course. A large amount of those who choose ldap also have a
> kerberos system (because, say, active directory) and the pick ldap
> only because they think it's good, not because it is...

This is certainly one area of frustration, but even if Kerberos isn't
available, it doesn't make it a good idea to use LDAP.

> But yes, I think the enforced cleartext password proxying is at the
> core of the problem. LDAP also encourages the idea of centralized
> password-reuse, which is not exactly a great thing for security.

Right- passing around a user's password in the clear (or even through an
encrypted tunnel) has been strongly discouraged for a very long time,
for very good reason.  LDAP does double-down on that by being a
centralized password, meaning that someone's entire identity (for all
the services that share that LDAP system, at least) are compromised if
any one system in the environment is.

Ideally, we'd have a 'PasswordAuthentication' option which would
disallow cleartext passwords, as has been discussed elsewhere, which
would make things like ldap and pam auth methods disallowed.

> That said, I don't think either of those are reasons not to improve on
> LDAP. It can certainly be a reason for somebody not to want to spend
> their own time on it, but there's no reason it should prevent
> improvements.

I realize that this isn't a popular opinion, but I'd much rather we
actively move in the direction of deprecating auth methods which use
cleartext passwords.  The one auth method we have that works that way
and isn't terrible is radius, though it also isn't great since the pin
doesn't change and would be compromised, not to mention that it likely
depends on the specific system as to if an attacker might be able to use
the exact same code provided to log into other systems if done fast
enough.

> > > > I propose that every auth method should store the string it uses to
> > > > identify a user -- what I'll call an "authenticated identity" -- into
> > > > one central location in Port, after authentication succeeds but before
> > > > any pg_ident authorization occurs. This field can then be exposed in
> > > > log_line_prefix. (It could additionally be exposed through a catalog
> > > > table or SQL function, if that were deemed useful.) This would let a
> > > > DBA more easily audit user activity when using more complicated
> > > > pg_ident setups.
> > >
> > > This seems like it would be good to include the CSV format log files
> > > also.
> >
> > Agreed in principle... Is the CSV format configurable? Forcing it into
> > CSV logs by default seems like it'd be a hard sell, especially for
> > people not using pg_ident.
> 
> For CVS, all columns are always included, and that's a feature -- it
> makes it predictable.
> 
> To make it optional it would have to be a configuration parameter that
> turns the field into an empty one. but it should still be there.

Yeah, we've been around this before and, as I recall anyway, there was
actually a prior patch proposed to add this information to the CSV log.
There is the question about if it's valuable enough to repeat on every
line or not.  These days, I think I lean in the same direction as the
majority on this thread that it's sufficient to log as part of the
connection authorized message.

> > > For some auth methods, eg: GSS, we've recently added information into
> > > the authentication method which logs what the authenticated identity
> > > was.  The advantage with that approach is that it avoids bloating the
> > > log by only logging that information once upon connection rather than
> > > on every log line...  I wonder if we should be focusing on a similar
> > > approach for other pg_ident.conf use-cases instead of having it via
> > > log_line_prefix, as the latter means we'd be logging the same value over
> > > and over again on every log line.
> >
> > As long as the identity can be easily logged and reviewed by DBAs, I'm
> > happy.
> 
> Yeah, per my previous mail, I think this is a better way - make it
> part of log_connections. But it would be good to find a way that we
> can log it the same way for all of them -- rather than slightly
> different ways depending on authentication method.

+1.

> With that I think it would also be useful to have it available in the
> system as well -- either as a column in pg_stat_activity or maybe just
> as a function like pg_get_authenticated_identity() since it might be
> something that's interesting to a smallish subset of users (but very
> interesting to those).

We've been trending in the direction of

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-02-01 Thread Alvaro Herrera

On 2021-Jan-24, Julien Rouhaud wrote:

> While working on pg14 compatibility for an extension relying on an apparently
> uncommon combination of FOR UPDATE and stored function calls, I hit some new
> Asserts introduced in 866e24d47db (Extend amcheck to check heap pages):
> 
> + /*
> +  * Do not allow tuples with invalid combinations of hint bits to be 
> placed
> +  * on a page.  These combinations are detected as corruption by the
> +  * contrib/amcheck logic, so if you disable one or both of these
> +  * assertions, make corresponding changes there.
> +  */
> + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> +  (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED)));
> 
> 
> I attach a simple self contained script to reproduce the problem, the last
> UPDATE triggering the Assert.

Maybe we should contest the idea that this is a sensible thing to Assert
against.  AFAICS this was originally suggested here:
https://www.postgresql.org/message-id/flat/CAFiTN-syyHc3jZoou51v0SR8z0POoNfktqEO6MaGig4YS8mosA%40mail.gmail.com#ad215d0ee0606b5f67bbc57d011c96b8
and it appears now to have been a bad idea.  If I recall correctly,
HEAP_KEYS_UPDATED is supposed to distinguish locks/updates that don't
modify the key columns from those that do.  Since SELECT FOR UPDATE
stands for a future update that may modify arbitrary portions of the
tuple (including "key" columns), then it produces that bit, just as said
UPDATE or a DELETE; as opposed to SELECT FOR NO KEY UPDATE which stands
for a future UPDATE that will only change columns that aren't part of
any keys.

So I think that I misspoke earlier in this thread when I said this is a
bug, and that the right fix here is to remove the Assert() and change
amcheck to match.

Separately, maybe it'd also be good to have a test case based on
Julien's SQL snippet that produces this particular infomask combination
(and other interesting combinations) and passes them through VACUUM etc
to see that everything behaves correctly.

You could also argue the HEAP_KEYS_UPDATED is a misnomer and that we'd
do well to change its name, and update README.tuplock.

-- 
Álvaro Herrera39°49'30"S 73°17'W
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree.  (Don Knuth)

Re: Proposal: Save user's original authenticated identity for logging

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Greg Stark  writes:
> > I wonder if there isn't room to handle this the other way around. To
> > configure Postgres to not need a CREATE ROLE for every role but
> > delegate the user management to the external authentication service.
> 
> > So Postgres would consider the actual role to be the one kerberos said
> > it was even if that role didn't exist in pg_role. Presumably you would
> > want to delegate to a corresponding authorization system as well so if
> > the role was absent from pg_role (or more likely fit some pattern)
> > Postgres would ignore pg_role and consult the authorization system
> > configured like AD or whatever people use with Kerberos these days.
> 
> This doesn't sound particularly workable: how would you manage
> inside-the-database permissions?  Kerberos isn't going to know
> what "view foo" is, let alone know whether you should be allowed
> to read or write it.  So ISTM there has to be a role to hold
> those permissions.  Certainly, you could allow multiple external
> identities to share a role ... but that works today.

Agreed- we would need something in the database to tie it to and I don't
see it making much sense to try to invent something else for that when
that's what roles are.  What's been discussed before and would certainly
be nice, however, would be a way to have roles automatically created.
There's pg_ldap_sync for that today but it'd be nice to have something
built-in and which happens at connection/authentication time, or maybe a
background worker that connects to an ldap server and listens for
changes and creates appropriate roles when they're created.  Considering
we've got the LDAP code already, that'd be a really nice capability.

Thanks,

Stephen

signature.asc
Description: PGP signature

Next Commitfest Manager.

2021-02-01 Thread Ibrar Ahmed

As Commitfest 2021-01 is now closed. I am volunteering to manage next
commitfest.


--
Ibrar Ahmed

Re: Proposal: Save user's original authenticated identity for logging

Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> This doesn't sound particularly workable: how would you manage
>> inside-the-database permissions?  Kerberos isn't going to know
>> what "view foo" is, let alone know whether you should be allowed
>> to read or write it.  So ISTM there has to be a role to hold
>> those permissions.  Certainly, you could allow multiple external
>> identities to share a role ... but that works today.

> Agreed- we would need something in the database to tie it to and I don't
> see it making much sense to try to invent something else for that when
> that's what roles are.  What's been discussed before and would certainly
> be nice, however, would be a way to have roles automatically created.
> There's pg_ldap_sync for that today but it'd be nice to have something
> built-in and which happens at connection/authentication time, or maybe a
> background worker that connects to an ldap server and listens for
> changes and creates appropriate roles when they're created.  Considering
> we've got the LDAP code already, that'd be a really nice capability.

That's still got the same issue though: where does the role get any
permissions from?

I suppose you could say "allow auto-creation of new roles and make them
members of group X", where X holds the permissions that "everybody"
should have.  But I'm not sure how much that buys compared to just
letting everyone log in as X.

regards, tom lane

[POC] verifying UTF-8 using SIMD instructions

2021-02-01 Thread John Naylor

Hi,

As of b80e10638e3, there is a new API for validating the encoding of
strings, and one of the side effects is that we have a wider choice of
algorithms. For UTF-8, it has been demonstrated that SIMD is much faster at
decoding [1] and validation [2] than the standard approach we use.

It makes sense to start with the ascii subset of UTF-8 for a couple
reasons. First, ascii is very widespread in database content, particularly
in bulk loads. Second, ascii can be validated using the simple SSE2
intrinsics that come with (I believe) any x64-64 chip, and I'm guessing we
can detect that at compile time and not mess with runtime checks. The
examples above using SSE for the general case are much more complicated and
involve SSE 4.2 or AVX.

Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is
okay, I'll do Linux/gcc and add more inputs). The test is the same as
Heikki shared in [3], but I added a case with >95% Chinese characters just
to show how that compares to the mixed ascii/multibyte case.

master:

 chinese | mixed | ascii
-+---+---
1081 |   761 |   366

patch:

 chinese | mixed | ascii
-+---+---
1103 |   498 |51

The speedup in the pure ascii case is nice.

In the attached POC, I just have a pro forma portability stub, and left
full portability detection for later. The fast path is inlined inside
pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted into
a separate function to which is passed a function pointer for full encoding
validation. That would allow other encodings with strict ascii subsets to
use this as well, but coding that abstraction might be a little messy, and
b80e10638e3 already gives a performance boost over PG13.

I also gave a shot at doing full UTF-8 recognition using a DFA, but so far
that has made performance worse. If I ever have more success with that,
I'll add that in the mix.

[1] https://woboq.com/blog/utf-8-processing-using-simd.html
[2]
https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
[3]
https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5...@iki.fi

-- 
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/common/wchar.c b/src/common/wchar.c
index 6e7d731e02..12b3a5e1a2 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -13,6 +13,10 @@
 #include "c.h"
 
 #include "mb/pg_wchar.h"
+#include "port/pg_bitutils.h"
+
+/* FIXME -- should go in src/include/port */
+#include 
 
 
 /*
@@ -1762,6 +1766,80 @@ pg_utf8_verifystr(const unsigned char *s, int len)
 {
 	const unsigned char *start = s;
 
+#ifdef __x86_64__
+
+
+	const __m128i	zero = _mm_setzero_si128();
+	__m128i			chunk,
+	cmp;
+
+	const int		chunk_size = sizeof(__m128i);
+	intzero_mask,
+	highbit_mask,
+	ascii_count,
+	remainder;
+
+	while (len >= chunk_size)
+	{
+		/* load next chunk */
+		chunk = _mm_loadu_si128((const __m128i *) s);
+
+		/* first detect any zero bytes */
+		cmp = _mm_cmpeq_epi8(chunk, zero);
+		zero_mask = _mm_movemask_epi8(cmp);
+
+		/* if there is a zero byte, let the slow path encounter it */
+		if (zero_mask)
+			break;
+
+		/* now check for non-ascii bytes */
+		highbit_mask = _mm_movemask_epi8(chunk);
+
+		if (!highbit_mask)
+		{
+			/* all ascii, so advance to the next chunk */
+			s += chunk_size;
+			len -= chunk_size;
+			continue;
+		}
+
+		/*
+		 * if not all ascii, maybe there is a solid block of ascii
+		 * at the beginning of the chunk. if so, skip it
+		 */
+		ascii_count = pg_rightmost_one_pos32(highbit_mask);
+
+		s += ascii_count;
+		len -= ascii_count;
+		remainder = chunk_size - ascii_count;
+
+		/* found non-ascii, so handle the remainder in the normal way */
+		while (remainder > 0)
+		{
+			int			l;
+
+			/*
+			 * fast path for ASCII-subset characters
+			 * we already know they're non-zero
+			 */
+			if (!IS_HIGHBIT_SET(*s))
+l = 1;
+			else
+			{
+l = pg_utf8_verifychar(s, len);
+if (l == -1)
+	goto finish;
+			}
+			s += l;
+			len -= l;
+			remainder -= l;
+
+		}
+	}
+
+#endif			/* __x86_64__ */
+
+	/* handle last few bytes */
 	while (len > 0)
 	{
 		int			l;
@@ -1770,19 +1848,20 @@ pg_utf8_verifystr(const unsigned char *s, int len)
 		if (!IS_HIGHBIT_SET(*s))
 		{
 			if (*s == '\0')
-break;
+goto finish;
 			l = 1;
 		}
 		else
 		{
 			l = pg_utf8_verifychar(s, len);
 			if (l == -1)
-break;
+goto finish;
 		}
 		s += l;
 		len -= l;
 	}
 
+finish:
 	return s - start;
 }

Re: Proposal: Save user's original authenticated identity for logging

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> This doesn't sound particularly workable: how would you manage
> >> inside-the-database permissions?  Kerberos isn't going to know
> >> what "view foo" is, let alone know whether you should be allowed
> >> to read or write it.  So ISTM there has to be a role to hold
> >> those permissions.  Certainly, you could allow multiple external
> >> identities to share a role ... but that works today.
> 
> > Agreed- we would need something in the database to tie it to and I don't
> > see it making much sense to try to invent something else for that when
> > that's what roles are.  What's been discussed before and would certainly
> > be nice, however, would be a way to have roles automatically created.
> > There's pg_ldap_sync for that today but it'd be nice to have something
> > built-in and which happens at connection/authentication time, or maybe a
> > background worker that connects to an ldap server and listens for
> > changes and creates appropriate roles when they're created.  Considering
> > we've got the LDAP code already, that'd be a really nice capability.
> 
> That's still got the same issue though: where does the role get any
> permissions from?
> 
> I suppose you could say "allow auto-creation of new roles and make them
> members of group X", where X holds the permissions that "everybody"
> should have.  But I'm not sure how much that buys compared to just
> letting everyone log in as X.

Right, pg_ldap_sync already supports making new roles a member of
another role in PG such as a group role, we'd want to do something
similar.  Also- once the role exists, then permissions could be assigned
directly as well, of course, which would be the advantage of a
background worker that's keeping the set of roles in sync, as the role
would be created at nearly the same time in both the authentication
system itself (eg: AD) and in PG.  That kind of integration exists in
other products and would go a long way to making PG easier to use and
administer.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Proposal: Save user's original authenticated identity for logging

2021-02-01 Thread Magnus Hagander

On Mon, Feb 1, 2021 at 6:32 PM Tom Lane  wrote:
>
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> This doesn't sound particularly workable: how would you manage
> >> inside-the-database permissions?  Kerberos isn't going to know
> >> what "view foo" is, let alone know whether you should be allowed
> >> to read or write it.  So ISTM there has to be a role to hold
> >> those permissions.  Certainly, you could allow multiple external
> >> identities to share a role ... but that works today.
>
> > Agreed- we would need something in the database to tie it to and I don't
> > see it making much sense to try to invent something else for that when
> > that's what roles are.  What's been discussed before and would certainly
> > be nice, however, would be a way to have roles automatically created.
> > There's pg_ldap_sync for that today but it'd be nice to have something
> > built-in and which happens at connection/authentication time, or maybe a
> > background worker that connects to an ldap server and listens for
> > changes and creates appropriate roles when they're created.  Considering
> > we've got the LDAP code already, that'd be a really nice capability.
>
> That's still got the same issue though: where does the role get any
> permissions from?
>
> I suppose you could say "allow auto-creation of new roles and make them
> members of group X", where X holds the permissions that "everybody"
> should have.  But I'm not sure how much that buys compared to just
> letting everyone log in as X.

What people would *really* want I think is "alow auto-creation of new
roles, and then look up which other roles they should be members of
using ldap" (or "using this script over here" for a more flexible
approach). Which is of course a whole different thing to do in the
process of authentication.

The main thing you'd gain by auto-creating users rather than just
letting them log in is the ability to know exactly which user did
something, and view who it really is through pg_stat_activity. Adding
the "original auth id" as a field or available method would provide
that information in the mapped user case -- making the difference even
smaller. It's really the auto-membership that's the killer feature of
that one, I think.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-01 Thread Heikki Linnakangas


On 01/02/2021 19:32, John Naylor wrote:
It makes sense to start with the ascii subset of UTF-8 for a couple 
reasons. First, ascii is very widespread in database content, 
particularly in bulk loads. Second, ascii can be validated using the 
simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and 
I'm guessing we can detect that at compile time and not mess with 
runtime checks. The examples above using SSE for the general case are 
much more complicated and involve SSE 4.2 or AVX.


I wonder how using SSE compares with dealing with 64 or 32-bit words at 
a time, using regular instructions? That would be more portable.


Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is 
okay, I'll do Linux/gcc and add more inputs). The test is the same as 
Heikki shared in [3], but I added a case with >95% Chinese characters 
just to show how that compares to the mixed ascii/multibyte case.


master:

  chinese | mixed | ascii
-+---+---
     1081 |   761 |   366

patch:

  chinese | mixed | ascii
-+---+---
     1103 |   498 |    51

The speedup in the pure ascii case is nice.


Yep.

In the attached POC, I just have a pro forma portability stub, and left 
full portability detection for later. The fast path is inlined inside 
pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted 
into a separate function to which is passed a function pointer for full 
encoding validation. That would allow other encodings with strict ascii 
subsets to use this as well, but coding that abstraction might be a 
little messy, and b80e10638e3 already gives a performance boost over PG13.


All supported encodings are ASCII subsets. Might be best to putt the 
ASCII-check into a static inline function and use it in all the verify 
functions. I presume it's only a few instructions, and these functions 
can be pretty performance sensitive.


I also gave a shot at doing full UTF-8 recognition using a DFA, but so 
far that has made performance worse. If I ever have more success with 
that, I'll add that in the mix.


That's disappointing. Perhaps the SIMD algorithms have higher startup 
costs, so that you need longer inputs to benefit? In that case, it might 
make sense to check the length of the input and only use the SIMD 
algorithm if the input is long enough.


- Heikki

Re: SEARCH and CYCLE clauses

2021-02-01 Thread Peter Eisentraut


On 2020-11-25 20:35, Pavel Stehule wrote:

I checked this patch, and I didn't find any issue.

make check-world passed
make doc passed

I'll mark it as ready for committer


This has been committed.  Thanks.

--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/

Re: SEARCH and CYCLE clauses

2021-02-01 Thread Pavel Stehule

po 1. 2. 2021 v 19:02 odesílatel Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> napsal:

> On 2020-11-25 20:35, Pavel Stehule wrote:
> > I checked this patch, and I didn't find any issue.
> >
> > make check-world passed
> > make doc passed
> >
> > I'll mark it as ready for committer
>
> This has been committed.  Thanks.
>

great!

Pavel


> --
> Peter Eisentraut
> 2ndQuadrant, an EDB company
> https://www.2ndquadrant.com/
>

Re: Fix DROP TABLESPACE on Windows with ProcSignalBarrier?

2021-02-01 Thread Andres Freund

Hi,

Thanks for developing this.

On 2021-01-31 02:11:06 +1300, Thomas Munro wrote:
> --- a/src/backend/commands/tablespace.c
> +++ b/src/backend/commands/tablespace.c
> @@ -520,15 +520,23 @@ DropTableSpace(DropTableSpaceStmt *stmt)
>* but we can't tell them apart from important data files that 
> we
>* mustn't delete.  So instead, we force a checkpoint which 
> will clean
>* out any lingering files, and try again.
> -  *
> -  * XXX On Windows, an unlinked file persists in the directory 
> listing
> -  * until no process retains an open handle for the file.  The 
> DDL
> -  * commands that schedule files for unlink send invalidation 
> messages
> -  * directing other PostgreSQL processes to close the files.  
> DROP
> -  * TABLESPACE should not give up on the tablespace becoming 
> empty
> -  * until all relevant invalidation processing is complete.
>*/
>   RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | 
> CHECKPOINT_WAIT);
> + /*
> +  * On Windows, an unlinked file persists in the directory 
> listing until
> +  * no process retains an open handle for the file.  The DDL
> +  * commands that schedule files for unlink send invalidation 
> messages
> +  * directing other PostgreSQL processes to close the files, but 
> nothing
> +  * guarantees they'll be processed in time.  So, we'll also use 
> a
> +  * global barrier to ask all backends to close all files, and 
> wait
> +  * until they're finished.
> +  */
> +#if defined(WIN32) || defined(USE_ASSERT_CHECKING)
> + LWLockRelease(TablespaceCreateLock);
> + 
> WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
> + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
> +#endif
> + /* And now try again. */
>   if (!destroy_tablespace_directories(tablespaceoid, false))
>   {
>   /* Still not empty, the files must be important then */

It's probably rare enough to care, but this still made me thing whether
we could avoid the checkpoint at all somehow. Requiring an immediate
checkpoint for dropping relations is quite a heavy hammer that
practically cannot be used in production without causing performance
problems. But it seems hard to process the fsync deletion queue in
another way.


> diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
> index 4dc24649df..0f8548747c 100644
> --- a/src/backend/storage/smgr/smgr.c
> +++ b/src/backend/storage/smgr/smgr.c
> @@ -298,6 +298,12 @@ smgrcloseall(void)
>   smgrclose(reln);
>  }
>  
> +void
> +smgrrelease(void)
> +{
> + mdrelease();
> +}

Probably should be something like
for (i = 0; i < NSmgr; i++)
{
if (smgrsw[i].smgr_release)
smgrsw[i].smgr_release();
}

Greetings,

Andres Freund

Re: row filtering for logical replication

2021-02-01 Thread Euler Taveira

On Mon, Feb 1, 2021, at 6:11 AM, japin wrote:
> Thanks for updating the patch.  Here are some comments:
Thanks for your review. I updated the documentation accordingly.

> The documentation says:
> 
> >  Columns used in the WHERE clause must be part of the
> >  primary key or be covered by REPLICA IDENTITY otherwise
> >  UPDATE and DELETE operations will not
> >  be replicated.
The UPDATE is an oversight from a previous version.

> 
> Does the publication only load the REPLICA IDENTITY columns into oldtuple 
> when we
> execute DELETE? So the pgoutput_row_filter() cannot find non REPLICA IDENTITY
> columns, which cause it return false, right?  If that's right, the UPDATE 
> might
> not be limitation by REPLICA IDENTITY, because all columns are in newtuple,
> isn't it?
No. oldtuple could possibly be available for UPDATE and DELETE. However, row
filter consider only one tuple for filtering. INSERT has only newtuple; row
filter uses it.  UPDATE has newtuple and optionally oldtuple (if it has PK or
REPLICA IDENTITY); row filter uses newtuple. DELETE optionally has only
oldtuple; row filter uses it (if available). Keep in mind, if the expression
evaluates to NULL, it returns false and the row won't be replicated.

After the commit 3696a600e2, the last patch does not apply cleanly. I'm
attaching another version to address the documentation issues.


--
Euler Taveira
EnterpriseDB:  https://www.enterprisedb.com/
From fb45140efacea0cfc9478f2c3747d9a4e5cecbd0 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 18 Jan 2021 11:13:23 -0300
Subject: [PATCH 1/7] Remove unused column from initial table synchronization

Column atttypmod was added in the commit 7c4f52409a, but it is not used.
The removal is safe because COPY from publisher does not need such
information.
---
 src/backend/replication/logical/tablesync.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 863d196fd7..a18f847ade 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -640,7 +640,7 @@ fetch_remote_table_info(char *nspname, char *relname,
 	StringInfoData cmd;
 	TupleTableSlot *slot;
 	Oid			tableRow[] = {OIDOID, CHAROID, CHAROID};
-	Oid			attrRow[] = {TEXTOID, OIDOID, INT4OID, BOOLOID};
+	Oid			attrRow[] = {TEXTOID, OIDOID, BOOLOID};
 	bool		isnull;
 	int			natt;
 
@@ -685,7 +685,6 @@ fetch_remote_table_info(char *nspname, char *relname,
 	appendStringInfo(&cmd,
 	 "SELECT a.attname,"
 	 "   a.atttypid,"
-	 "   a.atttypmod,"
 	 "   a.attnum = ANY(i.indkey)"
 	 "  FROM pg_catalog.pg_attribute a"
 	 "  LEFT JOIN pg_catalog.pg_index i"
@@ -718,7 +717,7 @@ fetch_remote_table_info(char *nspname, char *relname,
 		Assert(!isnull);
 		lrel->atttyps[natt] = DatumGetObjectId(slot_getattr(slot, 2, &isnull));
 		Assert(!isnull);
-		if (DatumGetBool(slot_getattr(slot, 4, &isnull)))
+		if (DatumGetBool(slot_getattr(slot, 3, &isnull)))
 			lrel->attkeys = bms_add_member(lrel->attkeys, natt);
 
 		/* Should never happen. */
-- 
2.20.1

From 4902323fca0efa8024de600853670b2a1f0e3ca0 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 18 Jan 2021 11:53:34 -0300
Subject: [PATCH 2/7] Rename a WHERE node

A WHERE clause will be used for row filtering in logical replication. We
already have a similar node: 'WHERE (condition here)'. Let's rename the
node to a generic name and use it for row filtering too.
---
 src/backend/parser/gram.y | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dd72a9fc3c..ecfd98ba5b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -485,7 +485,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	def_arg columnElem where_clause where_or_current_clause
 a_expr b_expr c_expr AexprConst indirection_el opt_slice_bound
 columnref in_expr having_clause func_table xmltable array_expr
-ExclusionWhereClause operator_def_arg
+OptWhereClause operator_def_arg
 %type 	rowsfrom_item rowsfrom_list opt_col_def_list
 %type  opt_ordinality
 %type 	ExclusionConstraintList ExclusionConstraintElem
@@ -3806,7 +3806,7 @@ ConstraintElem:
 	$$ = (Node *)n;
 }
 			| EXCLUDE access_method_clause '(' ExclusionConstraintList ')'
-opt_c_include opt_definition OptConsTableSpace  ExclusionWhereClause
+opt_c_include opt_definition OptConsTableSpace  OptWhereClause
 ConstraintAttributeSpec
 {
 	Constraint *n = makeNode(Constraint);
@@ -3908,7 +3908,7 @@ ExclusionConstraintElem: index_elem WITH any_operator
 			}
 		;
 
-ExclusionWhereClause:
+OptWhereClause:
 			WHERE '(' a_expr ')'	{ $$ = $3; }
 			| /*EMPTY*/{ $$ = NULL; }
 		;
-- 
2.20.1

From a3693dc370e80166c5fb8ba08c765a1559aaf89c Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 18 Jan 2021 12:07:51

Re: Recording foreign key relationships for the system catalogs

Hi again,

After trying to use pg_get_catalog_foreign_keys() to replace what I had before,
I notice one ambiguity which I think is a serious problem in the 
machine-readable context.

The is_array OUT parameter doesn't say which of the possibly many fkcols that 
is the array column.

One example:

   fktable|fkcols |   pktable|  pkcols  
 | is_array
--+---+--+---+--
pg_constraint| {conrelid,conkey} | pg_attribute | {attrelid,attnum} 
| t

Is the array "conrelid" or is it "conkey"? As a human, I know it's "conkey", 
but for a machine to figure out, one would need to join 
information_schema.columns and check the data_type or something similar.

Suggestions on how to fix:

* Make is_array an boolean[], and let each element represent the is_array value 
for each fkcols element.

* Change interface to be more like information_schema, and add a 
"ordinal_position" column, and return each column on a separate row.

I think I prefer the latter since it's more information_schema-conformant, but 
any works.

/Joel

Re: Recording foreign key relationships for the system catalogs

On Mon, Feb 1, 2021, at 20:33, Joel Jacobson wrote:
>Suggestions on how to fix:
>
>* Make is_array an boolean[], and let each element represent the is_array 
>value for each fkcols element.
>
>* Change interface to be more like information_schema, and add a 
>"ordinal_position" column, and return each column on a separate row.
>
>I think I prefer the latter since it's more information_schema-conformant, but 
>any works.

Ops. I see a problem with just adding a "ordinal_position", since then one 
would also need enumerate the "foreing key" constraints or give them names like 
"constraint_name" in information_schema.table_constraints (since there can be 
multiple foreign keys per fktable, so there would be multiple e.g. 
ordinal_position=1 per fktable).

With this into consideration, I think the easiest and cleanest solution is to 
make is_array a boolean[].

I like the usage of arrays, it makes it much easier to understand the output,
as it's the visually easy to see what groups of columns that references what 
group of columns.

/Joel

Re: [sqlsmith] Failed assertion during partition pruning

David Rowley  writes:
> On Mon, 1 Feb 2021 at 18:57, Tom Lane  wrote:
>> Oh, it's too late at night.  I now remember that the real problem
>> I had with that representation was that it cannot work for joinrels.
>> Currently we only apply this logic to partitioned baserels, but
>> don't you think it might someday be called on to optimize
>> partitionwise joins?

> I've not looked in detail, but I think the code would need a pretty
> big overhaul before that could happen.

Yeah, there's no doubt that a lot of other work would be needed too;
I'd just figured that maybe this code could be ready for it.  But on
looking at it again, I agree that the bitmapset approach is simpler
and cheaper than what I did.  I realized that a good bit of what
I did not like about the pre-existing logic is that it was calling
the values "Relids" rather than Bitmapsets.  To my mind, the Relids
typedef is meant to denote values that identify individual relations
(which might be joins) to the planner.  The values we're dealing with
in this code do *not* correspond to anything that is or could be a
RelOptInfo; rather they are sets of distinct relations.  It's fine
to use a Bitmapset if it's a convenient representation, but we should
call it that and not a Relids.

I renamed things that way, did some more work on the comments,
and pushed it.  Thanks for reviewing!

regards, tom lane

Re: Recording foreign key relationships for the system catalogs

"Joel Jacobson"  writes:
> The is_array OUT parameter doesn't say which of the possibly many fkcols that 
> is the array column.

Yeah, I didn't write the sgml docs yet, but the comments explain that
the array is always the last fkcol.  Maybe someday that won't be
general enough, but we can cross that bridge when we come to it.

regards, tom lane

Re: Should we make Bitmapsets a kind of Node?

Now that commit f003a7522 did away with the partitioned_rels fields,
my original motivation for doing $SUBJECT is gone.  It might still be
worth doing, but I'm not planning to tackle it right now.

regards, tom lane

Re: Support for NSS as a libpq TLS backend

2021-02-01 Thread Daniel Gustafsson

> On 29 Jan 2021, at 19:46, Jacob Champion  wrote:

> I think the bad news is that the static approach will need support for
> ENABLE_THREAD_SAFETY.

I did some more reading today and noticed that the NSS documentation (and their
sample code for doing crypto without TLS connections) says to use NSS_NoDB_Init
to perform a read-only init which don't require a matching close call.  Now,
the docs aren't terribly clear and also seems to have gone offline from MDN,
and skimming the code isn't entirelt self-explanatory, so I may well have
missed something.  The v24 patchset posted changes to this and at least passes
tests with decent performance so it seems worth investigating.

> (It looks like the NSS implementation of pgtls_close() needs some thread
> support too?)

Storing the context in conn would probably be better?

> The good(?) news is that I don't understand why OpenSSL's
> implementation of cryptohash doesn't _also_ need the thread-safety
> code. (Shouldn't we need to call CRYPTO_set_locking_callback() et al
> before using any of its cryptohash implementation?) So maybe we can
> implement the same global setup/teardown API for OpenSSL too and not
> have to one-off it for NSS...

No idea here, wouldn't that impact pgcrypto as well in that case?

--
Daniel Gustafsson   https://vmware.com/

Re: Support for NSS as a libpq TLS backend

2021-02-01 Thread Daniel Gustafsson

> On 1 Feb 2021, at 14:25, Michael Paquier  wrote:

> I have refreshed the docs on top to be consistent with the new
> configuration, and applied it after more checks.

Thanks, I was just about to send a rebased version earlier today with the doc
changes in the 0001 patch when this email landed in my inbox =) The v24 posted
upthread is now rebased on top of this.

> I'll try to look in more details at the failures with cryptohashes I found
> upthread.

Great, thanks.

--
Daniel Gustafsson   https://vmware.com/

Re: [PATCH] pg_hba.conf error messages for logical replication connections

2021-02-01 Thread Paul Martinez

On Fri, Jan 29, 2021 at 8:41 PM Amit Kapila  wrote:
>
> Yeah, hints or more details might improve the situation but I am not
> sure we want to add more branching here. Can we write something
> similar to HOSTNAME_LOOKUP_DETAIL for hints? Also, I think what you
> are proposing to write is more of a errdetail kind of message. See
> more error routines in the docs [1].
>

Alright, I've updated both sets of error messages to use something like
HOSTNAME_LOOKUP_DETAIL, both for the error message itself, and for the
extra detail message about the replication keyword. Since now we specify
both an errdetail (sent to the client) and an errdetail_log (sent to the
log), I renamed HOSTNAME_LOOKUP_DETAIL to HOSTNAME_LOOKUP_DETAIL_LOG.

- Paul

pg_hba_conf_error_message_patch_v01.diff
Description: Binary data

Re: [HACKERS] Custom compression methods

2021-02-01 Thread Robert Haas

Some more review comments:

'git am' barfs on v0001 because it's got a whitespace error.

VARFLAGS_4B_C() doesn't seem to be used in any of the patches. I'm OK
with keeping it even if it's not used just because maybe someone will
need it later but, uh, don't we need to use it someplace?

To avoid moving the goalposts for a basic install, I suggest that
--with-lz4 should default to disabled. Maybe we'll want to rethink
that at some point, but since we're just getting started with this
whole thing, I don't think now is the time.

The change to ddl.sgml doesn't seem to make sense to me. There might
be someplace where we want to explain how properties are inherited in
partitioning hierarchies, but I don't think this is the right place,
and I don't think this explanation is particularly clear.

+  This clause adds the compression method to a column.  The Compression
+  method can be set from available compression methods.  The built-in
+  methods are pglz and lz4.
+  If no compression method is specified, then compressible types will have
+  the default compression method pglz.

Suggest: This sets the compression method for a column. The supported
compression methods are pglz and
lz4. lz4 is available only if
--with-lz4 was used when building
PostgreSQL. The default is
pglz.

We should make sure, if you haven't already, that trying to create a
column with LZ4 compression fails at table creation time if the build
does not support LZ4. But, someone could also create a table using a
build that has LZ4 support and then switch to a different set of
binaries that do not have it, so we need the runtime checks also.
However, those runtime checks shouldn't fail simplify from trying to
access a table that is set to use LZ4 compression; they should only
fail if we actually need to decompress an LZ4'd value.

Since indexes don't have TOAST tables, it surprises me that
brin_form_tuple() thinks it can TOAST anything. But I guess that's not
this patch's problem, if it's a problem at all.

I like the fact that you changed the message "compressed data is
corrupt" to indicate the compression method, but I think the resulting
message doesn't follow style guidelines because I don't believe we
normally put something with a colon prefix at the beginning of a
primary error message. So instead of saying "pglz: compressed data is
corrupt" I think you should say something like "compressed pglz data
is corrupt". Also, I suggest that we take this opportunity to switch
to ereport() rather than elog() and set
errcode(ERRCODE_DATA_CORRUPTED).

What testing have you done for performance impacts? Does the patch
slow things down noticeably with pglz? (Hopefully not.) Can you
measure a performance improvement with pglz? (Hopefully so.) Is it
likely to hurt performance that there's no minimum size for lz4
compression as we have for pglz? Seems like that could result in a lot
of wasted cycles trying to compress short strings.

pglz_cmcompress() cancels compression if the resulting value would be
larger than the original one, but it looks like lz4_cmcompress() will
just store the enlarged value. That seems bad.

pglz_cmcompress() doesn't need to pfree(tmp) before elog(ERROR).

CompressionOidToId(), CompressionIdToOid() and maybe other places need
to remember the message style guidelines. Primary error messages are
not capitalized.

Why should we now have to include toast_internals.h in
reorderbuffer.c, which has no other changes? That definitely shouldn't
be necessary. If something in another header file now requires
something from toast_internals.h, then that header file would be
obliged to include toast_internals.h itself. But actually that
shouldn't happen, because the whole point of toast_internals.h is that
it should not be included in very many places at all. If we're adding
stuff there that is going to be broadly needed, we're adding it in the
wrong place.

varlena.c shouldn't need toast_internals.h either, and if it did, it
should be in alphabetical order.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Thoughts on "killed tuples" index hint bits support on standby

2021-02-01 Thread Michail Nikolaev

Hello, Peter.

Thanks a lot for your comments.
There are some mine thoughts, related to the “masked bits” solution and
your comments:

> During recovery, we will probably always have to consider the
> possibility that LP_DEAD bits that get set on the primary may be
> received by a replica through some implementation detail (e.g. LP_DEAD
> bits are set in FPIs we replay, maybe even some other thing that
> neither of us have thought of).

It is fine to receive a page to the standby from any source: `btpo_flags`
should have some kind “LP_DEAD safe for standby” bit set to allow new bits
to be set and old - read.

> We can't really mask LP_DEAD bits from
> the primary in recovery anyway, because of stuff like page-level
> checksums. I suspect that it just isn't useful to fight against that.

As far as I can see - there is no problem here. Checksums already differ
for both heap and index pages on standby and primary. Checksums are
calculated before the page is written to the disk (not after applying FPI).
So, the masking page during *applying* the FPI is semantically the same as
setting a bit in it 1 nanosecond after.

And `btree_mask` (and other mask functions) already used for consistency
checks to exclude LP_DEAD.

> Plus you'll introduce new overhead for this process during replay,
> which creates significant overhead -- often most leaf pages have some
> LP_DEAD bits set during recovery.

I hope it is not big enough, because FPIs are not too frequent + positive
effect will easily overcome additional CPU cycles of `btree_mask` (and the
page is already in CPU cache at the moment).

> I don't like that idea. Apart from what I said already, you're
> assuming that setting LP_DEAD bits in indexes on the primary won't
> eventually have some value on the standby after it is promoted and can
> accept writes -- they really do have meaning and value on standbys.

I think it is fine to lose part of LP_DEAD on promotion (which can be
received only by FPI in practice). They will be set on the first scan
anyway. Also, bits set by standby may be used by newly promoted primary if
we honor OldestXmin of the previous primary while setting it (just need to
add OldestXmin in xl_running_xacts and include it into dead-horizon on
standby).

> Why shouldn't this break page-level checksums (or wal_log_hints) in
> some way? What about pg_rewind, some eventual implementation of
> incremental backups, etc? I suspect that it will be necessary to
> invent some entirely new concept that is like a hint bit, but works on
> standbys (without breaking anything else).

As I said before - applying the mask on *standby* will not break any
checksums - because the page is still dirty after that (and it is even
possible to call `PageSetChecksumInplace` for an additional paranoia).
Actual checksums on standby and primary already have different values (and,
probably, in the most of the pages because LP_DEAD and “classic” hint bits).

> If you invent some entirely new category of standby-only hint bit at a
> level below the access method code, then you can use it inside access
> method code such as nbtree. Maybe you don't have to play games with
> minRecoveryPoint in code like the "if (RecoveryInProgress())" path
> from the XLogNeedsFlush() function. Maybe you can do some kind of
> rudimentary "masking" for the in recovery case at the point that we
> *write out* a buffer (*not* at the point hint bits are initially set)
> -- maybe this could happen near to or inside FlushBuffer(), and maybe
> only when checksums are enabled? I'm unsure.

Not sure I was able to understand your idea here, sorry.

> The difference may be subtle, but it's important here -- it justifies
> inventing a whole new type of LP_DEAD-style status bit that gets set
> only on standbys. Even today, heap tuples can have hint bits
> "independently" set on standbys, subject to certain limitations needed
> to avoid breaking things like data page checksums

Yes, and I see three major ways to implement it in the current
infrastructure:

1) Use LP_REDIRECT (or other free value) instead of LP_DEAD on standby
2) Use LP_DEAD on standby, but involve some kind of recovery conflicts
(like here -
https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com
)
3) Mask index FPI during a replay on hot standby + mark page as “primary
LP_DEAD free” in btpo_flags

Of course, each variant requires some special additional things to keep
everything safe.

As I see in SetHintsBits limitations are related to XLogNeedsFlush (check
of minRecoveryPoint in case of standby).

> Conclusion: The whole minRecoveryPoint question that you're trying to
> answer to improve things for your patch is just the wrong question.
> Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
> would be useful to set "true hint bits" on standbys earlier, and maybe
> thinking about minRecoveryPoint would help with that problem, but that
> doesn't help your index-related

Re: [sqlsmith] Failed assertion during partition pruning

2021-02-01 Thread David Rowley

On Tue, 2 Feb 2021 at 08:58, Tom Lane  wrote:
> I renamed things that way, did some more work on the comments,
> and pushed it.  Thanks for reviewing!

Thanks for working on this and coming up with the idea to nuke partitioned_rels.

David

Re: Proposal: Save user's original authenticated identity for logging

On Sun, 2021-01-31 at 12:27 +0100, Magnus Hagander wrote:
> > (There's also the fact that I think pg_ident mapping for LDAP would be
> > just as useful as it is for GSS or certs. That's for a different
> > conversation.)
> 
> Specifically for search+bind, I would assume?

Even for the simple bind case, I think it'd be useful to be able to
perform a pg_ident mapping of

ldapmap/.*ldapuser

so that anyone who is able to authenticate against the LDAP server is
allowed to assume the ldapuser role. (For this to work, you'd need to
be able to specify your LDAP username as a connection option, similar
to how you can specify a client certificate, so that you could set
PGUSER=ldapuser.)

But again, that's orthogonal to the current discussion.

> With that I think it would also be useful to have it available in the
> system as well -- either as a column in pg_stat_activity or maybe just
> as a function like pg_get_authenticated_identity() since it might be
> something that's interesting to a smallish subset of users (but very
> interesting to those).

Agreed, it would slot in nicely with the other per-backend stats functions.
--Jacob

Re: Proposal: Save user's original authenticated identity for logging

On Mon, 2021-02-01 at 11:49 -0500, Stephen Frost wrote:
> * Magnus Hagander (mag...@hagander.net) wrote:
> > But yes, I think the enforced cleartext password proxying is at the
> > core of the problem. LDAP also encourages the idea of centralized
> > password-reuse, which is not exactly a great thing for security.
> 
> Right- passing around a user's password in the clear (or even through an
> encrypted tunnel) has been strongly discouraged for a very long time,
> for very good reason.  LDAP does double-down on that by being a
> centralized password, meaning that someone's entire identity (for all
> the services that share that LDAP system, at least) are compromised if
> any one system in the environment is.

Sure. I don't disagree with anything you've said in that paragraph, but
as someone who's implementing solutions for other people who are
actually deploying, I don't have a lot of control over whether a
customer's IT department wants to use LDAP or not. And I'm not holding
my breath for LDAP servers to start implementing federated identity,
though that would be nice.

> Also, if we do add
> it, I would think we'd have it under the same check as the other
> sensitive pg_stat_activity fields and not be superuser-only.

Just the standard HAS_PGSTAT_PERMISSIONS(), then?

To double-check -- since giving this ability to the pg_read_all_stats
role would expand its scope -- could that be dangerous for anyone?

--Jacob

Re: Proposal: Save user's original authenticated identity for logging

Greetings,

* Jacob Champion (pchamp...@vmware.com) wrote:
> On Mon, 2021-02-01 at 11:49 -0500, Stephen Frost wrote:
> > * Magnus Hagander (mag...@hagander.net) wrote:
> > > But yes, I think the enforced cleartext password proxying is at the
> > > core of the problem. LDAP also encourages the idea of centralized
> > > password-reuse, which is not exactly a great thing for security.
> > 
> > Right- passing around a user's password in the clear (or even through an
> > encrypted tunnel) has been strongly discouraged for a very long time,
> > for very good reason.  LDAP does double-down on that by being a
> > centralized password, meaning that someone's entire identity (for all
> > the services that share that LDAP system, at least) are compromised if
> > any one system in the environment is.
> 
> Sure. I don't disagree with anything you've said in that paragraph, but
> as someone who's implementing solutions for other people who are
> actually deploying, I don't have a lot of control over whether a
> customer's IT department wants to use LDAP or not. And I'm not holding
> my breath for LDAP servers to start implementing federated identity,
> though that would be nice.

Not sure exactly what you're referring to here but AD already provides
Kerberos with cross-domain trusts (aka forests).  The future is here..?
:)

> > Also, if we do add
> > it, I would think we'd have it under the same check as the other
> > sensitive pg_stat_activity fields and not be superuser-only.
> 
> Just the standard HAS_PGSTAT_PERMISSIONS(), then?
> 
> To double-check -- since giving this ability to the pg_read_all_stats
> role would expand its scope -- could that be dangerous for anyone?

I don't agree that this really expands its scope- in fact, you'll see
that the GSSAPI and SSL user authentication information is already
allowed under HAS_PGSTAT_PERMISSIONS().

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Proposal: Save user's original authenticated identity for logging

2021-02-01 Thread Magnus Hagander

On Mon, Feb 1, 2021 at 10:36 PM Jacob Champion  wrote:
>
> On Sun, 2021-01-31 at 12:27 +0100, Magnus Hagander wrote:
> > > (There's also the fact that I think pg_ident mapping for LDAP would be
> > > just as useful as it is for GSS or certs. That's for a different
> > > conversation.)
> >
> > Specifically for search+bind, I would assume?
>
> Even for the simple bind case, I think it'd be useful to be able to
> perform a pg_ident mapping of
>
> ldapmap/.*ldapuser
>
> so that anyone who is able to authenticate against the LDAP server is
> allowed to assume the ldapuser role. (For this to work, you'd need to
> be able to specify your LDAP username as a connection option, similar
> to how you can specify a client certificate, so that you could set
> PGUSER=ldapuser.)
>
> But again, that's orthogonal to the current discussion.

Right. I guess that's what I mean -- *just* adding support for user
mapping wouldn't be helpful. You'd have to change how the actual
authentication is done. The way that it's done now, mapping makes no
sense.

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Key management with tests

On Fri, Jan 29, 2021 at 05:05:06PM +0900, Masahiko Sawada wrote:
> TBH I’m confused a bit about the recent situation of this patch, but
> I

Yes, it is easy to get confused.

> can contribute to KMS work by discussing, writing, reviewing, and
> testing the patch. Also, I can work on the data encryption part of TDE

Great.

> (we need more discussion on that though). If the community concerns
> about the high-level design and thinks the design reviews by
> cryptography experts are still needed, we would need to do that first
> since the data encryption part of TDE depends on KMS. As far as I

I totally agree.  While we don't need to commit the key management patch
to the tree before moving forward, we should have agreement on the key
management patch before doing more work on this.  If we can't agree on
the key management part, there is no value in working further, as I
stated in an earlier email.

> know, we have done that many times on pgsql-hackers, on offl-line and
> including the discussion on the past proposal, etc but given that the
> community still has a concern, it seems that we haven’t been able
> to share the details of the discussion enough that led to the design
> decision or the design is still not good. Honestly, I’m not sure how
> this feature can get consensus. But maybe we would need to have a

Yes, I am also confused.

> break from refining the patch now and we need to marshal the
> discussions so far and the point behind the design so that everyone
> can understand why this feature is designed in that way. To do that,
> it might be a good start to sort the wiki page since it has data
> encryption part, KMS, and ToDo mixed.

What I ended up doing is to moving the majority of the
non-data-encryption part of the wiki into the patch, either in docs or
README files, since people asked for more of this in the patch, and
having the information in two places is confusing.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Fix DROP TABLESPACE on Windows with ProcSignalBarrier?

2021-02-01 Thread Thomas Munro

On Tue, Feb 2, 2021 at 8:02 AM Andres Freund  wrote:
> It's probably rare enough to care, but this still made me thing whether
> we could avoid the checkpoint at all somehow. Requiring an immediate
> checkpoint for dropping relations is quite a heavy hammer that
> practically cannot be used in production without causing performance
> problems. But it seems hard to process the fsync deletion queue in
> another way.

Right, the checkpoint itself is probably worse than this
"close-all-your-files!" thing in some cases (though it seems likely
that once we start using ProcSignalBarrier we're going to find out
about places that take a long time to get around to processing them
and that's going to be a thing to work on).  As a separate project,
perhaps we should find some other way to stop GetNewRelFileNode() from
recycling the relfilenode until the next checkpoint, so that we can
unlink the file eagerly at commit time, while still avoiding the
hazard described in the comment for mdunlink().  A straw-man idea
would be to touch a file under PGDATA/pg_dropped and fsync it so it
survives a power outage, have checkpoints clean that out, and have
GetNewRelFileNode() to try access() it.  Then we wouldn't need the
checkpoint here, I think; we'd just need this ProcSignalBarrier for
Windows.

> > +void
> > +smgrrelease(void)
> > +{
> > + mdrelease();
> > +}
>
> Probably should be something like
> for (i = 0; i < NSmgr; i++)
> {
> if (smgrsw[i].smgr_release)
> smgrsw[i].smgr_release();
> }

Fixed.  Thanks!

v3-0001-Use-a-global-barrier-to-fix-DROP-TABLESPACE-on-Wi.patch
Description: Binary data

v3-0002-Use-condition-variables-for-ProcSignalBarriers.patch
Description: Binary data

Re: Proposal: Save user's original authenticated identity for logging

On Mon, 2021-02-01 at 18:44 +0100, Magnus Hagander wrote:
> What people would *really* want I think is "alow auto-creation of new
> roles, and then look up which other roles they should be members of
> using ldap" (or "using this script over here" for a more flexible
> approach). Which is of course a whole different thing to do in the
> process of authentication.

Yep. I think there are at least three separate things:

1) third-party authentication ("tell me who this user is"), which I
think Postgres currently has a fairly good handle on;

2) third-party authorization ("tell me what roles this user can
assume"), which Postgres doesn't do, unless you have a script
automatically update pg_ident -- and even then you can't do it for
every authentication type; and

3) third-party role administration ("tell me what roles should exist in
the database, and what permissions they have"), which currently exists
in a limited handful of third-party tools.

Many users will want all three of these questions to be answered by the
same system, which is fine, but for more advanced use cases I think
it'd be really useful if you could answer them fully independently.

For really gigantic deployments, the overhead of hundreds of Postgres
instances randomly pinging a central server just to see if there have
been any new users can be a concern. Having a solid system for
authorization could potentially decrease the need for a role auto-
creation system, and reduce the number of moving parts. If you have a
small number of core roles (relative to the number of users), it might
not be as important to constantly keep role lists up to date, so long
as the central authority can tell you which of your existing roles a
user is authorized to become.

> The main thing you'd gain by auto-creating users rather than just
> letting them log in is the ability to know exactly which user did
> something, and view who it really is through pg_stat_activity. Adding
> the "original auth id" as a field or available method would provide
> that information in the mapped user case -- making the difference even
> smaller. It's really the auto-membership that's the killer feature of
> that one, I think.

Agreed. As long as it's possible for multiple user identities to assume
the same role, storing the original authenticated identity is still
important, regardless of how you administer the roles themselves.

--Jacob

Announcing Release 12 of the PostgreSQL Buildfarm client

2021-02-01 Thread Andrew Dunstan



At long last I have just pushed Release 12 of the PostgreSQL Buildfarm
client. It's been about 16 months since the last release.

Apart from some minor fixes and code tidy up. this includes the
following more notable changes:

  * the TextUpgradeXVersion module is brought up to date with various
core code changes
  * a module-neutral, animal-based save mechanism replaces the bespoke
mechanism that was implemented in TestUpgradeXVersion. This will
enable future development like testing openssl builds against NSS
builds etc.
  * a standardized way of accumulating log files is implemented. This
will enable some planned server side improvements.
  * requests are now signed using SHA256 nstead of SHA1.
  * there is a separate config section for settings to be used with
valgrind. This makes it easier to turn valgrind on or off.
  * typedefs detection is improved for OSX
  * there is a setting for additional docs build targets, in addition to
the standard html target.
  * use of the configure "accache" is made substantially more robust
  * timed out processes are more verbose

Downloads are available at
 and



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Proposal: Save user's original authenticated identity for logging

On Mon, 2021-02-01 at 17:01 -0500, Stephen Frost wrote:
> * Jacob Champion (pchamp...@vmware.com) wrote:
> > And I'm not holding
> > my breath for LDAP servers to start implementing federated identity,
> > though that would be nice.
> 
> Not sure exactly what you're referring to here but AD already provides
> Kerberos with cross-domain trusts (aka forests).  The future is here..?
> :)

If the end user is actually using LDAP-on-top-of-AD, and comfortable
administering the Kerberos-related pieces of AD so that their *nix
servers and users can speak it instead, then sure. But I continue to
hear about customers who don't fit into that mold. :D Enough that I
have to keep an eye on the "pure" LDAP side of things, at least.

> > To double-check -- since giving this ability to the pg_read_all_stats
> > role would expand its scope -- could that be dangerous for anyone?
> 
> I don't agree that this really expands its scope- in fact, you'll see
> that the GSSAPI and SSL user authentication information is already
> allowed under HAS_PGSTAT_PERMISSIONS().

Ah, so they are. :) I think that's the way to go, then.

--Jacob

Re: Key management with tests

On Sat, Jan 30, 2021 at 08:23:11AM -0500, Tom Kincaid wrote:
> I propose that we meet to discuss what approach we want to use to move TDE
> forward.  We then start a new thread with a proposal on the approach
> and finalize it via community consensus. I will invite Bruce, Stephen and
> Masahiko to this meeting. If anybody else would like to participate in this
> discussion and subsequently in the effort to get TDE in PG1x, please let me
> know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a 
> volunteer
> from this meeting) will post the proposal for how we move this patch forward 
> in
> another thread. Hopefully, we can get consensus on that and subsequently
> restart the execution of delivering this feature.

We got complaints that decisions were not publicly discussed, or were
too long, so I am not sure this helps.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Thoughts on "killed tuples" index hint bits support on standby

2021-02-01 Thread Peter Geoghegan

On Mon, Feb 1, 2021 at 1:19 PM Michail Nikolaev
 wrote:
> It is fine to receive a page to the standby from any source: `btpo_flags` 
> should have some kind “LP_DEAD safe for standby” bit set to allow new bits to 
> be set and old - read.
>
> > We can't really mask LP_DEAD bits from
> > the primary in recovery anyway, because of stuff like page-level
> > checksums. I suspect that it just isn't useful to fight against that.
>
> As far as I can see - there is no problem here. Checksums already differ for 
> both heap and index pages on standby and primary.

AFAICT that's not true, at least not in any practical sense. See the
comment in the middle of MarkBufferDirtyHint() that begins with "If we
must not write WAL, due to a relfilenode-specific...", and see the
"Checksums" section at the end of src/backend/storage/page/README. The
last paragraph in the README is particularly relevant:

New WAL records cannot be written during recovery, so hint bits set during
recovery must not dirty the page if the buffer is not already dirty, when
checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
being set, but with checksums enabled, a page cannot be dirtied after setting a
hint bit (due to the torn page risk). So, it must wait for full-page images
containing the hint bit updates to arrive from the primary.

IIUC the intention is that MarkBufferDirtyHint() is a no-op during hot
standby when we successfully set a hint bit, though only in the
XLogHintBitIsNeeded() case. So we don't really dirty the page within
SetHintBits() in this specific scenario. That is, the buffer header
won't actually get marked BM_DIRTY or BM_JUST_DIRTIED within
MarkBufferDirtyHint() when in Hot Standby + XLogHintBitIsNeeded().
What else could work at all? The only "alternative" is to write an
FPI, just like on the primary -- but writing new WAL records is not
possible during Hot Standby!

A comment within MarkBufferDirtyHint() spells it out directly -- we
can have hint bits set in hot standby independently of the primary,
but it works in a way that makes sure that the hint bits never make it
out to disk:

"We can set the hint, just not dirty the page as a result so the hint
is lost when we evict the page or shutdown"

You may be right in some narrow sense -- checksums can differ on a
standby. But that's like saying that it's sometimes okay to have a
torn page on disk. Yes, it's okay, but only because we expect the
problem during crash recovery, and can reliably repair it.

> Checksums are calculated before the page is written to the disk (not after 
> applying FPI). So, the masking page during *applying* the FPI is semantically 
> the same as setting a bit in it 1 nanosecond after.
>
> And `btree_mask` (and other mask functions) already used for consistency 
> checks to exclude LP_DEAD.

I don't see how that is relevant. btree_mask() is only used by
wal_consistency_checking, which is mostly just for Postgres hackers.

--
Peter Geoghegan

Re: POC: Cleaning up orphaned files using undo logs

On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote:
> Antonin Houska  wrote:
> > Well, on repeated run of the test I could also hit the first one. I could 
> > fix
> > it and will post a new version of the patch (along with some other small
> > changes) this week.
> 
> Attached is the next version. Changes done:

Yikes, this patch is 23k lines, and most of it looks like added lines of
code.  Is this size expected?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Should we make Bitmapsets a kind of Node?

2021-02-01 Thread David Rowley

On Tue, 2 Feb 2021 at 09:23, Tom Lane  wrote:
>
> Now that commit f003a7522 did away with the partitioned_rels fields,
> my original motivation for doing $SUBJECT is gone.  It might still be
> worth doing, but I'm not planning to tackle it right now.

I'm not sure if the misuse of Lists to store non-Node types should be
all that surprising. lappend() accepts a void pointer rather than a
Node *. I also didn't catch anything that indicates storing non-Node
types is bad practise.

Maybe it's worth still adding something to some comments in list.c to
try and reduce the chances of someone making this mistake again in the
future?

David

Re: Proposal: Save user's original authenticated identity for logging

Greetings,

* Jacob Champion (pchamp...@vmware.com) wrote:
> On Mon, 2021-02-01 at 17:01 -0500, Stephen Frost wrote:
> > * Jacob Champion (pchamp...@vmware.com) wrote:
> > > And I'm not holding
> > > my breath for LDAP servers to start implementing federated identity,
> > > though that would be nice.
> > 
> > Not sure exactly what you're referring to here but AD already provides
> > Kerberos with cross-domain trusts (aka forests).  The future is here..?
> > :)
> 
> If the end user is actually using LDAP-on-top-of-AD, and comfortable
> administering the Kerberos-related pieces of AD so that their *nix
> servers and users can speak it instead, then sure. But I continue to
> hear about customers who don't fit into that mold. :D Enough that I
> have to keep an eye on the "pure" LDAP side of things, at least.

I suppose it's likely that I'll continue to run into people who are
horrified to learn that they've been using pass-the-password auth thanks
to using ldap.

> > > To double-check -- since giving this ability to the pg_read_all_stats
> > > role would expand its scope -- could that be dangerous for anyone?
> > 
> > I don't agree that this really expands its scope- in fact, you'll see
> > that the GSSAPI and SSL user authentication information is already
> > allowed under HAS_PGSTAT_PERMISSIONS().
> 
> Ah, so they are. :) I think that's the way to go, then.

Ok..  but what's 'go' mean here?  We already have views and such for GSS
and SSL, is the idea to add another view for LDAP and add in columns
that are returned by pg_stat_get_activity() which are then pulled out by
that view?  Or did you have something else in mind?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Proposal: Save user's original authenticated identity for logging

On Mon, 2021-02-01 at 18:01 -0500, Stephen Frost wrote:
> Ok..  but what's 'go' mean here?  We already have views and such for GSS
> and SSL, is the idea to add another view for LDAP and add in columns
> that are returned by pg_stat_get_activity() which are then pulled out by
> that view?  Or did you have something else in mind?

Magnus suggested a function like pg_get_authenticated_identity(), which
is what I was thinking of when I said that. I'm not too interested in
an LDAP-specific view, and I don't think anyone so far has asked for
that.

My goal is to get this one single point of reference, for all of the
auth backends. The LDAP mapping conversation is separate.

--Jacob

memory leak in auto_explain

2021-02-01 Thread Jeff Janes

I accidentally tried to populate a test case
while auto_explain.log_min_duration was set to
zero.  auto_explain.log_nested_statements was also on.

create or replace function gibberish(int) returns text language SQL as $_$
select left(string_agg(md5(random()::text),),$1) from
generate_series(0,$1/32) $_$;

create table j1 as select x, md5(random()::text) as t11, gibberish(1500) as
t12 from generate_series(1,20e6) f(x);

I got logorrhea of course, but I also got a memory leak into the SQL
function context:

  TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
PortalContext: 16384 total in 5 blocks; 5328 free (1 chunks); 11056
used: 
  ExecutorState: 4810120 total in 13 blocks; 4167160 free (74922
chunks); 642960 used
SQL function: 411058232 total in 60 blocks; 4916568 free (4
chunks); 406141664 used: gibberish

The memory usage grew until OOM killer stepped in.

Cheers,

Jeff

Re: memory leak in auto_explain

2021-02-01 Thread Jeff Janes

On Mon, Feb 1, 2021 at 6:09 PM Jeff Janes  wrote:

>
>
> create or replace function gibberish(int) returns text language SQL as $_$
> select left(string_agg(md5(random()::text),),$1) from
> generate_series(0,$1/32) $_$;
>
> create table j1 as select x, md5(random()::text) as t11, gibberish(1500)
> as t12 from generate_series(1,20e6) f(x);
>

I should have added, found it on HEAD, verified it also in 12.5.

Cheers,

Jeff

>

Re: Fix DROP TABLESPACE on Windows with ProcSignalBarrier?

2021-02-01 Thread Thomas Munro

On Tue, Feb 2, 2021 at 11:16 AM Thomas Munro  wrote:
> ...  A straw-man idea
> would be to touch a file under PGDATA/pg_dropped and fsync it so it
> survives a power outage, have checkpoints clean that out, and have
> GetNewRelFileNode() to try access() it.  ...

I should add, the reason I mentioned fsyncing it is that in another
thread we've also discussed making the end-of-crash-recovery
checkpoint optional, and then I think you'd need to be sure you can
avoid reusing the relfilenode even after crash recovery, because if
you recycle the relfilenode and then crash again you'd be exposed to
that hazard during the 2nd run thought recovery.  But perhaps it's
enough to recreate the hypothetical pg_dropped file while replaying
the drop-relation record.  Not sure, would need more thought.

Re: Key management with tests

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Fri, Jan 29, 2021 at 05:40:37PM -0500, Stephen Frost wrote:
> > I hope it's pretty clear that I'm also very much in support of both this
> > effort with the KMS and of TDE in general- TDE is specifically,
> 
> Yes, thanks.  I know we have privately talked about this recently, but
> it is nice to have it in public like this.

Certainly happy to lend my support and to spend some time working on
this to move it forward.

> > repeatedly, called out as a capability whose lack is blocking PG from
> > being able to be used for certain use-cases that it would otherwise be
> > well suited for, and that's really unfortunate.
> 
> So, below, I am going to copy two doc paragraphs from the patch:
> 
>   The purpose of cluster file encryption is to prevent users with read
>   access to the directories used to store database files and write-ahead
>   log files from being able to access the data stored in those files.
>   For example, when using cluster file encryption, users who have read
>   access to the cluster directories for backup purposes will not be able
>   to decrypt the data stored in these files.  It also protects against
>   decrypted data access after media theft.

That's one valid use-case and it particularly makes sense to consider,
now that we support group read-access to the data cluster.  The last
line seems a bit unclear- I would update it to say:

Cluster file encryption also provides data-at-rest security, protecting
users from data loss should the physical media on which the cluster is
stored be stolen, improperly deprovisioned (not wiped or destroyed), or
otherwise ends up in the hands of an attacker.

>   File system write access can allow for unauthorized file system data
>   decryption if the writes can be used to weaken the system's security
>   and this weakened system is later supplied with externally-stored keys.

This isn't very clear as to exactly what the concern is or how an
attacker would be able to thwart the system if they had write access to
it.  An attacker with write access could possibly attempt to replace the
existing keys, but with the key wrapping that we're using, that should
result in just a decryption failure (unless, of course, the attacker has
the actual KEK that was used, but that's not terribly interesting to
worry about since then they could just go access the files directly).

Until and unless we solve the issue around storing the GCM tags for each
page, we will have the risk that an attacker could modify a page in a
manner that we wouldn't detect.  This is the biggest concern that I have
currently with the existing TDE patch sets.

There's two options that I see around how to address that issue- either
we arrange to create space in the page for the tag, such as by making
the 'special' space on a page a bit bigger and making sure that
everything understands that, or we'll need to add another fork in which
we store the tags (and possibly other TDE/encryption related
information).  If we go with a fork then it should be possible to do WAL
streaming from an unencrypted cluster to an encrypted one, which would
be pretty neat, but it means another fork and another page that has to
be read/written every time we modify a page.  Getting some input into
the trade-offs here would be really helpful.  I don't think it's really
reasonable to go out with TDE without having figured out the integrity
side.  Certainly, when I review things like NIST 800-53, it's very clear
that the requirement is for both confidentiality *and* integrity.

>   This also does not protect from users who have read access to system
>   memory.  This also does not detect or protect against users with write
>   access from removing or modifying database files.

The last seems a bit obvious, but the first sentence quoted above is
important to make clear.  I might even say:

All of the pages in memory and all of the keys which are used for the
encryption and decryption are stored in the clear in memory and
therefore an attacker who is able to read the memory allocated by
PostgreSQL would be able to decrypt the enitre cluster.

> Given what I said above, is the value of this feature for compliance, or
> for actual additional security?  If it just compliance, are we willing
> to add all of this code just for that, even if it has limited security
> value?  We should answer this question now, and if we don't want it,
> let's document that so users know and can consider alternatives.

The feature is for both compliance and additional security.  While there
are other ways to achieve data-at-rest encryption, they are not always
available, for a variety of reasons.

> FYI, I don't think we can detect or protect against writers modifying
> the data files --- even if we could do it on a block level, they could
> remove trailing pages (might cause index lookup failures) or copy
> pages from other tables at the same offset.  Therefore, I think we can
> only offer

Re: Key management with tests

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Sat, Jan 30, 2021 at 08:23:11AM -0500, Tom Kincaid wrote:
> > I propose that we meet to discuss what approach we want to use to move TDE
> > forward.  We then start a new thread with a proposal on the approach
> > and finalize it via community consensus. I will invite Bruce, Stephen and
> > Masahiko to this meeting. If anybody else would like to participate in this
> > discussion and subsequently in the effort to get TDE in PG1x, please let me
> > know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a 
> > volunteer
> > from this meeting) will post the proposal for how we move this patch 
> > forward in
> > another thread. Hopefully, we can get consensus on that and subsequently
> > restart the execution of delivering this feature.
> 
> We got complaints that decisions were not publicly discussed, or were
> too long, so I am not sure this helps.

If the notes are published afterwords as an explanation of why certain
choices were made, I suspect it'd be reasonably well received.  The
concern about back-room discussions is more that decisions are made
without explanation as to why, provided we avoid that, I believe they
can be helpful.

So, +1 for my part to have the conversation.

Thanks,

Stephen


signature.asc
Description: PGP signature

Typo in tablesync comment

2021-02-01 Thread Peter Smith

PSA a trivial patch to correct what seems like a typo in the tablesync comment.


Kind Regards,
Peter Smith.
Fujitsu Australia


fix-tablesync-comment-typo-20210202.patch
Description: Binary data

Re: Proposal: Save user's original authenticated identity for logging

Greetings,

* Jacob Champion (pchamp...@vmware.com) wrote:
> On Mon, 2021-02-01 at 18:01 -0500, Stephen Frost wrote:
> > Ok..  but what's 'go' mean here?  We already have views and such for GSS
> > and SSL, is the idea to add another view for LDAP and add in columns
> > that are returned by pg_stat_get_activity() which are then pulled out by
> > that view?  Or did you have something else in mind?
> 
> Magnus suggested a function like pg_get_authenticated_identity(), which
> is what I was thinking of when I said that. I'm not too interested in
> an LDAP-specific view, and I don't think anyone so far has asked for
> that.
> 
> My goal is to get this one single point of reference, for all of the
> auth backends. The LDAP mapping conversation is separate.

Presumably this would be the DN for SSL then..?  Not just the CN?  How
would the issuer DN be included?  And the serial?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Should we make Bitmapsets a kind of Node?

David Rowley  writes:
> On Tue, 2 Feb 2021 at 09:23, Tom Lane  wrote:
>> Now that commit f003a7522 did away with the partitioned_rels fields,
>> my original motivation for doing $SUBJECT is gone.  It might still be
>> worth doing, but I'm not planning to tackle it right now.

> I'm not sure if the misuse of Lists to store non-Node types should be
> all that surprising.

Well, as I tried to clarify upthread, it's only a problem if the list
is a subfield of a recognized Node type.  Random private data structures
can and do contain lists of $whatever.  But if you put something in a
Node type then you'd better be prepared to teach backend/nodes/*.c about
it.

regards, tom lane

Re: Determine parallel-safety of partition relations for Inserts

2021-02-01 Thread Greg Nancarrow

On Mon, Feb 1, 2021 at 4:02 PM Hou, Zhijie  wrote:
>
> Attatching v2 patch which addressed the comments above.
>
> Some further refactor:
>
> Introducing a new function is_parallel_possible_for_modify() which decide 
> whether to do safety check.
>
> IMO, It seems more readable to extract all the check that we can do before 
> the safety-check and put them
> in the new function.
>
> Please consider it for further review.
>

I've updated your v2 patches and altered some comments and
documentation changes (but made no code changes) - please compare
against your v2 patches, and see whether you agree with the changes to
the wording.
In the documentation, you will also notice that in your V2 patch, it
says that the "parallel_dml_enabled" table option defaults to false.
As it actually defaults to true, I changed that in the documentation
too.

Regards,
Greg Nancarrow
Fujitsu Australia

v3_0004-reloption-parallel_dml-test-and-doc.patch
Description: Binary data

v3_0001-guc-option-enable_parallel_dml-src.patch
Description: Binary data

v3_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: Binary data

v3_0003-reloption-parallel_dml-src.patch
Description: Binary data

Re: Determine parallel-safety of partition relations for Inserts

2021-02-01 Thread Zhihong Yu

Hi,
For v3_0003-reloption-parallel_dml-src.patch :

+* Check if parallel_dml_enabled is enabled for the target table,
+* if not, skip the safety checks and return PARALLEL_UNSAFE.

Looks like the return value is true / false. So the above comment should be
adjusted.

+   if (!RelationGetParallelDML(rel, true))
+   {
+   table_close(rel, NoLock);
+   return false;
+   }
+
+   table_close(rel, NoLock);

Since the rel would always be closed, it seems the return value
from RelationGetParallelDML() can be assigned to a variable, followed by
call to table_close(), then the return statement.

Cheers

On Mon, Feb 1, 2021 at 3:56 PM Greg Nancarrow  wrote:

> On Mon, Feb 1, 2021 at 4:02 PM Hou, Zhijie 
> wrote:
> >
> > Attatching v2 patch which addressed the comments above.
> >
> > Some further refactor:
> >
> > Introducing a new function is_parallel_possible_for_modify() which
> decide whether to do safety check.
> >
> > IMO, It seems more readable to extract all the check that we can do
> before the safety-check and put them
> > in the new function.
> >
> > Please consider it for further review.
> >
>
> I've updated your v2 patches and altered some comments and
> documentation changes (but made no code changes) - please compare
> against your v2 patches, and see whether you agree with the changes to
> the wording.
> In the documentation, you will also notice that in your V2 patch, it
> says that the "parallel_dml_enabled" table option defaults to false.
> As it actually defaults to true, I changed that in the documentation
> too.
>
> Regards,
> Greg Nancarrow
> Fujitsu Australia
>

Re: Proposal: Save user's original authenticated identity for logging

On Mon, 2021-02-01 at 18:40 -0500, Stephen Frost wrote:
> * Jacob Champion (pchamp...@vmware.com) wrote:
> > My goal is to get this one single point of reference, for all of the
> > auth backends. The LDAP mapping conversation is separate.
> 
> Presumably this would be the DN for SSL then..?  Not just the CN?

Correct.

> How would the issuer DN be included?  And the serial?

In the current proposal, they're not. Seems like only the Subject
should be considered when determining the "identity of the user" --
knowing the issuer or the certificate fingerprint might be useful in
general, and perhaps they should be logged somewhere, but they're not
part of the user's identity.

If there were a feature that considered the issuer or serial number
when making role mappings, I think it'd be easier to make a case for
that. As of right now I don't think they should be incorporated into
this *particular* identifier.

--Jacob

Re: Key management with tests

On Mon, Feb  1, 2021 at 06:34:53PM -0500, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Sat, Jan 30, 2021 at 08:23:11AM -0500, Tom Kincaid wrote:
> > > I propose that we meet to discuss what approach we want to use to move TDE
> > > forward.  We then start a new thread with a proposal on the approach
> > > and finalize it via community consensus. I will invite Bruce, Stephen and
> > > Masahiko to this meeting. If anybody else would like to participate in 
> > > this
> > > discussion and subsequently in the effort to get TDE in PG1x, please let 
> > > me
> > > know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a 
> > > volunteer
> > > from this meeting) will post the proposal for how we move this patch 
> > > forward in
> > > another thread. Hopefully, we can get consensus on that and subsequently
> > > restart the execution of delivering this feature.
> > 
> > We got complaints that decisions were not publicly discussed, or were
> > too long, so I am not sure this helps.
> 
> If the notes are published afterwords as an explanation of why certain
> choices were made, I suspect it'd be reasonably well received.  The
> concern about back-room discussions is more that decisions are made
> without explanation as to why, provided we avoid that, I believe they
> can be helpful.

Well, I thought that was what the wiki was, but I guess not.  I did
remove some of the decision logic recently since we had made a final
decision.  However, most of the questions were not covered on the wiki,
since, as I said, everyone comes with a different need for details.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Support for NSS as a libpq TLS backend

On Mon, 2021-02-01 at 21:49 +0100, Daniel Gustafsson wrote:
> > On 29 Jan 2021, at 19:46, Jacob Champion  wrote:
> > I think the bad news is that the static approach will need support for
> > ENABLE_THREAD_SAFETY.
> 
> I did some more reading today and noticed that the NSS documentation (and 
> their
> sample code for doing crypto without TLS connections) says to use 
> NSS_NoDB_Init
> to perform a read-only init which don't require a matching close call.  Now,
> the docs aren't terribly clear and also seems to have gone offline from MDN,
> and skimming the code isn't entirelt self-explanatory, so I may well have
> missed something.  The v24 patchset posted changes to this and at least passes
> tests with decent performance so it seems worth investigating.

Nice! Not having to close helps quite a bit.

(Looks like thread safety for NSS_Init was added in 3.13, so we have an
absolute version floor.)

> > (It looks like the NSS implementation of pgtls_close() needs some thread
> > support too?)
> 
> Storing the context in conn would probably be better?

Agreed.

> > The good(?) news is that I don't understand why OpenSSL's
> > implementation of cryptohash doesn't _also_ need the thread-safety
> > code. (Shouldn't we need to call CRYPTO_set_locking_callback() et al
> > before using any of its cryptohash implementation?) So maybe we can
> > implement the same global setup/teardown API for OpenSSL too and not
> > have to one-off it for NSS...
> 
> No idea here, wouldn't that impact pgcrypto as well in that case?

If pgcrypto is backend-only then I don't think it should need
multithreading protection; is that right?

--Jacob

RE: Commitfest 2021-01 is now closed.

2021-02-01 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> about 1 month. But if I confirm that the author has a plan to update
> the patch soon I didn't close them. So I might have left too many
> patches for the next commitfest. If you have a patch that was moved,
> and you intend to rewrite enough of it to warrant a resubmission then
> please go in and close your entry.

I respect your kind treatment like this.  A great job and great thanks!  It 
must have been tough to shift through so many difficult discussions.


> From another point of view, those patches are likely to have a long
> discussion and a certain level of difficulty, so it's relatively hard
> for beginners. It would be good if the experienced hackers more focus
> on such difficult patches.  It's a just idea but I thought that it
> would be helpful if we could have something like a mark on CF app
> indicating the patch is good for beginners like we have [E] mark in
> the ToDo wiki page[1]. This would be a good indicator for new-coming
> contributors to choose the patch to review and might help increase the
> reviewers. Which could help that the experienced hackers can focus on
> other patches. The mark can be added/edited either by the patch author
> or CFM.

+10
Or maybe we can add some difficulty score like e-commerce's review score, so 
that multiple people (patch author(s), serious persistent reviewers, CFM, and 
others who had a look but gave up reviewing) can reflect their impressions.
Further, something like stars or "Likes" could be encouraging (while 0 count 
may be discouraging for the author.)
Also, I'd be happy if I could know the patch set size -- the total of the last 
line of diffstat for each patch file.


Regards
Takayuki Tsunakawa

Re: Key management with tests

On Mon, Feb  1, 2021 at 06:31:32PM -0500, Stephen Frost wrote:
> * Bruce Momjian (br...@momjian.us) wrote:
> >   The purpose of cluster file encryption is to prevent users with read
> >   access to the directories used to store database files and write-ahead
> >   log files from being able to access the data stored in those files.
> >   For example, when using cluster file encryption, users who have read
> >   access to the cluster directories for backup purposes will not be able
> >   to decrypt the data stored in these files.  It also protects against
> >   decrypted data access after media theft.
> 
> That's one valid use-case and it particularly makes sense to consider,
> now that we support group read-access to the data cluster.  The last

Do enough people use group read-access to be useful?

> line seems a bit unclear- I would update it to say:
> Cluster file encryption also provides data-at-rest security, protecting
> users from data loss should the physical media on which the cluster is
> stored be stolen, improperly deprovisioned (not wiped or destroyed), or
> otherwise ends up in the hands of an attacker.

I have split the section into three paragraphs, trimmed down some of the
suggested text, and added it.  Full version below.

> >   File system write access can allow for unauthorized file system data
> >   decryption if the writes can be used to weaken the system's security
> >   and this weakened system is later supplied with externally-stored keys.
> 
> This isn't very clear as to exactly what the concern is or how an
> attacker would be able to thwart the system if they had write access to
> it.  An attacker with write access could possibly attempt to replace the
> existing keys, but with the key wrapping that we're using, that should
> result in just a decryption failure (unless, of course, the attacker has
> the actual KEK that was used, but that's not terribly interesting to
> worry about since then they could just go access the files directly).

Uh, well, they could modify postgresql.conf to change the script to save
the secret returned by the script before returning it to the PG server. 
We could require postgresql.conf to be somewhere secure, but then how do
we know that is secure?  I just don't see a clean solution here, but the
idea that you write and then wait for the key to show up seems like a
very valid way of attack, and it took me a while to be able to
articulate it.

> Until and unless we solve the issue around storing the GCM tags for each
> page, we will have the risk that an attacker could modify a page in a
> manner that we wouldn't detect.  This is the biggest concern that I have
> currently with the existing TDE patch sets.

Well, GCM certainly can detect page modification, but it can't detect
removing pages from the end of the table, or, since the nonce is
LSN/pageno, you could copy a page from another table that has the same
offset into another table, particularly with partitioning where the
tables have the same columns.  We might be able to protect against the
later with some kind of table-id in the nonce, but I don't see how table
truncation can be detected without adding a whole lot of overhead and
complexity.  And if we can't protect against those two, why bother with
detecting single-page modifications?  We have to do a full job for it to
be useful.

> There's two options that I see around how to address that issue- either
> we arrange to create space in the page for the tag, such as by making
> the 'special' space on a page a bit bigger and making sure that
> everything understands that, or we'll need to add another fork in which
> we store the tags (and possibly other TDE/encryption related
> information).  If we go with a fork then it should be possible to do WAL
> streaming from an unencrypted cluster to an encrypted one, which would
> be pretty neat, but it means another fork and another page that has to
> be read/written every time we modify a page.  Getting some input into
> the trade-offs here would be really helpful.  I don't think it's really
> reasonable to go out with TDE without having figured out the integrity
> side.  Certainly, when I review things like NIST 800-53, it's very clear
> that the requirement is for both confidentiality *and* integrity.

Wow, well, if they are both required, and we can't do both, is it
valuable to do just one?  Yes, we can do something later, but what if we
have no idea how to implement the second part?  Your fork idea above
might need to store some table-id used for the nonce (to prevent copying
from another table) and the number of pages in the table, which fixes
the integrity check issue, but adds a lot of complexity and perhaps
overhead.

> >   This also does not protect from users who have read access to system
> >   memory.  This also does not detect or protect against users with write
> >   access from removing or modifying database files.
> 
> The last seems a bit obvious, but the first sentence quoted above is

Re: Support for NSS as a libpq TLS backend

On Mon, 2021-02-01 at 21:49 +0100, Daniel Gustafsson wrote:
> > Embedded NULLs are now handled in a similar manner to the OpenSSL side,
> > though because this failure happens during the certificate
> > authentication callback, it results in a TLS alert rather than simply
> > closing the connection.
> 
> But returning SECFailure from the cert callback force NSS to terminate the
> connection immediately doesn't it?

IIRC NSS will send the alert first, whereas our OpenSSL implementation
will complete the handshake and then drop the connection. I'll rebuild
with the latest and confirm.

> > For easier review of just the parts I've changed, I've also attached a
> > since-v22.diff, which is part of the 0001 patch.
> 
> I confused my dev trees and missed to include this in the v23 that I sent out
> (which should've been v24), sorry about that.  Attached is a v24 which is
> rebased on top of todays --with-ssl commit, and now includes your changes.

No problem. Thanks!

--Jacob

Re: a verbose option for autovacuum

2021-02-01 Thread Tommy Li

Hi Masahiko

> If we set
> it per-table basis, it’s useful when the user already knows which
> tables are likely to take a long time for autovacuum

I would assume that's the default case, most apps I've seen are designed
around a small
number of large tables that take up most of the maintenance effort

> Regarding when to log, we can have autovacuum emit index vacuum log
> after each lazy_vacuum/cleanup_index() end like VACUUM VERBOSE does,
> but I’m not sure how it could work together with
> log_autovacuum_min_duration.

I do like having this rolled into the existing configuration. This might be
an absurd idea, but
what if the autovacuum process accumulates the per-index vacuum information
until that
threshold is reached, and then spits out the logs all at once? And after
the min duration is
passed, it just logs the rest of the index vacuum information as they
occur. That way the
information is more likely to be available to investigate an abnormally
long running vacuum
while it's still happening.


Tommy

Re: Support for NSS as a libpq TLS backend

2021-02-01 Thread Michael Paquier

On Tue, Feb 02, 2021 at 12:42:23AM +, Jacob Champion wrote:
> (Looks like thread safety for NSS_Init was added in 3.13, so we have an
> absolute version floor.)

If that's the case, I would recommend to add at least something in the
section called install-requirements in the docs.

> If pgcrypto is backend-only then I don't think it should need
> multithreading protection; is that right?

No need for it in the backend, unless there are plans to switch from
processes to threads there :p

libpq, ecpg and anything using them have to care about that.  Worth
noting that OpenSSL also has some special handling in libpq with
CRYPTO_get_id_callback() and that it tracks the number of opened
connections.
--
Michael


signature.asc
Description: PGP signature

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-02-01 Thread Michael Paquier

On Mon, Feb 01, 2021 at 06:28:57PM +0300, Alexey Kondratov wrote:
> Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly uses
> index_create() with a proper tablespaceOid instead of
> SetRelationTableSpace(). And its checks structure is more restrictive even
> without tablespace change, so it doesn't use CheckRelationTableSpaceMove().

Sure.  I have not checked the patch in details, but even with that it
would be much safer to me if we apply the same sanity checks
everywhere.  That's less potential holes to worry about.

> Basically, it implements this option "we could silently *not* do the switch
> for partitioned indexes in the flow of REINDEX, letting users handle that
> with an extra ALTER TABLE SET TABLESPACE once REINDEX has finished". It
> probably makes sense, since we are doing tablespace change altogether with
> index relation rewrite and don't touch relations without storage. Doing
> ALTER INDEX ... SET TABLESPACE will be almost cost-less on them, since they
> do not hold any data.

Yeah, they'd still need an AEL for a short time on the partitioned
bits with what's on HEAD.  I'll keep in mind to look at the
possibility to downgrade this lock if cascading only on partitioned
tables.  The main take is that AlterTableGetLockLevel() cannot select
a lock type based on the table meta-data.  Tricky problem it is if
taken as a whole, but I guess that we should be able to tweak ALTER
TABLE ONLY on a partitioned table/index pretty easily (inh = false).
--
Michael


signature.asc
Description: PGP signature

Re: a verbose option for autovacuum

2021-02-01 Thread Euler Taveira

On Fri, Jan 29, 2021, at 4:35 AM, Masahiko Sawada wrote:
> I agree to have autovacuum log more information, especially index
> vacuums. Currently, the log related to index vacuum is only the number
> of index scans. I think it would be helpful if the log has more
> details about each index vacuum.
+1 for this feature. Sometimes this analysis is useful to confirm your theory;
without data, it is just a wild guess.

> But I'm not sure that neither always logging that nor having set the
> parameter per-table basis is a good idea. In the former case, it could
> be log spam for example in the case of anti-wraparound vacuums that
> vacuums on all tables (and their indexes) in the database. If we set
> it per-table basis, it’s useful when the user already knows which
> tables are likely to take a long time for autovacuum but won’t work
> when the users want to check the autovacuum details for tables that
> autovacuum could take a long time for.
I prefer a per-table parameter since it allows us a fine-grained tuning. It
covers the cases you provided above. You can disable it at all and only enable
it in critical tables or enable it and disable it for known-to-be-spam tables.

> Given that we already have log_autovacuum_min_duration, I think this
> verbose logging should work together with that. I’d prefer to enable
> the verbose logging by default for the same reason Stephen mentioned.
> Or maybe we can have a parameter to control verbosity, say
> log_autovaucum_verbosity.
IMO this new parameter is just an option to inject VERBOSE into VACUUM command.
Since there is already a parameter to avoid spam autovacuum messages, this
feature shouldn't hijack log_autovacuum_min_duration behavior. If the
autovacuum command execution time runs less than l_a_m_d, the output should be
discarded.

I don't have a strong opinion about this parameter name but I think your
suggestion (log_autovaccum_verbosity) is easier to guess what this parameter is
for.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

RE: Determine parallel-safety of partition relations for Inserts

2021-02-01 Thread Hou, Zhijie

> > IMO, It seems more readable to extract all the check that we can do
> > before the safety-check and put them in the new function.
> >
> > Please consider it for further review.
> >
> 
> I've updated your v2 patches and altered some comments and documentation
> changes (but made no code changes) - please compare against your v2 patches,
> and see whether you agree with the changes to the wording.
> In the documentation, you will also notice that in your V2 patch, it says
> that the "parallel_dml_enabled" table option defaults to false.
> As it actually defaults to true, I changed that in the documentation too.

Hi greg,

Thanks a lot for the document update, LGTM.

Attaching v4 patches with the changes:
 * fix some typos in the code.
 * store partitioned reloption in a separate struct PartitionedOptions, 
   making it more standard and easier to expand in the future.

Please consider it for further review.

Best regards,
Houzj





v4_0001-guc-option-enable_parallel_dml-src.patch
Description: v4_0001-guc-option-enable_parallel_dml-src.patch


v4_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: v4_0002-guc-option-enable_parallel_dml-doc-and-test.patch


v4_0003-reloption-parallel_dml-src.patch.patch
Description: v4_0003-reloption-parallel_dml-src.patch.patch


v4_0004-reloption-parallel_dml-test-and-doc.patch
Description: v4_0004-reloption-parallel_dml-test-and-doc.patch

Re: [PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)

2021-02-01 Thread Euler Taveira

On Sun, Jan 31, 2021, at 10:27 AM, Joel Jacobson wrote:
> Here comes the patch again, now including data.
Joel, register this patch into the next CF [1] so we don't lose track of it.


[1] https://commitfest.postgresql.org/32/


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-02-01 Thread Fujii Masao





On 2021/02/01 16:13, Bharath Rupireddy wrote:

On Mon, Feb 1, 2021 at 12:29 PM Fujii Masao  wrote:

On 2021/01/30 9:28, Bharath Rupireddy wrote:

On Sat, Jan 30, 2021 at 12:14 AM Fujii Masao
 wrote:

+   /*
+* It doesn't make sense to show this entry in the 
output with a
+* NULL server_name as it will be closed at the xact 
end.
+*/
+   continue;

-1 with this change because I still think that it's more useful to list
all the open connections.


If postgres_fdw_get_connections doesn't have a "valid" column, then I
thought it's better not showing server_name NULL in the output.


Or if we don't have strong reason to remove "valid" column,
the current design is enough?


My only worry was that the statement from [1] "A cache flush should
not cause user-visible state changes."


If we follow this strictly, I'm afraid that postgres_fdw_get_connections()
itself would also be a problem because the cached connections are affected
by cache flush and postgres_fdw_get_connections() shows that to users.
I'm not sure if removing "valid" column is actually helpful for that statement.

Anyway, for now we have the following options;

(1) keep the feature as it is
(2) remove "valid" column
(2-1) show NULL for the connection whose server was dropped
(2-2) show fixed value (e.g., ) for the connection whose 
server was dropped
(3) remove "valid" column and don't display connection whose server was dropped
(4) remove postgres_fdw_get_connections()

For now I like (1), but if others think "valid" column should be dropped,
I'm fine with (2). But I'd like to avoid (3) because I think that
postgres_fdw_get_connections() should list all the connections that
are actually being established. I have no strong opinion about whether
(2-1) or (2-2) is better, for now.


But the newly added function
postgres_fdw_get_connections is VOLATILE which means that the results
returned by postgres_fdw_get_connections() is also VOLATILE. Isn't
this enough, so that users will not get surprised with different
results in case invalidations occur within the server by the time they
run the query subsequent times and see different results than what
they saw in the first run?


I'm not sure about this...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Single transaction in the tablesync worker?

2021-02-01 Thread Peter Smith

On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila  wrote:

> I have updated the patch to display WARNING for each of the tablesync
> slots during DropSubscription. As discussed, I have moved the drop
> slot related code towards the end in AlterSubscription_refresh. Apart
> from this, I have fixed one more issue in tablesync code where in
> after catching the exception we were not clearing the transaction
> state on the publisher, see changes in LogicalRepSyncTableStart. I
> have also fixed other comments raised by you.

Here are some additional feedback comments about the v24 patch:

~~

ReportSlotConnectionError:

1,2,3,4.
+ foreach(lc, rstates)
+ {
+ SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+ Oid relid = rstate->relid;
+
+ /* Only cleanup resources of tablesync workers */
+ if (!OidIsValid(relid))
+ continue;
+
+ /*
+ * Caller needs to ensure that we have appropriate locks so that
+ * relstate doesn't change underneath us.
+ */
+ if (rstate->state != SUBREL_STATE_SYNCDONE)
+ {
+ char syncslotname[NAMEDATALEN] = { 0 };
+
+ ReplicationSlotNameForTablesync(subid, relid, syncslotname);
+ elog(WARNING, "could not drop tablesync replication slot \"%s\"",
+ syncslotname);
+
+ }
+ }

1. I wonder if "rstates" would be better named something like
"not_ready_rstates", otherwise it is not apparent what states are in
this list

2. The comment "/* Only cleanup resources of tablesync workers */" is
not quite correct because there is no cleanup happening here. Maybe
change to:
if (!OidIsValid(relid))
continue; /* not a tablesync worker */

3. Maybe the "appropriate locks" comment can say what locks are the
"appropriate" ones?

4. Spurious blank line after the elog?

~~

AlterSubscription_refresh:

5.
+ /*
+ * Drop the tablesync slot. This has to be at the end because
otherwise if there
+ * is an error while doing the database operations we won't be able to rollback
+ * dropped slot.
+ */

Maybe "Drop the tablesync slot." should say "Drop the tablesync slots
associated with removed tables."

~~

DropSubscription:

6.
+ /*
+ * Cleanup of tablesync replication origins.
+ *
+ * Any READY-state relations would already have dealt with clean-ups.
+ *
+ * Note that the state can't change because we have already stopped both
+ * the apply and tablesync workers and they can't restart because of
+ * exclusive lock on the subscription.
+ */
+ rstates = GetSubscriptionNotReadyRelations(subid);
+ foreach(lc, rstates)

I wonder if "rstates" would be better named as "not_ready_rstates",
because it is used in several places where not READY is assumed.

7.
+ {
+ if (!slotname)
+ {
+ /* be tidy */
+ list_free(rstates);
+ return;
+ }
+ else
+ {
+ ReportSlotConnectionError(rstates, subid, slotname, err);
+ }
+
+ }

Spurious blank line above?

8.
The new logic of calling the ReportSlotConnectionError seems to be
expecting that the user has encountered some connection error, and
*after* that they have assigned slot_name = NONE as a workaround. In
this scenario the code looks ok since names of any dangling tablesync
slots were being logged at the time of the error.

But I am wondering what about where the user might have set slot_name
= NONE *before* the connection is broken. In this scenario, there is
no ERROR, so if there are still (is it possible?) dangling tablesync
slots, their names are never getting logged at all. So how can the
user know what to delete?

~~

> Additionally, I have
> removed the test because it was creating the same name slot as the
> tablesync worker and tablesync worker removed the same due to new
> logic in LogicalRepSyncStart. Earlier, it was not failing because of
> the bug in that code which I have fixed in the attached.

Wasn't causing a tablesync slot clash and seeing if it could recover
the point of that test? Why not just keep, and modify the test to make
it work again? Isn't it still valuable because at least it would
execute the code through the PG_CATCH which otherwise may not get
executed by any other test?

>
> I wonder whether we should restrict creating slots with prefix pg_
> because we are internally creating slots with those names? I think
> this was a problem previously also. We already prohibit it for few
> other objects like origins, schema, etc., see the usage of
> IsReservedName.
>

Yes, we could restrict the create slot API if you really wanted to.
But, IMO it is implausible that a user could "accidentally" clash with
the internal tablesync slot name, so in practice maybe this change
would not help much but it might make it more difficult to test some
scenarios.

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: New IndexAM API controlling index vacuum strategies

2021-02-01 Thread Peter Geoghegan

On Fri, Jan 29, 2021 at 5:26 PM Peter Geoghegan  wrote:
> It'll be essential to have good instrumentation as we do more
> benchmarking. We're probably going to have to make subjective
> assessments of benchmark results, based on multiple factors. That will
> probably be the only practical way to assess how much better (or
> worse) the patch is compared to master. This patch is more about
> efficiency and predictability than performance per se. Which is good,
> because that's where most of the real world problems actually are.

I've been thinking about how to get this patch committed for
PostgreSQL 14. This will probably require cutting scope, so that the
initial commit is not so ambitious. I think that "incremental VACUUM"
could easily take up a lot of my time for Postgres 15, and maybe even
Postgres 16.

I'm starting to think that the right short term goal should not
directly involve bottom-up index deletion. We should instead return to
the idea of "unifying" the vacuum_cleanup_index_scale_factor feature
with the INDEX_CLEANUP feature, which is kind of where this whole idea
started out at. This short term goal is much more than mere
refactoring. It is still a whole new user-visible feature. The patch
would teach VACUUM to skip doing any real index work within both
ambulkdelete() and amvacuumcleanup() in many important cases.

Here is a more detailed explanation:

Today we can skip all significant work in ambulkdelete() and
amvacuumcleanup() when there are zero dead tuples in the table. But
why is the threshold *precisely* zero? If we could treat cases that
have "practically zero" dead tuples in the same way (or almost the
same way) as cases with *exactly* zero dead tuple, that's still a big
improvement. And it still sets an important precedent that is crucial
for the wider "incremental VACUUM" project: the criteria for
triggering index vacuuming becomes truly "fuzzy" for the first time.
It is "fuzzy" in the sense that index vacuuming might not happen
during VACUUM at all now, even when the user didn't explicitly use
VACUUUM's INDEX_CLEANUP option, and even when more than *precisely*
zero dead index tuples are involved (though not *much* more than zero,
can't be too aggressive). That really is a big change.

A recap on vacuum_cleanup_index_scale_factor, just to avoid confusion:

The reader should note that this is very different to Masahiko's
vacuum_cleanup_index_scale_factor project, which skips *cleanup* in
VACUUM (not bulk delete), a question which only comes up when there
are definitely zero dead index tuples. The unifying work I'm talking
about now implies that we completely avoid scanning indexes during
vacuum, even when they are known to have at least a few dead index
tuples, and even when VACUUM's INDEX_CLEANUP emergency option is not
in use. Which, as I just said, is a big change.

Thoughts on triggering criteria for new "unified" design, ~99.9%
append-only tables:

Actually, in *one* sense the difference between "precisely zero" and
"practically zero" here *is* small. But it's still probably going to
result in skipping reading indexes during VACUUM in many important
cases. Like when you must VACUUM a table that is ~99.9% append-only.
In the real world things are rarely in exact discrete categories, even
when we imagine that they are. It's too easy to be wrong about one
tiny detail -- like one tiny UPDATE from 4 weeks ago, perhaps. Having
a tiny amount of "forgiveness" here is actually a *huge* improvement
on having precisely zero forgiveness. Small and big.

This should help cases that get big surprising spikes due to
anti-wraparound vacuums that must vacuum indexes for the first time in
ages -- indexes may be vacuumed despite only having a tiny absolute
number of dead tuples. I don't think that it's necessary to treat
anti-wraparound vacuums as special at all (not in Postgres 14 and
probably not ever), because simply considering cases where the table
has "practically zero" dead tuples alone should be enough. Vacuuming a
10GB index to delete only 10 tuples simply makes no sense. It doesn't
necessarily matter how we end up there, it just shouldn't happen.

The ~99.9% append-only table case is likely to be important and common
in the real world. We should start there for Postgres 14 because it's
easier, that's all. It's not fundamentally different to what happens
in workloads involving lots of bottom-up deletion -- it's just
simpler, and easier to reason about. Bottom-up deletion is an
important piece of the big puzzle here, but some variant of
"incremental VACUUM" really would still make sense in a world where
bottom-up index deletion does not exist. (In fact, I started thinking
about "incremental VACUUM" before bottom-up index deletion, and didn't
make any connection between them until work on bottom-up deletion had
already progressed significantly.)

Here is how the triggering criteria could work: maybe skipping
accessing all indexes during VACUUM happens when less than 1% or
10,000 of the items

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-02-01 Thread Fujii Masao

On 2021/02/01 16:39, Bharath Rupireddy wrote:

On Mon, Feb 1, 2021 at 12:43 PM Fujii Masao wrote:

On 2021/01/27 10:06, Bharath Rupireddy wrote:

On Tue, Jan 26, 2021 at 8:38 AM Bharath Rupireddy
wrote:

I will post "keep_connections" GUC and "keep_connection" server level
option patches later.

Attaching v19 patch set for "keep_connections" GUC and
"keep_connection" server level option. Please review them further.

These options are no longer necessary because we now support
idle_session_timeout? If we want to disconnect the foreign server connections
that sit on idle to prevent them from eating up the connection capacities in
the foriegn servers, we can just set idle_session_timeout in those foreign
servers. If we want to avoid the cluster-wide setting of idle_session_timeout,
we can set that per role. One issue for this approach is that the connection
entry remains even after idle_session_timeout happens. So
postgres_fdw_get_connections() returns that connection even though it's
actually closed by the timeout. Which is confusing. But which doesn't cause any
actual problem, right? When the foreign table is accessed the next time, that
connection entry is dropped, an error is detected, and then new connection will
be remade.

First of all, idle_session_timeout is by default 0 i.e. disabled,
there are chances that users may not use that and don't want to set it
just for not caching any foreign server connection. A simple use case
where server level option can be useful is that, users are accessing
foreign tables (may be not that frequently, once in a while) from a
long running local session using foreign servers and they don't want
to keep the local session cache those connections, then setting this
server level option, keep_connections to false makes their life
easier, without having to depend on setting idle_session_timeout on
the remote server.

Thanks for explaining this!

I understand that use case. But I still think that we can use
idle_session_timeout for that use case without keep_connections.
Per the past discussion, Robert seems to prefer controling the cached
connection by timeout rather than boolean, at [1]. Bruce seems to think
that idle_session_timeout is enough for the use case, at [2]. So I'm not
sure what the current consensus is...

Also Alexey seems to have thought that idle_session_timeout is not
suitable for cached connection because it's the cluster-wide option, at [3].
But since it's marked as PGC_USERSET, we can set it per-role, e.g.,
by using ALTER ROLE SET, so that it can affect only the foreign server
connections.

One merit of keep_connections that I found is that we can use it even
when connecting to the older PostgreSQL that doesn't support
idle_session_timeout. Also it seems simpler to use keep_connections
rather than setting idle_session_timeout in multiple remote servers.
So I'm inclined to add this feature, but I'd like to hear more opinions.

[1]
https://www.postgresql.org/message-id/CA%2BTgmob_nF7NkBfVLUhmQ%2Bt8JGVV4hXy%2BzkuMUtTSd-%3DHPBeuA%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/20200714165822.GE7628%40momjian.us

[3]
https://www.postgresql.org/message-id/6df6525ca7a4b54a4a39f55e4dd6b3e9%40postgrespro.ru

And, just using idle_session_timeout on a remote server may not help
us completely. Because the remote session may go away, while we are
still using that cached connection in an explicit txn on the local
session. Our connection retry will also not work because we are in the
middle of an xact, so the local explicit txn gets aborted.

Regarding idle_in_transaction_session_timeout, this seems true. But
I was thinking that idle_session_timeout doesn't cause this issue because
it doesn't close the connection in the middle of transaction. No?

So, IMO, we can still have both server level option as well as
postgres_fdw contrib level GUC (to tell the local session that "I
don't want to keep any foreign connections active" instead of setting
keep_connection server level option for each foreign server).

Sorry I've not read the past long discussion about this feature. If there is
the consensus that these options are still necessary and useful even when we
have idle_session_timeout, please correct me.

ISTM that it's intuitive (at least for me) to add this kind of option into the
foreign server. But I'm not sure if it's good idea to expose the option as GUC.
Also if there is the consensus about this, please correct me.

See here [1].

[1] -
https://www.postgresql.org/message-id/f58d1df4ae58f6cf3bfa560f923462e0%40postgrespro.ru

Thanks!

Here are some review comments.

- (used_in_current_xact && !keep_connections))
+ (used_in_current_xact &&
+ (!keep_connections || !entry->keep_connection)))

The names of GUC and server-level option should be the same,
to make the thing less confusing?

IMO the server-level option should override GUC. IOW, GUC setting
should

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-02-01 Thread Julien Rouhaud

On Mon, Feb 01, 2021 at 02:00:48PM -0300, Alvaro Herrera wrote:
> On 2021-Jan-24, Julien Rouhaud wrote:
> 
> > While working on pg14 compatibility for an extension relying on an 
> > apparently
> > uncommon combination of FOR UPDATE and stored function calls, I hit some new
> > Asserts introduced in 866e24d47db (Extend amcheck to check heap pages):
> > 
> > +   /*
> > +* Do not allow tuples with invalid combinations of hint bits to be 
> > placed
> > +* on a page.  These combinations are detected as corruption by the
> > +* contrib/amcheck logic, so if you disable one or both of these
> > +* assertions, make corresponding changes there.
> > +*/
> > +   Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> > +(tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED)));
> > 
> > 
> > I attach a simple self contained script to reproduce the problem, the last
> > UPDATE triggering the Assert.
> 
> Maybe we should contest the idea that this is a sensible thing to Assert
> against.  AFAICS this was originally suggested here:
> https://www.postgresql.org/message-id/flat/CAFiTN-syyHc3jZoou51v0SR8z0POoNfktqEO6MaGig4YS8mosA%40mail.gmail.com#ad215d0ee0606b5f67bbc57d011c96b8
> and it appears now to have been a bad idea.  If I recall correctly,
> HEAP_KEYS_UPDATED is supposed to distinguish locks/updates that don't
> modify the key columns from those that do.  Since SELECT FOR UPDATE
> stands for a future update that may modify arbitrary portions of the
> tuple (including "key" columns), then it produces that bit, just as said
> UPDATE or a DELETE; as opposed to SELECT FOR NO KEY UPDATE which stands
> for a future UPDATE that will only change columns that aren't part of
> any keys.

Thanks for the clarification, that makes sense.

> So I think that I misspoke earlier in this thread when I said this is a
> bug, and that the right fix here is to remove the Assert() and change
> amcheck to match.

I'm attaching a patch to do so.

> Separately, maybe it'd also be good to have a test case based on
> Julien's SQL snippet that produces this particular infomask combination
> (and other interesting combinations) and passes them through VACUUM etc
> to see that everything behaves correctly.

I also updated amcheck perl regression tests to generate such a combination,
add added an additional pass of verify_heapam() just after the VACUUM.

> 
> You could also argue the HEAP_KEYS_UPDATED is a misnomer and that we'd
> do well to change its name, and update README.tuplock.

Changing the name may be overkill, but claryfing the hint bit usage in
README.tuplock would definitely be useful, especially since the combination
isn't always produced.  How about adding something like:

 HEAP_KEYS_UPDATED
  This bit lives in t_infomask2.  If set, indicates that the XMAX updated
  this tuple and changed the key values, or it deleted the tuple.
+  It can also be set in combination of HEAP_XMAX_LOCK_ONLY.
  It's set regardless of whether the XMAX is a TransactionId or a MultiXactId.

>From 529dff8991d4bd7db661612a450b40bbc53a1429 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Fri, 22 Jan 2021 00:06:34 +0800
Subject: [PATCH v1] Allow HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED
 combination.

This combination of hint bits was previously detected as a form of corruption,
but it can be obtained with some mix of SELECT ... FOR UPDATE and UPDATE
queries.

Discussion: https://postgr.es/m/20210124061758.GA11756@nol
---
 contrib/amcheck/t/001_verify_heapam.pl | 14 +-
 contrib/amcheck/verify_heapam.c|  7 ---
 src/backend/access/heap/hio.c  |  8 +++-
 3 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/contrib/amcheck/t/001_verify_heapam.pl 
b/contrib/amcheck/t/001_verify_heapam.pl
index a2f65b826d..6050feb712 100644
--- a/contrib/amcheck/t/001_verify_heapam.pl
+++ b/contrib/amcheck/t/001_verify_heapam.pl
@@ -4,7 +4,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 
-use Test::More tests => 79;
+use Test::More tests => 80;
 
 my ($node, $result);
 
@@ -47,6 +47,9 @@ detects_heap_corruption(
 #
 fresh_test_table('test');
 $node->safe_psql('postgres', q(VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test));
+detects_no_corruption(
+   "verify_heapam('test')",
+   "all-frozen not corrupted table");
 corrupt_first_page('test');
 detects_heap_corruption("verify_heapam('test')",
"all-frozen corrupted table");
@@ -92,6 +95,15 @@ sub fresh_test_table
ALTER TABLE $relname ALTER b SET STORAGE external;
INSERT INTO $relname (a, b)
(SELECT gs, repeat('b',gs*10) FROM 
generate_series(1,1000) gs);
+   BEGIN;
+   SAVEPOINT s1;
+   SELECT 1 FROM $relname WHERE a = 42 FOR UPDATE;
+   UPDATE $relname SET b = b WHERE a = 42;
+   RELEASE s1;
+   SAVEPOINT s1;
+   SELECT 1 FROM $relname WHERE a = 42 FOR UPDATE;
+   UPDATE $reln

Re: New IndexAM API controlling index vacuum strategies

2021-02-01 Thread Peter Geoghegan

On Mon, Feb 1, 2021 at 7:17 PM Peter Geoghegan  wrote:
> Here is how the triggering criteria could work: maybe skipping
> accessing all indexes during VACUUM happens when less than 1% or
> 10,000 of the items from the table are to be removed by VACUUM --
> whichever is greater. Of course this is just the first thing I thought
> of. It's a starting point for further discussion.

And now here is the second thing I thought of, which is much better:

Sometimes 1% of the dead tuples in a heap relation will be spread
across 90%+ of the pages. With other workloads 1% of dead tuples might
be highly concentrated, and appear in no more than 1% of all heap
pages. Obviously the distinction between these two cases/workloads
matters a lot. And so the triggering criteria must be quantitative
*and* qualitative. It should not be based on counting dead tuples,
since that alone won't differentiate these two extreme cases - both of
which are probably quite common (in the real world extremes are
actually the normal and common case IME).

I like the idea of basing it on counting *heap blocks*, not dead
tuples. We can count heap blocks that have *at least* one dead tuple
(of course it doesn't matter how they're dead, whether it was this
VACUUM operation or some earlier opportunistic pruning). Note in
particular that it should not matter if it's a heap block that has
only one LP_DEAD line pointer or a heap page that is near the
MaxHeapTuplesPerPage limit for the page -- we count either type of
page towards the heap-page based limit used to decide if index
vacuuming goes ahead for all indexes during VACUUM.

This removes almost all of the issues with not setting visibility map
bits reliably (another concern of mine that I forget to mention
earlier), but it is still very likely to do the right thing in the
"~99.9% append-only table" case I mentioned in the last email. We can
be relatively aggressive (say by triggering index skipping when less
than 1% of all heap pages have dead tuples). Plus the new nbtree index
tuple deletion stuff is naturally very good at deleting index tuples
in the event of dead tuples becoming concentrated in relatively few
table blocks -- that helps too. This is true of the enhanced simple
deletion mechanism (which has been taught to be clever about table
block dead tuple concentrations/table layout), as well as bottom-up
index deletion.

-- 
Peter Geoghegan

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination