Re: [PATCH] - Provide robust alternatives for replace_string

2020-08-05 Thread Asim Praveen

> On 03-Aug-2020, at 8:36 PM, Alvaro Herrera  wrote:
> 
> On 2020-Aug-03, Asim Praveen wrote:
> 
>> Thank you Alvaro for reviewing the patch!
>> 
>>> On 01-Aug-2020, at 7:22 AM, Alvaro Herrera  wrote:
>>> 
>>> What happens if a replacement string happens to be split in the middle
>>> by the fgets buffering?  I think it'll fail to be replaced.  This
>>> applies to both versions.
>> 
>> Can a string to be replaced be split across multiple lines in the source 
>> file?  If I understand correctly, fgets reads one line from input file at a 
>> time.  If I do not, in the worst case, we will get an un-replaced string in 
>> the output, such as “@abs_dir@“ and it should be easily detected by a 
>> failing diff.
> 
> I meant what if the line is longer than 1023 chars and the replace
> marker starts at byte 1021, for example.  Then the first fgets would get
> "@ab" and the second fgets would get "s_dir@" and none would see it as
> replaceable.
> 


Please find attached a StringInfo based solution to this problem.  It uses 
fgetln instead of fgets such that a line is read in full, without ever 
splitting it.

Asim



0001-Use-a-stringInfo-instead-of-a-char-for-replace_strin.patch
Description:  0001-Use-a-stringInfo-instead-of-a-char-for-replace_strin.patch


Re: LSM tree for Postgres

2020-08-05 Thread Konstantin Knizhnik



On 04.08.2020 20:44, Tomas Vondra wrote:


IMO the 6x difference is rather misleading, as it very much depends on
the duration of the benchmark and how much data it ends up with. I think
it's better to test 'stable states' i.e. with small data set that does
not exceed RAM during the whole test, and large ones that already starts
larger than RAM. Not sure if it makes sense to make a difference between
cases that fit into shared buffers and those that exceed shared buffers
but still fit into RAM.


I have changed benchmark scenario.
Now I inserted 200 million records with sequential key: it is fast 
enough and makes index size about 19Gb.

Then I perform 1 million random inserts.

-- init schema
create table t(k bigint, v1 bigint, v2 bigint, v3 bigint, v4 bigint, v5 
bigint, v6 bigint, v7 bigint, v8 bigint);

create index lsm_index on t using lsm3(k) include (v1,v2,v3,v4,v5,v6,v7,v8);
create table t2(k bigint, v1 bigint, v2 bigint, v3 bigint, v4 bigint, v5 
bigint, v6 bigint, v7 bigint, v8 bigint);

create index on t2(k) include (v1,v2,v3,v4,v5,v6,v7,v8);

-- fill with sequential data
insert into t values (generate_series(1,2),0,0,0,0,0,0,0,0);
Time: 520655,635 ms (08:40,656)

insert into t2 values (generate_series(1,2),0,0,0,0,0,0,0,0);
Time: 372245,093 ms (06:12,245)

-- random inserts
insert into t (v1,k,v2,v3,v4,v5,v6,v7,v8) values 
(generate_series(1,100),(random()*10)::bigint,0,0,0,0,0,0,0);

Time: 3781,614 ms (00:03,782)

insert into t2 (v1,k,v2,v3,v4,v5,v6,v7,v8) values 
(generate_series(1,100),(random()*10)::bigint,0,0,0,0,0,0,0);

Time: 39034,574 ms (00:39,035)

The I perform random selects

select.sql:
\set k random(1, 10)
select * from t where k=:k;

select2.sql:
\set k random(1, 10)
select * from t2 where k=:k;

pgbench -n -T 100 -P 10 -M prepared -f select.sql postgres
tps = 11372.821006 (including connections establishing)

pgbench -n -T 100 -P 10 -M prepared -f select2.sql postgres
tps = 10392.729026 (including connections establishing)


So as you can see - insertion speed of Lsm3 is ten times higher and 
select speed is the same as of nbtree.




Re: Reg. Postgres 13

2020-08-05 Thread Magnus Hagander
On Wed, Aug 5, 2020 at 8:08 AM Joel Mariadasan (jomariad) <
jomar...@cisco.com> wrote:

> Hi,
>
>
>
> I see that release 13 is currently in beta.
>
> When will be the official production release of 13 be out?
>
>
>
> We need to see if we can include this as part of our product release cycle.
>
>
>

Hello!

You can find this info on https://www.postgresql.org/developer/roadmap/.
The current roadmas is Q3, with a target hopefully in the Sept/Oct
timeframe (but no guarantees, of course).

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Yet another issue with step generation in partition pruning

2020-08-05 Thread Amit Langote
Fujita-san,

Thanks a lot for your time on fixing these multi-column range
partition pruning issues.  I'm sorry that I failed to notice the
previous two reports on -bugs for which you committed a fix last week.

On Tue, Aug 4, 2020 at 9:46 PM Etsuro Fujita  wrote:
> Commit 13838740f fixed some issues with step generation in partition
> pruning, but as I mentioned in [1], I noticed that there is yet
> another issue: get_steps_using_prefix() assumes that clauses in the
> passed-in prefix list are sorted in ascending order of their partition
> key numbers, but the caller (i.e., gen_prune_steps_from_opexps())
> doesn’t ensure that in the case of range partitioning, leading to an
> assertion failure.  Here is an example causing such a failure, which
> would happen with/without that commit:
>
> create table rp_prefix_test2 (a int, b int, c int) partition by range (a, b, 
> c);
> create table rp_prefix_test2_p1 partition of rp_prefix_test2 for
> values from (1, 1, 0) to (1, 1, 10);
> create table rp_prefix_test2_p2 partition of rp_prefix_test2 for
> values from (2, 2, 0) to (2, 2, 10);
> select * from rp_prefix_test2 where a <= 1 and b <= 1 and b = 1 and c <= 0;
>
> I don't think we write queries like this, but for this query, the
> caller would create the prefix list for the last partition key “c”
> {b=1, a<=1, b<=1} (the clauses are not sorted properly!), then calling
> get_steps_using_prefix(), which leads to an assertion failure.

That analysis is spot on.

> Attached is a patch for fixing this issue.

I have looked at the patch and played around with it using the
regression tests you've added recently. I was not able to find any
results that looked surprising.

Thanks again.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH v1] elog.c: Remove special case which avoided %*s format strings..

2020-08-05 Thread Michael Paquier
On Tue, Aug 04, 2020 at 09:06:16PM +1200, David Rowley wrote:
> FWIW, the tests I did to check this when initially working on it are
> in [1].  Justin, it would be good if you could verify you're making as
> bad as what's mentioned on that thread again.

Ouch.  Thanks for the reference.  Indeed it looks that it would hurt
even with just a simple PL function.
--
Michael


signature.asc
Description: PGP signature


Re: Add MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP options to VACUUM

2020-08-05 Thread Masahiko Sawada
On Mon, 3 Aug 2020 at 15:47, Michael Paquier  wrote:
>
> On Tue, Jul 14, 2020 at 05:34:01AM +, k.jami...@fujitsu.com wrote:
> > I've also confirmed those through regression + tap test in my own env
> > and they've passed. I'll look into deeply again if I find problems.
>
> +   VACOPT_TOAST_CLEANUP = 1 << 6,  /* process TOAST table, if any */
> +   VACOPT_DISABLE_PAGE_SKIPPING = 1 << 7,  /* don't skip any pages */
> +   VACOPT_MAIN_REL_CLEANUP = 1 << 8/* process main relation */
>  } VacuumOption;
>
> Do we actually need this much complication in the option set?  It is
> possible to vacuum directly a toast table by passing directly its
> relation name, with pg_toast as schema, so you can already vacuum a
> toast relation without the main part.  And I would guess that users
> caring about the toast table specifically would know already how to do
> that, even if it requires a simple script and a query on pg_class.

Yeah, I also doubt we really need to have this option in the core just
for the purpose of easily specifying toast relation to VACUUM command.
If the user doesn't know how to search the toast relation, I think we
can provide a script or an SQL function executes vacuum() C function
with the toast relation fetched by using the main relation. I
personally think VACUUM option basically should be present to control
the vacuum internal behavior granularly that the user cannot control
from outside, although there are some exceptions: FREEZE and ANALYZE.

Regards,


--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: LSM tree for Postgres

2020-08-05 Thread Dmitry Dolgov
> On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote:
>
> Then I think about implementing ideas of LSM using standard Postgres
> nbtree.
>
> We need two indexes: one small for fast inserts and another - big
> (main) index. This top index is small enough to fit in memory so
> inserts in this index are very fast.  Periodically we will merge data
> from top index to base index and truncate the top index. To prevent
> blocking of inserts in the table while we are merging indexes we can
> add ... on more index, which will be used during merge.
>
> So final architecture of Lsm3 is the following: two top indexes used
> in cyclic way and one main index. When top index reaches some
> threshold value we initiate merge with main index, done by bgworker
> and switch to another top index.  As far as merging indexes is done in
> background, it doesn't  affect insert speed.  Unfortunately Postgres
> Index AM has not bulk insert operation, so we have to perform normal
> inserts.  But inserted data is already sorted by key which should
> improve access locality and partly solve random reads problem for base
> index.
>
> Certainly to perform search in Lsm3 we have to make lookups in all
> three indexes and merge search results.

Thanks for sharing this! In fact this reminds me more of partitioned
b-trees [1] (and more older [2]) rather than LSM as it is (although
could be that the former was influenced by the latter). What could be
interesting is that quite often in these and many other whitepapers
(e.g. [3]) to address the lookup overhead the design includes bloom
filters in one or another way to avoid searching not relevant part of an
index. Tomas mentioned them in this thread as well (in the different
context), probably the design suggested here could also benefit from it?

[1]: Riegger Christian, Vincon Tobias, Petrov Ilia. Write-optimized
indexing with partitioned b-trees. (2017). 296-300. 10.1145/3151759.3151814.
[2]: Graefe Goetz. Write-Optimized B-Trees. (2004). 672-683.
10.1016/B978-012088469-8/50060-7.
[3]: Huanchen Zhang, David G. Andersen, Andrew Pavlo, Michael Kaminsky,
Lin Ma, and Rui Shen. Reducing the Storage Overhead of Main-Memory OLTP
Databases with Hybrid Indexes. (2016). 1567–1581. 10.1145/2882903.2915222.




Re: display offset along with block number in vacuum errors

2020-08-05 Thread Robert Haas
On Sun, Aug 2, 2020 at 10:43 PM Masahiko Sawada
 wrote:
> I think that since the user who uses this errcontext information is
> likely to know more or less the internal of PostgreSQL I think 0-based
> block number and 1-based offset will not be a problem. However, I
> expected these are documented but couldn't find. If not yet, I think
> it’s a good time to document that.

I agree. That's just how TIDs are.


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Can a background worker exist without shared memory access for EXEC_BACKEND cases?

2020-08-05 Thread Bharath Rupireddy
On Tue, Aug 4, 2020 at 10:20 PM Robert Haas  wrote:
>
> On Tue, Aug 4, 2020 at 7:27 AM Bharath Rupireddy
>  wrote:
> > I could get these points earlier in my initial analysis. In fact, I
> > could figure out the flow on Windows, how these parameters are shared
> > using a shared file(CreateFileMapping(), MapViewOfFile()), and the
> > shared file name being passed as an argv[2] to the child process, and
> > the way child process uses this file name to read the backend
> > parameters in read_backend_variables().
>
> Doesn't that happen even if the background worker isn't declared to
> use BGWORKER_SHMEM_ACCESS? See StartBackgroundWorker(): IIUC, we start
> with shared memory access, then afterwards detach.
>

Yes, the bg worker starts with shared memory access even with no
BGWORKER_SHMEM_ACCESS and later it gets detached in
StartBackgroundWorker() with PGSharedMemoryDetach().

if ((worker->bgw_flags & BGWORKER_SHMEM_ACCESS) == 0)
{
dsm_detach_all();
PGSharedMemoryDetach();
 }

In EXEC_BACKEND cases, right after PGSharedMemoryDetach(), the bg
worker will no longer be able to access the backend parameters, see
below(I tried this on my Ubuntu machine with a bgworker with no
BGWORKER_SHMEM_ACCESS flag and defined EXEC_BACKEND macro in
pg_config_manual.h) :

(gdb) p *MyLatch
Cannot access memory at address 0x7fd60424a6b4
(gdb) p *ShmemVariableCache
Cannot access memory at address 0x7fd58427bf80
(gdb) p ProcStructLock
$10 = (slock_t *) 0x7fd60429bd00 
(gdb) p *AuxiliaryProcs
Cannot access memory at address 0x7fd60424cc60
(gdb) p *ProcGlobal
Cannot access memory at address 0x7fd604232880

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Can a background worker exist without shared memory access for EXEC_BACKEND cases?

2020-08-05 Thread Robert Haas
On Wed, Aug 5, 2020 at 7:24 AM Bharath Rupireddy
 wrote:
> In EXEC_BACKEND cases, right after PGSharedMemoryDetach(), the bg
> worker will no longer be able to access the backend parameters, see
> below(I tried this on my Ubuntu machine with a bgworker with no
> BGWORKER_SHMEM_ACCESS flag and defined EXEC_BACKEND macro in
> pg_config_manual.h) :
>
> (gdb) p *MyLatch
> Cannot access memory at address 0x7fd60424a6b4
> (gdb) p *ShmemVariableCache
> Cannot access memory at address 0x7fd58427bf80
> (gdb) p ProcStructLock
> $10 = (slock_t *) 0x7fd60429bd00  address 0x7fd60429bd00>
> (gdb) p *AuxiliaryProcs
> Cannot access memory at address 0x7fd60424cc60
> (gdb) p *ProcGlobal
> Cannot access memory at address 0x7fd604232880

Well all of those are pointers into the main shared memory segment,
which is expected to be inaccessible after it is detached. Hopefully
nobody should be surprised that if you don't specify
BGWORKER_SHMEM_ACCESS, you can't access data stored in shared memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Which SET TYPE don't actually require a rewrite

2020-08-05 Thread Magnus Hagander
On Sat, Jul 18, 2020 at 4:57 AM Noah Misch  wrote:

> On Fri, Jul 17, 2020 at 04:08:36PM +0200, Magnus Hagander wrote:
> > On Fri, Jul 17, 2020 at 5:40 AM Noah Misch  wrote:
> > > On Wed, Jul 15, 2020 at 02:54:37PM +0200, Magnus Hagander wrote:
> > > > Our Fine Manual (TM) specifies:
> > > > "As an exception, when changing the type of an existing column, if
> the
> > > > USING clause does not change the column contents and the old type is
> > > either
> > > > binary coercible to the new type or an unconstrained domain over the
> new
> > > > type, a table rewrite is not needed; but any indexes on the affected
> > > > columns must still be rebuilt."
> > > >
> > > > First of all, how is a non-internals-expert even supposed to know
> what a
> > > > binary coercible type is?
> > >
> > > The manual defines it at binary coercible.
> >
> > The only way to actually realize that this is a  is to look at
> > the source code though, right?
>
> I see italic typeface for .  This one deserves an ,
> too.
> (I bet many other  uses deserve an .)
>
> > It's definitely not clear that one should go
> > look at the CREATE CAST documentation to find the definition -- certainly
> > not from the ALTER TABLE documentation, which I would argue is the place
> > where most people would go.
>
> Agreed.
>
> > > We can also for example increase the precision of numeric without a
> > > rewrite
> > > > (but not scale). Or we can change between text and varchar. And we
> can
> > > > increase the length of a varchar but not decrease it.
> > > >
> > > > Surely we can do better than this when it comes to documenting it?
> Even
> > > if
> > > > it's a pluggable thing so it may or may not be true of external
> > > > datatypes installed later, we should be able to at least be more
> clear
> > > > about the builtin types, I think?
> > >
> > > I recall reasoning that ATColumnChangeRequiresRewrite() is a DDL
> analog of
> > > query optimizer logic.  The manual brings up only a minority of planner
> > > optimizations, and comprehensive lists of optimization preconditions
> are
> > > even
> > > rarer.  But I don't mind if $SUBJECT documentation departs from that
> norm.
> >
> > I can see the argument being made for that, and certainly having been
> made
> > for it in the future. But I'd say given the very bad consequences of
> > getting it wrong, it's far from minor. And given the number of times I've
> > had to answer the question "can I make this change safely" (which usually
> > amounts to me trying it out to see what happens, if I hadn't done that
> > exact one many times before) indicates the need for a more detailed
> > documentation on it.
>
> Such a doc addition is fine with me.  I agree with Tom that it will be
> prone
> to staleness, but I don't conclude that the potential for staleness reduces
> its net value below zero.  Having said that, if the consequences of doc
> staleness are "very bad", you may consider documenting the debug1 user
> interface (https://postgr.es/m/20121202020736.gd13...@tornado.leadboat.com
> )
> instead of documenting the exact rules.  Either way is fine with me.
>

The DEBUG1 method is only after the fact though, isn't it?

That makes it pretty hard for someone to say review a migration script and
see "this is going to cause problems". And if it's going to be run in an
env, I personally find it more useful to just stick an event trigger in
there per our documentation and block it (though it might be a good idea to
link to that from the alter table reference page, and not just have it
under event trigger examples).

I agree that documenting the rules would definitely be prone to staleness,
and that having EXPLAIN for DDL would be the *better* solution. But also
that having the docs, even if they go a bit stale, would be better than the
scenario today.

Unfortunately, I'm not sure I know enough of the details of what the rules
actually *are* to explain them in a way that's easy enough to go in the
docs...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-08-05 Thread Amit Kapila
On Tue, Aug 4, 2020 at 12:42 PM Dilip Kumar  wrote:
>
> On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila  wrote:
> >
>
> > 4.  I think we can explain the problems (like we can see the wrong
> > tuple or see two versions of the same tuple or whatever else wrong can
> > happen, if possible with some example) related to concurrent aborts
> > somewhere in comments.
>
> Done
>

I have slightly modified the comment added for the above point and
apart from that added/modified a few comments at other places.  I have
also slightly edited the commit message.

@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb,
TransactionId xid,
  change->lsn = lsn;
  change->txn = txn;
  change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+ change->txn = txn;

This change is not required as the same information is assigned a few
lines before.  So, I have removed this change as well.  Let me know
what you think of the above changes?

Can we add a test for incomplete changes (probably with toast
insertion but we can do it for spec_insert case as well) in
ReorderBuffer such that it needs to first serialize the changes and
then stream it?  I have manually verified such scenarios but it is
good to have the test for the same.

-- 
With Regards,
Amit Kapila.


v46-0001-Implement-streaming-mode-in-ReorderBuffer.patch
Description: Binary data


Re: Can a background worker exist without shared memory access for EXEC_BACKEND cases?

2020-08-05 Thread Bharath Rupireddy
On Wed, Aug 5, 2020 at 5:16 PM Robert Haas  wrote:
>
> On Wed, Aug 5, 2020 at 7:24 AM Bharath Rupireddy
>  wrote:
> > In EXEC_BACKEND cases, right after PGSharedMemoryDetach(), the bg
> > worker will no longer be able to access the backend parameters, see
> > below(I tried this on my Ubuntu machine with a bgworker with no
> > BGWORKER_SHMEM_ACCESS flag and defined EXEC_BACKEND macro in
> > pg_config_manual.h) :
> >
> > (gdb) p *MyLatch
> > Cannot access memory at address 0x7fd60424a6b4
> > (gdb) p *ShmemVariableCache
> > Cannot access memory at address 0x7fd58427bf80
> > (gdb) p ProcStructLock
> > $10 = (slock_t *) 0x7fd60429bd00  > address 0x7fd60429bd00>
> > (gdb) p *AuxiliaryProcs
> > Cannot access memory at address 0x7fd60424cc60
> > (gdb) p *ProcGlobal
> > Cannot access memory at address 0x7fd604232880
>
> Well all of those are pointers into the main shared memory segment,
> which is expected to be inaccessible after it is detached. Hopefully
> nobody should be surprised that if you don't specify
> BGWORKER_SHMEM_ACCESS, you can't access data stored in shared memory.
>

Right.

Will the proposed patch(v2) having some info in bgworker.sgml and
bgworker.h be ever useful to the users in some way?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Can a background worker exist without shared memory access for EXEC_BACKEND cases?

2020-08-05 Thread Robert Haas
On Wed, Aug 5, 2020 at 9:02 AM Bharath Rupireddy
 wrote:
> Will the proposed patch(v2) having some info in bgworker.sgml and
> bgworker.h be ever useful to the users in some way?

Well, it says things that aren't true, so, no, it's not useful. Your
patch claims that "the worker fails to receive the backend parameters
from the postmaster," but that's not the case. SubPostmasterMain()
first calls read_backend_variables() which calls
restore_backend_variables(); then later it calls
StartBackgroundWorker() which does PGSharedMemoryDetach(). So the
values of the backend variables *are* available in the worker
processes. Your debugger output also shows this: if
restore_backend_variables() weren't running in the child processes,
those variables would all be NULL, but you show them all at different
addresses in the 0x7fd... range, which is presumably where the shared
memory segment was mapped.

The reason you can't print out the results of dereferencing the
variables with * is because the memory to which the variables point is
no longer mapped in the process, not because the variables haven't
been initialized. If you looked at a variable that wasn't a pointer to
shared memory, but rather, say, an integer, like max_safe_fds or
MyCancelKey, I think you'd find that the value was preserved just
fine. I think you're confusing the pointer with the data to which it
points.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] - Provide robust alternatives for replace_string

2020-08-05 Thread Alvaro Herrera
On 2020-Aug-05, Asim Praveen wrote:

> Please find attached a StringInfo based solution to this problem.  It
> uses fgetln instead of fgets such that a line is read in full, without
> ever splitting it.

never heard of fgetln, my system doesn't have a manpage for it, and we
don't use it anywhere AFAICS.  Are you planning to add something to
src/common for it?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-05 Thread Ashutosh Sharma
Hi Robert,

Thanks for the review. Please find my comments inline:

On Sat, Aug 1, 2020 at 12:18 AM Robert Haas  wrote:
>
> On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma  wrote:
> > Attached is the new version of patch that addresses the comments from 
> > Andrey and Beena.
>
> +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
>
> the -> a
>
> I also suggest: heap table -> relation, because we might want to
> extend this to other cases later.
>

Corrected.

> +-- toast table.
> +begin;
> +create table ttab(a text);
> +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> 65), '') from generate_series(1,1);
> +select * from ttab where xmin = 2;
> + a
> +---
> +(0 rows)
> +
> +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> + heap_force_freeze
> +---
> +
> +(1 row)
> +
>
> I don't understand the point of this. You're not testing the function
> on the TOAST table; you're testing it on the main table when there
> happens to be a TOAST table that is probably getting used for
> something. But that's not really relevant to what is being tested
> here, so as written this seems redundant with the previous cases.
>

Yeah, it's being tested on the main table, not on a toast table. I've
removed this test-case and also restricted direct access to the toast
table using heap_force_kill/freeze functions. I think we shouldn't be
using these functions to do any changes in the toast table. We will
only use these functions with the main table and let VACUUM remove the
corresponding data chunks (pointed by the tuple that got removed from
the main table).

Another option would be to identify all the data chunks corresponding
to the tuple (ctid) being killed from the main table and remove them
one by one. We will only do this if the tuple from the main table that
has been marked as killed has an external storage. We will have to add
a bunch of code for this otherwise we can let VACUUM do this for us.
Let me know your thoughts on this.

> +-- test pg_surgery functions with the unsupported relations. Should fail.
>
> Please name the specific functions being tested here in case we add
> more in the future that are tested separately.
>

Done.

> +++ b/contrib/pg_surgery/heap_surgery_funcs.c
>
> I think we could drop _funcs from the file name.
>

Done.

> +#ifdef PG_MODULE_MAGIC
> +PG_MODULE_MAGIC;
> +#endif
>
> The #ifdef here is not required, and if you look at other contrib
> modules you'll see that they don't have it.
>

Okay, done.

> I don't like all the macros at the top of the file much. CHECKARRVALID
> is only used in one place, so it seems to me that you might as well
> just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
>

Done.

> Once you do that, heap_force_common() can just compute the number of
> array elements once, instead of doing it once inside ARRISEMPTY, then
> again inside SORT, and then a third time to initialize ntids. You can
> just have a local variable in that function that is set once and then
> used as needed. Then you are only doing ARRNELEMS once, so you can get
> rid of that macro too. The same technique can be used to get rid of
> ARRPTR. So then all the macros go away without introducing any code
> duplication.
>

Done.

> +/* Options to forcefully change the state of a heap tuple. */
> +typedef enum HTupleForceOption
> +{
> + FORCE_KILL,
> + FORCE_FREEZE
> +} HTupleForceOption;
>
> I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.

Done.

Also, how about
> option -> operation?
>

I think both look okay to me.

> + return heap_force_common(fcinfo, FORCE_KILL);
>
> I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
> know it's the same thing, though, and perhaps I'm even wrong about the
> prevailing style.
>

Done.

> + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
>
> I think this is unnecessary. It's an enum with 2 values.
>

Removed.

> + if (ARRISEMPTY(ta))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("empty tid array")));
>
> I don't see why this should be an error. Why can't we just continue
> normally and if it does nothing, it does nothing? You'd need to change
> the do..while loop to a while loop so that the end condition is tested
> at the top, but that seems fine.
>

I think it's okay to have this check. So, just left it as-is. We do
have such checks in other contrib modules as well wherever the array
is being passed as an input to the function.

> + rel = relation_open(relid, AccessShareLock);
>
> Maybe we should take RowExclusiveLock, since we are going to modify
> rows. Not sure how much it matters, though.
>

I don't know how it would make a difference, but still as you said
replaced AccessShare with RowExclusive.

> + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
> + ereport(ERROR,
> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> + errmsg("mu

Re: [HACKERS] [PATCH] Generic type subscripting

2020-08-05 Thread Dmitry Dolgov
> On Sun, Aug 02, 2020 at 12:50:12PM +0200, Pavel Stehule wrote:
> >
> > > Maybe this could be salvaged by flushing 0005 in its current form and
> > > having the jsonb subscript executor do something like "if the current
> > > value-to-be-subscripted is a JSON array, then try to convert the textual
> > > subscript value to an integer".  Not sure about what the error handling
> > > rules ought to be like, though.
> >
> > I'm fine with the idea of separating 0005 patch and potentially prusuing
> > it as an independent item. Just need to rebase 0006, since Pavel
> > mentioned that it's a reasonable change he would like to see in the
> > final result.
> >
>
> +1

Here is what I had in mind. Worth noting that, as well as the original
patch, the attached implementation keeps the same behaviour for negative
indices. Also, I've removed a strange inconsistency one could notice
with the original implementation, when one extra gap was introduced when
we append something at the beginning of an array.
>From ed8036ffd1fd65f5779f408fd0a4080357b29df2 Mon Sep 17 00:00:00 2001
From: erthalion <9erthali...@gmail.com>
Date: Thu, 31 Jan 2019 22:37:19 +0100
Subject: [PATCH v33 1/5] Base implementation of subscripting mechanism

Introduce all the required machinery for generalizing subscripting
operation for a different data types:

* subscripting handler procedure, to set up a relation between data type
and corresponding subscripting logic.

* subscripting routines, that help generalize a subscripting logic,
since it involves few stages, namely preparation (e.g. to figure out
required types), validation (to check the input and return meaningful
error message), fetch (executed when we extract a value using
subscripting), assign (executed when we update a data type with a new
value using subscripting). Without this it would be neccessary to
introduce more new fields to pg_type, which would be too invasive.

All ArrayRef related logic was removed and landed as a separate
subscripting implementation in the following patch from this series. The
rest of the code was rearranged, to e.g. store a type of assigned value
for an assign operation.

Reviewed-by: Tom Lane, Arthur Zakirov
---
 .../pg_stat_statements/pg_stat_statements.c   |   1 +
 src/backend/catalog/heap.c|   6 +-
 src/backend/catalog/pg_type.c |  15 +-
 src/backend/commands/typecmds.c   |  77 +-
 src/backend/executor/execExpr.c   |  25 +---
 src/backend/executor/execExprInterp.c | 124 +++
 src/backend/nodes/copyfuncs.c |   2 +
 src/backend/nodes/equalfuncs.c|   2 +
 src/backend/nodes/outfuncs.c  |   2 +
 src/backend/nodes/readfuncs.c |   2 +
 src/backend/parser/parse_expr.c   |  54 ---
 src/backend/parser/parse_node.c   | 141 --
 src/backend/parser/parse_target.c |  88 +--
 src/backend/utils/adt/ruleutils.c |  21 +--
 src/backend/utils/cache/lsyscache.c   |  23 +++
 src/include/c.h   |   2 +
 src/include/catalog/pg_type.h |   9 +-
 src/include/executor/execExpr.h   |  13 +-
 src/include/nodes/primnodes.h |   6 +
 src/include/nodes/subscripting.h  |  42 ++
 src/include/parser/parse_node.h   |   6 +-
 src/include/utils/lsyscache.h |   1 +
 22 files changed, 336 insertions(+), 326 deletions(-)
 create mode 100644 src/include/nodes/subscripting.h

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 14cad19afb..bf19507d32 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -2793,6 +2793,7 @@ JumbleExpr(pgssJumbleState *jstate, Node *node)
 JumbleExpr(jstate, (Node *) sbsref->reflowerindexpr);
 JumbleExpr(jstate, (Node *) sbsref->refexpr);
 JumbleExpr(jstate, (Node *) sbsref->refassgnexpr);
+APP_JUMB(sbsref->refnestedfunc);
 			}
 			break;
 		case T_FuncExpr:
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 3985326df6..911e2a1ffe 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1056,7 +1056,8 @@ AddNewRelationType(const char *typeName,
    -1,			/* typmod */
    0,			/* array dimensions for typBaseType */
    false,		/* Type NOT NULL */
-   InvalidOid); /* rowtypes never have a collation */
+   InvalidOid,  /* rowtypes never have a collation */
+   InvalidOid);	/* typsubshandler - none */
 }
 
 /* 
@@ -1335,7 +1336,8 @@ heap_create_with_catalog(const char *relname,
    -1,			/* typmod */
    0,			/* array dimensions for typBaseType */
    false,		/* Type NOT NULL */
-   InvalidOid); /* rowtypes never have a collation */
+   InvalidOid,  /* rowtypes never have a c

Re: new heapcheck contrib module

2020-08-05 Thread Robert Haas
On Tue, Aug 4, 2020 at 9:06 PM Peter Geoghegan  wrote:
> of messed-up with indexes in my experience. The first error really
> does tell you most of what you need to know about any given corrupt
> index. Kind of like how you can bucket the number of cockroaches in
> your home into perhaps three meaningful buckets: 0 cockroaches, at
> least 1 cockroach, and lots of cockroaches. (Even there, if you really
> care about the distinction between the second and third bucket,
> something has gone terribly wrong -- so even three buckets seems like
> a lot to me.)

Not sure I agree with this. As a homeowner, the distinction between 0
and 1 is less significant to me than the distinction between a few
(preferably in places where I'll never see them) and whole lot. I
agree with you to an extent though: all I really care about is whether
I have too few to worry about, enough that I'd better try to take care
of it somehow, or so many that I need a professional exterminator. If,
however, I were a professional exterminator, I would be unhappy with
just knowing that there are few problems or many. I suspect I would
want to know something about where the problems were, and get a more
nuanced indication of just how bad things are in each location.

FWIW, pg_catcheck is an example of an existing tool (designed by me
and written partially by me) that uses the kind of model I'm talking
about. It does a single SELECT * FROM pg_ on each catalog
table - so that it doesn't get confused if your system catalog indexes
are messed up - and then performs a bunch of cross-checks on the
tuples it gets back and tells you about all the messed up stuff. If it
can't get data from all your catalog tables it performs whichever
checks are valid given what data it was able to get. As a professional
exterminator of catalog corruption, I find it quite helpful. If
someone sends me the output from a database cluster, I can tell right
away whether they are just fine, in a little bit of trouble, or in a
whole lot of trouble; I can speculate pretty well about what kind of
thing might've happened to cause the problem; and I can recommend
steps to straighten things out.

> FWIW, current DEBUG1 + DEBUG2 output for amcheck shows you quite a lot
> of details about the tree structure. It's a handy way of getting a
> sense of what's going on at a high level. For example, if index
> corruption is found very early on, that strongly suggests that it's
> pretty pervasive.

Interesting.

> > A fourth type of checking is to verify the index key against the keys
> > in the heap tuples to which they point, but only for index tuples that
> > passed the basic index-heap sanity checking and where the tuples have
> > not been pruned. This can be sensibly done even if the structural
> > checks or index-ordering checks have failed.
>
> That's going to require the equivalent of a merge join, which is
> terribly expensive relative to such a small benefit.

I think it depends on how big your data is. If you've got a 2TB table
and 512GB of RAM, it's pretty impractical no matter the algorithm. But
for small tables even a naive nested loop will suffice.

> Meanwhile, a simple smoke test covering many indexes probably gives
> you a fairly meaningful idea of the extent of the damage, without
> requiring that we do any hard engineering work.

In my experience, when EDB customers complain about corruption-related
problems, the two most common patterns are: (1) my whole system is
messed up and (2) I have one or a few specific objects which are
messed up and everything else is fine. The first category is often
something like inability to start the database, or scary messages in
the log file complaining about, say, checkpoints failing. The second
category is the one I'm worried about here. The people who are in this
category generally already know which things are broken; they've
figured that out through trial and error. Sometimes they miss some
problems, but more frequently, in my experience, their understanding
of what problems they have is accurate. Now that category of users can
be further decomposed into two groups: the people who don't care what
happened and just want to barrel through it, and the people who do
care what happened and want to know what happened, why it happened,
whether it's a bug, etc. The first group are unproblematic: tell them
to REINDEX (or restore from backup, or whatever) and you're done.

The second group is a lot harder. It is in general difficult to
speculate about how something that is now wrong got that way given
knowledge only of the present state of affairs. But good tooling makes
it easier to speculate intelligently. To take a classic example,
there's a great difference between a checksum failure caused by the
checksum being incorrect on an otherwise-valid page; a checksum
failure on a page the first half of which appears valid and the second
half of which looks like it might be some other database page; and a
checksum failure on a page whose contents 

Re: Yet another issue with step generation in partition pruning

2020-08-05 Thread Etsuro Fujita
Amit-san,

On Wed, Aug 5, 2020 at 5:13 PM Amit Langote  wrote:
> Thanks a lot for your time on fixing these multi-column range
> partition pruning issues.  I'm sorry that I failed to notice the
> previous two reports on -bugs for which you committed a fix last week.

No problem.

> On Tue, Aug 4, 2020 at 9:46 PM Etsuro Fujita  wrote:
> > Attached is a patch for fixing this issue.
>
> I have looked at the patch and played around with it using the
> regression tests you've added recently. I was not able to find any
> results that looked surprising.

That's good to hear!  Thanks for reviewing!  Will push the patch tomorrow.

Best regards,
Etsuro Fujita




Re: Allow some recovery parameters to be changed with reload

2020-08-05 Thread Robert Haas
On Sat, Mar 28, 2020 at 7:21 AM Sergei Kornilov  wrote:
> So...
> We call restore_command only when walreceiver is stopped.
> We use restore_command only in startup process - so we have no race condition 
> between processes.
> We have some issues here? Or we can just make restore_command reloadable as 
> attached?

I don't see the problem here, either. Does anyone else see a problem,
or some reason not to press forward with this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Replace remaining StrNCpy() by strlcpy()

2020-08-05 Thread Tom Lane
Peter Eisentraut  writes:
> Okay, here is a new patch with improved implementations of namecpy() and 
> namestrcpy().  I didn't see any other places that relied on the 
> zero-filling behavior of strncpy().

I've looked through this patch, and I concur with your conclusion that
noplace else is depending on zero-fill, with the exception of the one
place in pgstat.c that David already noted.  But the issue there is only
that valgrind might bitch about send()'ing undefined bytes, and ISTM
that the existing suppressions in our valgrind.supp should already
handle it, since we already have other pgstat messages with pad bytes.

However I do see one remaining nit to pick, in CreateInitDecodingContext:
 
/* register output plugin name with slot */
SpinLockAcquire(&slot->mutex);
-   StrNCpy(NameStr(slot->data.plugin), plugin, NAMEDATALEN);
+   namestrcpy(&slot->data.plugin, plugin);
SpinLockRelease(&slot->mutex);

This is already a pro-forma violation of our rule about "only
straight-line code inside a spinlock".  Now I'm not terribly concerned
about that right now, and the patch as it stands is only changing things
cosmetically.  But if you modify namestrcpy to do pg_mbcliplen then all
of a sudden there is a whole lot of code that could get reached within
the spinlock, and I'm not a bit happy about that prospect.

The least-risk fix would be to namestrcpy() into a local variable
and then just use a plain memcpy() inside the spinlock.  There might
be better ways if we're willing to make assumptions about what the
plugin name can be.  For that matter, do we really need a spinlock
here at all?  Why is the plugin name critical but the rest of the
slot not?

BTW, while we're here I think we ought to change namecpy and namestrcpy
to return void (no caller checks their results) and drop their checks
for null-pointer inputs.  AFAICS a null pointer would be a caller bug in
every case, and if it isn't, why is failing to initialize the
destination an OK outcome?  I find the provisions for nulls in namestrcmp
pretty useless too, although it looks like at least some thought has
been spent there.

I think this is committable once these points are addressed.

regards, tom lane




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Tom Lane
Amit Langote  writes:
> The crash reported here is in the other case where the concurrently
> added partitions cause the execution-time PartitionDesc to have more
> partitions than the one that PartitionedRelPruneInfo is based on.
> I was able to reproduce such a crash as follows:

Yeah, I can repeat the case per these directions.  I concur that the
issue is that ExecCreatePartitionPruneState is failing to cope with
zeroes in the relid_map.

> The attached patch should fix that.

I don't like this patch at all though; I do not think it is being nearly
careful enough to ensure that it's matched the surviving relation OIDs
correctly.  In particular it blithely assumes that a zero in relid_map
*must* match the immediately next entry in partdesc->oids, which is easy
to break if the new partition is adjacent to the one the planner managed
to prune.  So I think we should do it more like the attached.

I'm strongly tempted to convert the trailing Assert to an actual
test-and-elog, too, but didn't do so here.

regards, tom lane

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index fb6ce49056..221a34e738 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1673,7 +1673,19 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 pprune->subpart_map = palloc(sizeof(int) * partdesc->nparts);
 for (pp_idx = 0; pp_idx < partdesc->nparts; ++pp_idx)
 {
-	if (pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx])
+	/*
+	 * pinfo->relid_map can contain InvalidOid entries for
+	 * partitions pruned by the planner.  We cannot tell
+	 * exactly which of the partdesc entries these correspond
+	 * to, but we don't have to; just skip over them.  The
+	 * non-pruned relid_map entries, however, had better be a
+	 * subset of the partdesc entries and in the same order.
+	 */
+	while (pd_idx < pinfo->nparts &&
+		   !OidIsValid(pinfo->relid_map[pd_idx]))
+		pd_idx++;
+	if (pd_idx >= pinfo->nparts ||
+		pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx])
 	{
 		pprune->subplan_map[pp_idx] = -1;
 		pprune->subpart_map[pp_idx] = -1;
@@ -1686,6 +1698,14 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			pinfo->subpart_map[pd_idx++];
 	}
 }
+
+/*
+ * It might seem that we need to skip any trailing InvalidOid
+ * entries in pinfo->relid_map before asserting that we
+ * scanned all of the relid_map.  But we will have skipped
+ * them above, because they must correspond to some
+ * partdesc->oids entries; we just couldn't tell which.
+ */
 Assert(pd_idx == pinfo->nparts);
 			}
 


Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-08-05 Thread Andres Freund
Hi,

On 2020-08-04 10:05:25 +1200, David Rowley wrote:
> I'd like to push the 0002 patch quite soon as I think it's an
> improvement to simplehash.h regardless of if we get Result Cache.  It
> reuses the SH_LOOKUP function for deletes. Also, if we ever get around
> to giving up performing a lookup if we get too far away from the
> optimal bucket, then that would only need to appear in one location
> rather than in two.

> Andres, or anyone, any objections to me pushing 0002?

I think it'd be good to add a warning that, unless one is very careful,
no other hashtable modifications are allowed between lookup and
modification. E.g. something like
a = foobar_lookup();foobar_insert();foobar_delete();
will occasionally go wrong...


> - /* TODO: return false; if distance too big */
> +/*
> + * Perform hash table lookup on 'key', delete the entry belonging to it and
> + * return true.  Returns false if no item could be found relating to 'key'.
> + */
> +SH_SCOPE bool
> +SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
> +{
> + SH_ELEMENT_TYPE *entry = SH_LOOKUP(tb, key);
>  
> - curelem = SH_NEXT(tb, curelem, startelem);
> + if (likely(entry != NULL))
> + {
> + /*
> +  * Perform deletion and also the relocation of subsequent items 
> which
> +  * are not in their optimal position but can now be moved up.
> +  */
> + SH_DELETE_ITEM(tb, entry);
> + return true;
>   }
> +
> + return false;   /* Can't find 'key' */
>  }

You meantioned on IM that there's a slowdowns with gcc. I wonder if this
could partially be responsible. Does SH_DELETE inline LOOKUP and
DELETE_ITEM? And does the generated code end up reloading entry-> or
tb-> members?

When the SH_SCOPE isn't static *, then IIRC gcc on unixes can't rely on
the called function actually being the function defined in the same
translation unit (unless -fno-semantic-interposition is specified).


Hm, but you said that this happens in tidbitmap.c, and there all
referenced functions are local statics. So that's not quite the
explanation I was thinking it was...


Hm. Also wonder whether we currently (i.e. the existing code) we
unnecessarily end up reloading tb->data a bunch of times, because we do
the access to ->data as
SH_ELEMENT_TYPE *entry = &tb->data[curelem];

Think we should instead store tb->data in a local variable.

Greetings,

Andres Freund




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Robert Haas
On Wed, Aug 5, 2020 at 1:30 PM Tom Lane  wrote:
> I don't like this patch at all though; I do not think it is being nearly
> careful enough to ensure that it's matched the surviving relation OIDs
> correctly.  In particular it blithely assumes that a zero in relid_map
> *must* match the immediately next entry in partdesc->oids, which is easy
> to break if the new partition is adjacent to the one the planner managed
> to prune.  So I think we should do it more like the attached.

Ooh, nice catch.

> I'm strongly tempted to convert the trailing Assert to an actual
> test-and-elog, too, but didn't do so here.

I was thinking about that, too. +1 for taking that step.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




pg_rewind is not crash safe

2020-08-05 Thread Heikki Linnakangas
A colleague of mine brought to my attention that pg_rewind is not crash 
safe. If it is interrupted for any reason, it leaves behind a data 
directory with a mix of data from the source and target images. If 
you're "lucky", the server will start up, but it can be in an 
inconsistent state. That's obviously not good. It would be nice to:


1. Detect the situation, and refuse to start up.

Or even better:

2. Make pg_rewind crash safe, so that you could safely restart it if 
it's interrupted.


Has anyone else run into this? How did you work around it?

It doesn't seem hard to detect this. pg_rewind can somehow "poison" the 
data directory just before it starts making irreversible changes. I'm 
thinking of updating the 'state' in the control file to a new 
PG_IN_REWIND value.


It also doesn't seem too hard to make it restartable. As long as you 
point it to the same source server, it is already almost safe to run 
pg_rewind again. If we re-order the way it writes the control or backup 
files and makes other changes, pg_rewind can verify that you pointed it 
at the same or compatible primary as before.


I think there's one corner case with truncated files, if pg_rewind has 
extended a file by copying missing "tail" from the source system, but 
the system crashes before it's fsynced to disk. But I think we can fix 
that too, by paying attention to SMGR_TRUNCATE records when scanning the 
source WAL.


- Heikki




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Tom Lane
Robert Haas  writes:
> On Wed, Aug 5, 2020 at 1:30 PM Tom Lane  wrote:
>> I'm strongly tempted to convert the trailing Assert to an actual
>> test-and-elog, too, but didn't do so here.

> I was thinking about that, too. +1 for taking that step.

Will do.

In the longer term, it's annoying that we have no test methodology
for this other than "manually set a breakpoint here".  If we're going
to allow plan-relevant DDL changes to happen with less than full table
lock, I think we need to improve that.  I spent a little bit of time
just now trying to build an isolationtester case for this, and failed
completely.  So I wonder if we can create some sort of test module that
allows capture of a plan tree and then execution of that plan tree later
(even after relcache inval would normally have forced replanning).
Obviously that could not be a normal SQL-accessible feature, because
some types of invals would make the plan completely wrong, but for
testing purposes it'd be mighty helpful to check that a stale plan
still works.

regards, tom lane




Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-05 Thread Robert Haas
On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma  wrote:
> If the above path is taken that means none of the items in the page
> got changed.

Oops. I didn't realize that, sorry. Maybe it would be a little more
clear if instead of "int nSkippedItems" you had "bool
did_modify_page"? Then you could initialize it to false and set it to
true just before doing the page modifications.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-05 Thread Robert Haas
On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma  wrote:
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).

I agree with removing the test case, but I disagree with restricting
this from being used on the TOAST table. These are tools for experts,
who may use them as they see fit. It's unlikely that it would be
useful to use this on a TOAST table, I think, but not impossible.

> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.

I don't think VACUUM will do anything for us automatically -- it isn't
going to know that we force-killed the tuple in the main table.
Normally, a tuple delete would have to set xmax on the TOAST tuples
and then VACUUM would do its thing, but in this case that won't
happen. So if you force-kill a tuple in the main table you would end
up with a space leak in the TOAST table.

The problem here is that one reason you might force-killing a tuple in
the main table is because it's full of garbage. If so, trying to
decode the tuple so that you can find the TOAST pointers might crash
or error out, or maybe that part will work but then you'll error out
trying to look up the corresponding TOAST tuples, either because the
values are not valid or because the TOAST table itself is generally
hosed in some way. So I think it is probably best if we keep this tool
as simple as possible, with as few dependencies as we can, and
document the possible negative outcomes of using it. It's not
impossible to recover from a space-leak like this; you can always move
the data into a new table with CTAS and then drop the old one. Not
sure whether CLUSTER or VACUUM FULL would also be sufficient.

Separately, we might want to add a TOAST-checker to amcheck, or
enhance the heap-checker Mark is working on, and one of the things it
could do is check for TOAST entries to which nothing points. Then if
you force-kill tuples in the main table you could also use that tool
to look for things in the TOAST table that ought to also be
force-killed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Robert Haas
On Wed, Aug 5, 2020 at 2:22 PM Tom Lane  wrote:
> In the longer term, it's annoying that we have no test methodology
> for this other than "manually set a breakpoint here".  If we're going
> to allow plan-relevant DDL changes to happen with less than full table
> lock, I think we need to improve that.  I spent a little bit of time
> just now trying to build an isolationtester case for this, and failed
> completely.  So I wonder if we can create some sort of test module that
> allows capture of a plan tree and then execution of that plan tree later
> (even after relcache inval would normally have forced replanning).
> Obviously that could not be a normal SQL-accessible feature, because
> some types of invals would make the plan completely wrong, but for
> testing purposes it'd be mighty helpful to check that a stale plan
> still works.

That's an interesting idea. I don't know exactly how it would work,
but I agree that it would allow useful testing that we can't do today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-08-05 Thread Robert Haas
On Wed, May 20, 2020 at 7:44 AM David Rowley  wrote:
> I've attached a patch which implements this.  The new node type is
> called "Result Cache".  I'm not particularly wedded to keeping that
> name, but if I change it, I only want to do it once. I've got a few
> other names I mind, but I don't feel strongly or confident enough in
> them to go and do the renaming.

This is cool work; I am going to bikeshed on the name for a minute. I
don't think Result Cache is terrible, but I have two observations:

1. It might invite confusion with a feature of some other database
systems where they cache the results of entire queries and try to
reuse the entire result set.

2. The functionality reminds me a bit of a Materialize node, except
that instead of overflowing to disk, we throw away cache entries, and
instead of caching just one result, we potentially cache many.

I can't really think of a way to work Materialize into the node name
and I'm not sure it would be the right idea anyway. But I wonder if
maybe a name like "Parameterized Cache" would be better? That would
avoid confusion with any other use of the phrase "result cache"; also,
an experienced PostgreSQL user might be more likely to guess how a
"Parameterized Cache" is different from a "Materialize" node than they
would be if it were called a "Result Cache".

Just my $0.02,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Tom Lane
Robert Haas  writes:
> On Wed, Aug 5, 2020 at 2:22 PM Tom Lane  wrote:
>> In the longer term, it's annoying that we have no test methodology
>> for this other than "manually set a breakpoint here".  If we're going
>> to allow plan-relevant DDL changes to happen with less than full table
>> lock, I think we need to improve that.  I spent a little bit of time
>> just now trying to build an isolationtester case for this, and failed
>> completely.  So I wonder if we can create some sort of test module that
>> allows capture of a plan tree and then execution of that plan tree later
>> (even after relcache inval would normally have forced replanning).
>> Obviously that could not be a normal SQL-accessible feature, because
>> some types of invals would make the plan completely wrong, but for
>> testing purposes it'd be mighty helpful to check that a stale plan
>> still works.

> That's an interesting idea. I don't know exactly how it would work,
> but I agree that it would allow useful testing that we can't do today.

After thinking about it for a little bit, I'm envisioning a test module
that can be loaded into a session, and then it gets into the planner_hook,
and what it does is after each planner execution, take and release an
advisory lock with some selectable ID.  Then we can construct
isolationtester specs that do something like

session 1   session 2

LOAD test-module;
SET custom_guc_for_lock_id = n;
prepare test tables;

SELECT pg_advisory_lock(n);

SELECT victim-query-here;
... after planning, query blocks on lock

perform DDL changes;
SELECT pg_advisory_unlock(n);

... query executes with now-stale plan

regards, tom lane




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Robert Haas
On Wed, Aug 5, 2020 at 4:19 PM Tom Lane  wrote:
> After thinking about it for a little bit, I'm envisioning a test module
> that can be loaded into a session, and then it gets into the planner_hook,
> and what it does is after each planner execution, take and release an
> advisory lock with some selectable ID.  Then we can construct
> isolationtester specs that do something like
>
> session 1   session 2
>
> LOAD test-module;
> SET custom_guc_for_lock_id = n;
> prepare test tables;
>
> SELECT pg_advisory_lock(n);
>
> SELECT victim-query-here;
> ... after planning, query blocks on lock
>
> perform DDL changes;
> SELECT pg_advisory_unlock(n);
>
> ... query executes with now-stale plan

Very sneaky!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: new heapcheck contrib module

2020-08-05 Thread Peter Geoghegan
On Wed, Aug 5, 2020 at 7:09 AM Robert Haas  wrote:
> Not sure I agree with this. As a homeowner, the distinction between 0
> and 1 is less significant to me than the distinction between a few
> (preferably in places where I'll never see them) and whole lot. I
> agree with you to an extent though: all I really care about is whether
> I have too few to worry about, enough that I'd better try to take care
> of it somehow, or so many that I need a professional exterminator. If,
> however, I were a professional exterminator, I would be unhappy with
> just knowing that there are few problems or many. I suspect I would
> want to know something about where the problems were, and get a more
> nuanced indication of just how bad things are in each location.

Right, but the professional exterminator can be expected to use expert
level tools, where a great deal of technical sophistication is
required to interpret what's going on sensibly. An amatuer can only
use them to determine if something is wrong at all, which is usually
not how they add value.

(I think that my analogy is slightly flawed in that it hinged upon
everybody hating cockroaches as much as I do, which is more than the
ordinary amount.)

> FWIW, pg_catcheck is an example of an existing tool (designed by me
> and written partially by me) that uses the kind of model I'm talking
> about. It does a single SELECT * FROM pg_ on each catalog
> table - so that it doesn't get confused if your system catalog indexes
> are messed up - and then performs a bunch of cross-checks on the
> tuples it gets back and tells you about all the messed up stuff. If it
> can't get data from all your catalog tables it performs whichever
> checks are valid given what data it was able to get. As a professional
> exterminator of catalog corruption, I find it quite helpful.

I myself seem to have had quite different experiences with corruption,
presumably because it happened at product companies like Heroku. I
tended to find software bugs (e.g. the one fixed by commit 008c4135)
that were rare and novel by casting a wide net over a large number of
relatively homogenous databases. Whereas your experiences tend to
involve large support customers with more opportunity for operator
error. Both perspectives are important.

> The second group is a lot harder. It is in general difficult to
> speculate about how something that is now wrong got that way given
> knowledge only of the present state of affairs. But good tooling makes
> it easier to speculate intelligently. To take a classic example,
> there's a great difference between a checksum failure caused by the
> checksum being incorrect on an otherwise-valid page; a checksum
> failure on a page the first half of which appears valid and the second
> half of which looks like it might be some other database page; and a
> checksum failure on a page whose contents appear to be taken from a
> Microsoft Word document. I'm not saying we ever want a tool which
> tries to figure that sort of thing out in an automated way; there's no
> substitute for human intelligence (yet, anyway).

I wrote my own expert level tool, pg_hexedit. I have to admit that the
level of interest in that tool doesn't seem to be all that great,
though I myself have used it to investigate corruption to great
effect. But I suppose there is no way to know how it's being used.

-- 
Peter Geoghegan




Re: Support for NSS as a libpq TLS backend

2020-08-05 Thread Andrew Dunstan


On 8/4/20 5:42 PM, Daniel Gustafsson wrote:
>> On 3 Aug 2020, at 21:18, Andrew Dunstan  
>> wrote:
>> On 8/3/20 12:46 PM, Andrew Dunstan wrote:
>>> On 7/31/20 4:44 PM, Andrew Dunstan wrote:
 OK, here is an update of your patch that compiles and runs against NSS
 under Windows (VS2019).
> Out of curiosity since I'm not familiar with Windows, how hard/easy is it to
> install NSS for the purpose of a) hacking on postgres+NSS and b) using 
> postgres
> with NSS as the backend?




I've laid out the process at
https://www.2ndquadrant.com/en/blog/nss-on-windows-for-postgresql-development/


>>> OK, this version contains pre-generated nss files, and passes a full
>>> buildfarm run including the ssl test module, with both openssl and NSS.
>>> That should keep the cfbot happy :-)
> Exciting, thanks a lot for helping out on this!  I've started to look at the
> required documentation changes during vacation, will hopefully be able to post
> something soon.
>


Good. Having got the tests running cleanly on Linux, I'm now going back
to work on that for Windows.


After that I'll look at the hook/callback stuff.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-08-05 Thread Peter Geoghegan
On Wed, May 20, 2020 at 4:44 AM David Rowley  wrote:
> Does it seem like something we might want for PG14?

Minor terminology issue: "Hybrid Hash Join" is a specific hash join
algorithm which is unrelated to what you propose to do here. I hope
that confusion can be avoided, possibly by not using the word hybrid
in the name.

-- 
Peter Geoghegan




Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-08-05 Thread David Rowley
On Thu, 6 Aug 2020 at 08:13, Robert Haas  wrote:
>
> This is cool work; I am going to bikeshed on the name for a minute. I
> don't think Result Cache is terrible, but I have two observations:

Thanks

> 1. It might invite confusion with a feature of some other database
> systems where they cache the results of entire queries and try to
> reuse the entire result set.

Yeah. I think "Cache" is good to keep, but I'm pretty much in favour
of swapping "Result" for something else. It's a bit too close to the
"Result" node in name, but too distant for everything else.

> 2. The functionality reminds me a bit of a Materialize node, except
> that instead of overflowing to disk, we throw away cache entries, and
> instead of caching just one result, we potentially cache many.
>
> I can't really think of a way to work Materialize into the node name
> and I'm not sure it would be the right idea anyway. But I wonder if
> maybe a name like "Parameterized Cache" would be better?

Yeah, I think that name is better. The only downside as far as I can
see is the length of it.

I'll hold off a bit before doing any renaming though to see what other
people think. I just feel bikeshedding on the name is something that's
going to take up quite a bit of time and effort with this. I plan to
rename it at most once.

Thanks for the comments

David




Re: Replace remaining StrNCpy() by strlcpy()

2020-08-05 Thread David Rowley
On Tue, 4 Aug 2020 at 00:12, Tom Lane  wrote:
>
> I wrote:
> > David Rowley  writes:
> >> Will mean that we'll now no longer zero the full length of the m_xlog
> >> field after the end of the string. Won't that mean we'll start writing
> >> junk bytes to the stats collector?
>
> > StrNCpy doesn't zero-fill the destination today either (except for
> > the very last byte).
>
> Oh, no, I take that back --- didn't read all of the strncpy man
> page :-(.  Yeah, this is a point.  We'd need to check each call
> site to see whether the zero-padding matters.

I just had a thought that even strlcpy() is not really ideal for many
of our purposes for it.

Currently we still have cruddy code like:

strlcpy(fullname, pg_TZDIR(), sizeof(fullname));
if (strlen(fullname) + 1 + strlen(name) >= MAXPGPATH)
return -1; /* not gonna fit */
strcat(fullname, "/");
strcat(fullname, name);

If strlcpy() had been designed differently to take a signed size and
do nothing when the size is <= 0 then we could have had much cleaner
implementations of the above:

size_t len;
len = strlcpy(fullname, pg_TZDIR(), sizeof(fullname));
len += strlcpy(fullname + len, "/", sizeof(fullname) - len);
len += strlcpy(fullname + len, name, sizeof(fullname) - len);
if (len >= sizeof(fullname))
return -1; /* didn't fit */

This should be much more efficient, in general, due to the lack of
strlen() calls and the concatenation not having to refind the end of
the string again each time.

Now, I'm not saying we should change strlcpy() to take a signed type
instead of size_t to allow this to work. Reusing that name for another
purpose is likely a bad idea that will lead to misuse and confusion.
What I am saying is that perhaps strlcpy() is not all that it's
cracked up to be and it could have been done better.  Perhaps we can
invent our own version that fixes the shortcomings.

Just a thought.

On the other hand, perhaps we're not using the return value of
strlcpy() enough for such a change to make sense. However, a quick
glance shows we certainly could use it more often, e.g:

if (parsed->xinfo & XACT_XINFO_HAS_GID)
{
strlcpy(parsed->twophase_gid, data, sizeof(parsed->twophase_gid));
data += strlen(data) + 1;
}

David




Re: Amcheck: do rightlink verification with lock coupling

2020-08-05 Thread Peter Geoghegan
On Tue, Aug 4, 2020 at 9:33 AM Andrey M. Borodin  wrote:
> BTW, reviewing this patch again I cannot understand why we verify link 
> coherence only on leaf level but not for internal pages?
> I do not see any actual problems here.

Well, I thought that it might be a good idea to limit it to the leaf
level, based on the theory that we rarely couple locks on internal
page levels in general. But yeah, that's probably not a good enough
reason to avoid lock coupling on internal pages. It's probably better
to do it everywhere than to explain why we don't do it on the internal
level -- the explanation will probably be confusing. And even if there
was a performance issue, it could only happen when there are
concurrent internal page splits -- but those are supposed to be rare.

Attached is v4, which now checks internal pages (just like leaf
pages). The main other change in this revised version is that we make
the error raised by bt_index_check() match the error used in the old
bt_index_parent_check() case -- we always want to blame the current
target page when amcheck complains (technically the page we blame when
the invariant fails isn't strictly guaranteed to be quite the same
thing as the target, but it's close enough to not really matter in
reality). Other adjustments:

* Added _bt_checkpage() calls for buffers, as is standard practice in nbtree.

* Added protection against locking the same page a second time in the
event of a faulty sibling link -- we should avoid a self-deadlock in
the event of a page that is corrupt in just the wrong way.

* Updated obsolescent comments that claimed that we never couple
buffer locks in amcheck.

I would like to commit something like this in the next day or two.

Thoughts?

--
Peter Geoghegan


v4-0001-Add-amcheck-sibling-link-checks.patch
Description: Binary data


PROC_IN_ANALYZE stillborn 13 years ago

2020-08-05 Thread Alvaro Herrera
Back in the 8.3 cycle (2007) when the autovacuum launcher/worker split
was done, we annoyed people because it blocked DDL.  That led to an
effort to cancel autovac automatically when that was detected, by Simon
Riggs.
https://postgr.es/m/1191526327.4223.204.ca...@ebony.site
https://postgr.es/m/1192129897.4233.433.ca...@ebony.site

I was fixated on only cancelling when it was ANALYZE, to avoid losing
any VACUUM work.
https://postgr.es/m/20071025164150.gf23...@alvh.no-ip.org
That turned into some flags for PGPROC to detect whether a process was
ANALYZE, and cancel only those.
https://postgr.es/m/20071024151328.gg6...@alvh.no-ip.org
Commit:
https://postgr.es/m/20071024205536.cb425754...@cvs.postgresql.org
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=745c1b2c2ab

However, I was outvoted, so we do not limit cancellation to analyze.
Patch and discussion: https://postgr.es/m/20071025164150.gf23...@alvh.no-ip.org
Commit:
https://postgr.es/m/20071026204510.aa02e754...@cvs.postgresql.org
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=acac68b2bca

... which means the flag I had added two days earlier has never been
used for anything.  We've carried the flag forward to this day for
almost 13 years, dutifully turning it on and off ... but never checking
it anywhere.

I propose to remove it, as in the attached patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From f96b28defe856e284a44b207c2016c72a48a2ff2 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 5 Aug 2020 18:57:50 -0400
Subject: [PATCH] Remove PROC_IN_ANALYZE

---
 src/backend/commands/analyze.c  | 13 +
 src/include/storage/proc.h  |  3 +--
 src/include/storage/procarray.h |  7 ---
 3 files changed, 2 insertions(+), 21 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..e0fa73ba79 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -247,11 +247,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 	}
 
 	/*
-	 * OK, let's do it.  First let other backends know I'm in ANALYZE.
+	 * OK, let's do it.  First, initialize progress reporting.
 	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags |= PROC_IN_ANALYZE;
-	LWLockRelease(ProcArrayLock);
 	pgstat_progress_start_command(PROGRESS_COMMAND_ANALYZE,
   RelationGetRelid(onerel));
 
@@ -279,14 +276,6 @@ analyze_rel(Oid relid, RangeVar *relation,
 	relation_close(onerel, NoLock);
 
 	pgstat_progress_end_command();
-
-	/*
-	 * Reset my PGXACT flag.  Note: we need this here, and not in vacuum_rel,
-	 * because the vacuum flag is cleared by the end-of-xact code.
-	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags &= ~PROC_IN_ANALYZE;
-	LWLockRelease(ProcArrayLock);
 }
 
 /*
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index b20e2ad4f6..5ceb2494ba 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -52,7 +52,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_ANALYZE		0x04	/* currently running analyze */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
  * decoding outside xact */
@@ -60,7 +59,7 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_ANALYZE | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * We allow a small number of "weak" relation locks (AccessShareLock,
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a5c7d0c064..01040d76e1 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -29,8 +29,6 @@
  */
 #define		PROCARRAY_VACUUM_FLAG			0x02	/* currently running lazy
 	 * vacuum */
-#define		PROCARRAY_ANALYZE_FLAG			0x04	/* currently running
-	 * analyze */
 #define		PROCARRAY_LOGICAL_DECODING_FLAG 0x10	/* currently doing logical
 	 * decoding outside xact */
 
@@ -42,7 +40,6 @@
  * have no corresponding PROC flag equivalent.
  */
 #define		PROCARRAY_PROC_FLAGS_MASK	(PROCARRAY_VACUUM_FLAG | \
-		 PROCARRAY_ANALYZE_FLAG | \
 		 PROCARRAY_LOGICAL_DECODING_FLAG)
 
 /* Use the following flags as an input "flags" to GetOldestXmin function */
@@ -50,10 +47,6 @@
 #define		PROCARRAY_FLAGS_DEFAULT			PROCARRAY_LOGICAL_DECODING_FLAG
 /* Ignore vacuum backends */
 #define		PROCARRAY_FLAGS_VACUUM			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_VACUUM_FLAG
-/* Ignore analyze backends */
-#define		PROCARRAY_FLAGS_ANALYZE			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_ANALYZE_FLAG
-/* Ignore both vacuum and analyze backends */
-#define		PROCARRAY_FLAGS_VACUUM_ANALYZE	PROCAR

Re: Reg. Postgres 13

2020-08-05 Thread Bruce Momjian
On Wed, Aug  5, 2020 at 06:08:15AM +, Joel Mariadasan (jomariad) wrote:
> Hi,
> 
> I see that release 13 is currently in beta.
> 
> When will be the official production release of 13 be out?
> 
> We need to see if we can include this as part of our product release cycle.

Look here:

https://www.postgresql.org/developer/roadmap/

 The next major release of PostgreSQL is planned to be the 13
release. A tentative schedule for this version has a release in the
third quarter of 2020. 

Most likely Sept/Oct.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: PROC_IN_ANALYZE stillborn 13 years ago

2020-08-05 Thread Andres Freund
Hi,

On 2020-08-05 19:55:49 -0400, Alvaro Herrera wrote:
> Back in the 8.3 cycle (2007) when the autovacuum launcher/worker split
> was done, we annoyed people because it blocked DDL.  That led to an
> effort to cancel autovac automatically when that was detected, by Simon
> Riggs.
> https://postgr.es/m/1191526327.4223.204.ca...@ebony.site
> https://postgr.es/m/1192129897.4233.433.ca...@ebony.site
> 
> I was fixated on only cancelling when it was ANALYZE, to avoid losing
> any VACUUM work.
> https://postgr.es/m/20071025164150.gf23...@alvh.no-ip.org
> That turned into some flags for PGPROC to detect whether a process was
> ANALYZE, and cancel only those.
> https://postgr.es/m/20071024151328.gg6...@alvh.no-ip.org
> Commit:
> https://postgr.es/m/20071024205536.cb425754...@cvs.postgresql.org
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=745c1b2c2ab
> 
> However, I was outvoted, so we do not limit cancellation to analyze.
> Patch and discussion: 
> https://postgr.es/m/20071025164150.gf23...@alvh.no-ip.org
> Commit:
> https://postgr.es/m/20071026204510.aa02e754...@cvs.postgresql.org
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=acac68b2bca
> 
> ... which means the flag I had added two days earlier has never been
> used for anything.  We've carried the flag forward to this day for
> almost 13 years, dutifully turning it on and off ... but never checking
> it anywhere.
> 
> I propose to remove it, as in the attached patch.

I'm mildly against that, because I'd really like to start making use of
the flag. Not so much for cancellations, but to avoid the drastic impact
analyze has on bloat.  In OLTP workloads with big tables, and without
disabled cost limiting for analyze (or slow IO), the snapshot that
analyze holds is often by far the transaction with the oldest xmin.

It's not entirely trivial to fix (just ignoring it could lead to
detoasting issues), but also not that.

Only mildly against because it'd not be hard to reintroduce once we need
it.

Greetings,

Andres Freund




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-08-05 Thread k.jami...@fujitsu.com
On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:

Hi,
Thank you for your constructive review and comments.
Sorry for the late reply.

> Hi,
> 
> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > Indeed. The buffer mapping hashtable already is visible as a major
> > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > > is large enough (so the hashtable is larger than the cache). Not to
> > > speak of things like a cached sequential scan with a cheap qual and wide
> rows.
> >
> > To be fair, the added overhead is in buffer allocation not buffer lookup.
> > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > upthread, the potential trouble spot is where the working set is
> > bigger than shared buffers but still fits in RAM (so there's no actual
> > I/O needed, but we do still have to shuffle buffers a lot).
> 
> Oh, right, not sure what I was thinking.
> 
> 
> > > Wonder if the temporary fix is just to do explicit hashtable probes
> > > for all pages iff the size of the relation is < s_b / 500 or so.
> > > That'll address the case where small tables are frequently dropped -
> > > and dropping large relations is more expensive from the OS and data
> > > loading perspective, so it's not gonna happen as often.
> >
> > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > relation had been (preferably without adding an lseek call), but maybe
> > that's do-able.
> 
> IIRC we already do smgrnblocks nearby, when doing the truncation (to figure 
> out
> which segments we need to remove). Perhaps we can arrange to combine the
> two? The layering probably makes that somewhat ugly :(
> 
> We could also just use pg_class.relpages. It'll probably mostly be accurate
> enough?
> 
> Or we could just cache the result of the last smgrnblocks call...
> 
> 
> One of the cases where this type of strategy is most intersting to me is the 
> partial
> truncations that autovacuum does... There we even know the range of tables
> ahead of time.

Konstantin tested it on various workloads and saw no regression.
But I understand the sentiment on the added overhead on BufferAlloc.
Regarding the case where the patch would potentially affect workloads that fit 
into
RAM but not into shared buffers, could one of Andres' suggested idea/s above 
address
that, in addition to this patch's possible shared invalidation fix? Could that 
settle
the added overhead in BufferAlloc() as temporary fix?
Thomas Munro is also working on caching relation sizes [1], maybe that way we
could get the latest known relation size. Currently, it's possible only during
recovery in smgrnblocks.

Tom Lane wrote:
> But aside from that, I noted a number of things I didn't like a bit:
> 
> * The amount of new shared memory this needs seems several orders of
> magnitude higher than what I'd call acceptable: according to my measurements
> it's over 10KB per shared buffer!  Most of that is going into the
> CachedBufTableLock data structure, which seems fundamentally misdesigned ---
> how could we be needing a lock per map partition *per buffer*?  For 
> comparison,
> the space used by buf_table.c is about 56 bytes per shared buffer; I think 
> this
> needs to stay at least within hailing distance of there.
> 
> * It is fairly suspicious that the new data structure is manipulated while 
> holding
> per-partition locks for the existing buffer hashtable.
> At best that seems bad for concurrency, and at worst it could result in 
> deadlocks,
> because I doubt we can assume that the new hash table has partition boundaries
> identical to the old one.
> 
> * More generally, it seems like really poor design that this has been written
> completely independently of the existing buffer hash table.
> Can't we get any benefit by merging them somehow?

The original aim is to just shorten the recovery process, and eventually the 
speedup
on both vacuum and truncate process are just added bonus.
Given that we don't have a shared invalidation mechanism in place yet like 
radix tree
buffer mapping which is complex, I thought a patch like mine could be an 
alternative
approach to that. So I want to improve the patch further. 
I hope you can help me clarify the direction, so that I can avoid going farther 
away
from what the community wants.
 1. Both normal operations and recovery process
 2. Improve recovery process only

For 1, the current patch aims to touch on that, but further design improvement 
is needed.
It would be ideal to modify the BufferDesc, but that cannot be expanded anymore 
because
it would exceed the CPU cache line size. So I added new data structures (hash 
table,
dlist, lock) instead of modifying the existing ones.
The new hash table ensures that it's identical to the old one with the use of 
the same
Relfilenode in the key and a lock when inserting and deleting buffers from 
buffer table,
as well as during lookups. As for the partition locking, I added it to reduce 
lock contentio

Re: Add MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP options to VACUUM

2020-08-05 Thread Michael Paquier
On Wed, Aug 05, 2020 at 12:56:48AM +, Bossart, Nathan wrote:
> My main motive for adding the MAIN_RELATION_CLEANUP option is to allow
> table owners to easily vacuum only a relation's TOAST table.  Roles do
> not have access to the pg_toast schema by default, so they might be
> restricted from vacuuming their TOAST tables directly.

True that you need an extra GRANT USAGE ON pg_toast to achieve that
for users with no privileges, but that's not impossible now either.  I
am not sure that this use-case justifies a new option and more
complications in the code paths of vacuum though.  So let's see first
if others have an opinion to offer.
--
Michael


signature.asc
Description: PGP signature


Re: WIP: WAL prefetch (another approach)

2020-08-05 Thread Thomas Munro
On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra
 wrote:
> On Thu, Jul 02, 2020 at 03:09:29PM +1200, Thomas Munro wrote:
> >FYI I am still trying to reproduce and understand the problem Tomas
> >reported; more soon.
>
> Any luck trying to reproduce thigs? Should I try again and collect some
> additional debug info?

No luck.  I'm working on it now, and also trying to reduce the
overheads so that we're not doing extra work when it doesn't help.

By the way, I also looked into recovery I/O stalls *other* than
relation buffer cache misses, and created
https://commitfest.postgresql.org/29/2669/ to fix what I found.  If
you avoid both kinds of stalls then crash recovery is finally CPU
bound (to go faster after that we'll need parallel replay).




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Amit Langote
On Thu, Aug 6, 2020 at 2:30 AM Tom Lane  wrote:
> Amit Langote  writes:
> > The attached patch should fix that.
>
> I don't like this patch at all though; I do not think it is being nearly
> careful enough to ensure that it's matched the surviving relation OIDs
> correctly.  In particular it blithely assumes that a zero in relid_map
> *must* match the immediately next entry in partdesc->oids, which is easy
> to break if the new partition is adjacent to the one the planner managed
> to prune.

Indeed, you're right.

>  So I think we should do it more like the attached.

Thanks for pushing that.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Andy Fan
On Thu, Aug 6, 2020 at 2:22 AM Tom Lane  wrote:

> Robert Haas  writes:
> > On Wed, Aug 5, 2020 at 1:30 PM Tom Lane  wrote:
> >> I'm strongly tempted to convert the trailing Assert to an actual
> >> test-and-elog, too, but didn't do so here.
>
> > I was thinking about that, too. +1 for taking that step.
>
> Will do.
>
> In the longer term, it's annoying that we have no test methodology
> for this other than "manually set a breakpoint here".


One of the methods I see is we can just add some GUC variable for some
action injection.   basically it adds some code based on the GUC like this;

if (shall_delay_planning)
{
  sleep(10)
};

AFAIK,  MongoDB uses much such technology  in their test framework. First
it
defines the fail point [1],  and then does code injection if the fail point
is set [2].
At last, during the test it can set a fail point like a GUC, but with more
attributes [3].
If that is useful in PG as well and it is not an urgent task,  I would like
to help
in this direction.

[1]
https://github.com/mongodb/mongo/search?q=MONGO_FAIL_POINT_DEFINE&unscoped_q=MONGO_FAIL_POINT_DEFINE

[2]
https://github.com/mongodb/mongo/blob/d4e7ea57599b44353b5393afedee8ae5670837b3/src/mongo/db/repl/repl_set_config.cpp#L475
[3]
https://github.com/mongodb/mongo/blob/e07c2d29aded5a30ff08b5ce6a436b6ef6f44014/src/mongo/shell/replsettest.js#L1427



If we're going
> to allow plan-relevant DDL changes to happen with less than full table
> lock, I think we need to improve that.  I spent a little bit of time
> just now trying to build an isolationtester case for this, and failed
> completely.  So I wonder if we can create some sort of test module that
> allows capture of a plan tree and then execution of that plan tree later
> (even after relcache inval would normally have forced replanning).
> Obviously that could not be a normal SQL-accessible feature, because
> some types of invals would make the plan completely wrong, but for
> testing purposes it'd be mighty helpful to check that a stale plan
> still works.
>
> regards, tom lane
>
>
>

-- 
Best Regards
Andy Fan


Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Tom Lane
Andy Fan  writes:
> On Thu, Aug 6, 2020 at 2:22 AM Tom Lane  wrote:
>> In the longer term, it's annoying that we have no test methodology
>> for this other than "manually set a breakpoint here".

> One of the methods I see is we can just add some GUC variable for some
> action injection.   basically it adds some code based on the GUC like this;

See my straw-man proposal downthread.  I'm not very excited about putting
things like this into the standard build, because it's really hard to be
sure that there are no security-hazard-ish downsides of putting in ways to
get at testing behaviors from standard SQL.  And then there's the question
of whether you're adding noticeable overhead to production builds.  So a
loadable module that can use some existing hook to provide the needed
behavior seems like a better plan to me, whenever we can do it that way.

In general, though, it seems like we've seldom regretted investments in
test tooling.

regards, tom lane




Re: Which SET TYPE don't actually require a rewrite

2020-08-05 Thread Noah Misch
On Wed, Aug 05, 2020 at 02:52:42PM +0200, Magnus Hagander wrote:
> On Sat, Jul 18, 2020 at 4:57 AM Noah Misch  wrote:
> > Such a doc addition is fine with me.  I agree with Tom that it will be prone
> > to staleness, but I don't conclude that the potential for staleness reduces
> > its net value below zero.  Having said that, if the consequences of doc
> > staleness are "very bad", you may consider documenting the debug1 user
> > interface (https://postgr.es/m/20121202020736.gd13...@tornado.leadboat.com)
> > instead of documenting the exact rules.  Either way is fine with me.
> 
> The DEBUG1 method is only after the fact though, isn't it?
> 
> That makes it pretty hard for someone to say review a migration script and
> see "this is going to cause problems". And if it's going to be run in an
> env, I personally find it more useful to just stick an event trigger in
> there per our documentation and block it (though it might be a good idea to
> link to that from the alter table reference page, and not just have it
> under event trigger examples).

The "after the fact" aspect is basically the same for the DEBUG1 method and
the event trigger method.  Each fires after lock acquisition and before
rewriting the first tuple.

Event trigger drawbacks include the requirement for superuser cooperation.
DEBUG1/statement_timeout drawbacks include an ambiguity: if it reaches
statement_timeout without printing the DEBUG1, that could mean a lack of
rewrite, or it could mean some other cause of slowness.  I have a weak
preference for promoting the DEBUG1/statement_timeout approach, because cloud
deployments find the superuser obstacle insurmountable.  The ambiguity is
surmountable; one can always remove the statement_timeout and run the command
to completion in a pre-production environment.




Re: Parallel copy

2020-08-05 Thread vignesh C
On Tue, Aug 4, 2020 at 9:51 PM Tomas Vondra
 wrote:
>
> On Mon, Aug 03, 2020 at 12:33:48PM +0530, Bharath Rupireddy wrote:
> >On Sat, Aug 1, 2020 at 9:55 AM vignesh C  wrote:
> >>
> >> The patches were not applying because of the recent commits.
> >> I have rebased the patch over head & attached.
> >>
> >I rebased v2-0006-Parallel-Copy-For-Binary-Format-Files.patch.
> >
> >Putting together all the patches rebased on to the latest commit
> >b8fdee7d0ca8bd2165d46fb1468f75571b706a01. Patches from 0001 to 0005
> >are rebased by Vignesh, that are from the previous mail and the patch
> >0006 is rebased by me.
> >
> >Please consider this patch set for further review.
> >
>
> I'd suggest incrementing the version every time an updated version is
> submitted, even if it's just a rebased version. It makes it clearer
> which version of the code is being discussed, etc.

Sure, we will take care of this when we are sending the next set of patches.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Amcheck: do rightlink verification with lock coupling

2020-08-05 Thread Andrey M. Borodin



> 6 авг. 2020 г., в 04:25, Peter Geoghegan  написал(а):
> 
> * Added _bt_checkpage() calls for buffers, as is standard practice in nbtree.
> 
> * Added protection against locking the same page a second time in the
> event of a faulty sibling link -- we should avoid a self-deadlock in
> the event of a page that is corrupt in just the wrong way.
> 
> * Updated obsolescent comments that claimed that we never couple
> buffer locks in amcheck.
Cool, thanks!
There's mintor typo: missing space in "of_bt_check_unique".

> 
> I would like to commit something like this in the next day or two.
> 
> Thoughts?

Sounds great! Thanks!

Best regards, Andrey Borodin.





Re: pg_rewind is not crash safe

2020-08-05 Thread Andrey M. Borodin



> 5 авг. 2020 г., в 23:13, Heikki Linnakangas  написал(а):
> 
> A colleague of mine brought to my attention that pg_rewind is not crash safe. 
> If it is interrupted for any reason, it leaves behind a data directory with a 
> mix of data from the source and target images. If you're "lucky", the server 
> will start up, but it can be in an inconsistent state. 

FWIW we routinely encounter cases when after unsuccessful pg_rewind databases 
refuses to start with "contrecord requested" message.
I did not investigate this in detail yet, but I think it is a result of wrong 
redo recptr written to control file (due to interruption or insufficient 
present WAL segments).

Best regards, Andrey Borodin.



Re: public schema default ACL

2020-08-05 Thread Noah Misch
On Mon, Aug 03, 2020 at 09:46:23AM -0400, Robert Haas wrote:
> On Mon, Aug 3, 2020 at 2:30 AM Noah Misch  wrote:
> > Between (b)(2)(X) and (b)(3)(X), what are folks' preferences?  Does anyone
> > strongly favor some other option (including the option of changing nothing)
> > over both of those two?
> 
> I don't think we have any options here that are secure but do not
> break backward compatibility.

I agree, but compatibility breaks vary in pain caused.  I want to offer a
simple exit to a backward-compatible configuration, and I want a $NEW_DEFAULT
pleasing enough that a decent fraction of deployments keep $NEW_DEFAULT (forgo
the exit).  The move to default standard_conforming_strings=on is an example
to follow (editing postgresql.conf was the simple exit).

> I don't know how to choose between (1), (2), and (3).

One way is to envision deployments you know and think about a couple of
questions in the context of those deployments.  If $EACH_OPTION happened,
would this deployment keep $NEW_DEFAULT, override $NEW_DEFAULT to some other
secure configuration, or exit to $v13_DEFAULT?  Where the answer is "exit",
would those deployments rate the exit recipe easy, medium, or difficult?




Re: public schema default ACL

2020-08-05 Thread Noah Misch
On Mon, Aug 03, 2020 at 07:46:02PM +0200, Peter Eisentraut wrote:
> The important things in my mind are that you keep an easy onboarding
> experience (you can do SQL things without having to create and unlock a
> bunch of things first) and that advanced users can do the things they want
> to do *somehow*.

Makes sense.  How do these options differ in ease of onboarding?  In that
light, what option would you pick?




Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-05 Thread Ashutosh Sharma
On Thu, Aug 6, 2020 at 1:04 AM Robert Haas  wrote:
>
> On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma  wrote:
> > If the above path is taken that means none of the items in the page
> > got changed.
>
> Oops. I didn't realize that, sorry. Maybe it would be a little more
> clear if instead of "int nSkippedItems" you had "bool
> did_modify_page"? Then you could initialize it to false and set it to
> true just before doing the page modifications.
>

Okay, np, in that case, as you suggested, I will replace "int
nSkippedItems" with "did_modify_page" to increase the clarity. I will
do this change in the next version of patch. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com




Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-05 Thread Ashutosh Sharma
On Thu, Aug 6, 2020 at 1:29 AM Robert Haas  wrote:
>
> On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma  wrote:
> > Yeah, it's being tested on the main table, not on a toast table. I've
> > removed this test-case and also restricted direct access to the toast
> > table using heap_force_kill/freeze functions. I think we shouldn't be
> > using these functions to do any changes in the toast table. We will
> > only use these functions with the main table and let VACUUM remove the
> > corresponding data chunks (pointed by the tuple that got removed from
> > the main table).
>
> I agree with removing the test case, but I disagree with restricting
> this from being used on the TOAST table. These are tools for experts,
> who may use them as they see fit. It's unlikely that it would be
> useful to use this on a TOAST table, I think, but not impossible.
>

Okay, If you want I can remove the restriction on a toast table, but,
then that means a user can directly remove the data chunks from the
toast table without changing anything in the main table. This means we
won't be able to query the main table as it will fail with an error
like "ERROR:  unexpected chunk number ...". So, we will have to find
some way to identify the pointer to the chunks that got deleted from
the toast table and remove that pointer from the main table. We also
need to make sure that before we remove a tuple (pointer) from the
main table, we identify all the remaining data chunks pointed by this
tuple and remove them completely only then that table would be
considered to be in a good state. Now, I am not sure if we can
currently do all these things.

> > Another option would be to identify all the data chunks corresponding
> > to the tuple (ctid) being killed from the main table and remove them
> > one by one. We will only do this if the tuple from the main table that
> > has been marked as killed has an external storage. We will have to add
> > a bunch of code for this otherwise we can let VACUUM do this for us.
> > Let me know your thoughts on this.
>
> I don't think VACUUM will do anything for us automatically -- it isn't
> going to know that we force-killed the tuple in the main table.
> Normally, a tuple delete would have to set xmax on the TOAST tuples
> and then VACUUM would do its thing, but in this case that won't
> happen. So if you force-kill a tuple in the main table you would end
> up with a space leak in the TOAST table.
>
> The problem here is that one reason you might force-killing a tuple in
> the main table is because it's full of garbage. If so, trying to
> decode the tuple so that you can find the TOAST pointers might crash
> or error out, or maybe that part will work but then you'll error out
> trying to look up the corresponding TOAST tuples, either because the
> values are not valid or because the TOAST table itself is generally
> hosed in some way. So I think it is probably best if we keep this tool
> as simple as possible, with as few dependencies as we can, and
> document the possible negative outcomes of using it.

I completely agree with you.

It's not
> impossible to recover from a space-leak like this; you can always move
> the data into a new table with CTAS and then drop the old one. Not
> sure whether CLUSTER or VACUUM FULL would also be sufficient.
>

Yeah, I think, we can either use CTAS or VACUUM FULL, both look fine.

> Separately, we might want to add a TOAST-checker to amcheck, or
> enhance the heap-checker Mark is working on, and one of the things it
> could do is check for TOAST entries to which nothing points. Then if
> you force-kill tuples in the main table you could also use that tool
> to look for things in the TOAST table that ought to also be
> force-killed.
>

Okay, good to know that. Thanks for sharing this info.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com




Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration

2020-08-05 Thread Michael Paquier
Hi all,

As $subject says, pg_test_fsync and pg_test_timing don't really check
the range of option values specified.  It is possible for example to
make pg_test_fsync run an infinite amount of time, and pg_test_timing
does not handle overflows with --duration at all.

These are far from being critical issues, but let's fix them at least
on HEAD.  So, please see the attached, where I have also added some
basic TAP tests for both tools.

Thanks,
--
Michael
diff --git a/src/bin/pg_test_fsync/.gitignore b/src/bin/pg_test_fsync/.gitignore
index f3b5932498..5eb5085f45 100644
--- a/src/bin/pg_test_fsync/.gitignore
+++ b/src/bin/pg_test_fsync/.gitignore
@@ -1 +1,3 @@
 /pg_test_fsync
+
+/tmp_check/
diff --git a/src/bin/pg_test_fsync/Makefile b/src/bin/pg_test_fsync/Makefile
index 7632c94eb7..c4f9ae0664 100644
--- a/src/bin/pg_test_fsync/Makefile
+++ b/src/bin/pg_test_fsync/Makefile
@@ -22,6 +22,12 @@ install: all installdirs
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
 
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
 uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_test_fsync$(X)'
 
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 6e47293123..e56b42bcf8 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -5,6 +5,7 @@
 
 #include "postgres_fe.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -174,6 +175,12 @@ handle_args(int argc, char *argv[])
 
 			case 's':
 secs_per_test = atoi(optarg);
+if (secs_per_test <= 0 || errno == ERANGE)
+{
+	pg_log_error("%s must be in range %d..%d",
+ "--secs-per-test", 1, INT_MAX);
+	exit(1);
+}
 break;
 
 			default:
diff --git a/src/bin/pg_test_fsync/t/001_basic.pl b/src/bin/pg_test_fsync/t/001_basic.pl
new file mode 100644
index 00..cc5517cb06
--- /dev/null
+++ b/src/bin/pg_test_fsync/t/001_basic.pl
@@ -0,0 +1,21 @@
+use strict;
+use warnings;
+
+use Config;
+use TestLib;
+use Test::More tests => 10;
+
+#
+# Basic checks
+
+program_help_ok('pg_test_fsync');
+program_version_ok('pg_test_fsync');
+program_options_handling_ok('pg_test_fsync');
+
+#
+# Test invalid option combinations
+
+command_fails_like(
+	[ 'pg_test_fsync', '--secs-per-test', '0' ],
+	qr/\Qpg_test_fsync: error: --secs-per-test must be in range 1..2147483647\E/,
+	'pg_test_fsync: --secs-per-test must be in range 1..2147483647');
diff --git a/src/bin/pg_test_timing/.gitignore b/src/bin/pg_test_timing/.gitignore
index f6c664c765..e5aac2ab12 100644
--- a/src/bin/pg_test_timing/.gitignore
+++ b/src/bin/pg_test_timing/.gitignore
@@ -1 +1,3 @@
 /pg_test_timing
+
+/tmp_check/
diff --git a/src/bin/pg_test_timing/Makefile b/src/bin/pg_test_timing/Makefile
index 334d6ff5c0..52994b4103 100644
--- a/src/bin/pg_test_timing/Makefile
+++ b/src/bin/pg_test_timing/Makefile
@@ -22,6 +22,12 @@ install: all installdirs
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
 
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
 uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_test_timing$(X)'
 
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index e14802372b..d369d32781 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -6,6 +6,8 @@
 
 #include "postgres_fe.h"
 
+#include 
+
 #include "getopt_long.h"
 #include "portability/instr_time.h"
 
@@ -69,6 +71,12 @@ handle_args(int argc, char *argv[])
 		{
 			case 'd':
 test_duration = atoi(optarg);
+if (test_duration <= 0 || errno == ERANGE)
+{
+	fprintf(stderr, _("%s: %s must be in range %d..%d\n"),
+			progname, "--duration", 1, INT_MAX);
+	exit(1);
+}
 break;
 
 			default:
@@ -89,22 +97,11 @@ handle_args(int argc, char *argv[])
 		exit(1);
 	}
 
-	if (test_duration > 0)
-	{
-		printf(ngettext("Testing timing overhead for %d second.\n",
-		"Testing timing overhead for %d seconds.\n",
-		test_duration),
-			   test_duration);
-	}
-	else
-	{
-		fprintf(stderr,
-_("%s: duration must be a positive integer (duration is \"%d\")\n"),
-progname, test_duration);
-		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
-progname);
-		exit(1);
-	}
+
+	printf(ngettext("Testing timing overhead for %d second.\n",
+	"Testing timing overhead for %d seconds.\n",
+	test_duration),
+		   test_duration);
 }
 
 static uint64
diff --git a/src/bin/pg_test_timing/t/001_basic.pl b/src/bin/pg_test_timing/t/001_basic.pl
new file mode 100644
index 00..32dcfbe572
--- /dev/null
+++ b/src/bin/pg_test_timing/t/001_basic.pl
@@ -0,0 +1,21 @@
+use strict;
+use warnings;
+
+use Config;
+use TestLib;
+use Test::More tests => 10;
+
+#
+# Basic checks
+
+program_help_ok('pg_test_timing');
+program_version_ok('pg_test_timing');
+program_options_handling_ok('pg_test_timing');
+