date:20191016

Bug about drop index concurrently

2019-10-16 Thread 李杰(慎追)


Hi hackers,

In recently, I discovered a postgres bug, and I hope I can ask you for the best 
solution.
The problem is as follows:

postgres=# explain analyze select * from xxx where a=500;
ERROR: could not open relation with OID 25989
The structure of my table is as follows:
postgres=# \d xxx
 Table "public.xxx"
 Column | Type  | Collation | Nullable | Default
+-+---+--+-
 a | integer |  | |
 b | text |  | |

postgres=# select count(*) from xxx;
 count 

 80
(1 row)

postgres=# select * from xxx limit 3;

 a | b
---+--
 1 | 203c51477570aa517cfa317155dcc52c
 2 | b58da31baa5c78fee4642fd398bd5909
 3 | c7c475bf0a3ca2fc2afc4812a4d44c58

I opened the log file and saw that the index of table xxx was deleted,

postgres=# drop index CONCURRENTLY idx_xxx ;
DROP INDEX

In order to reproduce this bug, I created and deleted the index again and again 
on the master.
What is hard to understand is that this bug cannot be repeated 100%.
I wrote a script that loops over the master and runs the following two 
sentences.

postgres=# create index idx_xxx on xxx (a);
postgres=# drop index CONCURRENTLY idx_xxx ;
postgres=# create index idx_xxx on xxx (a);
postgres=# drop index CONCURRENTLY idx_xxx ;
...
...
...
At the same time, I started two clients in the standby, 
respectively execute the following sql on the table xxx:

postgres=# explain analyze select * from xxx where a=500;
postgres=# \watch 0.1

After a few minutes, the bug will appear.

I finally confirmed my guess, I used an index scan in the standby query,
but deleted the index on the master at the same time.
Curious, I went to read the source code of Postgres. I found that
 regular DROP INDEX commands imposes a AccessExclusiveLock on the table,
 while drop index concurrently commands only used ShareUpdateExclusiveLock.

As we all know, only AccessExclusiveLock and  AccessShareLock ,a select's  lock 
,
are mutually exclusive, and AccessShareLock can't block 
ShareUpdateExclusiveLock.
This is very weird and not desirable.

This is of course, developers must have thought of this, so we can see in the 
source 
code, before the drop index concurrently, will wait for all transactions using 
this
 index to wait for detection.

 But this only exists on the master, my query is executed on the standby.
 I use the pg_waldump tool to parse the wal file, and analyze the stantup 
process,
 I found that there is no similar operation on the standby, so it will appear 
that 
 when I execute the query on the standby, the index will be deleted by others.


I think this is a bug that will affect the user's experience. we need to fix it.
 I have imagined that the logic that detects the query transaction and
  waits for it to end is implemented on the standby,but this may increase the
 log application delay and the delay is exacerbated that cause the master and 
backup. 
This is not desirable if the query concurrency is large.

All in all, I expect that you can provide a solution that can use drop index 
concurrently 
without affecting the master-slave delay.

Sincerely look forward to your reply and thanks.

adger

Re: Understanding TupleQueue impact and overheads?

2019-10-16 Thread Andres Freund

Hi,

On 2019-10-16 01:24:04 +, Tom Mercha wrote:
> What I am having difficulty understanding is what happens to the
> location of the HeapTuple as it moves from one TupleTableSlot to the
> other as described above. Since there most likely is a reference to a
> physical tuple involved, am I incurring a disk-access overhead with each
> copy of a tuple? This would seem like a massive overhead; how can I keep
> such overheads to a minimum?

The tuple is fully "materialized" on the sending size, due to
tuple = ExecFetchSlotHeapTuple(slot, true, &should_free);

so there's no direct references to disk data at that point. But if
there's toasted columns, they'll may only be accessed on the receiving
side.

Side-note: This very likely rather should use a minimal, rather than a
full heap, tuple.

> Furthermore, to what extent can I expect other modules to impact a
> queued HeapTuple? If some external process updates this tuple, when will
> I see the change? Would it be a possiblity that the update is not
> reflected on the queued HeapTuple but the external process is not
> blocked/delayed from updating? In other words, like operating on some
> kind of multiple snapshots? When does DBMS logging kick in whilst I am
> transferring a tuple from TupTableStore to another?

I'm not quite sure what you're actually trying to get at. Whether a
tuple is ferried through the queue or not shouldn't have an impact on
visibility / snapshot and locking considerations. For parallel query etc
the snapshots are synchronized between the "leader" and its workers. If
you want to use them for something separate, it's your responsibility to
do so.

Greetings,

Andres Freund

Re: strong memory leak in plpgsql from handled rollback and lazy cast

2019-10-16 Thread Andres Freund

Hi,

On 2019-09-22 20:16:15 -0400, Tom Lane wrote:
> I think you're considering a much smaller slice of the system than
> what seems to me to be at risk here.  As an example, an ExprState
> tree would also include any fn_extra-linked state that individual
> functions might've set up.

> > Hm. I interestingly am working on a patch that merges all the memory
> > allocations done for an ExprState into one or two allocations (by
> > basically doing the traversal twice). Then it'd be feasible to just
> > pfree() the memory, if that helps.
>
> Again, that'd do nothing to clean up subsidiary fn_extra state.
> If we want no leaks, we need a separate context for the tree
> to live in.

As mentioned, as part of some expression evaluation improvements (both
w/ and wo/ jit) I now have a prototype patch that moves nearly all the
dynamically allocated memory, including the FunctionCallInfo, into one
chunk of memory. That's currently allocated together with the ExprState.
With the information collected for that it'd be fairly trivial to reset
things like fn_extra in a reasonably efficient manner, without needing
knowledge about each ExprEvalOp.

Obviously that'd not itself change e.g. anything about the lifetime of
the memory pointed to by fn_extra etc. But it'd not be particularly hard
to have FmgrInfo->fn_mcxt point somewhere else.

As part of the above rework ExecInitExprRec() etc all now pass down a
new ExprStateBuilder * object, containing state needed somewhere down in
the expression tree.  It'd e.g. be quite possible to add an
ExecInitExpr() variant that allows to specify in more detail what memory
context ought to be used for what.

I'm however not at all sure it's worth investing time into doing so
specifically for this case. But it seems like it might generally be
something we'll need more infrastructure for in other cases too.

Greetings,

Andres Freund

Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Amit Kapila

On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas  wrote:
>
> On 15/10/2019 09:37, Amit Kapila wrote:
> > 2. Right now, in gistbulkdelete we make a note of empty leaf pages and
> > internals pages and then in the second pass during gistvacuumcleanup,
> > we delete all the empty leaf pages.  I was thinking why unlike nbtree,
> > we have delayed the deletion of empty pages till gistvacuumcleanup.  I
> > don't see any problem if we do this during gistbulkdelete itself
> > similar to nbtree, also I think there is some advantage in marking the
> > pages as deleted as early as possible.  Basically, if the vacuum
> > operation is canceled or errored out between gistbulkdelete and
> > gistvacuumcleanup, then I think the deleted pages could be marked as
> > recyclable very early in next vacuum operation.  The other advantage
> > of doing this during gistbulkdelete is we can avoid sharing
> > information between gistbulkdelete and gistvacuumcleanup which is
> > quite helpful for a parallel vacuum as the information is not trivial
> > (it is internally stored as in-memory Btree).   OTOH, there might be
> > some advantage for delaying the deletion of pages especially in the
> > case of multiple scans during a single VACUUM command.  We can
> > probably delete all empty leaf pages in one go which could in some
> > cases lead to fewer internal page reads.  However, I am not sure if it
> > is really advantageous to postpone the deletion as there seem to be
> > some downsides to it as well. I don't see it documented why unlike
> > nbtree we consider delaying deletion of empty pages till
> > gistvacuumcleanup, but I might be missing something.
>
> Hmm. The thinking is/was that removing the empty pages is somewhat
> expensive, because it has to scan all the internal nodes to find the
> downlinks to the to-be-deleted pages. Furthermore, it needs to scan all
> the internal pages (or at least until it has found all the downlinks),
> regardless of how many empty pages are being deleted. So it makes sense
> to do it only once, for all the empty pages. You're right though, that
> there would be advantages, too, in doing it after each pass.
>

I was thinking more about this and it seems that there could be more
cases where delaying the delete mark for pages can further delay the
recycling of pages.  It is quite possible that immediately after bulk
delete the value of nextFullXid (used as deleteXid) is X and during
vacuum clean up it can be X + N where the chances of N being large is
more when there are multiple passes of vacuum scan.  Now, if we would
have set the value of deleteXid as X, then there are more chances for
the next vacuum to recycle it.  I am not sure but it might be that in
the future, we could come up with something (say if we can recompute
RecentGlobalXmin again) where we can recycle pages of first index scan
in the next scan of the index during single vacuum operation.

This is just to emphasize the point that doing the delete marking of
pages in the same pass has advantages, otherwise, I understand that
there are advantages in delaying it as well.

> All things
> considered, I'm not sure which is better.
>

Yeah, this is a tough call to make, but if we can allow it to delete
the pages in bulkdelete conditionally for parallel vacuum workers,
then it would be better.

I think we have below option w.r.t Gist indexes for parallel vacuum
a. don't allow Gist Index to participate in parallel vacuum
b. allow delete of leaf pages in bulkdelete for parallel worker
c. always allow deleting leaf pages in bulkdelete
d. Invent some mechanism to share all the Gist stats via shared memory

(a) is not a very good option, but it is a safe option as we can
extend it in the future and we might decide to go with it especially
if we can't decide among any other options. (b) would serve the need
but would add some additional checks in gistbulkdelete and will look
more like a hack.  (c) would be best, but I think it will be difficult
to be sure that is a good decision for all type of cases. (d) can
require a lot of effort and I am not sure if it is worth adding
complexity in the proposed patch.

Do you have any thoughts on this?

Just to give you an idea of the current parallel vacuum patch, the
master backend scans the heap and forms the dead tuple array in dsm
and then we launch one worker for each index based on the availability
of workers and share the dead tuple array with each worker.  Each
worker performs bulkdelete for the index.  In the end, we perform
cleanup of all the indexes either via worker or master backend based
on some conditions.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Cache lookup errors with functions manipulation object addresses

2019-10-16 Thread Dmitry Dolgov

> On Tue, Sep 24, 2019 at 1:58 AM Michael Paquier  wrote:
>
> Please feel free to use the updated versions attached.  These can
> apply on top of HEAD at 30d1379.

Thanks for the update. Looking at v17 0003 I have one more question. In all the
places where we have to do systable_endscan, it followed by heap_close,
although in the rest of the file table_close is used. I guess this logic is not
heap specific and should also use table_close?

Re: fairywren failures

2019-10-16 Thread Andrew Dunstan

On Sat, Oct 12, 2019 at 3:56 PM Peter Eisentraut
 wrote:
>
> On 2019-10-03 16:21, Andrew Dunstan wrote:
> > My new msys2 animal fairywren
>
> Could you please check how this animal is labeled?  AFAICT, this is not
> an msys2 build but a mingw build (x86_64-w64-mingw32).

It is indeed an msys2 system. However, when we set  MSYSTEM=MINGW64 as
we do in fairywren's config environment so that the compiler it is
properly detected by configure (using Msys2's /etc/config.site)
'uname -a' reports MINGW64... instead of MSYS...

This is a bit confusing.

The compiler currently being used on the animal is the gcc 7.3.0 from
the Mingw64 project, the same one that's being usied on jacana (which
runs Msys1). Notwithstanding the "mingw32" in the compiler name, these
are 64 bit builds. I think the "32" there is somewhat vestigial.

>
> > has had 3 recent failures when checking
> > pg_upgrade. The failures have been while running the regression tests,
> > specifically the interval test, and they all look like this:
>
> I've also seen this randomly, but only under 64-bit mingw, never 32-bit
> mingw.
>

Since I downgraded the compiler from gcc 9.0 about a week ago these
errors seem to have stopped.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: configure fails for perl check on CentOS8

2019-10-16 Thread Andrew Dunstan

On Tue, Oct 15, 2019 at 6:45 AM Kyotaro Horiguchi
 wrote:

> >
> > Is this error perhaps occurring with a non-vendor Perl installation?
> > What's the exact error message?  config.log might contain some useful
> > clues, too.
>
> The perl package is official one. I found the cause, that's my
> mistake.
>
> The problematic command line boils down to:
>
> $ ./configure --with-perl CFLAGS=-O0
>
> It is bash-aliased and survived without being found for a long time on
> my Cent7 environment, but CentOS8 doesn't allow that.
>
> By the way, is there any official way to specify options like -O0
> while configure time?
>


CFLAGS=-O0 configure  --with-perl ...

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: maintenance_work_mem used by Vacuum

2019-10-16 Thread Masahiko Sawada

On Wed, Oct 16, 2019 at 3:48 PM Amit Kapila  wrote:
>
> On Wed, Oct 16, 2019 at 7:20 AM Masahiko Sawada  wrote:
> >
> > On Sat, Oct 12, 2019 at 8:45 PM Amit Kapila  wrote:
> > >
> > > On Sat, Oct 12, 2019 at 10:49 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Fri, Oct 11, 2019 at 5:13 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > That's right, but OTOH, if the user specifies gin_pending_list_limit
> > > > > as an option during Create Index with a value greater than GUC
> > > > > gin_pending_list_limit, then also we will face the same problem.  It
> > > > > seems to me that we haven't thought enough on memory usage during Gin
> > > > > pending list cleanup and I don't want to independently make a decision
> > > > > to change it.  So unless the original author/committer or some other
> > > > > people who have worked in this area share their opinion, we can leave
> > > > > it as it is and make a parallel vacuum patch adapt to it.
> > > >
> > > > Yeah I totally agreed.
> > > >
> > > > Apart from the GIN problem can we discuss whether need to change the
> > > > current memory usage policy of parallel utility command described in
> > > > the doc? We cannot control the memory usage in index AM after all but
> > > > we need to generically consider how much memory is used during
> > > > parallel vacuum.
> > > >
> > >
> > > Do you mean to say change the docs for a parallel vacuum patch in this
> > > regard?  If so, I think we might want to do something for
> > > maintenance_work_mem for parallel vacuum as described in one of the
> > > emails above [1] and then change the docs accordingly.
> > >
> >
> > Yes agreed. I thought that we can discuss that while waiting for other
> > opinion on the memory usage of gin index's pending list cleanup. For
> > example one of your suggestions[1] is simple and maybe acceptable but
> > I guess that it can deal with only gin indexes but not other index AMs
> > that might consume more memory.
> >
>
> It is not that currently, other indexes don't use any additional
> memory (except for maintainence_work_mem).  For example, Gist index
> can use memory for collecting empty leaf pages and internal pages.  I
> am not sure if we can do anything for such cases.  The case for Gin
> index seems to be clear and it seems to be having the risk of using
> much more memory, so why not try to do something for it?

Yeah gin indexes is clear now and I agree that we need to do something
for it. But I'm also concerned third party index AMs. Similar to the
problem related to IndexBulkDeleteResult structure that we're
discussing on another thread I thought that we have the same problem
on this.

Regards,

--
Masahiko Sawada

Re: configure fails for perl check on CentOS8

2019-10-16 Thread Tom Lane

Andrew Dunstan  writes:
> On Tue, Oct 15, 2019 at 6:45 AM Kyotaro Horiguchi
>  wrote:
>> The problematic command line boils down to:
>> $ ./configure --with-perl CFLAGS=-O0
>> It is bash-aliased and survived without being found for a long time on
>> my Cent7 environment, but CentOS8 doesn't allow that.

I don't quite understand why that wouldn't work.

>> By the way, is there any official way to specify options like -O0
>> while configure time?

> CFLAGS=-O0 configure  --with-perl ...

The way Horiguchi-san did it has been supported by autoconf for
a good long time now, so I don't think command line syntax is
the issue.  Could that CFLAGS setting be interfering with some
feature test in our configure script?

regards, tom lane

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-10-16 Thread Etsuro Fujita

On Wed, Sep 25, 2019 at 12:59 AM Etsuro Fujita  wrote:
> I will continue to review the rest of the patch.

I've been reviewing the rest of the patch.  Here are my review comments:

* map_and_merge_partitions() checks whether the two partitions from
the outer and inner sides can be merged in two steps: 1) see if the
partitions are mapped to each other (ie, partmap1->from == index2 &&
partmap2->from == index1), and 2) see if the merged partition indexes
assigned are the same (ie, partmap1->to == partmap2->to) (or satisfy
some other conditions), but the first step seems redundant to me
because I think that if the merged partition indexes are the same,
then the partitions would be guaranteed to be mapped to each other.
Also, I noticed that that function can't handle some list-partitioning
cases properly.  Here is an example:

CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('', '0002');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0003', '0004');
INSERT INTO plt1 SELECT i, i, to_char(i % 5, 'FM') FROM
generate_series(0, 24) i WHERE i % 5 IN (0, 2, 3, 4);
ANALYZE plt1;
CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0001', '0002');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0003', '0004');
INSERT INTO plt2 SELECT i, i, to_char(i % 5, 'FM') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 2, 3, 4);
ANALYZE plt2;
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.a, t2.c FROM plt1 t1 FULL JOIN plt2 t2 ON (t1.c
= t2.c) WHERE COALESCE(t1.a, 0) % 5 != 3 AND COALESCE(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a;
 QUERY PLAN

---
--
 Sort
   Sort Key: t1.c, t1.a, t2.a
   ->  Hash Full Join
 Hash Cond: (t1.c = t2.c)
 Filter: (((COALESCE(t1.a, 0) % 5) <> 3) AND ((COALESCE(t1.a,
0) % 5) <> 4))
 ->  Append
   ->  Seq Scan on plt1_p1 t1
   ->  Seq Scan on plt1_p2 t1_1
 ->  Hash
   ->  Append
 ->  Seq Scan on plt2_p1 t2
 ->  Seq Scan on plt2_p2 t2_1
(12 rows)

This should use partitionwise join by the new partition-matching
algorithm but doesn't.  The reason for that is because plt1_p1 and
plt2_p1 are mapped to different merged partitions and thus considered
not merge-able by map_and_merge_partitions() as-is.  I might be
missing something, but I don't think this is intentional, so I rewrote
that function completely in the attached, which is a WIP patch created
on top of the patch [1].

* In handle_missing_partition(), I noticed this:

+   else if (missing_side_inner)
+   {
+   /*
+* If this partition has already been mapped (say because we
+* found an overlapping range earlier), we know where does it
+* fit in the join result. Nothing to do in that case. Else
+* create a new merged partition.
+*/
+   PartitionMap *extra_map = &maps_with_extra[extra_part];
+   if (extra_map->to < 0)
+   {
+   extra_map->to = *next_index;
+   *next_index = *next_index + 1;
+   *merged_index = extra_map->to;
+   }
+   }

As commented, that function skips setting *merged_index when the
"extra_part" partition is already mapped to a merged partition.  This
would be correct for range partitioning, but not for list
partitioning, I think.  Here is an example:

CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('', '0001', '0002');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0003', '0004');
INSERT INTO plt1 SELECT i, i, to_char(i % 5, 'FM') FROM
generate_series(0, 24) i;
ANALYZE plt1;
CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0003', '0004');
INSERT INTO plt2 SELECT i, i, to_char(i % 5, 'FM') FROM
generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4);
ANALYZE plt2;
CREATE TABLE plt3 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt3_p1 PARTITION OF plt3 FOR VALUES IN ('0001');
CREATE TABLE plt3_p2 PARTITION OF plt3 FOR VALUES IN ('0003', '0004');
INSERT INTO plt3 SELECT i, i, to_char(i % 5, 'FM') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4);
ANALYZE plt3;
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1 t1 LEFT JOIN plt2
t2 ON (t1.c = t2.c)) FULL JOIN plt3 t3 ON (t1.c = t3.c) WHERE
COALESCE(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
 QUERY PLAN

---
--
 Sort
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join
 Hash Cond: (t1.c = t3.c)
 Filter: (((C

Re: v12.0: segfault in reindex CONCURRENTLY

2019-10-16 Thread Alvaro Herrera

On 2019-Oct-15, Michael Paquier wrote:

> So, Alvaro, your patch looks good to me.  Could you apply it?

Thanks, pushed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP/PoC for parallel backup

2019-10-16 Thread Asif Rehman

On Mon, Oct 7, 2019 at 6:35 PM Asif Rehman  wrote:

>
>
> On Mon, Oct 7, 2019 at 6:05 PM Robert Haas  wrote:
>
>> On Mon, Oct 7, 2019 at 8:48 AM Asif Rehman 
>> wrote:
>> > Sure. Though the backup manifest patch calculates and includes the
>> checksum of backup files and is done
>> > while the file is being transferred to the frontend-end. The manifest
>> file itself is copied at the
>> > very end of the backup. In parallel backup, I need the list of
>> filenames before file contents are transferred, in
>> > order to divide them into multiple workers. For that, the manifest file
>> has to be available when START_BACKUP
>> >  is called.
>> >
>> > That means, backup manifest should support its creation while excluding
>> the checksum during START_BACKUP().
>> > I also need the directory information as well for two reasons:
>> >
>> > - In plain format, base path has to exist before we can write the file.
>> we can extract the base path from the file
>> > but doing that for all files does not seem a good idea.
>> > - base backup does not include the content of some directories but
>> those directories although empty, are still
>> > expected in PGDATA.
>> >
>> > I can make these changes part of parallel backup (which would be on top
>> of backup manifest patch) or
>> > these changes can be done as part of manifest patch and then parallel
>> can use them.
>> >
>> > Robert what do you suggest?
>>
>> I think we should probably not use backup manifests here, actually. I
>> initially thought that would be a good idea, but after further thought
>> it seems like it just complicates the code to no real benefit.
>
>
> Okay.
>
>
>>   I
>> suggest that the START_BACKUP command just return a result set, like a
>> query, with perhaps four columns: file name, file type ('d' for
>> directory or 'f' for file), file size, file mtime. pg_basebackup will
>> ignore the mtime, but some other tools might find that useful
>> information.
>>
> yes current patch already returns the result set. will add the additional
> information.
>
>
>> I wonder if we should also split START_BACKUP (which should enter
>> non-exclusive backup mode) from GET_FILE_LIST, in case some other
>> client program wants to use one of those but not the other.  I think
>> that's probably a good idea, but not sure.
>>
>
> Currently pg_basebackup does not enter in exclusive backup mode and other
> tools have to
> use pg_start_backup() and pg_stop_backup() functions to achieve that.
> Since we are breaking
> backup into multiple command, I believe it would be a good idea to have
> this option. I will include
> it in next revision of this patch.
>
>
>>
>> I still think that the files should be requested one at a time, not a
>> huge long list in a single command.
>>
> sure, will make the change.
>
>
>

I have refactored the functionality into multiple smaller patches in order
to make the review process easier. I have divided the code into backend
changes and pg_basebackup changes. The
backend replication system now supports the following commands:

- START_BACKUP
- SEND_FILE_LIST
- SEND_FILES_CONTENT
- STOP_BACKUP

The START_BACKUP will not return the list of files, instead SEND_FILE_LIST
is used for that. The START_BACKUP
now calls pg_start_backup and returns starting WAL position, tablespace
header information and content of backup label file.
Initially I was using tmp files to store the backup_label content but that
turns out to be bad idea, because there can be multiple
non-exclusive backups running. The backup label information is needed by
stop_backup so pg_basebackup will send it as part
of STOP_BACKUP.

The SEND_FILE_LIST will return the list of files. It will be returned as
resultset having four columns (filename, type, size, mtime).
The SEND_FILES_CONTENT can now return the single file or multiple files as
required. There is not much change required to
support both, so I believe it will be much useable this way if other tools
want to utilise it.

As per suggestion from Robert, I am currently working on making changes in
pg_basebackup to fetch files one by one. However that's not complete and
the attach patch
is still using the old method of multi-file fetching to test the backend
commands. I will send an updated patch which will contain the changes on
fetching file one by one.

I wanted to share the backend patch to get some feedback in the mean time.

Thanks,

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

0001-Refactor-some-basebackup-code-to-increase-reusabilit.patch
Description: Binary data

0004-parallel-backup-testcase.patch
Description: Binary data

0003-pg_basebackup-changes-for-parallel-backup.patch
Description: Binary data

0002-backend-changes-for-parallel-backup.patch
Description: Binary data

Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Heikki Linnakangas

On 16 October 2019 12:57:03 CEST, Amit Kapila  wrote:
>On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas 
>wrote:
>> All things
>> considered, I'm not sure which is better.
>
>Yeah, this is a tough call to make, but if we can allow it to delete
>the pages in bulkdelete conditionally for parallel vacuum workers,
>then it would be better.

Yeah, if it's needed for parallel vacuum, maybe that tips the scale.

Hopefully, multi-pass vacuums are rare in practice. And we should lift the 
current 1 GB limit on the dead TID array, replacing it with something more 
compact and expandable, to make multi-pass vacuums even more rare. So I don't 
think we need to jump through many hoops to optimize the multi-pass case.

- Heikki

Re: configure fails for perl check on CentOS8

2019-10-16 Thread Andrew Dunstan

On Wed, Oct 16, 2019 at 8:32 AM Tom Lane  wrote:
>
> Andrew Dunstan  writes:
> > On Tue, Oct 15, 2019 at 6:45 AM Kyotaro Horiguchi
> >  wrote:
> >> The problematic command line boils down to:
> >> $ ./configure --with-perl CFLAGS=-O0
> >> It is bash-aliased and survived without being found for a long time on
> >> my Cent7 environment, but CentOS8 doesn't allow that.
>
> I don't quite understand why that wouldn't work.
>
> >> By the way, is there any official way to specify options like -O0
> >> while configure time?
>
> > CFLAGS=-O0 configure  --with-perl ...
>
> The way Horiguchi-san did it has been supported by autoconf for
> a good long time now, so I don't think command line syntax is
> the issue.

Ah.

> Could that CFLAGS setting be interfering with some
> feature test in our configure script?
>
>


It looks like you need CFLAGS='-O0 -fPIC' on CentOS 8 when building with perl.

cheers

andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ERROR: multixact X from before cutoff Y found to be still running

2019-10-16 Thread Bossart, Nathan

On 10/15/19, 11:11 PM, "Thomas Munro"  wrote:
> Here's a version with a proposed commit message and a comment.  Please
> let me know if I credited things to the right people!

Looks good to me.  Thanks!

Nathan

Re: v12 and pg_restore -f-

2019-10-16 Thread Stephen Frost

Greetings,

* imai.yoshik...@fujitsu.com (imai.yoshik...@fujitsu.com) wrote:
> On Sun, Oct 6, 2019 at 7:09 PM, Justin Pryzby wrote:
> > I saw this and updated our scripts with pg_restore -f-
> > https://www.postgresql.org/docs/12/release-12.html
> > |In pg_restore, require specification of -f - to send the dump contents to 
> > standard output (Euler Taveira)
> > |Previously, this happened by default if no destination was specified, but 
> > that was deemed to be unfriendly.
> > 
> > What I didn't realize at first is that -f- has no special meaning in v11 - 
> > it
> > just writes a file called ./-  And it's considered untennable to change
> > behavior of v11.
> 
> Ahh... I totally missed thinking about the behavior of "-f -" in v11 when I 
> reviewed this patch.

Clearly you weren't the only one.

> On Wed, Oct 9, 2019 at 0:45 PM, Stephen Frost wrote:
> > * Euler Taveira (eu...@timbira.com.br) wrote:
> > > Em ter, 8 de out de 2019 às 15:08, Stephen Frost  
> > > escreveu:
> > > > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > > > > Andrew Gierth  writes:
> > > > > > "Tom" == Tom Lane  writes:
> > > > > >  Tom> Perhaps we could change the back branches so that they
> > > > > > interpret  Tom> "-f -" as "write to stdout", but without
> > > > > > enforcing that you use  Tom> that syntax.
> > > > >
> > > > > > We should definitely do that.
> > > >
> > > > I agree that this would be a reasonable course of action.  Really,
> > > > it should have always meant that...
> > > >
> > > Indeed, it was a broken behavior and the idea was to fix it. However,
> > > changing pg_restore in back-branches is worse than do nothing because
> > > it could break existent scripts.
> > 
> > I can certainly respect that argument, in general, but in this specific 
> > case, I've got a really hard time believeing
> > that people wrote scripts which use '-f -' with the expectation that a 
> > './-' file was to be created.
> 
> +1.
> 
> If we only think of the problem that we can't use "-f -" with the meaning 
> "dump to the stdout" in v11 and before ones, it seems a bug and we should fix 
> it.
> Of course, if we fix it, some people would go into the trouble, but such 
> people are who wrote scripts which use '-f -' with the expectation that a 
> './-' file.
> I don't think there are such people a lot.

This topic, unfortunately, seems a bit stuck right now.  Maybe there's a
way we can pry it loose and get to a resolution.

On the one hand, we have Yoshikazu Imai, Andrew, and I pushing to change
back-branches, Eurler argueing that we shouldn't do anything, and Tom
argueing to make all versions accept '-f -'.

First, I'd like to clarify what I believe Tom's suggestion is, and then
talk through that, as his vote sways this topic pretty heavily.

Tom, I take it your suggestion is to have '-f -' be accepted to mean
'goes to stdout' in all branches?  That goes against the argument that
we don't want to break existing scripts, as it's possible that there are
existing scripts that depend on '-f -' actually going to a './-' file.

Is your argument here that, with the above, existing scripts could be
updated to use '-f -' explicitly and work with multiple versions, and
that scripts which aren't changed would work as-is?

If so, then I don't agree with it- if we really don't want to break
existing scripts when moving from pre-v12 to v12, then this patch never
should have been accepted at all as that's the only way to avoid
breaking anything, but then, we shouldn't be making a lot of other
changes between major versions either because there's often a good
chance that we'll break things.  Instead, the patch was accepted, as a
good and forward-moving change, with the understanding that it would
require some users to update their scripts when they move to v12, so,
for my part at least, that question was answered when we committed the
change and released with it.  Now, if we wish to adjust back-branches to
make it easier for users to have scripts that work with both versions,
that seems like a worthwhile change and is very unlikely to to cause
breakage- and it's certainly more likely to have users actually change
their scripts to use '-f -' explicitly when they start working with v12,
instead of depending on stdout being the default, which is ultimately
the goal and why the change was made in the first place.

If the concern is that we can expect folks to install v12 and then
refuse or be unable to upgrade back-branches, then I just don't have any
sympathy for that either- minor updates are extremely important, and new
major versions are certainly no cake walk to get installed, so that
argument just doesn't hold water with me- if they can upgrade to v12,
then they can update to the latest minor versions, if they actually need
to work with both concurrently (which strikes me as already at least
relatively uncommon...).

If you meant for all branches to accept '-f -' and have it go to a './-'
file then that's just a revert of this entire change, which I ca

Re: v12 and pg_restore -f-

2019-10-16 Thread Justin Pryzby

On Sun, Oct 06, 2019 at 04:43:13PM -0400, Tom Lane wrote:
> Nobody is going to wish that to mean "write to a file named '-'", so

Probably right, but it occurs to me that someone could make a named pipe called
that, possibly to make "dump to stdout" work with scripts that don't support
dumping to stdout, but with what's arguably a self-documenting syntax.

On Wed, Oct 09, 2019 at 09:07:40AM -0300, Euler Taveira wrote:
> > Maintenance scripts break across major versions.
...
> Yeah, if you check pg_restore version, you could use new syntax for
> 12+. We break scripts every release (mainly with catalog changes) and
> I don't know why this change is different than the other ones. The
> pg_restore changes is more user-friendly and less error-prone.

The issue isn't that scripts broke, but that after I fixed scripts to work with
v12, it's impossible (within pg_restore and without relying on shell or linux
conventions) to write something that works on both v11 and v12, without
conditionalizing on pg_restore --version.

On Wed, Oct 16, 2019 at 01:21:48PM -0400, Stephen Frost wrote:
> [...] if they actually need to work with both concurrently (which strikes me
> as already at least relatively uncommon...).

I doubt it's uncommon.  In our case, we have customers running centos6 and 7.
There's no postgis RPMs provided for centos6 which allow an upgrade path to
v12, so we'll end up supporting at least (centos6, pg11) and (centos7,pg12) for
months, at least.

I have a half dozen maintenance scripts to do things like reindex, vacuum,
cluster, alter tblspace.  In the immediate case, our backup script runs
pg_restore to check if an existing pg_dump backup of an old table is empty when
the table is not itself empty - which has happened before due to logic errors
and mishandled DST...  (We're taking advantage of timeseries partitioning so
daily backups exclude tables older than a certain thershold, which are assumed
to be unchanged, or at least not have data updated).

I'd *like* to be able to deploy our most recent maint scripts during the
interval of time our customers are running different major versions.  The
alternative being to try to remember to avoid deploying updated v12 scripts at
customers still running v11.  I went to the effort to make our vacuum/analyze
script support both versions following the OID change.

I worked around the pg_restore change using /dev/stdout ; possibly the
documentation should mention that workaround for portability to earlier
versions: that would work for maybe 85% of cases.  If need be, one could check
pg_restore --version.  But it's nicer not to need to.

Tom's proposed in February to backpatch the -f- behavior, so ISTM that we're
right now exactly where we (or at least he) planned to be, except that the
backpatch ideally should've been included in the minor releases in August,
before v12 was released.

https://www.postgresql.org/message-id/24868.1550106683%40sss.pgh.pa.us

Justin

Re: [Proposal] Global temporary tables

2019-10-16 Thread 曾文旌(义从)



> 2019年10月12日 下午1:16，Pavel Stehule  写道：
> 
> 
> 
> pá 11. 10. 2019 v 15:50 odesílatel Konstantin Knizhnik 
> mailto:k.knizh...@postgrespro.ru>> napsal:
> 
> 
> On 11.10.2019 15:15, 曾文旌(义从) wrote:
>> Dear Hackers,
>> 
>> This propose a way to develop global temporary tables in PostgreSQL.
>> 
>> I noticed that there is an "Allow temporary tables to exist as empty by 
>> default in all sessions" in the postgresql todolist.
>> https://wiki.postgresql.org/wiki/Todo 
>> 
>> In recent years, PG community had many discussions about global temp table 
>> (GTT) support. Previous discussion covered the following topics: 
>> (1)  The main benefit or function: GTT offers features like “persistent 
>> schema, ephemeral data”, which avoids catalog bloat and reduces catalog 
>> vacuum. 
>> (2)  Whether follows ANSI concept of temporary tables
>> (3)  How to deal with statistics, single copy of schema definition, relcache
>> (4)  More can be seen in 
>> https://www.postgresql.org/message-id/73954ab7-44d3-b37b-81a3-69bdcbb446f7%40postgrespro.ru
>>  
>> 
>> (5)  A recent implementation and design from Konstantin Knizhnik covered 
>> many functions of GTT: 
>> https://www.postgresql.org/message-id/attachment/103265/global_private_temp-1.patch
>>  
>> 
>> 
>> However, as pointed by Konstantin himself, the implementation still needs 
>> functions related to CLOG, vacuum, and MVCC visibility.
>> 
> 
> Just to clarify.
> I have now proposed several different solutions for GTT:
> 
> Shared vs. private buffers for GTT:
> 1. Private buffers. This is least invasive patch, requiring no changes in 
> relfilenodes.
> 2. Shared buffers. Requires changing relfilenode but supports parallel query 
> execution for GTT.
> 
> This is important argument for using share buffers. Maybe the best is mix of 
> both - store files in temporal tablespace, but using share buffers. More, it 
> can be accessible for autovacuum.
> 
> Access to GTT at replica:
> 1. Access is prohibited (as for original temp tables). No changes at all.
> 2. Tuples of temp tables are marked with forzen XID.  Minimal changes, 
> rollbacks are not possible.
> 3. Providing special XIDs for GTT at replica. No changes in CLOG are 
> required, but special MVCC visibility rules are used for GTT. Current 
> limitation: number of transactions accessing GTT at replica is limited by 2^32
> and bitmap of correspondent size has to be maintained (tuples of GTT are not 
> proceeded by vacuum and not frozen, so XID horizon never moved).
> 
> So except the limitation mentioned above (which I do not consider as 
> critical) there is only one problem which was not addressed: maintaining 
> statistics for GTT. 
> If all of the following conditions are true:
> 
> 1) GTT are used in joins
> 2) There are indexes defined for GTT
> 3) Size and histogram of GTT in different backends can significantly vary. 
> 4) ANALYZE was explicitly called for GTT
> 
> then query execution plan built in one backend will be also used for other 
> backends where it can be inefficient.
> I also do not consider this problem as "show stopper" for adding GTT to 
> Postgres.
> 
> The last issue is show stopper in my mind. It really depends on usage. There 
> are situation where shared statistics are ok (and maybe good solution), and 
> other situation, where shared statistics are just unusable.
This proposal calculates and stores independent statistics(relpages reltuples 
and histogram of GTT) for the gtt data within each session, ensuring optimizer 
can get accurate statistics.


> Regards
> 
> Pavel
> 
> 
> 
> I still do not understand the opinion of community which functionality of GTT 
> is considered to be most important.
> But the patch with local buffers and no replica support is small enough to 
> become good starting point.
> 
> 
> -- 
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com 
> 
> The Russian Postgres Company

Re: v12 and pg_restore -f-

2019-10-16 Thread Stephen Frost

Greetings,

* Justin Pryzby (pry...@telsasoft.com) wrote:
> On Sun, Oct 06, 2019 at 04:43:13PM -0400, Tom Lane wrote:
> > Nobody is going to wish that to mean "write to a file named '-'", so
> 
> Probably right, but it occurs to me that someone could make a named pipe 
> called
> that, possibly to make "dump to stdout" work with scripts that don't support
> dumping to stdout, but with what's arguably a self-documenting syntax.

I'm not super keen to stress a great deal over "someone could" cases.
Yes, we can come up with lots of "what ifs" here to justify why someone
might think to do this but it still seems extremely rare to me.  It'd be
nice to get some actual numbers somehow but I haven't got any great
ideas about how to do that.  Actual research into this would probably be
to go digging through source code that's available to try and figure out
if such a case exists anywhere public.

> On Wed, Oct 09, 2019 at 09:07:40AM -0300, Euler Taveira wrote:
> > > Maintenance scripts break across major versions.
> ...
> > Yeah, if you check pg_restore version, you could use new syntax for
> > 12+. We break scripts every release (mainly with catalog changes) and
> > I don't know why this change is different than the other ones. The
> > pg_restore changes is more user-friendly and less error-prone.
> 
> The issue isn't that scripts broke, but that after I fixed scripts to work 
> with
> v12, it's impossible (within pg_restore and without relying on shell or linux
> conventions) to write something that works on both v11 and v12, without
> conditionalizing on pg_restore --version.

Right- you were happy (more-or-less) to update the scripts to deal with
the v12 changes, but didn't like that those changes then broke when run
against v11, something that would be fixed by correcting those earlier
versions.

> On Wed, Oct 16, 2019 at 01:21:48PM -0400, Stephen Frost wrote:
> > [...] if they actually need to work with both concurrently (which strikes me
> > as already at least relatively uncommon...).
> 
> I doubt it's uncommon.  In our case, we have customers running centos6 and 7.
> There's no postgis RPMs provided for centos6 which allow an upgrade path to
> v12, so we'll end up supporting at least (centos6, pg11) and (centos7,pg12) 
> for
> months, at least.

I suppose the issue here is that you don't want to have different
versions of some scripts for centos6/pg11 vs. centos7/pg12?  I'm a bit
surprised that you don't have to for reasons unrelated to pg_restore.

> I have a half dozen maintenance scripts to do things like reindex, vacuum,
> cluster, alter tblspace.  In the immediate case, our backup script runs
> pg_restore to check if an existing pg_dump backup of an old table is empty 
> when
> the table is not itself empty - which has happened before due to logic errors
> and mishandled DST...  (We're taking advantage of timeseries partitioning so
> daily backups exclude tables older than a certain thershold, which are assumed
> to be unchanged, or at least not have data updated).
> 
> I'd *like* to be able to deploy our most recent maint scripts during the
> interval of time our customers are running different major versions.  The
> alternative being to try to remember to avoid deploying updated v12 scripts at
> customers still running v11.  I went to the effort to make our vacuum/analyze
> script support both versions following the OID change.

And I suppose you don't want to install v12 client tools for the v11
systems..?  I get that there's an argument for that, but it does also
seem like it'd be an alternative solution.

> I worked around the pg_restore change using /dev/stdout ; possibly the
> documentation should mention that workaround for portability to earlier
> versions: that would work for maybe 85% of cases.  If need be, one could check
> pg_restore --version.  But it's nicer not to need to.
> 
> Tom's proposed in February to backpatch the -f- behavior, so ISTM that we're
> right now exactly where we (or at least he) planned to be, except that the
> backpatch ideally should've been included in the minor releases in August,
> before v12 was released.
> 
> https://www.postgresql.org/message-id/24868.1550106683%40sss.pgh.pa.us

That continues to strike me as a good way forward, and I'm guessing you
agree on that?  If so, sorry for not including you in my earlier email.

Thanks,

Stephen


signature.asc
Description: PGP signature

[Patch proposal] libpq portal support

2019-10-16 Thread Sergei Fedorov

Hello everybody,

Our company was in desperate need of portals in async interface of libpq,
so we patched it.

We would be happy to upstream the changes.

The description of changes:

Two functions in libpq-fe.h:
PQsendPortalBindParams for sending a command to bind a portal to a
previously prepared statement;
PQsendPortalExecute for executing a previously bound portal with a given
number of rows.

A patch to pqParseInput3 in fe-protocol3.c to handle the `portal suspended`
message tag.

The patch is ready for review, but it lacks documentation, tests and usage
examples.

There are no functions for sending bind without params and no functions for
sync interface, but they can easily be added to the feature.

-- 
Thank you,
Sergei Fedorov

Re: v12 and pg_restore -f-

2019-10-16 Thread Justin Pryzby

On Wed, Oct 16, 2019 at 03:04:52PM -0400, Stephen Frost wrote:

> > On Wed, Oct 16, 2019 at 01:21:48PM -0400, Stephen Frost wrote:
> > > [...] if they actually need to work with both concurrently (which strikes 
> > > me
> > > as already at least relatively uncommon...).
> > 
> > I doubt it's uncommon.  In our case, we have customers running centos6 and 
> > 7.
> > There's no postgis RPMs provided for centos6 which allow an upgrade path to
> > v12, so we'll end up supporting at least (centos6, pg11) and (centos7,pg12) 
> > for
> > months, at least.
> 
> I suppose the issue here is that you don't want to have different
> versions of some scripts for centos6/pg11 vs. centos7/pg12?  I'm a bit
> surprised that you don't have to for reasons unrelated to pg_restore.

Right, I don't want to "have to" do anything :)

If we really need a separate script, or conditional, then we'll do, but it's
nicer to be ABLE write something (like this sanity check) in one line and not
NEED TO write it in six.  So far these maint scripts have significant a bit
telsasoft-specific logic, but very little specific to postgres versions.  I
recall the schema-qualification stuff caused some churn, but it was in a minor
release, so everyone was upgraded within 30-40 days, and if they weren't, I
probably knew not to deploy updated scripts, either.

> > I'd *like* to be able to deploy our most recent maint scripts during the
> > interval of time our customers are running different major versions.  The
> > alternative being to try to remember to avoid deploying updated v12 scripts 
> > at
> > customers still running v11.  I went to the effort to make our 
> > vacuum/analyze
> > script support both versions following the OID change.
> 
> And I suppose you don't want to install v12 client tools for the v11
> systems..?  I get that there's an argument for that, but it does also
> seem like it'd be an alternative solution.

Hm, I'd be opened to it, except that when I tries this during beta, the PGDG
RPMs are compiled with GSSAPI, which creates lots of warnings...but maybe
that's just in nagios...

> > Tom's proposed in February to backpatch the -f- behavior, so ISTM that we're
> > right now exactly where we (or at least he) planned to be, except that the
> > backpatch ideally should've been included in the minor releases in August,
> > before v12 was released.
> > 
> > https://www.postgresql.org/message-id/24868.1550106683%40sss.pgh.pa.us
> 
> That continues to strike me as a good way forward, and I'm guessing you
> agree on that?  If so, sorry for not including you in my earlier email.

I believe you did include me (?) - I started the thread (on -general).
https://www.postgresql.org/message-id/20191016172148.GH6962%40tamriel.snowman.net

I think it's a good idea to do some combination of backpatching -f-, and maybe
document behavior of pre-12 pg_restore in v12 release notes, and suggest
/dev/stdout as a likely workaround.  Of course, if backpatched, the behavior of
pre-12 will vary, and should be documented as such, which is a kind of alot,
but well.

|In pg_restore, require specification of -f - to send the dump contents to 
standard output (Euler Taveira)
|Previously, this happened by default if no destination was specified, but that 
was deemed to be unfriendly.
|In the latest minor releases of versions v11 and earlier, pg_restore -f - is 
updated for
|consistency with the new behavior of v12, to allow scripts to be written which
|work on both.  But note that earlier releases of v9.3 to v11 don't specially
|handle "-f -", which will cause them to write to a file called "-" and not
|stdout.  If called under most unix shells, -f /dev/stdout will write to stdout 
on all versions of pg_restore.

It's not perfect - someone who wants portable behavior has to apply November's
minor upgrade before installing any v12 server.  And vendors (something like
pgadmin) will end up "having to" write to a file to be portable, or else check
the full version, not just the major version.

Justin

Re: WIP/PoC for parallel backup

2019-10-16 Thread Jeevan Ladhe

I quickly tried to have a look at your 0001-refactor patch.
Here are some comments:

1. The patch fails to compile.

Sorry if I am missing something, but am not able to understand why in new
function collectTablespaces() you have added an extra parameter NULL while
calling sendTablespace(), it fails the compilation :

+ ti->size = infotbssize ? sendTablespace(fullpath, true, NULL) : -1;


gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
-Wno-unused-command-line-argument -g -g -O0 -Wall -Werror
-I../../../../src/include-c -o xlog.o xlog.c -MMD -MP -MF .deps/xlog.Po
xlog.c:12253:59: error: too many arguments to function call, expected 2,
have 3
ti->size = infotbssize ? sendTablespace(fullpath, true,
NULL) : -1;
 ~~ ^~~~

2. I think the patch needs to run via pg_indent. It does not follow 80
column
width.
e.g.

+void
+collectTablespaces(List **tablespaces, StringInfo tblspcmapfile, bool
infotbssize, bool needtblspcmapfile)
+{

3.
The comments in re-factored code appear to be redundant. example:
Following comment:
 /* Setup and activate network throttling, if client requested it */
appears thrice in the code, before calling setup_throttle(), in the
prologue of
the function setup_throttle(), and above the if() in that function.
Similarly - the comment:
/* Collect information about all tablespaces */
in collectTablespaces().

4.
In function include_wal_files() why is the parameter TimeLineID i.e. endtli
needed. I don't see it being used in the function at all. I think you can
safely
get rid of it.

+include_wal_files(XLogRecPtr endptr, TimeLineID endtli)

Regards,
Jeevan Ladhe

On Wed, Oct 16, 2019 at 6:49 PM Asif Rehman  wrote:

>
>
> On Mon, Oct 7, 2019 at 6:35 PM Asif Rehman  wrote:
>
>>
>>
>> On Mon, Oct 7, 2019 at 6:05 PM Robert Haas  wrote:
>>
>>> On Mon, Oct 7, 2019 at 8:48 AM Asif Rehman 
>>> wrote:
>>> > Sure. Though the backup manifest patch calculates and includes the
>>> checksum of backup files and is done
>>> > while the file is being transferred to the frontend-end. The manifest
>>> file itself is copied at the
>>> > very end of the backup. In parallel backup, I need the list of
>>> filenames before file contents are transferred, in
>>> > order to divide them into multiple workers. For that, the manifest
>>> file has to be available when START_BACKUP
>>> >  is called.
>>> >
>>> > That means, backup manifest should support its creation while
>>> excluding the checksum during START_BACKUP().
>>> > I also need the directory information as well for two reasons:
>>> >
>>> > - In plain format, base path has to exist before we can write the
>>> file. we can extract the base path from the file
>>> > but doing that for all files does not seem a good idea.
>>> > - base backup does not include the content of some directories but
>>> those directories although empty, are still
>>> > expected in PGDATA.
>>> >
>>> > I can make these changes part of parallel backup (which would be on
>>> top of backup manifest patch) or
>>> > these changes can be done as part of manifest patch and then parallel
>>> can use them.
>>> >
>>> > Robert what do you suggest?
>>>
>>> I think we should probably not use backup manifests here, actually. I
>>> initially thought that would be a good idea, but after further thought
>>> it seems like it just complicates the code to no real benefit.
>>
>>
>> Okay.
>>
>>
>>>   I
>>> suggest that the START_BACKUP command just return a result set, like a
>>> query, with perhaps four columns: file name, file type ('d' for
>>> directory or 'f' for file), file size, file mtime. pg_basebackup will
>>> ignore the mtime, but some other tools might find that useful
>>> information.
>>>
>> yes current patch already returns the result set. will add the additional
>> information.
>>
>>
>>> I wonder if we should also split START_BACKUP (which should enter
>>> non-exclusive backup mode) from GET_FILE_LIST, in case some other
>>> client program wants to use one of those but not the other.  I think
>>> that's probably a good idea, but not sure.
>>>
>>
>> Currently pg_basebackup does not enter in exclusive backup mode and other
>> tools have to
>> use pg_start_backup() and pg_stop_backup() functions to achieve that.
>> Since we are breaking
>> backup into multiple command, I believe it would be a good idea to have
>> this option. I will include
>> it in next revision of this patch.
>>
>>
>>>
>>> I still think that the files should be requested one at a time, not a
>>> huge long list in a single command.
>>>
>> sure, will make the change.
>>
>>
>>
>
> I have refactored the functionality into multiple smaller patches in order
> to make the review process easier. I have divided the code into backend
> changes and pg_basebackup changes. The
> backend replication system now supports the following command

Re: v12.0: segfault in reindex CONCURRENTLY

2019-10-16 Thread Justin Pryzby

On Sun, Oct 13, 2019 at 04:18:34PM -0300, Alvaro Herrera wrote:
> (FWIW I expect the crash is possible not just in reindex but also in
> CREATE INDEX CONCURRENTLY.)

FWIW, for sake of list archives, and for anyone running v12 hoping to avoid
crashing, I believe we hit this for DROP INDEX CONCURRENTLY, although I don't
have the backtrace to prove it.

Justin

Re: v12 and pg_restore -f-

2019-10-16 Thread Stephen Frost

Greetings,

* Justin Pryzby (pry...@telsasoft.com) wrote:
> On Wed, Oct 16, 2019 at 03:04:52PM -0400, Stephen Frost wrote:
> 
> > > On Wed, Oct 16, 2019 at 01:21:48PM -0400, Stephen Frost wrote:
> > > > [...] if they actually need to work with both concurrently (which 
> > > > strikes me
> > > > as already at least relatively uncommon...).
> > > 
> > > I doubt it's uncommon.  In our case, we have customers running centos6 
> > > and 7.
> > > There's no postgis RPMs provided for centos6 which allow an upgrade path 
> > > to
> > > v12, so we'll end up supporting at least (centos6, pg11) and 
> > > (centos7,pg12) for
> > > months, at least.
> > 
> > I suppose the issue here is that you don't want to have different
> > versions of some scripts for centos6/pg11 vs. centos7/pg12?  I'm a bit
> > surprised that you don't have to for reasons unrelated to pg_restore.
> 
> Right, I don't want to "have to" do anything :)

Sure, fair enough.

> If we really need a separate script, or conditional, then we'll do, but it's
> nicer to be ABLE write something (like this sanity check) in one line and not
> NEED TO write it in six.  So far these maint scripts have significant a bit
> telsasoft-specific logic, but very little specific to postgres versions.  I
> recall the schema-qualification stuff caused some churn, but it was in a minor
> release, so everyone was upgraded within 30-40 days, and if they weren't, I
> probably knew not to deploy updated scripts, either.

Hmm, that's interesting as a data point, at least.

> > > I'd *like* to be able to deploy our most recent maint scripts during the
> > > interval of time our customers are running different major versions.  The
> > > alternative being to try to remember to avoid deploying updated v12 
> > > scripts at
> > > customers still running v11.  I went to the effort to make our 
> > > vacuum/analyze
> > > script support both versions following the OID change.
> > 
> > And I suppose you don't want to install v12 client tools for the v11
> > systems..?  I get that there's an argument for that, but it does also
> > seem like it'd be an alternative solution.
> 
> Hm, I'd be opened to it, except that when I tries this during beta, the PGDG
> RPMs are compiled with GSSAPI, which creates lots of warnings...but maybe
> that's just in nagios...

Warnings in the server log because they attempt to start a GSSAPI
encrypted session first, if you are authenticating with GSSAPI?  If so
then I'm sympathetic, but at least you could address that by setting
PGGSSENCMODE to disable, and that'd work for pre-v12 and v12+.

> > > Tom's proposed in February to backpatch the -f- behavior, so ISTM that 
> > > we're
> > > right now exactly where we (or at least he) planned to be, except that the
> > > backpatch ideally should've been included in the minor releases in August,
> > > before v12 was released.
> > > 
> > > https://www.postgresql.org/message-id/24868.1550106683%40sss.pgh.pa.us
> > 
> > That continues to strike me as a good way forward, and I'm guessing you
> > agree on that?  If so, sorry for not including you in my earlier email.
> 
> I believe you did include me (?) - I started the thread (on -general).
> https://www.postgresql.org/message-id/20191016172148.GH6962%40tamriel.snowman.net

Ah, no, I mean in the list of who was taking what position- I only named
Yoshikazu Imai, Andrew, Eurler, Tom and myself.

> I think it's a good idea to do some combination of backpatching -f-, and maybe
> document behavior of pre-12 pg_restore in v12 release notes, and suggest
> /dev/stdout as a likely workaround.  Of course, if backpatched, the behavior 
> of
> pre-12 will vary, and should be documented as such, which is a kind of alot,
> but well.
> 
> |In pg_restore, require specification of -f - to send the dump contents to 
> standard output (Euler Taveira)
> |Previously, this happened by default if no destination was specified, but 
> that was deemed to be unfriendly.
> |In the latest minor releases of versions v11 and earlier, pg_restore -f - is 
> updated for
> |consistency with the new behavior of v12, to allow scripts to be written 
> which
> |work on both.  But note that earlier releases of v9.3 to v11 don't specially
> |handle "-f -", which will cause them to write to a file called "-" and not
> |stdout.  If called under most unix shells, -f /dev/stdout will write to 
> stdout on all versions of pg_restore.

We'd probably have to list the specific minor versions instead of just
saying "latest" and if we're suggesting an alternative course of action
then we might want to actually include that in the documentation
somewhere..  I'm not really sure that we want to get into such
platform-specific recommendations though.

> It's not perfect - someone who wants portable behavior has to apply November's
> minor upgrade before installing any v12 server.  And vendors (something like
> pgadmin) will end up "having to" write to a file to be portable, or else check
> the full version, not just the majo

Re: Connect as multiple users using single client certificate

2019-10-16 Thread Stephen Frost

Greetings,

* Kyle Bateman (k...@batemans.org) wrote:
> What I hope to accomplish is: Establish a secure, encrypted connection to
> Postgresql from a trusted process, possibly running on another machine, whom
> I trust to tell me which user (within a limited set, defined by a role) it
> would like to connect as.  That process does it's own robust authentication
> of users before letting them through to the database by the username they
> claim.  However, it is still useful to connect as different users because my
> views and functions operate differently depending on which user is on the
> other end of the connection.
> 
> Is there a way I can accomplish this using the existing authentication
> methods (other than trust)?

Have you considered just having a regular client-side cert for the
middleware that logs in as a common user to the PG database, and then
performs a SET ROLE to whichever user the middleware has authenticated
the user as?  That seems to match pretty closely what you're looking for
and has the advantage that it'll also allow you to work through
connection poolers.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: ERROR: multixact X from before cutoff Y found to be still running

2019-10-16 Thread Thomas Munro

On Thu, Oct 17, 2019 at 6:11 AM Jeremy Schneider  wrote:
> On 10/16/19 10:09, Bossart, Nathan wrote:
> > On 10/15/19, 11:11 PM, "Thomas Munro"  wrote:
> >> Here's a version with a proposed commit message and a comment.  Please
> >> let me know if I credited things to the right people!
> >
> > Looks good to me.  Thanks!
>
> +1

Pushed.

Re: maintenance_work_mem used by Vacuum

2019-10-16 Thread Greg Stark

It's a bit unfortunate that we're doing the pending list flush while the
vacuum memory is allocated at all. Is there any reason other than the way
the callbacks are defined that gin doesn't do the pending list flush before
vacuum does the heap scan and before it allocates any memory using
maintenance_work_mem?

(I'm guessing doing it after vacuum is finished would have different
problems with tuples in the pending queue not getting vacuumed?)

Re: v12.0: segfault in reindex CONCURRENTLY

2019-10-16 Thread Michael Paquier

On Wed, Oct 16, 2019 at 09:53:56AM -0300, Alvaro Herrera wrote:
> Thanks, pushed.

Thanks, Alvaro.
--
Michael


signature.asc
Description: PGP signature

Re: v12.0: segfault in reindex CONCURRENTLY

2019-10-16 Thread Michael Paquier

On Wed, Oct 16, 2019 at 04:11:46PM -0500, Justin Pryzby wrote:
> On Sun, Oct 13, 2019 at 04:18:34PM -0300, Alvaro Herrera wrote:
>> (FWIW I expect the crash is possible not just in reindex but also in
>> CREATE INDEX CONCURRENTLY.)
> 
> FWIW, for sake of list archives, and for anyone running v12 hoping to avoid
> crashing, I believe we hit this for DROP INDEX CONCURRENTLY, although I don't
> have the backtrace to prove it.

You may not have a backtrace, but I think that you are right:
WaitForLockers() gets called in index_drop() with progress reporting
enabled.  index_drop() would also be taken by REINDEX CONCURRENTLY
through performMultipleDeletions() but we cannot know if it gets used
for REINDEX CONCURRENTLY or for DROP INDEX CONCURRENTLY as it goes
through the central deletion machinery, so we have to mark progress
reporting as true anyway.  Maybe that's worth a comment in index_drop
when calling WaitForLockers() because it is not actually that obvious,
say like that:
@@ -2157,7 +2157,10 @@ index_drop(Oid indexId, bool concurrent, bool
concurrent_lock_mode)

/*
 * Wait till every transaction that saw the old index state has
-* finished.
+* finished.  Progress reporting is enabled here for REINDEX
+* CONCURRENTLY, but not for DROP INDEX CONCURRENTLY.  Track
+* the progress through WaitForLockers() anyway, the information
+* will not show up if using DROP INDEX CONCURRENTLY.
 */
WaitForLockers(heaplocktag, AccessExclusiveLock, true);

Thoughts?
--
Michael


signature.asc
Description: PGP signature

Re: Cache lookup errors with functions manipulation object addresses

2019-10-16 Thread Michael Paquier

On Wed, Oct 16, 2019 at 01:04:57PM +0200, Dmitry Dolgov wrote:
> Thanks for the update. Looking at v17 0003 I have one more question. In all 
> the
> places where we have to do systable_endscan, it followed by heap_close,
> although in the rest of the file table_close is used. I guess this logic is 
> not
> heap specific and should also use table_close?

We need to use table_close as we cannot assume that the relation will
always be opened for heap.  This patch set is around for so long, so
that was a rotten part still able to compile as we have a macro to
cover the gap.  Thanks for noticing that.  I am sending an updated
patch set in a but, Alvaro had more comments.
--
Michael


signature.asc
Description: PGP signature

Re: Cache lookup errors with functions manipulation object addresses

2019-10-16 Thread Michael Paquier

On Thu, Sep 26, 2019 at 03:52:03PM +0900, Michael Paquier wrote:
> On Wed, Sep 25, 2019 at 09:21:03AM -0300, Alvaro Herrera wrote:
>> On 2019-Sep-24, Michael Paquier wrote:
>>> + * - FORMAT_TYPE_FORCE_NULL
>>> + *   if the type OID is invalid or unknown, return NULL instead of ???
>>> + *   or such
>>
>> I think FORCE_NULL is a strange name for this flag (and its two
>> siblings); I think something like FORMAT_TYPE_INVALID_AS_NULL is
>> clearer.
> 
> Good idea.  I see the point.

Got to think more on this one, and your idea is better.  So changed
this way.

>> I have still to review this comprehensively, but one thing I don't quite
>> like is the shape of the new regression tests, which seem a bit too
>> bulky ... why not use the style elsewhere in that file, with a large
>> VALUES clause to supply input params for a single query? That would be
>> a lot faster, too.
> 
> That makes sense.  Here is how I would write it then:
> WITH objects (classid, objid, objsubid) AS (VALUES
> ('pg_class'::regclass, 0, 0), -- no relation
> [ ... ]
>   )
> SELECT ROW(pg_identify_object(objects.classid, objects.objid, 
> objects.objsubid))
>  AS ident,
>ROW(pg_identify_object_as_address(objects.classid, objects.objid, 
> objects.objsubid))
>  AS addr,
>pg_describe_object(objects.classid, objects.objid, objects.objsubid)
>  AS descr,
> FROM objects
> ORDER BY objects.classid, objects.objid, objects.objsubid;

That's actually cleaner, so changed as well.  Attached is an updated
patch set with the heap_close() calls removed as per the recent report
from Dmitry.
--
Michael
From 3d42d9faf77bf6345694aaee8ce5e2317238d362 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 17 Oct 2019 10:25:24 +0900
Subject: [PATCH v18 1/3] Add flag to format_type_extended to enforce NULL-ness

If a type scanned is undefined, type format routines have two behaviors
depending on if FORMAT_TYPE_ALLOW_INVALID is defined by the caller:
- Generate an error
- Return undefined type name "???" or "-".

The current interface is unhelpful for callers willing to format
properly a type name, but still make sure that the type is defined as
there could be types matching the undefined strings.  In order to
counter that, add a new flag called FORMAT_TYPE_INVALID_AS_NULL which
returns a NULL result instead of "??? or "-" which does not generate an
error. They will be used for future patches to improve the SQL interface
for object addresses.
---
 src/backend/utils/adt/format_type.c | 22 +-
 src/include/utils/builtins.h|  1 +
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/adt/format_type.c b/src/backend/utils/adt/format_type.c
index 6ae2a31345..8696b11a9e 100644
--- a/src/backend/utils/adt/format_type.c
+++ b/src/backend/utils/adt/format_type.c
@@ -96,13 +96,16 @@ format_type(PG_FUNCTION_ARGS)
  * - FORMAT_TYPE_ALLOW_INVALID
  *			if the type OID is invalid or unknown, return ??? or such instead
  *			of failing
+ * - FORMAT_TYPE_INVALID_AS_NULL
+ *			if the type OID is invalid or unknown, return NULL instead of ???
+ *			or such
  * - FORMAT_TYPE_FORCE_QUALIFY
  *			always schema-qualify type names, regardless of search_path
  *
  * Note that TYPEMOD_GIVEN is not interchangeable with "typemod == -1";
  * see the comments above for format_type().
  *
- * Returns a palloc'd string.
+ * Returns a palloc'd string, or NULL.
  */
 char *
 format_type_extended(Oid type_oid, int32 typemod, bits16 flags)
@@ -114,13 +117,20 @@ format_type_extended(Oid type_oid, int32 typemod, bits16 flags)
 	char	   *buf;
 	bool		with_typemod;
 
-	if (type_oid == InvalidOid && (flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
-		return pstrdup("-");
+	if (type_oid == InvalidOid)
+	{
+		if ((flags & FORMAT_TYPE_INVALID_AS_NULL) != 0)
+			return NULL;
+		else if ((flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
+			return pstrdup("-");
+	}
 
 	tuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(type_oid));
 	if (!HeapTupleIsValid(tuple))
 	{
-		if ((flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
+		if ((flags & FORMAT_TYPE_INVALID_AS_NULL) != 0)
+			return NULL;
+		else if ((flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
 			return pstrdup("???");
 		else
 			elog(ERROR, "cache lookup failed for type %u", type_oid);
@@ -143,7 +153,9 @@ format_type_extended(Oid type_oid, int32 typemod, bits16 flags)
 		tuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(array_base_type));
 		if (!HeapTupleIsValid(tuple))
 		{
-			if ((flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
+			if ((flags & FORMAT_TYPE_INVALID_AS_NULL) != 0)
+return NULL;
+			else if ((flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
 return pstrdup("???[]");
 			else
 elog(ERROR, "cache lookup failed for type %u", type_oid);
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 937ddb7ef0..44298d7410 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -109,6 +109,7 @@ extern Datum numeric_float8_no_ov

Remaining calls of heap_close/heap_open in the tree

2019-10-16 Thread Michael Paquier

Hi all,

I have just bumped into $subject, and we now use the table_*
equivalents in the code.  Any objections to the simple patch attached
to clean up that?

Thanks,
--
Michael
diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index f51eb7bb64..0ab43deb71 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -637,7 +637,7 @@ UpdateStatisticsForTypeChange(Oid statsOid, Oid relationOid, int attnum,
 	replaces[Anum_pg_statistic_ext_data_stxdmcv - 1] = true;
 	nulls[Anum_pg_statistic_ext_data_stxdmcv - 1] = true;
 
-	rel = heap_open(StatisticExtDataRelationId, RowExclusiveLock);
+	rel = table_open(StatisticExtDataRelationId, RowExclusiveLock);
 
 	/* replace the old tuple */
 	stup = heap_modify_tuple(oldtup,
@@ -651,7 +651,7 @@ UpdateStatisticsForTypeChange(Oid statsOid, Oid relationOid, int attnum,
 
 	heap_freetuple(stup);
 
-	heap_close(rel, RowExclusiveLock);
+	table_close(rel, RowExclusiveLock);
 }
 
 /*
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index cf1761401d..e5f9e04d65 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2103,12 +2103,12 @@ has_stored_generated_columns(PlannerInfo *root, Index rti)
 	bool		result = false;
 
 	/* Assume we already have adequate lock */
-	relation = heap_open(rte->relid, NoLock);
+	relation = table_open(rte->relid, NoLock);
 
 	tupdesc = RelationGetDescr(relation);
 	result = tupdesc->constr && tupdesc->constr->has_generated_stored;
 
-	heap_close(relation, NoLock);
+	table_close(relation, NoLock);
 
 	return result;
 }


signature.asc
Description: PGP signature

RE: New SQL counter statistics view (pg_stat_sql)

2019-10-16 Thread Smith, Peter

Dear Hackers,

We have resurrected this 2 year old "SQL statements statistics counter" 
proposal from Hari.

The attached patch addresses the previous review issues.

The commit-fest entry has been moved forward to the next commit-fest
https://commitfest.postgresql.org/25/790/

Please review again, and consider if this is OK for "Ready for Committer" 
status.

Kind Regards
--
Peter Smith
Fujitsu Australia



-Original Message-
From: pgsql-hackers-ow...@postgresql.org  
On Behalf Of Andres Freund
Sent: Thursday, 6 April 2017 5:18 AM
To: Haribabu Kommi 
Cc: Michael Paquier ; Dilip Kumar 
; vinayak ; 
pgsql-hack...@postgresql.org
Subject: Re: New SQL counter statistics view (pg_stat_sql)

Hi,


I'm somewhat inclined to think that this really would be better done in an 
extension like pg_stat_statements.


On 2017-03-08 14:39:23 +1100, Haribabu Kommi wrote:
> On Wed, Feb 1, 2017 at 3:13 PM, Michael Paquier 
> 

> + 
> +  track_sql (boolean)
> +  
> +   track_sql configuration parameter
> +  
> +  
> +  
> +   
> +Enables collection of different SQL statement statistics that are
> +executed on the instance. This parameter is off by default. Only
> +superusers can change this setting.
> +   
> +  
> + 
> +

I don't like this name much, seems a bit too generic to me. 'track_activities', 
'track_io_timings' are at least a bit clearer.
How about track_statement_statistics + corresponding view/function renaming?


> +  
> +   pg_stat_sql View
> +   
> +
> +
> +  Column
> +  Type
> +  Description
> + 
> +
> +
> +   
> +
> + tag
> + text
> + Name of the SQL statement
> +

It's not really the "name" of a statement. Internally and IIRC elsewhere in the 
docs we describe something like this as tag?


> +/* --
> + * pgstat_send_sqlstmt(void)
> + *
> + *   Send SQL statement statistics to the collector
> + * --
> + */
> +static void
> +pgstat_send_sqlstmt(void)
> +{
> + PgStat_MsgSqlstmt msg;
> + PgStat_SqlstmtEntry *entry;
> + HASH_SEQ_STATUS hstat;
> +
> + if (pgstat_backend_sql == NULL)
> + return;
> +
> + pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_SQLSTMT);
> + msg.m_nentries = 0;
> +
> + hash_seq_init(&hstat, pgstat_backend_sql);
> + while ((entry = (PgStat_SqlstmtEntry *) hash_seq_search(&hstat)) != 
> NULL)
> + {
> + PgStat_SqlstmtEntry *m_ent;
> +
> + /* Skip it if no counts accumulated since last time */
> + if (entry->count == 0)
> + continue;
> +
> + /* need to convert format of time accumulators */
> + m_ent = &msg.m_entry[msg.m_nentries];
> + memcpy(m_ent, entry, sizeof(PgStat_SqlstmtEntry));
> +
> + if (++msg.m_nentries >= PGSTAT_NUM_SQLSTMTS)
> + {
> + pgstat_send(&msg, offsetof(PgStat_MsgSqlstmt, 
> m_entry[0]) +
> + msg.m_nentries * 
> sizeof(PgStat_SqlstmtEntry));
> + msg.m_nentries = 0;
> + }
> +
> + /* reset the entry's count */
> + entry->count = 0;
> + }
> +
> + if (msg.m_nentries > 0)
> + pgstat_send(&msg, offsetof(PgStat_MsgSqlstmt, m_entry[0]) +
> + msg.m_nentries * 
> sizeof(PgStat_SqlstmtEntry));

Hm. So pgstat_backend_sql is never deleted, which'll mean that if a backend 
lives long we'll continually iterate over a lot of 0 entries.  I think 
performance evaluation of this feature should take that into account.


> +}
> +
> +/*
> + * Count SQL statement for pg_stat_sql view  */ void 
> +pgstat_count_sqlstmt(const char *commandTag) {
> + PgStat_SqlstmtEntry *htabent;
> + boolfound;
> +
> + if (!pgstat_backend_sql)
> + {
> + /* First time through - initialize SQL statement stat table */
> + HASHCTL hash_ctl;
> +
> + memset(&hash_ctl, 0, sizeof(hash_ctl));
> + hash_ctl.keysize = NAMEDATALEN;
> + hash_ctl.entrysize = sizeof(PgStat_SqlstmtEntry);
> + pgstat_backend_sql = hash_create("SQL statement stat entries",
> + 
>  PGSTAT_SQLSTMT_HASH_SIZE,
> + 
>  &hash_ctl,
> + 
>  HASH_ELEM | HASH_BLOBS);
> + }
> +
> + /* Get the stats entry for this SQL statement, create if necessary */
> + htabent = hash_search(pgstat_backend_sql, commandTag,
> +   HASH_ENTER, &found);
> + if (!found)
> + htabent->count = 1;
> + else
> + htabent->count++;
> +}


That's not really something in this patch, but a lot of this would be

Re: ICU for global collation

2019-10-16 Thread Thomas Munro

On Wed, Oct 9, 2019 at 12:16 AM Marius Timmer
 wrote:
> like the others before me we (the university of Münster) are happy to
> see this feature as well. Thank you this.
>
> When I applied the patch two weeks ago I run into the issue that initdb
> did not recognize the new parameters (collation-provider and icu-locale)
> but I guess it was caused by my own stupidity.
>
> > When trying databases defined with ICU locales, I see that backends
> > that serve such databases seem to have their LC_CTYPE inherited from
> > the environment (as opposed to a per-database fixed value).
> I am able to recreate the issue described by Daniel on my machine.
>
> Now it works as expected. I just had to update the patch since commit
> 3f6b3be3 had modified two lines which resulted in conflicts. You find
> the updated patch as attachement to this mail.

I rebased this patch, and tweaked get_collation_action_version() very
slightly so that you get collation version change detection (of the
ersatz kind provided by commit d5ac14f9) for the default collation
even when not using ICU.  Please see attached.

+struct pg_locale_struct global_locale;

Why not "default_locale"?  Where is the terminology "global" coming from?

+Specifies the collation provider for the database.

"for the database's default collation"?


v2-0001-Add-option-to-use-ICU-as-global-collation-provide.patch
Description: Binary data

Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Amit Kapila

On Wed, Oct 16, 2019 at 11:20 AM Dilip Kumar  wrote:
>
> On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas  wrote:
> >
> > On 15/10/2019 09:37, Amit Kapila wrote:
> > > While reviewing a parallel vacuum patch [1], we noticed a few things
> > > about $SUBJECT implemented in commit -
> > > 7df159a620b760e289f1795b13542ed1b3e13b87.
> > >
> > > 1. A new memory context GistBulkDeleteResult->page_set_context has
> > > been introduced, but it doesn't seem to be used.
> >
> > Oops. internal_page_set and empty_leaf_set were supposed to be allocated
> > in that memory context. As things stand, we leak them until end of
> > vacuum, in a multi-pass vacuum.
>
> Here is a patch to fix this issue.
>

The patch looks good to me.  I have slightly modified the comments and
removed unnecessary initialization.

Heikki, are you fine me committing and backpatching this to 12?  Let
me know if you have a different idea to fix.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0001-Fix-memory-leak-introduced-in-commit-7df159a620.patch
Description: Binary data

Re: configure fails for perl check on CentOS8

2019-10-16 Thread Kyotaro Horiguchi

Hello.

# I'm almost comming back..

At Wed, 16 Oct 2019 10:41:19 -0400, Andrew Dunstan 
 wrote in 
> On Wed, Oct 16, 2019 at 8:32 AM Tom Lane  wrote:
> > >> $ ./configure --with-perl CFLAGS=-O0
> > >> It is bash-aliased and survived without being found for a long time on
> > >> my Cent7 environment, but CentOS8 doesn't allow that.
> >
> > I don't quite understand why that wouldn't work.

I'm confused with make CFLAGS=..

> It looks like you need CFLAGS='-O0 -fPIC' on CentOS 8 when building with perl.

Yes, that is what my first mail said. I dug into it further.

The immediately problematic command generated by autoconf is:


gcc -o /tmp/conftest -Wall -Wmissing-prototypes -Wpointer-arith \
  -Wdeclaration-after-statement -Werror=vla -Wendif-labels \
  -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing \
  -fwrapv -fexcess-precision=standard -Wno-format-truncation \
  -Wno-stringop-truncation \
  -O0 \
  -D_GNU_SOURCE  -I/usr/lib64/perl5/CORE \
  /tmp/conftest.c \
  -Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld \
  -fstack-protector-strong -L/usr/local/lib  -L/usr/lib64/perl5/CORE \
  -lperl -lpthread -lresolv -ldl -lm -lcrypt -lutil -lc

/usr/bin/ld: /tmp/ccGxodNv.o: relocation R_X86_64_32 against symbol 
`PL_memory_wrap' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status

Very interestingly I don't get the error when the "-O0" is "-O2". It
is because gcc eliminates the PL_memory_wrap maybe by inlining.

The *final* problematic command boils down to:

gcc -o /tmp/conftest /tmp/conftest_O0.o \
  -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -lperl

That is, the problem doesn't happen without
"-specs=/usr/lib/rpm/redhat/redhat-hardened-ld".

I found the following bug report but it hasn't ever been fixed since
2016 on Fedora 24. I'm not sure about the newer versions.

https://bugzilla.redhat.com/show_bug.cgi?id=1343892

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Amit Kapila

On Wed, Oct 16, 2019 at 7:21 PM Heikki Linnakangas  wrote:
>
> On 16 October 2019 12:57:03 CEST, Amit Kapila  wrote:
> >On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas 
> >wrote:
> >> All things
> >> considered, I'm not sure which is better.
> >
> >Yeah, this is a tough call to make, but if we can allow it to delete
> >the pages in bulkdelete conditionally for parallel vacuum workers,
> >then it would be better.
>
> Yeah, if it's needed for parallel vacuum, maybe that tips the scale.
>

makes sense.  I think we can write a patch for it and prepare the
parallel vacuum patch on top of it.  Once the parallel vacuum is in a
committable shape, we can commit the gist-index related patch first
followed by parallel vacuum patch.

> Hopefully, multi-pass vacuums are rare in practice. And we should lift the 
> current 1 GB limit on the dead TID array, replacing it with something more 
> compact and expandable, to make multi-pass vacuums even more rare. So I don't 
> think we need to jump through many hoops to optimize the multi-pass case.
>

Yeah, that will be a good improvement.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Proposal: Make use of C99 designated initialisers for nulls/values arrays

2019-10-16 Thread vignesh C

On Tue, Oct 8, 2019 at 4:43 AM Smith, Peter  wrote:
>
> From: Amit Kapila  Sent: Friday, 4 October 2019 4:50 
> PM
>
> >>How about I just define them both the same?
> >>#define INIT_ALL_ELEMS_ZERO {0}
> >>#define INIT_ALL_ELEMS_FALSE{0}
> >
> >I think using one define would be preferred, but you can wait and see if 
> >others prefer defining different macros for the same thing.
>
> While nowhere near unanimous, it seems majority favour using a macro (if only 
> to protect the unwary and document the behaviour).
> And of those in favour of macros, using INIT_ALL_ELEMS_ZERO even for bool 
> array is a clear preference.
>
> So, please find attached the updated patch, which now has just 1 macro.
Few thoughts on the patch:
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -770,8 +770,8 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
  GlobalTransaction gxact = &status->array[status->currIdx++];
  PGPROC   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
  PGXACT   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
- Datum values[5];
- bool nulls[5];
+ Datum values[5] = INIT_ALL_ELEMS_ZERO;
+ bool nulls[5] = INIT_ALL_ELEMS_ZERO;
  HeapTuple tuple;
  Datum result;
Initialisation may not be required here as all the members are getting
populated immediately

@@ -1314,9 +1314,6 @@ SetDefaultACL(InternalDefaultACL *iacls)
  Oid defAclOid;

  /* Prepare to insert or update pg_default_acl entry */
- MemSet(values, 0, sizeof(values));
- MemSet(nulls, false, sizeof(nulls));
- MemSet(replaces, false, sizeof(replaces));

  if (isNew)
 We can place the comment just before the next block of code for
better readability like you have done in other places.


@@ -2024,9 +2018,6 @@ ExecGrant_Relation(InternalGrant *istmt)
  nnewmembers = aclmembers(new_acl, &newmembers);

  /* finished building new ACL value, now insert it */
- MemSet(values, 0, sizeof(values));
- MemSet(nulls, false, sizeof(nulls));
- MemSet(replaces, false, sizeof(replaces));

  replaces[Anum_pg_class_relacl - 1] = true;
 We can place the comment just before the next block of code for
better readability like you have done in other places.
 There are few more instances like this in the same file, we can
handle that too.

-- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -77,7 +77,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
  bool immediately_reserve = PG_GETARG_BOOL(1);
  bool temporary = PG_GETARG_BOOL(2);
  Datum values[2];
- bool nulls[2];
+ bool nulls[2] = INIT_ALL_ELEMS_ZERO;
  TupleDesc tupdesc;
  HeapTuple tuple;
  Datum result;
@@ -95,12 +95,10 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
  InvalidXLogRecPtr);

  values[0] = NameGetDatum(&MyReplicationSlot->data.name);
- nulls[0] = false;

  if (immediately_reserve)
  {
  values[1] = LSNGetDatum(MyReplicationSlot->data.restart_lsn);
- nulls[1] = false;
  }
  else
  nulls[1] = true;
 We might not gain much here, may be this change is not required.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Dilip Kumar

On Thu, Oct 17, 2019 at 9:15 AM Amit Kapila  wrote:
>
> On Wed, Oct 16, 2019 at 7:21 PM Heikki Linnakangas  wrote:
> >
> > On 16 October 2019 12:57:03 CEST, Amit Kapila  
> > wrote:
> > >On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas 
> > >wrote:
> > >> All things
> > >> considered, I'm not sure which is better.
> > >
> > >Yeah, this is a tough call to make, but if we can allow it to delete
> > >the pages in bulkdelete conditionally for parallel vacuum workers,
> > >then it would be better.
> >
> > Yeah, if it's needed for parallel vacuum, maybe that tips the scale.
> >
>
> makes sense.  I think we can write a patch for it and prepare the
> parallel vacuum patch on top of it.  Once the parallel vacuum is in a
> committable shape, we can commit the gist-index related patch first
> followed by parallel vacuum patch.

+1
I can write a patch for the same.

> > Hopefully, multi-pass vacuums are rare in practice. And we should lift the 
> > current 1 GB limit on the dead TID array, replacing it with something more 
> > compact and expandable, to make multi-pass vacuums even more rare. So I 
> > don't think we need to jump through many hoops to optimize the multi-pass 
> > case.
> >
>
> Yeah, that will be a good improvement.
+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: maintenance_work_mem used by Vacuum

2019-10-16 Thread Amit Kapila

On Thu, Oct 17, 2019 at 6:07 AM Greg Stark  wrote:
>
> It's a bit unfortunate that we're doing the pending list flush while the 
> vacuum memory is allocated at all. Is there any reason other than the way the 
> callbacks are defined that gin doesn't do the pending list flush before 
> vacuum does the heap scan and before it allocates any memory using 
> maintenance_work_mem?
>

I can't think of any other reason.  Can we think of doing it as a
separate phase for indexes?  That can help in containing the memory
usage to a maximum of maintenance_work_mem for Gin indexes, but I am
not sure how useful it is for other index AM's.  Another idea could be
that we do something special (cleanup of pending list) just for Gin
indexes before heap scan in Vacuum.

> (I'm guessing doing it after vacuum is finished would have different problems 
> with tuples in the pending queue not getting vacuumed?)

Yeah, I also think so.

BTW, good point!

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP/PoC for parallel backup

2019-10-16 Thread Asif Rehman

On Thu, Oct 17, 2019 at 1:33 AM Jeevan Ladhe 
wrote:

> I quickly tried to have a look at your 0001-refactor patch.
> Here are some comments:
>
> 1. The patch fails to compile.
>
> Sorry if I am missing something, but am not able to understand why in new
> function collectTablespaces() you have added an extra parameter NULL while
> calling sendTablespace(), it fails the compilation :
>
> + ti->size = infotbssize ? sendTablespace(fullpath, true, NULL) : -1;
>
>
> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
> -Wno-unused-command-line-argument -g -g -O0 -Wall -Werror
> -I../../../../src/include-c -o xlog.o xlog.c -MMD -MP -MF .deps/xlog.Po
> xlog.c:12253:59: error: too many arguments to function call, expected 2,
> have 3
> ti->size = infotbssize ? sendTablespace(fullpath, true,
> NULL) : -1;
>  ~~
> ^~~~
>
> 2. I think the patch needs to run via pg_indent. It does not follow 80
> column
> width.
> e.g.
>
> +void
> +collectTablespaces(List **tablespaces, StringInfo tblspcmapfile, bool
> infotbssize, bool needtblspcmapfile)
> +{
>
> 3.
> The comments in re-factored code appear to be redundant. example:
> Following comment:
>  /* Setup and activate network throttling, if client requested it */
> appears thrice in the code, before calling setup_throttle(), in the
> prologue of
> the function setup_throttle(), and above the if() in that function.
> Similarly - the comment:
> /* Collect information about all tablespaces */
> in collectTablespaces().
>
> 4.
> In function include_wal_files() why is the parameter TimeLineID i.e. endtli
> needed. I don't see it being used in the function at all. I think you can
> safely
> get rid of it.
>
> +include_wal_files(XLogRecPtr endptr, TimeLineID endtli)
>
>
Thanks Jeevan. Some changes that should be part of 2nd patch were left in
the 1st. I have fixed that and the above mentioned issues as well.
Attached are the updated patches.

Thanks,

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca


0001-Refactor-some-basebackup-code-to-increase-reusabilit.patch
Description: Binary data


0002-backend-changes-for-parallel-backup.patch
Description: Binary data


0004-parallel-backup-testcase.patch
Description: Binary data


0003-pg_basebackup-changes-for-parallel-backup.patch
Description: Binary data

Re: SegFault on 9.6.14

2019-10-16 Thread Thomas Munro

On Fri, Sep 13, 2019 at 1:35 AM Robert Haas  wrote:
> On Thu, Sep 12, 2019 at 8:55 AM Amit Kapila  wrote:
> > Robert, Thomas, do you have any more suggestions related to this.  I
> > am planning to commit the above-discussed patch (Forbid Limit node to
> > shutdown resources.) coming Monday, so that at least the reported
> > problem got fixed.
>
> I think that your commit message isn't very clear about what the
> actual issue is.  And the patch itself doesn't add any comments or
> anything to try to clear it up. So I wouldn't favor committing it in
> this form.

Is the proposed commit message at the bottom of this email an improvement?

Do I understand correctly that, with this patch, we can only actually
lose statistics in the case where we rescan?  That is, precisely the
case that crashes (9.6) or spews warnings (10+)?  In a quick
non-rescan test with the ExecShutdownNode() removed, I don't see any
problem with the buffer numbers on my screen:

postgres=# explain (analyze, buffers, timing off, costs off) select
count(*) from t limit 5;
  QUERY PLAN
--
 Limit (actual rows=1 loops=1)
   Buffers: shared hit=16210 read=28038
   ->  Finalize Aggregate (actual rows=1 loops=1)
 Buffers: shared hit=16210 read=28038
 ->  Gather (actual rows=3 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=16210 read=28038
   ->  Partial Aggregate (actual rows=1 loops=3)
 Buffers: shared hit=16210 read=28038
 ->  Parallel Seq Scan on t (actual rows=333 loops=3)
   Buffers: shared hit=16210 read=28038
 Planning Time: 0.086 ms
 Execution Time: 436.669 ms
(14 rows)

===
Don't shut down Gather[Merge] early under Limit.

Revert part of commit 19df1702f5.

Early shutdown was added by that commit so that we could collect
statistics from workers, but unfortunately it interacted badly with
rescans.  Rescanning a Limit over a Gather node could produce a SEGV
on 9.6 and resource leak warnings on later releases.  By reverting the
early shutdown code, we might lose statistics in some cases of Limit
over Gather, but that will require further study to fix.

Author: Amit Kapila, testcase by Vignesh C
Reported-by: Jerry Sievers
Diagnosed-by: Thomas Munro
Backpatch-through: 9.6
Discussion: https://postgr.es/m/87ims2amh6@jsievers.enova.com
===

Re: [HACKERS] Block level parallel vacuum

2019-10-16 Thread Masahiko Sawada

On Wed, Oct 16, 2019 at 3:02 PM Amit Kapila  wrote:
>
> On Wed, Oct 16, 2019 at 6:50 AM Masahiko Sawada  wrote:
> >
> > On Tue, Oct 15, 2019 at 6:33 PM Amit Kapila  wrote:
> > >
> >
> > Attached updated patch set. 0001 patch introduces new index AM field
> > amcanparallelvacuum. All index AMs except for gist sets true for now.
> > 0002 patch incorporated the all comments I got so far.
> >
>
> I haven't studied the latest patch in detail, but it seems you are
> still assuming that all indexes will have the same amount of shared
> memory for index stats and copying it in the same way.

Yeah I thought we agreed at least to have canparallelvacuum and if an
index AM cannot support parallel index vacuuming like gist, it returns
false.

> I thought we
> agreed that each index AM should do this on its own.  The basic
> problem is as of now we see this problem only with the Gist index, but
> some other index AM's could also have a similar problem.

Okay. I'm thinking we're going to have a new callback to ack index AMs
the size of the structure using within both ambulkdelete and
amvacuumcleanup. But copying it to DSM can be done by the core because
it knows how many bytes need to be copied to DSM. Is that okay?

>
> Another major problem with previous and this patch version is that the
> cost-based vacuum concept seems to be entirely broken.  Basically,
> each parallel vacuum worker operates independently w.r.t vacuum delay
> and cost.  Assume that the overall I/O allowed for vacuum operation is
> X after which it will sleep for some time, reset the balance and
> continue.  In the patch, each worker will be allowed to perform X
> before which it can sleep and also there is no coordination for the
> same with master backend.  This is somewhat similar to memory usage
> problem, but a bit more tricky because here we can't easily split the
> I/O for each of the worker.
>
> One idea could be that we somehow map vacuum costing related
> parameters to the shared memory (dsm) which the vacuum operation is
> using and then allow workers to coordinate.  This way master and
> worker processes will have the same view of balance cost and can act
> accordingly.
>
> The other idea could be that we come up with some smart way to split
> the I/O among workers.  Initially, I thought we could try something as
> we do for autovacuum workers (see autovac_balance_cost), but I think
> that will require much more math.  Before launching workers, we need
> to compute the remaining I/O (heap operation would have used
> something) after which we need to sleep and continue the operation and
> then somehow split it equally across workers.  Once the workers are
> finished, then need to let master backend know how much I/O they have
> consumed and then master backend can add it to it's current I/O
> consumed.
>
> I think this problem matters because the vacuum delay is useful for
> large vacuums and this patch is trying to exactly solve that problem,
> so we can't ignore this problem.  I am not yet sure what is the best
> solution to this problem, but I think we need to do something for it.
>

I guess that the concepts of vacuum delay contradicts the concepts of
parallel vacuum. The concepts of parallel vacuum would be to use more
resource to make vacuum faster. Vacuum delays balances I/O during
vacuum in order to avoid I/O spikes by vacuum but parallel vacuum
rather concentrates I/O in shorter duration. Since we need to share
the memory in entire system we need to deal with the memory issue but
disks are different.

If we need to deal with this problem how about just dividing
vacuum_cost_limit by the parallel degree and setting it to worker's
vacuum_cost_limit?

Regards,

--
Masahiko Sawada

Re: recovery_min_apply_delay in archive recovery causes assertion failure in latch

2019-10-16 Thread Michael Paquier

On Tue, Oct 08, 2019 at 02:18:00AM +0900, Fujii Masao wrote:
> On Fri, Oct 4, 2019 at 9:03 PM Fujii Masao  wrote:
>> So archive recovery with recovery_min_apply_delay enabled would be
>> intended configuration. My patch that changes archive recovery so that
>> it always ignores thesetting might be in wrong direction. Maybe we should
>> make recovery_min_apply_delay work properly even in archive recovery.
>> Thought?
> 
> Patch attached. This patch allows archive recovery with
> recovery_min_apply_delay set, but not crash recovery.

Right.  In short it makes no sense to wait the delay when in crash
recovery.  After more testing I have been able to reproduce the
failure myself.

+   /* nothing to do if crash recovery is requested */
+   if (!ArchiveRecoveryRequested && !StandbyModeRequested)
+   return false;

ArchiveRecoveryRequested will be set to true if recovery.signal or
standby.signal are found, so it seems to me that you can make all
those checks more simple by removing from the equation
StandbyModeRequested, no?  StandbyModeRequested is never set to true
if ArchiveRecoveryRequested is not itself true.

It would be nice to test some scenario within 002_archiving.pl.
--
Michael

signature.asc
Description: PGP signature

Re: Proposal: Make use of C99 designated initialisers for nulls/values arrays

2019-10-16 Thread Thomas Munro

On Tue, Oct 8, 2019 at 11:09 PM Amit Kapila  wrote:
> On Sat, Oct 5, 2019 at 3:10 AM Andres Freund  wrote:
> > On 2019-10-04 17:08:29 -0400, Tom Lane wrote:
> > > Andres Freund  writes:
> > > > On 2019-10-04 16:31:29 -0400, Bruce Momjian wrote:
> > > >> Yeah, it is certainly weird that you have to assign the first array
> > > >> element to get the rest to be zeros.  By using a macro, we can document
> > > >> this behavior in one place.
> > >
> > > > IDK, to me this seems like something one just has to learn about C, with
> > > > the macro just obfuscating that already required knowledge. It's not
> > > > like this only applies to stack variables initializes with {0}. It's
> > > > also true of global variables, or function-local static ones, for
> > > > example.
> > >
> > > Huh?  For both those cases, the *default* behavior, going back to K&R C,
> > > is that the variable initializes to all-bits-zero.  There's no need to
> > > write anything extra.
> >
> > What I mean is that if there's any initialization, it's to all zeroes,
> > except for the parts explicitly initialized explicitly. And all that the
> > {0} does, is that the rest of the fields are initialized the way other
> > such initialization happens.
> >
>
> You have a point and I think over time everyone will know this.
> However, so many people advocating for having a macro with a comment
> to be more explicit about this behavior shows that this is not equally
> obvious to everyone or at least they think that it will help future
> patch authors.
>
> Now, I think as the usage ({0}) already exists in the code, so I think
> if we decide to use a macro, then ideally those places should also be
> changed.  I am not telling that it must be done in the same patch, we
> can even do it as a separate patch.
>
> I am personally still in the camp of people advocating the use of
> macro for this purpose.  It is quite possible after reading your
> points, some people might change their opinion or some others also
> share their opinion against using a macro in which case we can drop
> the idea of using a macro.

-1 for these macros.

These are basic facts about the C language.  I hope C eventually
supports {} like C++, so that you don't have to think hard about
whether the first member is another struct, and recursively so … but
since the macros can't help with that problem, what is the point?

I am reminded of an (apocryphal?) complaint from an old C FAQ about
people using #define BEGIN {.

Re: [HACKERS] Block level parallel vacuum

2019-10-16 Thread Mahendra Singh

Hi
I applied all 3 patches and ran regression test. I was getting one
regression failure.

diff -U3
> /home/mahendra/postgres_base_rp/postgres/src/test/regress/expected/vacuum.out
> /home/mahendra/postgres_base_rp/postgres/src/test/regress/results/vacuum.out
> ---
> /home/mahendra/postgres_base_rp/postgres/src/test/regress/expected/vacuum.out
> 2019-10-17 10:01:58.138863802 +0530
> +++
> /home/mahendra/postgres_base_rp/postgres/src/test/regress/results/vacuum.out
> 2019-10-17 11:41:20.930699926 +0530
> @@ -105,7 +105,7 @@
>  CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);
>  CREATE INDEX tmp_idx1 ON tmp (a);
>  VACUUM (PARALLEL 1) tmp; -- error, cannot parallel vacuum temporary tables
> -WARNING:  skipping "tmp" --- cannot parallel vacuum temporary tables
> +WARNING:  skipping vacuum on "tmp" --- cannot vacuum temporary tables in
> parallel
>  -- INDEX_CLEANUP option
>  CREATE TABLE no_index_cleanup (i INT PRIMARY KEY, t TEXT);
>  -- Use uncompressed data stored in toast.
>

It look likes that you changed warning message for temp table, but haven't
updated expected out file.

Thanks and Regards
Mahendra Thalor
EnterpriseDB: http://www.enterprisedb.com

On Wed, 16 Oct 2019 at 06:50, Masahiko Sawada  wrote:

> On Tue, Oct 15, 2019 at 6:33 PM Amit Kapila 
> wrote:
> >
> > On Tue, Oct 15, 2019 at 1:26 PM Masahiko Sawada 
> wrote:
> > >
> > > On Tue, Oct 15, 2019 at 4:15 PM Masahiko Sawada 
> wrote:
> > > >
> > > > > > > If we avoid postponing deleting empty pages till the cleanup
> phase,
> > > > > > > then we don't have the problem for gist indexes.
> > > > > >
> > > > > > Yes. But considering your pointing out I guess that there might
> be
> > > > > > other index AMs use the stats returned from bulkdelete in the
> similar
> > > > > > way to gist index (i.e. using more larger structure of which
> > > > > > IndexBulkDeleteResult is just the first field). If we have the
> same
> > > > > > concern the parallel vacuum still needs to deal with that as you
> > > > > > mentioned.
> > > > > >
> > > > >
> > > > > Right, apart from some functions for memory allocation/estimation
> and
> > > > > stats copy, we might need something like amcanparallelvacuum, so
> that
> > > > > index methods can have the option to not participate in parallel
> > > > > vacuum due to reasons similar to gist or something else.  I think
> we
> > > > > can work towards this direction as this anyway seems to be required
> > > > > and till we reach any conclusion for gist indexes, you can mark
> > > > > amcanparallelvacuum for gist indexes as false.
> > > >
> > > > Agreed. I'll create a separate patch to add this callback and change
> > > > parallel vacuum patch so that it checks the participation of indexes
> > > > and then vacuums on un-participated indexes after parallel vacuum.
> > >
> > > amcanparallelvacuum is not necessary to be a callback, it can be a
> > > boolean field of IndexAmRoutine.
> > >
> >
> > Yes, it will be a boolean.  Note that for parallel-index scans, we
> > already have amcanparallel.
> >
>
> Attached updated patch set. 0001 patch introduces new index AM field
> amcanparallelvacuum. All index AMs except for gist sets true for now.
> 0002 patch incorporated the all comments I got so far.
>
> Regards,
>
> --
> Masahiko Sawada
>

Re: [HACKERS] Block level parallel vacuum

2019-10-16 Thread Masahiko Sawada

On Thu, Oct 17, 2019 at 3:18 PM Mahendra Singh  wrote:
>
> Hi
> I applied all 3 patches and ran regression test. I was getting one regression 
> failure.
>
>> diff -U3 
>> /home/mahendra/postgres_base_rp/postgres/src/test/regress/expected/vacuum.out
>>  /home/mahendra/postgres_base_rp/postgres/src/test/regress/results/vacuum.out
>> --- 
>> /home/mahendra/postgres_base_rp/postgres/src/test/regress/expected/vacuum.out
>>  2019-10-17 10:01:58.138863802 +0530
>> +++ 
>> /home/mahendra/postgres_base_rp/postgres/src/test/regress/results/vacuum.out 
>> 2019-10-17 11:41:20.930699926 +0530
>> @@ -105,7 +105,7 @@
>>  CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);
>>  CREATE INDEX tmp_idx1 ON tmp (a);
>>  VACUUM (PARALLEL 1) tmp; -- error, cannot parallel vacuum temporary tables
>> -WARNING:  skipping "tmp" --- cannot parallel vacuum temporary tables
>> +WARNING:  skipping vacuum on "tmp" --- cannot vacuum temporary tables in 
>> parallel
>>  -- INDEX_CLEANUP option
>>  CREATE TABLE no_index_cleanup (i INT PRIMARY KEY, t TEXT);
>>  -- Use uncompressed data stored in toast.
>
>
> It look likes that you changed warning message for temp table, but haven't 
> updated expected out file.
>

Thank you!
I forgot to change the expected file. I'll fix it in the next version patch.

Regards,

--
Masahiko Sawada

Re: [HACKERS] Block level parallel vacuum

2019-10-16 Thread Amit Kapila

On Thu, Oct 17, 2019 at 10:56 AM Masahiko Sawada  wrote:
>
> On Wed, Oct 16, 2019 at 3:02 PM Amit Kapila  wrote:
> >
> > On Wed, Oct 16, 2019 at 6:50 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Oct 15, 2019 at 6:33 PM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > Attached updated patch set. 0001 patch introduces new index AM field
> > > amcanparallelvacuum. All index AMs except for gist sets true for now.
> > > 0002 patch incorporated the all comments I got so far.
> > >
> >
> > I haven't studied the latest patch in detail, but it seems you are
> > still assuming that all indexes will have the same amount of shared
> > memory for index stats and copying it in the same way.
>
> Yeah I thought we agreed at least to have canparallelvacuum and if an
> index AM cannot support parallel index vacuuming like gist, it returns
> false.
>
> > I thought we
> > agreed that each index AM should do this on its own.  The basic
> > problem is as of now we see this problem only with the Gist index, but
> > some other index AM's could also have a similar problem.
>
> Okay. I'm thinking we're going to have a new callback to ack index AMs
> the size of the structure using within both ambulkdelete and
> amvacuumcleanup. But copying it to DSM can be done by the core because
> it knows how many bytes need to be copied to DSM. Is that okay?
>

That sounds okay.

> >
> > Another major problem with previous and this patch version is that the
> > cost-based vacuum concept seems to be entirely broken.  Basically,
> > each parallel vacuum worker operates independently w.r.t vacuum delay
> > and cost.  Assume that the overall I/O allowed for vacuum operation is
> > X after which it will sleep for some time, reset the balance and
> > continue.  In the patch, each worker will be allowed to perform X
> > before which it can sleep and also there is no coordination for the
> > same with master backend.  This is somewhat similar to memory usage
> > problem, but a bit more tricky because here we can't easily split the
> > I/O for each of the worker.
> >
> > One idea could be that we somehow map vacuum costing related
> > parameters to the shared memory (dsm) which the vacuum operation is
> > using and then allow workers to coordinate.  This way master and
> > worker processes will have the same view of balance cost and can act
> > accordingly.
> >
> > The other idea could be that we come up with some smart way to split
> > the I/O among workers.  Initially, I thought we could try something as
> > we do for autovacuum workers (see autovac_balance_cost), but I think
> > that will require much more math.  Before launching workers, we need
> > to compute the remaining I/O (heap operation would have used
> > something) after which we need to sleep and continue the operation and
> > then somehow split it equally across workers.  Once the workers are
> > finished, then need to let master backend know how much I/O they have
> > consumed and then master backend can add it to it's current I/O
> > consumed.
> >
> > I think this problem matters because the vacuum delay is useful for
> > large vacuums and this patch is trying to exactly solve that problem,
> > so we can't ignore this problem.  I am not yet sure what is the best
> > solution to this problem, but I think we need to do something for it.
> >
>
> I guess that the concepts of vacuum delay contradicts the concepts of
> parallel vacuum. The concepts of parallel vacuum would be to use more
> resource to make vacuum faster. Vacuum delays balances I/O during
> vacuum in order to avoid I/O spikes by vacuum but parallel vacuum
> rather concentrates I/O in shorter duration.
>

You have a point, but the way it is currently working in the patch
doesn't make much sense.  Basically, each of the parallel workers will
be allowed to use a complete I/O limit which is actually a limit for
the entire vacuum operation.  It doesn't give any consideration to the
work done for the heap.

> Since we need to share
> the memory in entire system we need to deal with the memory issue but
> disks are different.
>
> If we need to deal with this problem how about just dividing
> vacuum_cost_limit by the parallel degree and setting it to worker's
> vacuum_cost_limit?
>

How will we take the I/O done by heap into consideration?  The
vacuum_cost_limit is the cost for the entire vacuum operation not
separately for heap and indexes.  What makes you think that
considering the limit for heap and index separately is not
problematic?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Heikki Linnakangas


On 17/10/2019 05:31, Amit Kapila wrote:

On Wed, Oct 16, 2019 at 11:20 AM Dilip Kumar  wrote:


On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas  wrote:


On 15/10/2019 09:37, Amit Kapila wrote:

While reviewing a parallel vacuum patch [1], we noticed a few things
about $SUBJECT implemented in commit -
7df159a620b760e289f1795b13542ed1b3e13b87.

1. A new memory context GistBulkDeleteResult->page_set_context has
been introduced, but it doesn't seem to be used.


Oops. internal_page_set and empty_leaf_set were supposed to be allocated
in that memory context. As things stand, we leak them until end of
vacuum, in a multi-pass vacuum.


Here is a patch to fix this issue.


The patch looks good to me.  I have slightly modified the comments and
removed unnecessary initialization.

Heikki, are you fine me committing and backpatching this to 12?  Let
me know if you have a different idea to fix.


Thanks! Looks good to me. Did either of you test it, though, with a 
multi-pass vacuum?


- Heikki

49 matches

Mail list logo