date:20191223

RE: Implementing Incremental View Maintenance

2019-12-23 Thread tsunakawa.ta...@fujitsu.com

From: Yugo Nagata 
> 1. Create a temporary table only once at the first view maintenance in
> this session. This is possible if we store names or oid of temporary
> tables used for each materialized view in memory. However, users may
> access to these temptables whenever during the session.
> 
> 2. Use tuplestores instead of temprary tables. Tuplestores can be
> converted to Ephemeral Name Relation (ENR) and used in queries.
>  It doesn't need updating system catalogs, but indexes can not be
> used to access.

How about unlogged tables ?  I thought the point of using a temp table is to 
avoid WAL overhead.

One concern about the temp table is that it precludes the use of distributed 
transactions (PREPARE TRANSACTION fails if the transaction accessed a temp 
table.)  This could become a headache when FDW has supported 2PC (which 
Sawada-san started and Horicuchi-san has taken over.)  In the near future, 
PostgreSQL may evolve into a shared nothing database with distributed 
transactions like Postgres-XL.


Regards
Takayuki Tsunakawa

Re: Implementing Incremental View Maintenance

2019-12-23 Thread Tatsuo Ishii

>> But if you want to get always up-to-data you need to pay the cost for
>> REFRESH MATERIALIZED VIEW. IVM gives a choice here.
> 
> Thank you, that clarified to some extent.  What kind of data do you think of 
> as an example?
> 
> Materialized view reminds me of the use in a data warehouse.  Oracle handles 
> the top in its Database Data Warehousing Guide, and Microsoft has just 
> started to offer the materialized view feature in its Azure Synapse Analytics 
> (formerly SQL Data Warehouse).  AWS also has previewed Redshift's 
> materialized view feature in re:Invent 2019.  Are you targeting the data 
> warehouse (analytics) workload?

First of all, we do not think that current approach is the final
one. Instead we want to implement IVM feature one by one: i.e. we
start with "immediate update" approach, because it's simple and easier
to implement. Then we will add "deferred update" mode later on.

In fact Oracle has both "immediate update" and "deferred update" mode
of IVM (actually there are more "mode" with their implementation).

I recommend you to look into Oracle's materialized view feature
closely. For fair evaluation, probably we should compare the IVM patch
with Oracle's "immediate update" (they call it "on statement") mode.

> IIUC, to put (over) simply, the data warehouse has two kind of tables:

Probably deferred IVM mode is more suitable for DWH. However as I said
earlier, we hope to implement the immediate mode first then add the
deferred mode. Let's start with simple one then add more features.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: smgr vs DropRelFileNodeBuffers() vs filesystem state vs no critical section

2019-12-23 Thread Andres Freund

Hi,

On 2019-12-07 11:07:04 +0530, Amit Kapila wrote:
> On Sat, Dec 7, 2019 at 5:42 AM Andres Freund  wrote:
> >
> > Tom, I seem to recall a recent thread of yours discussing a different
> > approach to truncation. I wonder if there's some intersection with
> > that. But unfortunately my search somehow has come up with nothing so
> > far - do you remember enough to find the thread?
> >
> 
> IIUC, a similar problem is discussed in the thread [1] where Tom
> proposed a few solutions which are close to what you are proposing.
> 
> [1] - https://www.postgresql.org/message-id/1880.1281020817%40sss.pgh.pa.us

It does indeed look like basically the same problem. I was actually
remembering a different thread, that was more about truncating with not
quite as heavyweight locking.

Having pondered this and some related problems (truncation, flushing
multiple buffers at once using asynchronous IO, PrefetchBuffer()
directly into buffers, cleanup lock implementation), I think we
basically need the ability to set something like BM_IO_IN_PROGRESS on
multiple buffers (perhaps signalling different forms of IO with
different flags, but I'm not sure it's needed).

I think we basically ought to replace the current IO buffer locking with
condition variables (as Robert has suggested at [1]). Instead of having
an array of LWLocks (BufferIOLWLockArray), we'd allocate one condition
variable for each buffer. I think we have enough space directly in
BufferDesc these days, due to the spinlock removal, but that seems like
a detail.

For truncation, we'd first iterate over all buffers once to mark them as
BM_IO_IN_PROGRESS, then we would truncate the underlying file. If the
truncation succeeds, we can use a local palloc'd array of IO_IN_PROGRESS
buffers to actually evict them. If the truncation fails, the same array
would be used to reset IO_IN_PROGRESS (basically AbortBufferIO, except
not setting BM_IO_ERROR for the truncation case).  This would solve the
problem of truncations leading to diverging fs/buffer state, would not
require a PANIC, and would allow to have concurrent buffer eviction to
efficiently wait for IO to finish.

This would pretty directly replace the current cleanup locks, which
would just need the existing flagbit to indicate that refcount=0 should
cause a condition variable wakeup.

Does somebody see a fundamental hole in this approach?

Obviously there's lots of different details to fill in, but most of them
seem likely to only be found by actually writing the patch...

Greetings,

Andres Freund

[1] 
https://postgr.es/m/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com

Re: [HACKERS] Block level parallel vacuum

2019-12-23 Thread Masahiko Sawada

On Mon, 23 Dec 2019 at 16:24, Mahendra Singh  wrote:
>
> On Fri, 20 Dec 2019 at 17:17, Prabhat Sahu
>  wrote:
> >
> > Hi,
> >
> > While testing this feature with parallel vacuum on "TEMPORARY TABLE", I got 
> > a server crash on PG Head+V36_patch.
> > Changed configuration parameters and Stack trace are as below:
> >
> > autovacuum = on
> > max_worker_processes = 4
> > shared_buffers = 10MB
> > max_parallel_workers = 8
> > max_parallel_maintenance_workers = 8
> > vacuum_cost_limit = 2000
> > vacuum_cost_delay = 10
> > min_parallel_table_scan_size = 8MB
> > min_parallel_index_scan_size = 0
> >
> > -- Stack trace:
> > [centos@parallel-vacuum-testing bin]$ gdb -q -c data/core.1399 postgres
> > Reading symbols from 
> > /home/centos/BLP_Vacuum/postgresql/inst/bin/postgres...done.
> > [New LWP 1399]
> > [Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/lib64/libthread_db.so.1".
> > Core was generated by `postgres: autovacuum worker   postgres   
> >'.
> > Program terminated with signal 6, Aborted.
> > #0  0x7f4517d80337 in raise () from /lib64/libc.so.6
> > Missing separate debuginfos, use: debuginfo-install 
> > glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 
> > krb5-libs-1.15.1-37.el7_7.2.x86_64 libcom_err-1.42.9-16.el7.x86_64 
> > libgcc-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 
> > openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 
> > zlib-1.2.7-18.el7.x86_64
> > (gdb) bt
> > #0  0x7f4517d80337 in raise () from /lib64/libc.so.6
> > #1  0x7f4517d81a28 in abort () from /lib64/libc.so.6
> > #2  0x00a96341 in ExceptionalCondition (conditionName=0xd18efb 
> > "strvalue != NULL", errorType=0xd18eeb "FailedAssertion",
> > fileName=0xd18ee0 "snprintf.c", lineNumber=442) at assert.c:67
> > #3  0x00b02522 in dopr (target=0x7ffdb0e38450, format=0xc5fa95 
> > ".%s\"", args=0x7ffdb0e38538) at snprintf.c:442
> > #4  0x00b01ea6 in pg_vsnprintf (str=0x256df50 "autovacuum: dropping 
> > orphan temp table \"postgres.", '\177' ..., count=1024,
> > fmt=0xc5fa68 "autovacuum: dropping orphan temp table \"%s.%s.%s\"", 
> > args=0x7ffdb0e38538) at snprintf.c:195
> > #5  0x00afbadf in pvsnprintf (buf=0x256df50 "autovacuum: dropping 
> > orphan temp table \"postgres.", '\177' ..., len=1024,
> > fmt=0xc5fa68 "autovacuum: dropping orphan temp table \"%s.%s.%s\"", 
> > args=0x7ffdb0e38538) at psprintf.c:110
> > #6  0x00afd34b in appendStringInfoVA (str=0x7ffdb0e38550, 
> > fmt=0xc5fa68 "autovacuum: dropping orphan temp table \"%s.%s.%s\"", 
> > args=0x7ffdb0e38538)
> > at stringinfo.c:149
> > #7  0x00a970fd in errmsg (fmt=0xc5fa68 "autovacuum: dropping orphan 
> > temp table \"%s.%s.%s\"") at elog.c:832
> > #8  0x008588d2 in do_autovacuum () at autovacuum.c:2249
> > #9  0x00857b29 in AutoVacWorkerMain (argc=0, argv=0x0) at 
> > autovacuum.c:1689
> > #10 0x0085772f in StartAutoVacWorker () at autovacuum.c:1483
> > #11 0x0086e64f in StartAutovacuumWorker () at postmaster.c:5562
> > #12 0x0086e106 in sigusr1_handler (postgres_signal_arg=10) at 
> > postmaster.c:5279
> > #13 
> > #14 0x7f4517e3f933 in __select_nocancel () from /lib64/libc.so.6
> > #15 0x00869838 in ServerLoop () at postmaster.c:1691
> > #16 0x00869212 in PostmasterMain (argc=3, argv=0x256bd70) at 
> > postmaster.c:1400
> > #17 0x0077855d in main (argc=3, argv=0x256bd70) at main.c:210
> > (gdb)
> >
> > I have tried to reproduce the same with all previously executed queries but 
> > now I am not able to reproduce the same.
>
> Thanks Prabhat for reporting this issue.
>
> I am able to reproduce this issue at my end. I tested and verified
> that this issue is not related to parallel vacuum patch. I am able to
> reproduce this issue on HEAD without parallel vacuum patch(v37).
>
> I will report this issue in new thread with reproducible test case.

Thank you so much!

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: unsupportable composite type partition keys

2019-12-23 Thread Amit Langote

On Sun, Dec 22, 2019 at 6:13 AM Tom Lane  wrote:
>
> I wrote:
> > As far as point 2 goes, I think this is another outgrowth of the
> > fundamental design error that we load a partitioned rel's partitioning
> > info immediately when the relcache entry is created, rather than later
> > on-demand.  If we weren't doing that then it wouldn't be problematic
> > to inspect the rel's rowtype while constructing the partitioning info.
> > I've bitched about this before, if memory serves, but couldn't light
> > a fire under anyone about fixing it.  Now I think we have no choice.
> > It was never a great idea that minimal construction of a relcache
> > entry could result in running arbitrary user-defined code.
>
> Here's a draft patch for that.

Thanks for writing the patch.  This also came up recently on another thread [1].

>  There are a couple of secondary issues
> I didn't do anything about yet:
>
> * When rebuilding an open relcache entry for a partitioned table, this
> coding now always quasi-leaks the old rd_pdcxt, where before that happened
> only if the partdesc actually changed.  (Even if I'd kept the
> equalPartitionDescs call, it would always fail.)  I complained about the
> quasi-leak behavior before, but this probably pushes it up to the level of
> "must fix".  What I'm inclined to do is to hack
> RelationDecrementReferenceCount so that, when the refcount goes to zero,
> we delete any child contexts of rd_pdcxt.  That's pretty annoying but in
> the big scheme of things it's unlikely to matter.

Hacking RelationDecrementReferenceCount() like that sounds OK.

-else if (rebuild && newrel->rd_pdcxt != NULL)
+if (rebuild && newrel->rd_pdcxt != NULL)

Checking rebuild seems unnecessary in this block, although that's true
even without the patch.

+ * To ensure that it's not leaked completely, re-attach it to the
+ * new reldesc, or make it a child of the new reldesc's rd_pdcxt
+ * in the unlikely event that there is one already.  (See hack in
+ * RelationBuildPartitionDesc.)
...
+if (relation->rd_pdcxt != NULL) /* probably never happens */
+MemoryContextSetParent(newrel->rd_pdcxt, relation->rd_pdcxt);
+else
+relation->rd_pdcxt = newrel->rd_pdcxt;

While I can imagine that RelationBuildPartitionDesc() might encounter
an old partition descriptor making the re-parenting hack necessary
there, I don't see why it's needed here, because a freshly built
relation descriptor would not contain the partition descriptor after
this patch.

> * It'd be better to declare RelationGetPartitionKey and
> RelationGetPartitionDesc in relcache.h and get their callers out of the
> business of including rel.h, where possible.

Although I agree to declare them in relcache.h, that doesn't reduce
the need to include rel.h in their callers much, because anyplace that
uses RelationGetPartitionDesc() is also very likely to use
RelationGetRelid() which is in rel.h.

> * equalPartitionDescs is now dead code, should we remove it?

Don't see any problem with doing so.

> > Note that the end result of this would be to allow, not prohibit,
> > cases like your example.  I wonder whether we couldn't also lift
> > the restriction against whole-row Vars in partition expressions.
> > Doesn't seem like there is much difference between such a Var and
> > a row(...)::table_rowtype expression.
>
> I didn't look into that either.  I wouldn't propose back-patching that,
> but it'd be interesting to try to fix it in HEAD.

Agreed.

Thanks,
Amit

[1] 
https://www.postgresql.org/message-id/CA%2BHiwqFucUh7hYkfZ6x1MVcs_R24eUfNVuRwdE_FwuwK8XpSZg%40mail.gmail.com

Re: Implementing Incremental View Maintenance

2019-12-23 Thread Yugo Nagata

On Mon, 23 Dec 2019 08:08:53 +
"tsunakawa.ta...@fujitsu.com"  wrote:

> From: Yugo Nagata 
> > 1. Create a temporary table only once at the first view maintenance in
> > this session. This is possible if we store names or oid of temporary
> > tables used for each materialized view in memory. However, users may
> > access to these temptables whenever during the session.
> > 
> > 2. Use tuplestores instead of temprary tables. Tuplestores can be
> > converted to Ephemeral Name Relation (ENR) and used in queries.
> >  It doesn't need updating system catalogs, but indexes can not be
> > used to access.
> 
> How about unlogged tables ?  I thought the point of using a temp table is to 
> avoid WAL overhead.

Hmm... this might be another option. However, if we use unlogged tables,
we will need to create them in a special schema similar to pg_toast
to split this from user tables. Otherwise, we need to create and drop
unlogged tables repeatedly for each session.

> 
> One concern about the temp table is that it precludes the use of distributed 
> transactions (PREPARE TRANSACTION fails if the transaction accessed a temp 
> table.)  This could become a headache when FDW has supported 2PC (which 
> Sawada-san started and Horicuchi-san has taken over.)  In the near future, 
> PostgreSQL may evolve into a shared nothing database with distributed 
> transactions like Postgres-XL.

This makes sense since you mean that PREPARE TRANSACTION can not be used
if any base table of incrementally maintainable materialized views is
modified in the transaction, at least in the immediate maintenance. Maybe,
this issue can be resolved if we implement the deferred maintenance planned
in future because  materialized views can be updated in other transactions
in this way.

> 
> 
> Regards
> Takayuki Tsunakawa
> 
> 
> 

-- 
Yugo Nagata

Re: Implementing Incremental View Maintenance

2019-12-23 Thread legrand legrand

Hello,
regarding my initial post:

> For each insert into a base table there are 3 statements:
> - ANALYZE pg_temp_3.pg_temp_81976
> - WITH updt AS (  UPDATE public.mv1 AS mv SET __ivm_count__ = ...
> - DROP TABLE pg_temp_3.pg_temp_81976

For me there where 3 points to discuss:
- create/drop tables may bloat dictionnary tables 
- create/drop tables prevents "WITH updt ..." from being shared  (with some
plan caching)
- generates many lines in pg_stat_statements

In fact I like the idea of a table created per session, but I would even
prefer a common "table" shared between all sessions like GLOBAL TEMPORARY
TABLE (or something similar) as described here:
https://www.postgresql.org/message-id/flat/157703426606.1198.2452090605041230054.pgcf%40coridan.postgresql.org#331e8344bbae904350af161fb43a0aa6

That would remove the drop/create issue, permits to reduce planning time for
"WITH updt ..." statements
(as done today in PLpgsql triggers), and would fix the pgss "bloat" issue. 

Like that the "cost" of the immediate refresh approach would be easier to
support ;o)

Regards
PAscal



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: mdclose() does not cope w/ FileClose() failure

2019-12-23 Thread Kyotaro Horiguchi

Hello.

At Sun, 22 Dec 2019 12:21:00 -0800, Noah Misch  wrote in 
> On Sun, Dec 22, 2019 at 01:19:30AM -0800, Noah Misch wrote:
> > I am inclined to fix this by decrementing md_num_open_segs before modifying
> > md_seg_fds (second attachment).
> 
> That leaked memory, since _fdvec_resize() assumes md_num_open_segs is also the
> allocated array length.  The alternative is looking better:

I agree that v2 is cleaner in the light of modularity and fixes the
memory leak happens at re-open.

> > An alternative would be to call
> > _fdvec_resize() after every FileClose(), like mdtruncate() does; however, 
> > the
> > repalloc() overhead could be noticeable.  (mdclose() is called much more
> > frequently than mdtruncate().)
> 
> I can skip repalloc() when the array length decreases, to assuage mdclose()'s
> worry.  In the mdclose() case, the final _fdvec_resize(reln, fork, 0) will
> still pfree() the array.  Array elements that mdtruncate() frees today will
> instead persist to end of transaction.  That is okay, since mdtruncate()
> crossing more than one segment boundary is fairly infrequent.  For it to
> happen, you must either create a >2G relation and then TRUNCATE it in the same
> transaction, or VACUUM must find >1-2G of unused space at the end of the
> relation.  I'm now inclined to do it that way, attached.

 * It doesn't seem worthwhile complicating the code by having a 
more
 * aggressive growth strategy here; the number of segments 
doesn't
 * grow that fast, and the memory context internally will 
sometimes
-* avoid doing an actual reallocation.
+* avoid doing an actual reallocation.  Likewise, since the 
number of
+* segments doesn't shrink that fast, don't shrink at all.  
During
+* mdclose(), we'll pfree the array at nseg==0.

If I understand it correctly, it is mentioning the number of the all
segment files in a fork, not the length of md_seg_fds arrays at a
certain moment. But actually _fdvec_resize is called for every segment
opening during mdnblocks (just-after mdopen), and every segment
closing during mdclose and mdtruncate as mentioned here. We are going
to omit pallocs only in the decreasing case.

If we regard repalloc as far faster than FileOpen/FileClose or we care
about only increase of segment number of mdopen'ed files and don't
care the frequent resize that happens during the functions above, then
the comment is right and we may resize the array in the
segment-by-segment manner.

But if they are comparable each other, or we don't want the array gets
resized frequently, we might need to prevent repalloc from happening
on every segment increase, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-23 Thread Tomas Vondra


On Sun, Dec 22, 2019 at 09:06:41AM +0100, Julien Rouhaud wrote:

On Sun, Dec 22, 2019 at 1:03 AM Tomas Vondra
 wrote:


On Sat, Dec 21, 2019 at 04:25:05PM -0500, Tom Lane wrote:
>Nikolay Samokhvalov  writes:
>> Here is what ORMs do:
>> select length('SELECT "column_name_1001", "column_name_1002",
>> "column_name_1003", "column_name_1004", "column_name_1005",
>> "column_name_1006", "column_name_1007", "column_name_1008",
>> "column_name_1009", "column_name_1010", "column_name_1011",
>> "column_name_1012", "column_name_1013", "column_name_1014",
>> "column_name_1015", "column_name_1016", "column_name_1017",
>> "column_name_1018", "column_name_1019", "column_name_1020",
>> "column_name_1021", "column_name_1022", "column_name_1023",
>> "column_name_1024", "column_name_1025", "column_name_1026",
>> "column_name_1027", "column_name_1028", "column_name_1029",
>> "column_name_1030", "column_name_1031", "column_name_1032",
>> "column_name_1033", "column_name_1034", "column_name_1035",
>> "column_name_1036", "column_name_1037", "column_name_1038",
>> "column_name_1039", "column_name_1040", "column_name_1041",
>> "column_name_1042", "column_name_1043", "column_name_1044",
>> "column_name_1045", "column_name_1046", "column_name_1047",
>> "column_name_1048", "column_name_1049", "column_name_1050" FROM
>> "some_table";');
>>  length
>> 
>>1024
>> (1 row)
>
>> That's it – with default settings, you won't see WHERE clause or
>> anything else.
>
>If that's true, it doesn't offer much of a case for upping the limit
>on track_activity_query_size.  The longest such a query could reasonably
>get is somewhere near NAMEDATALEN times MaxHeapAttributeNumber, which
>as it happens is exactly the existing limit on track_activity_query_size.
>
>> As a result, many queries exceed track_activity_query_size
>
>How?  And if they are, why do you care?  Such queries sure seem
>pretty content-free.
>

I believe the example was just a very simplistic example. ORMs can of
course generate queries with joins, which can easily exceed the limit
you mentioned.

>> What is the overhead here except the memory consumption?
>
>The time to copy those strings out of shared storage, any time
>you query pg_stat_activity.
>

IMO that seems like a reasonable price to pay, if you want to see
complete queries and bump the track_activity_query_size value up.


Couldn't be pg_stat_statements (or any similar extension) queryid
exposure in pg_stat_activity [1] also an alternative?  You wouldn't
have the parameters but maybe the normalized query would be enough for
most analysis.  Now, maybe pg_stat_statements jumble overhead for such
large statements would be even more problematic.



But that would effectively add dependency on pg_stat_statements, no? I
don't think we want that.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-23 Thread Julien Rouhaud

On Mon, Dec 23, 2019 at 1:10 PM Tomas Vondra
 wrote:
>
> On Sun, Dec 22, 2019 at 09:06:41AM +0100, Julien Rouhaud wrote:
> >On Sun, Dec 22, 2019 at 1:03 AM Tomas Vondra
> > wrote:
> >>
> >> On Sat, Dec 21, 2019 at 04:25:05PM -0500, Tom Lane wrote:
> >> >> What is the overhead here except the memory consumption?
> >> >
> >> >The time to copy those strings out of shared storage, any time
> >> >you query pg_stat_activity.
> >> >
> >>
> >> IMO that seems like a reasonable price to pay, if you want to see
> >> complete queries and bump the track_activity_query_size value up.
> >
> >Couldn't be pg_stat_statements (or any similar extension) queryid
> >exposure in pg_stat_activity [1] also an alternative?  You wouldn't
> >have the parameters but maybe the normalized query would be enough for
> >most analysis.  Now, maybe pg_stat_statements jumble overhead for such
> >large statements would be even more problematic.
> >
>
> But that would effectively add dependency on pg_stat_statements, no? I
> don't think we want that.

The queryid field is part of the core, so no dependency is added.  You
just get a somewhat useless NULL value returned until you load an
extension that compute a queryid, which may be pg_stat_statements but
any other one will work too.

Re: [Proposal] Extend TableAM routines for ANALYZE scan

2019-12-23 Thread Julien Rouhaud

Hello,

On Thu, Dec 5, 2019 at 11:14 AM Pengzhou Tang  wrote:
>
> When hacking the Zedstore, we need to get a more accurate statistic for 
> zedstore and we
> faced some restrictions:
> 1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate 
> sampling block
> numbers, this is not friendly for zedstore which wants to use a logical 
> block number and might also
> not friendly to non-block-oriented Table AMs.
> 2) columns of zedstore table store separately, so columns in a row have a 
> different physical position,
> tid in a tuple is invalid for zedstore which means the correlation 
> statistic is incorrect for zedstore.
> 3) RelOptInfo->pages is not correct for Zedstore if we only access partial of 
> the columns which make
>the IO cost much higher than the actual cost.
>
> For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines 
> in patch
> "0001-ANALYZE-tableam-API-change.patch" which add three more APIs:
> scan_analyze_beginscan(), scan_analyze_sample_tuple(), 
> scan_analyze_endscan(). This provides
> more convenience and table AMs can take more control of every step of 
> sampling rows. Meanwhile,
> with the new structure named "AcquireSampleContext", we can acquire extra 
> info (eg: physical position,
> physical size) except the real columns values.
>
> For 3), we hope we can have a similar mechanism with RelOptInfo->rows which 
> is calculated from
>  (RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages with 
> a page selectivity which
> is base on the selected zedstore columns.  
> 0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch
> shows one idea that adding the `stadiskfrac` to pg_statistic and planner use 
> it to estimate the
> RelOptInfo->pages.
>
> 0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how 
> Zedstore use the
> previous patches to achieve:
> 1. use logical block id to acquire the sample rows.
> 2. can only acquire sample rows from specified column c1, this is used when 
> user only analyze table
> on specified columns eg: "analyze zs (c1)".
> 3 when ANALYZE, zedstore table AM provided extra disksize info, then ANALYZE 
> compute the
> physical fraction statistic of each column and planner use it to estimate 
> the IO cost based on
> the selected columns.

I couldn't find an entry for that patchset in the next commitfest.
Could you register it so that it won't be forgotten?

Re: Fetching timeline during recovery

2019-12-23 Thread Jehan-Guillaume de Rorthais

On Mon, 23 Dec 2019 12:36:56 +0900
Michael Paquier  wrote:

> On Fri, Dec 20, 2019 at 11:14:28AM +0100, Jehan-Guillaume de Rorthais wrote:
> > Yes, that would be great but sadly, it would introduce a regression on
> > various tools relying on them. At least, the one doing "select *" or most
> > probably "select func()".
> > 
> > But anyway, adding 5 funcs is not a big deal neither. Too bad they are so
> > close to existing ones though.  
> 
> Consistency of the data matters a lot if we want to build reliable
> tools on top of them in case someone would like to compare the various
> modes, and using different functions for those fields creates locking
> issues (somewhat the point of Fujii-san upthread?).

To sum up: the original patch was about fetching the current timeline of a
standby from memory without relying on the asynchronous controlfile or
pg_stat_get_wal_receiver() which only shows data when the wal_receiver is
running.

Fujii-san was pointing out we must fetch both the received LSN and its timeline
with the same lock so they are consistent.

Michael is now discussing about fetching multiple LSN and their timeline,
while keeping them consistent, eg. received+tli and applied+tli. Thank you for
pointing this out. 

I thought about various way to deal with this concern and would like to
discuss/defend a new option based on existing pg_stat_get_wal_receiver()
function. The only problem I'm facing with this function is that it returns
a full NULL record if no wal receiver is active.

My idea would be to return a row from pg_stat_get_wal_receiver() as soon as
a wal receiver has been replicating during the uptime of the standby, no
matter if there's one currently working or not. If no wal receiver is active,
the "pid" field would be NULL and the "status" would reports eg. "inactive".
All other fields would report their last known value as they are kept in
shared memory WalRcv struct.

From the monitoring and HA point of view, we are now able to know that a wal
receiver existed, the lsn it has stopped, on what timeline, all consistent
with the same lock. That answer my original goal. We could extend this with two
more fields about replayed lsn and timeline to address last Michael's concern
if we decide it's really needed (and I think it's a valid concern for eg.
monitoring tools).

There's some more potential discussion about the pg_stat_wal_receiver view
which relies on pg_stat_get_wal_receiver(). My proposal do not introduce
regression with it as the view already filter out NULL data using "WHERE s.pid
IS NOT NULL". But:

 1. we could decide to remove this filter to expose the data even when no wal
receiver is active. It's the same behavior than pg_stat_subscription view.
It could introduce regression from tools point of view, but adds some
useful information. I would vote 0 for it.
 2. we could extend it with new replayed lsn/tli fields. I would vote +1 for
it.

On the "dark" side of this proposal, we do not deal with the primary side. We
still have no way to fetch various lsn+tli from the WAL Writer. However, I
included pg_current_wal_lsn_tl() in my original patch only for homogeneity
reason and the discussion slipped on this side while paying attention to the
user facing function logic and homogeneity. If this discussion decide this is a
useful feature, I think it could be addressed in another patch (and I volunteer
to deal with it).

Bellow the sum up this 6th proposition with examples. When wal receiver never
started (same as today):

  -[ RECORD 1 ]-+--
  pid   | Ø
  status| Ø
  receive_start_lsn | Ø
  receive_start_tli | Ø
  received_lsn  | Ø
  received_tli  | Ø
  last_msg_send_time| Ø
  last_msg_receipt_time | Ø
  latest_end_lsn| Ø
  latest_end_time   | Ø
  slot_name | Ø
  sender_host   | Ø
  sender_port   | Ø
  conninfo  | Ø

When wal receiver is active:

  $ select * from  pg_stat_get_wal_receiver();
  -[ RECORD 1 ]-+-
  pid   | 8576
  status| streaming
  receive_start_lsn | 0/400
  receive_start_tli | 1
  received_lsn  | 0/4000148
  received_tli  | 1
  last_msg_send_time| 2019-12-23 12:28:52.588738+01
  last_msg_receipt_time | 2019-12-23 12:28:52.588839+01
  latest_end_lsn| 0/4000148
  latest_end_time   | 2019-12-23 11:15:43.431657+01
  slot_name | Ø
  sender_host   | /tmp
  sender_port   | 15441
  conninfo  | port=15441 application_name=s

When wal receiver is not running and shared memory WalRcv is reporting past
activity:

  $ select * from  pg_stat_get_wal_receiver();
  -[ RECORD 1 ]-+-
  pid   | Ø
  status| inactive
  receive_start_lsn | 0/400
  receive_start_tli | 1
  received_lsn  | 0/4000148
  received_tli

Re: unsupportable composite type partition keys

2019-12-23 Thread Tom Lane

Amit Langote  writes:
> On Sun, Dec 22, 2019 at 6:13 AM Tom Lane  wrote:
> + * To ensure that it's not leaked completely, re-attach it to the
> + * new reldesc, or make it a child of the new reldesc's rd_pdcxt
> + * in the unlikely event that there is one already.  (See hack in
> + * RelationBuildPartitionDesc.)

> While I can imagine that RelationBuildPartitionDesc() might encounter
> an old partition descriptor making the re-parenting hack necessary
> there, I don't see why it's needed here, because a freshly built
> relation descriptor would not contain the partition descriptor after
> this patch.

Well, as the comment says, that's probably unreachable today.  But
I could see it happening in the future, particularly if we ever allow
partitioned system catalogs.  There are a lot of paths through this
code that are not obvious to the naked eye, and some of them can cause
relcache entries to get populated behind-your-back.  Most of relcache.c
is careful about this; I do not see an excuse for the partition-data
code to be less so, even if we think it can't happen today.

(I notice that RelationBuildPartitionKey is making a similar assumption
that the partkey couldn't magically appear while it's working, and I
don't like it much there either.)

>> * It'd be better to declare RelationGetPartitionKey and
>> RelationGetPartitionDesc in relcache.h and get their callers out of the
>> business of including rel.h, where possible.

> Although I agree to declare them in relcache.h, that doesn't reduce
> the need to include rel.h in their callers much, because anyplace that
> uses RelationGetPartitionDesc() is also very likely to use
> RelationGetRelid() which is in rel.h.

Hm.  Well, we can try anyway.

> [1] 
> https://www.postgresql.org/message-id/CA%2BHiwqFucUh7hYkfZ6x1MVcs_R24eUfNVuRwdE_FwuwK8XpSZg%40mail.gmail.com

Oh, interesting --- I hadn't been paying much attention to that thread.
I'll compare your PoC there to what I did.

regards, tom lane

reduce size of fmgr_builtins array

2019-12-23 Thread John Naylor

Hi all,

Currently, we include the function name string in each FmgrBuiltin
struct, whose size is 24 bytes on 64 bit platforms. As far as I can
tell, the name is usually unused, so the attached (WIP, untested)
patch stores it separately, reducing this struct to 16 bytes.

We can go one step further and allocate the names as a single
character string, reducing the binary size. It doesn't help much to
store offsets, since there are ~40k characters, requiring 32-bit
offsets. If we instead compute the offset on the fly from stored name
lengths, we can use 8-bit values, saving 19kB of space in the binary
over using string pointers.

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v1-fmgr-func-names.patch
Description: Binary data

Re: unsupportable composite type partition keys

2019-12-23 Thread Tom Lane

BTW, I forgot to mention: while I think the patch to forbid pseudotypes
by using CheckAttributeType() can be back-patched, I'm leaning towards
not back-patching the other patch.  The situation where we get into
infinite recursion seems not very likely in practice, and it's not
going to cause any crash or data loss, so I think we can just say
"sorry that's not supported before v13".  The patch as I'm proposing
it seems rather invasive for a back-branch fix.  Also, changing
RelationGetPartitionKey/Desc from macros to functions is at least a
weak ABI break.  If there are extensions calling either, they might
still work without a recompile --- but if they have code paths that
are the first thing to touch either field since a relcache flush,
they'd crash.

regards, tom lane

Re: unsupportable composite type partition keys

2019-12-23 Thread Amit Langote

On Mon, Dec 23, 2019 at 23:49 Tom Lane  wrote:

> Amit Langote  writes:
> > [1]
> https://www.postgresql.org/message-id/CA%2BHiwqFucUh7hYkfZ6x1MVcs_R24eUfNVuRwdE_FwuwK8XpSZg%40mail.gmail.com
>
> Oh, interesting --- I hadn't been paying much attention to that thread.
> I'll compare your PoC there to what I did.

Actually, I should’ve said that your patch is much better attempt at
getting this in order, so there’s not much to see in my patch really. :)

Thanks,
Amit

>

Re: [HACKERS] Block level parallel vacuum

2019-12-23 Thread Mahendra Singh

g_indg_On Mon, 23 Dec 2019 at 16:11, Amit Kapila 
wrote:
>
> On Fri, Dec 20, 2019 at 12:13 PM Masahiko Sawada
>  wrote:
> >
> > I've attached the updated version patch that incorporated the all
> > review comments I go so far.
> >
>
> I have further edited the first two patches posted by you.  The
> changes include (a) changed tests to reset the guc, (b) removing some
> stuff which is not required in this version, (c) moving some variables
> around to make them in better order, (d) changed comments and few
> other cosmetic things and (e) commit messages for first two patches.
>
> I think the first two patches attached in this email are in good shape
> and we can commit those unless you or someone has more comments on
> them, the main parallel vacuum patch can still be improved by some
> more test/polish/review.  I am planning to push the first two patches
> next week after another pass.  The first two patches are explained in
> brief as below:
>
> 1. v4-0001-Delete-empty-pages-in-each-pass-during-GIST-VACUUM:  It
> allows us to delete empty pages in each pass during GIST VACUUM.
> Earlier, we use to postpone deleting empty pages till the second stage
> of vacuum to amortize the cost of scanning internal pages.  However,
> that can sometimes (say vacuum is canceled or errored between first
> and second stage) delay the pages to be recycled.  Another thing is
> that to facilitate deleting empty pages in the second stage, we need
> to share the information of internal and empty pages between different
> stages of vacuum.  It will be quite tricky to share this information
> via DSM which is required for the main parallel vacuum patch.  Also,
> it will bring the logic to reclaim deleted pages closer to nbtree
> where we delete empty pages in each pass.  Overall, the advantages of
> deleting empty pages in each pass outweigh the advantages of
> postponing the same.  This patch is discussed in detail in a separate
> thread [1].
>
> 2. v39-0001-Introduce-IndexAM-fields-for-parallel-vacuum.patch:
> Introduce new fields amusemaintenanceworkmem and
> amparallelvacuumoptions in IndexAmRoutine for parallel vacuum.  The
> amusemaintenanceworkmem tells whether a particular IndexAM uses
> maintenance_work_mem or not.  This will help in controlling the memory
> used by individual workers as otherwise, each worker can consume
> memory equal to maintenance_work_mem.  This has been discussed in
> detail in a separate thread as well [2]. The amparallelvacuumoptions
> tell whether a particular IndexAM participates in a parallel vacuum
> and if so in which phase (bulkdelete, vacuumcleanup) of vacuum.
>
>
> [1] -
https://www.postgresql.org/message-id/CAA4eK1LGr%2BMN0xHZpJ2dfS8QNQ1a_aROKowZB%2BMPNep8FVtwAA%40mail.gmail.com
> [2] -
https://www.postgresql.org/message-id/caa4ek1lmcd5apogzwim5nn58ki+74a6edghx4wd8haskvha...@mail.gmail.com
>

Hi,
I reviewed v39 patch set. Below are the some minor review comments:

1.
+ * memory equal to maitenance_work_mem, the new maitenance_work_mem for

maitenance_work_mem should be replaced by maintenance_work_mem.

2.
+ * The number of workers can vary between and bulkdelete and cleanup

I think, grammatically above sentence is not correct. "and" is extra in
above sentence.

3.
+ /*
+ * Open table.  The lock mode is the same as the leader process.  It's
+ * okay because The lockmode does not conflict among the parallel workers.
+ */

I think, "lock mode" and "lockmode", both should be same.(means extra space
should be removed from "lock mode"). In "The", "T" should be small case
letter.

4.
+ /* We don't support parallel vacuum for autovacuum for now */

I think, above sentence should be like "As of now, we don't support
parallel vacuum for autovacuum"

5. I am not sure that I am right but I can see that we are not consistent
while ending the single line comments.

I think, if single line comment is started with "upper case letter", then
we should not put period(dot) at the end of comment, but if comment started
with "lower case letter", then we should put period(dot) at the end of
comment.

a)
+ /* parallel vacuum must be active */
I think. we should end above comment with dot or we should make "p" of
parallel as upper case letter.

b)
+ /* At least count itself */
I think, above is correct.

If my understanding is correct, then please let me know so that I can make
these changes on the top of v39 patch set.

6.
+boolamusemaintenanceworkmem;

I think, we haven't ran pgindent.

Thanks and Regards
Mahendra Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: unsupportable composite type partition keys

2019-12-23 Thread Tom Lane

Amit Langote  writes:
> On Mon, Dec 23, 2019 at 23:49 Tom Lane  wrote:
>> Oh, interesting --- I hadn't been paying much attention to that thread.
>> I'll compare your PoC there to what I did.

> Actually, I should’ve said that your patch is much better attempt at
> getting this in order, so there’s not much to see in my patch really. :)

One thing I see is that you chose to relocate RelationGetPartitionDesc's
declaration to partdesc.h, whereupon RelationBuildPartitionDesc doesn't
have to be exported at all anymore.  Perhaps that's a better factorization
than what I did.  It supposes that any caller of RelationGetPartitionDesc
is going to need partdesc.h, but that seems reasonable.  We could likewise
move RelationGetPartitionKey to partcache.h.

regards, tom lane

relpages of btree indexes are not truncating even after deleting all the tuples from table and doing vacuum

2019-12-23 Thread Mahendra Singh

Hi All,

While doing testing of "parallel vacuum" patch, I found that size of index
relation is not reducing even after deleting all the tuples and firing
vacuum command. I am not sure that this is expected behavior or not. For
reference, below I am giving one example.

postgres=# create table test (a int);
CREATE TABLE
postgres=# create index indx1 on test (a);
CREATE INDEX
postgres=# insert into test (select a  from generate_series(1,10) a);
INSERT 0 10
postgres=# analyze ;
ANALYZE
postgres=# select relpages, relname from pg_class where relname = 'indx1';
 relpages | relname
--+-
  276 | indx1
(1 row)

-- delete all the tuples from table.
postgres=# delete from test ;
DELETE 10

-- do vacuum to test tables
postgres=# vacuum test ;
VACUUM

-- check relpages in 'indx1' and 'test'
postgres=# select relpages, relname from pg_class where relname = 'indx1';
 relpages | relname
--+-
  276 | indx1
(1 row)

-- do vacuum to all the tables and check relpages in 'indx1'
postgres=# vacuum ;
VACUUM
postgres=# select relpages, relname from pg_class where relname = 'indx1';
 relpages | relname
--+-
  276 | indx1
(1 row)

-- check relpages in 'test' table
postgres=# select relpages, relname from pg_class where relname = 'test';
 relpages | relname
--+-
0 | test
(1 row)


>From above example, we can see that after deleting all the tuples from
table and firing vacuum command, size of table is reduced but size of index
relation is same as before vacuum.

Please let me your thoughts.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

string literal continuations in C

2019-12-23 Thread Alvaro Herrera

Per a recent thread, these patches remove string literals split with
\-escaped newlines.  The first is for the message "materialize mode
required, but it is not allowed in this context" where it's more
prevalent, and we keep perpetuating it; the second is for other
messages, whose bulk is in dblink and tablefunc.  I think the split is
pointless and therefore propose to push both together as a single
commit, but maybe somebody would like me to leave those contrib modules
alone.

There are many other error messages that are split with no \; I would
prefer not to have them, but maybe it would be too intrusive to change
them all.  So let's do this for now and remove this one point of
ugliness.

-- 
Álvaro Herrera39°49'30"S 73°17'W
>From 307d334046c62607e907a4f7a71e121c5c4b3dba Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 23 Dec 2019 16:23:57 -0300
Subject: [PATCH 1/2] Fix line continuation

---
 contrib/pg_stat_statements/pg_stat_statements.c |  3 +--
 contrib/tablefunc/tablefunc.c   | 12 
 src/backend/access/transam/xlogfuncs.c  |  3 +--
 src/backend/commands/extension.c|  9 +++--
 src/backend/commands/prepare.c  |  3 +--
 src/backend/libpq/hba.c |  3 +--
 src/backend/replication/logical/launcher.c  |  3 +--
 src/backend/replication/slotfuncs.c |  3 +--
 src/backend/replication/walsender.c |  3 +--
 src/backend/utils/adt/pgstatfuncs.c |  6 ++
 src/backend/utils/misc/guc.c|  3 +--
 src/backend/utils/mmgr/portalmem.c  |  3 +--
 12 files changed, 18 insertions(+), 36 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 63b5888ebb..0590a6769b 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1415,8 +1415,7 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 	if (!(rsinfo->allowedModes & SFRM_Materialize))
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("materialize mode required, but it is not " \
-		"allowed in this context")));
+ errmsg("materialize mode required, but it is not allowed in this context")));
 
 	/* Switch into long-lived context to construct returned data structures */
 	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 256d52fc62..30d94f5d90 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -373,8 +373,7 @@ crosstab(PG_FUNCTION_ARGS)
 	if (!(rsinfo->allowedModes & SFRM_Materialize))
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("materialize mode required, but it is not " \
-		"allowed in this context")));
+ errmsg("materialize mode required, but it is not allowed in this context")));
 
 	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
 
@@ -650,8 +649,7 @@ crosstab_hash(PG_FUNCTION_ARGS)
 		rsinfo->expectedDesc == NULL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("materialize mode required, but it is not " \
-		"allowed in this context")));
+ errmsg("materialize mode required, but it is not allowed in this context")));
 
 	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
 	oldcontext = MemoryContextSwitchTo(per_query_ctx);
@@ -1006,8 +1004,7 @@ connectby_text(PG_FUNCTION_ARGS)
 		rsinfo->expectedDesc == NULL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("materialize mode required, but it is not " \
-		"allowed in this context")));
+ errmsg("materialize mode required, but it is not allowed in this context")));
 
 	if (fcinfo->nargs == 6)
 	{
@@ -1086,8 +1083,7 @@ connectby_text_serial(PG_FUNCTION_ARGS)
 		rsinfo->expectedDesc == NULL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("materialize mode required, but it is not " \
-		"allowed in this context")));
+ errmsg("materialize mode required, but it is not allowed in this context")));
 
 	if (fcinfo->nargs == 7)
 	{
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 5fd12152b2..d7d17efaf9 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -186,8 +186,7 @@ pg_stop_backup_v2(PG_FUNCTION_ARGS)
 	if (!(rsinfo->allowedModes & SFRM_Materialize))
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("materialize mode required, but it is not " \
-		"allowed in this context")));
+ errmsg("materialize mode required, but it is not allowed in this context")));
 
 	/* Build a tuple descriptor for our result type */
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
diff --git a/src/backend/commands/extension.c b/src/backend/c

Re: Should we rename amapi.h and amapi.c?

2019-12-23 Thread David Fetter

On Mon, Dec 23, 2019 at 02:34:34PM +0900, Michael Paquier wrote:
> Hi all,
> 
> I was working on some stuff for table AMs, and I got to wonder it we
> had better rename amapi.h to indexam.h and amapi.c to indexam.c, so as
> things are more consistent with table AM.  It is a bit annoying to
> name the files dedicated to index AMs with what looks like now a too
> generic name.  That would require switching a couple of header files
> for existing module developers, which is always annoying, but the move
> makes sense thinking long-term?

+1 for being more specific about which AM we're talking about.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Should we rename amapi.h and amapi.c?

2019-12-23 Thread Ashwin Agrawal

On Sun, Dec 22, 2019 at 9:34 PM Michael Paquier  wrote:

> Hi all,
>
> I was working on some stuff for table AMs, and I got to wonder it we
> had better rename amapi.h to indexam.h and amapi.c to indexam.c, so as
> things are more consistent with table AM.  It is a bit annoying to
> name the files dedicated to index AMs with what looks like now a too
> generic name.  That would require switching a couple of header files
> for existing module developers, which is always annoying, but the move
> makes sense thinking long-term?
>
> Any thoughts?
>

I had raised the same earlier and [1] has response from Andres, which was
"We probably should rename it, but not in 12..."

[1]
https://www.postgresql.org/message-id/20190508215135.4eljnhnle5xp3jwb%40alap3.anarazel.de

Re: relpages of btree indexes are not truncating even after deleting all the tuples from table and doing vacuum

2019-12-23 Thread Peter Geoghegan

On Mon, Dec 23, 2019 at 11:05 AM Mahendra Singh  wrote:
> From above example, we can see that after deleting all the tuples from table 
> and firing vacuum command, size of table is reduced but size of index 
> relation is same as before vacuum.

VACUUM is only able to make existing empty pages in indexes recyclable
by future page splits within the same index. It is not possible for it
to reclaim space for the filesystem. Workload characteristics tend to
determine whether or not this limitation is truly important.

You can observe which pages are "free" in this sense (i.e. whether
they've been placed by the FSM for recycling) by using
contrib/pg_freespacemap.

-- 
Peter Geoghegan

Re: unsupportable composite type partition keys

2019-12-23 Thread Tom Lane

I wrote:
> One thing I see is that you chose to relocate RelationGetPartitionDesc's
> declaration to partdesc.h, whereupon RelationBuildPartitionDesc doesn't
> have to be exported at all anymore.  Perhaps that's a better factorization
> than what I did.  It supposes that any caller of RelationGetPartitionDesc
> is going to need partdesc.h, but that seems reasonable.  We could likewise
> move RelationGetPartitionKey to partcache.h.

I concluded that that is indeed a better solution; it does allow removing
some rel.h inclusions (though possibly those were just duplicative?), and
it also means that relcache.c itself doesn't need any partitioning
inclusions at all.

Here's a cleaned-up patch that does it like that and also fixes the
memory leakage issue.

I noticed along the way that with partkeys only being loaded on demand,
we no longer need the incredibly-unsafe hack in RelationBuildPartitionKey
whereby it just silently ignores failure to read the pg_partitioned_table
entry.  I also rearranged RelationBuildPartitionDesc so that it uses the
same context-reparenting trick as RelationBuildPartitionKey.  That doesn't
save us anything, but it makes the code considerably more robust, I think;
we don't need to assume as much about what partition_bounds_copy does.

One other thing worth noting is that I used unlikely() to try to
discourage the compiler from inlining RelationBuildPartitionDesc
into RelationGetPartitionDesc (and likewise for the Key functions).
Not sure how effective that is, but it can't hurt.

I haven't removed equalPartitionDescs here; that seems like material
for a separate patch (to make it easier to put it back if we need it).

regards, tom lane

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 452a7f3..8b68fb7 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -83,7 +83,6 @@
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/partcache.h"
-#include "utils/rel.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 8f242ae..a143998 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -812,7 +812,7 @@ DefineIndex(Oid relationId,
 	 */
 	if (partitioned && (stmt->unique || stmt->primary))
 	{
-		PartitionKey key = rel->rd_partkey;
+		PartitionKey key = RelationGetPartitionKey(rel);
 		int			i;
 
 		/*
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d8cbd6a..dc3d792 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -30,7 +30,6 @@
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
 #include "utils/partcache.h"
-#include "utils/rel.h"
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index d160d6f..4c6ca89 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -35,7 +35,6 @@
 #include "utils/hashutils.h"
 #include "utils/lsyscache.h"
 #include "utils/partcache.h"
-#include "utils/rel.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -775,6 +774,11 @@ partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval,
 /*
  * Return a copy of given PartitionBoundInfo structure. The data types of bounds
  * are described by given partition key specification.
+ *
+ * Note: it's important that this function and its callees not do any catalog
+ * access, nor anything else that would result in allocating memory other than
+ * the returned data structure.  Since this is called in a long-lived context,
+ * that would result in unwanted memory leaks.
  */
 PartitionBoundInfo
 partition_bounds_copy(PartitionBoundInfo src,
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 6ede084..e26e45e 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -47,17 +47,48 @@ typedef struct PartitionDirectoryEntry
 	PartitionDesc pd;
 } PartitionDirectoryEntry;
 
+static void RelationBuildPartitionDesc(Relation rel);
+
+
+/*
+ * RelationGetPartitionDesc -- get partition descriptor, if relation is partitioned
+ *
+ * Note: we arrange for partition descriptors to not get freed until the
+ * relcache entry's refcount goes to zero (see hacks in RelationClose,
+ * RelationClearRelation, and RelationBuildPartitionDesc).  Therefore, even
+ * though we hand back a direct pointer into the relcache entry, it's safe
+ * for callers to continue to use that pointer as long as (a) they hold the
+ * relation open, and (b) they hold a relation lock strong enough to ensure
+ * that the data doesn't become stale.
+ */
+PartitionDesc
+RelationGetPartitionDesc(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED

Re: unsupportable composite type partition keys

2019-12-23 Thread Amit Langote

On Tue, Dec 24, 2019 at 12:00 AM Tom Lane  wrote:
> BTW, I forgot to mention: while I think the patch to forbid pseudotypes
> by using CheckAttributeType() can be back-patched, I'm leaning towards
> not back-patching the other patch.  The situation where we get into
> infinite recursion seems not very likely in practice, and it's not
> going to cause any crash or data loss, so I think we can just say
> "sorry that's not supported before v13".  The patch as I'm proposing
> it seems rather invasive for a back-branch fix.

It is indeed.

Just to be sure, by going with "unsupported before v13", which one do you mean:

* documenting it as so
* giving an error in such cases, like the patch in the first email on
this thread did
* doing nothing really

Thanks,
Amit

Re: error context for vacuum to include block number

2019-12-23 Thread Justin Pryzby

On Mon, Dec 16, 2019 at 11:49:56AM +0900, Michael Paquier wrote:
> On Sun, Dec 15, 2019 at 10:27:12AM -0600, Justin Pryzby wrote:
> > I named it so because it calls both lazy_vacuum_index
> > ("PROGRESS_VACUUM_PHASE_VACUUM_INDEX") and
> > lazy_vacuum_heap("PROGRESS_VACUUM_PHASE_VACUUM_HEAP")
> > 
> > I suppose you don't think the other way around is better?
> > lazy_vacuum_index_heap
> 
> Dunno.  Let's see if others have other thoughts on the matter.  FWIW,
> I have a long history at naming things in a way others don't like :)

I renamed.

And deduplicated two more hunks into a 2nd function.

(I'm also including the changes I mentioned here ... in case anyone cares to
comment or review).
https://www.postgresql.org/message-id/20191220171132.GB30414%40telsasoft.com
>From 5317d9f3cee163762563f2255f1dd26800ea858b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 27 Nov 2019 20:07:10 -0600
Subject: [PATCH v6 1/6] Rename buf to avoid shadowing buf of another type

---
 src/backend/access/heap/vacuumlazy.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ab09d84..3f52278 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -517,7 +517,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber next_unskippable_block;
 	bool		skipping_blocks;
 	xl_heap_freeze_tuple *frozen;
-	StringInfoData buf;
+	StringInfoData sbuf;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1479,33 +1479,33 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * This is pretty messy, but we split it up so that we can skip emitting
 	 * individual parts of the message when not applicable.
 	 */
-	initStringInfo(&buf);
-	appendStringInfo(&buf,
+	initStringInfo(&sbuf);
+	appendStringInfo(&sbuf,
 	 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
 	 nkeep, OldestXmin);
-	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
+	appendStringInfo(&sbuf, _("There were %.0f unused item identifiers.\n"),
 	 nunused);
-	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
+	appendStringInfo(&sbuf, ngettext("Skipped %u page due to buffer pins, ",
 	"Skipped %u pages due to buffer pins, ",
 	vacrelstats->pinskipped_pages),
 	 vacrelstats->pinskipped_pages);
-	appendStringInfo(&buf, ngettext("%u frozen page.\n",
+	appendStringInfo(&sbuf, ngettext("%u frozen page.\n",
 	"%u frozen pages.\n",
 	vacrelstats->frozenskipped_pages),
 	 vacrelstats->frozenskipped_pages);
-	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
+	appendStringInfo(&sbuf, ngettext("%u page is entirely empty.\n",
 	"%u pages are entirely empty.\n",
 	empty_pages),
 	 empty_pages);
-	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
+	appendStringInfo(&sbuf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
 			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
 	RelationGetRelationName(onerel),
 	tups_vacuumed, num_tuples,
 	vacrelstats->scanned_pages, nblocks),
-			 errdetail_internal("%s", buf.data)));
-	pfree(buf.data);
+			 errdetail_internal("%s", sbuf.data)));
+	pfree(sbuf.data);
 }
 
 
-- 
2.7.4

>From 8fdee2073e083388865f52874a672e8cc47f7c51 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 12 Dec 2019 19:57:25 -0600
Subject: [PATCH v6 2/6] deduplication

---
 src/backend/access/heap/vacuumlazy.c | 103 +++
 1 file changed, 43 insertions(+), 60 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3f52278..c6dc44b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -153,6 +153,7 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 		   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 		   bool aggressive);
+static void lazy_vacuum_index_and_heap(Relation onerel, LVRelStats *vacrelstats, Relation *Irel, int nindexes, IndexBulkDeleteResult **indstats);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
 static void lazy_vacuum_index(Relation indrel,
@@ -740,12 +741,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
 			vacrelstats->num_dead_tuples > 0)
 		{
-			const int	hvp_index[] = {
-PROGRESS_VACUUM_PHASE,
-PROGRESS_VACUUM_NUM_INDEX_VACUUMS
-			};
-			int64		hvp_val[2];
-
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
 			 * hold on

Re: unsupportable composite type partition keys

2019-12-23 Thread Amit Langote

On Tue, Dec 24, 2019 at 6:33 AM Tom Lane  wrote:
> I wrote:
> > One thing I see is that you chose to relocate RelationGetPartitionDesc's
> > declaration to partdesc.h, whereupon RelationBuildPartitionDesc doesn't
> > have to be exported at all anymore.  Perhaps that's a better factorization
> > than what I did.  It supposes that any caller of RelationGetPartitionDesc
> > is going to need partdesc.h, but that seems reasonable.  We could likewise
> > move RelationGetPartitionKey to partcache.h.
>
> I concluded that that is indeed a better solution; it does allow removing
> some rel.h inclusions (though possibly those were just duplicative?), and
> it also means that relcache.c itself doesn't need any partitioning
> inclusions at all.
>
> Here's a cleaned-up patch that does it like that and also fixes the
> memory leakage issue.

Thanks for the updated patch.  I didn't find anything to complain about.

> I haven't removed equalPartitionDescs here; that seems like material
> for a separate patch (to make it easier to put it back if we need it).

Seems like a good idea.

Btw, does the memory leakage fix in this patch address any of the
pending concerns that were discussed on the "hyrax vs.
RelationBuildPartitionDesc" thread earlier this year[1]?

Thanks,
Amit

[1] 
https://www.postgresql.org/message-id/flat/3800.1560366716%40sss.pgh.pa.us#092b6b4f6bf75d2f3f90ef6a3b3eab5b

Re: Implementing Incremental View Maintenance

2019-12-23 Thread Yugo Nagata

On Mon, 23 Dec 2019 03:41:18 -0700 (MST)
legrand legrand  wrote:

> Hello,
> regarding my initial post:
> 
> > For each insert into a base table there are 3 statements:
> > - ANALYZE pg_temp_3.pg_temp_81976
> > - WITH updt AS (  UPDATE public.mv1 AS mv SET __ivm_count__ = ...
> > - DROP TABLE pg_temp_3.pg_temp_81976
> 
> For me there where 3 points to discuss:
> - create/drop tables may bloat dictionnary tables 
> - create/drop tables prevents "WITH updt ..." from being shared  (with some
> plan caching)
> - generates many lines in pg_stat_statements
> 
> In fact I like the idea of a table created per session, but I would even
> prefer a common "table" shared between all sessions like GLOBAL TEMPORARY
> TABLE (or something similar) as described here:
> https://www.postgresql.org/message-id/flat/157703426606.1198.2452090605041230054.pgcf%40coridan.postgresql.org#331e8344bbae904350af161fb43a0aa6

Although I have not looked into this thread, this may be help if this is
implemented. However, it would be still necessary to truncate the table
before the view maintenance because such tables always exist and can be
accessed and modified by any users.


-- 
Yugo Nagata

Re: archive status ".ready" files may be created too early

2019-12-23 Thread Kyotaro Horiguchi

At Sat, 21 Dec 2019 01:18:24 +, "Bossart, Nathan"  
wrote in 
> On 12/18/19, 8:34 AM, "Bossart, Nathan"  wrote:
> > On 12/17/19, 2:26 AM, "Kyotaro Horiguchi"  wrote:
> >> If so, we could amend also that case by marking the last segment as
> >> .ready when XLogWrite writes the first bytes of the next segment. (As
> >> the further corner-case, it still doesn't work if a contination record
> >> spans over trhee or more segments.. But I don't (or want not to) think
> >> we don't need to consider that case..)
> >
> > I'm working on a new version of the patch that will actually look at
> > the WAL page metadata to determine when it is safe to mark a segment
> > as ready for archival.  It seems relatively easy to figure out whether
> > a page is the last one for the current WAL record.
> 
> I stand corrected.  My attempts to add logic to check the WAL records
> added quite a bit more complexity than seemed reasonable to maintain
> in this code path.  For example, I didn't anticipate things like
> XLOG_SWITCH records.
> 
> I am still concerned about the corner case you noted, but I have yet
> to find a practical way to handle it.  You suggested waiting until
> writing the first bytes of the next segment before marking a segment
> as ready, but I'm not sure that fixes this problem either, and I
> wonder if it could result in waiting arbitrarily long before creating
> a ".ready" file in some cases.  Perhaps I am misunderstanding your
> suggestion.
> 
> Another thing I noticed is that any changes in this area could impact
> archive_timeout.  If we reset the archive_timeout timer when we mark
> the segments ready, we could force WAL switches more often.  If we do
> not move the timer logic, we could be resetting it before the file is
> ready for the archiver.  However, these differences might be subtle
> enough to be okay.

You're right. That doesn't seem to work. Another thing I had in my
mind was that XLogWrite had an additional flag so that
AdvanceXLInsertBuffer can tell not to mark .ready. The function is
called while it *is* writing a complete record. So even if
AdvanceXLInsertBuffer inhibit marking .ready the succeeding bytes
comes soon and we can mark the old segment as .ready at the time.

..
+ * If record_write == false, we don't mark the last segment as .ready
+ * if the caller requested to write up to segment boundary.
..
  static void
- XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
+ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool record_write)

When XLogWrite is called with record_write = false, we don't mark
.ready and don't advance lastSegSwitchTime/LSN. At the next time
XLogWrite is called with record_write=true, if lastSegSwitchLSN is
behind the latest segment boundary before or equal to
LogwrtResult.Write, mark the skipped segments as .ready and update
lastSegSwitchTime/LSN.

Does the above make sense?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Condition variables vs interrupts

2019-12-23 Thread Thomas Munro

On Sat, Dec 21, 2019 at 2:10 PM Shawn Debnath  wrote:
> On Fri, Dec 20, 2019 at 12:05:34PM +1300, Thomas Munro wrote:
> > I think we should probably just remove the unusual ResetLatch() call,
> > rather than adding a CFI().  See attached.  Thoughts?
>
> I agree: removing the ResetLatch() and having the wait event code deal
> with it is a better way to go. I wonder why the reset was done in the
> first place?

Thanks for the review.  I was preparing to commit this today, but I
think I've spotted another latch protocol problem in the stable
branches only and I'd to get some more eyeballs on it.  I bet it's
really hard to hit, but ConditionVariableSleep()'s return path (ie
when the CV has been signalled) forgets that the the latch is
multiplexed and latches are merged: just because it was set by
ConditionVariableSignal() doesn't mean it wasn't *also* set by die()
or some other interrupt, and yet at the point we return, we've reset
the latch and not run CFI(), and there's nothing to say the caller
won't then immediately wait on the latch in some other code path, and
sleep like a log despite the earlier delivery of (say) SIGTERM.  If
I'm right about that, I think the solution is to move that CFI down in
the stable branches (which you already did for master in commit
1321509f).

0001-Don-t-reset-latch-in-ConditionVariablePrepare-master.patch
Description: Binary data

0001-Don-t-reset-latch-in-ConditionVariableP-backbranches.patch
Description: Binary data

Re: Should we rename amapi.h and amapi.c?

2019-12-23 Thread Michael Paquier

On Mon, Dec 23, 2019 at 12:28:36PM -0800, Ashwin Agrawal wrote:
> I had raised the same earlier and [1] has response from Andres, which was
> "We probably should rename it, but not in 12..."
> 
> [1]
> https://www.postgresql.org/message-id/20190508215135.4eljnhnle5xp3jwb%40alap3.anarazel.de

Okay, glad to see that this has been mentioned.  So let's do some
renaming for v13 then.  I have studied first if we had better remove
amapi.c, then move amvalidate() to amvalidate.c and the handler lookup
routine to indexam.c as it already exists, but keeping things ordered
as they are makes sense to limit spreading too much dependencies with
the syscache mainly, so instead the attached patch does the following
changes:
- amapi.h -> indexam.h
- amapi.c -> indexamapi.c.  Here we have an equivalent in access/table/
as tableamapi.c.
- amvalidate.c -> indexamvalidate.c
- amvalidate.h -> indexamvalidate.h
- genam.c -> indexgenam.c

Please note that we have also amcmds.c and amcmds.c in the code, but
the former could be extended to have utilities for table AMs, and the
latter applies to both, so they are better left untouched in my
opinion.
--
Michael
From e4f78403c585d6ead07f1f2c4477942920ef5b7e Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 24 Dec 2019 11:55:34 +0900
Subject: [PATCH] Rename files and headers related to index AM

The following renaming is done, so as index AM source files are more
consistent with table AMs, and so as no more too generic names are used:
- amapi.h -> indexam.h
- amapi.c -> indexamapi.c.  Here we have an equivalent in access/table/
as tableamapi.c.
- amvalidate.c -> indexamvalidate.c
- amvalidate.h -> indexamvalidate.h
- genam.c -> indexgenam.c
---
 contrib/bloom/blinsert.c   |  2 +-
 contrib/bloom/bloom.h  |  2 +-
 contrib/bloom/blutils.c|  2 +-
 contrib/bloom/blvacuum.c   |  2 +-
 contrib/bloom/blvalidate.c |  2 +-
 contrib/sepgsql/database.c |  2 +-
 contrib/sepgsql/label.c|  2 +-
 contrib/sepgsql/proc.c |  2 +-
 contrib/sepgsql/relation.c |  2 +-
 contrib/sepgsql/schema.c   |  2 +-
 src/backend/access/brin/brin_inclusion.c   |  2 +-
 src/backend/access/brin/brin_minmax.c  |  2 +-
 src/backend/access/brin/brin_validate.c|  2 +-
 src/backend/access/common/detoast.c|  2 +-
 src/backend/access/common/toast_internals.c|  2 +-
 src/backend/access/gin/ginvalidate.c   |  2 +-
 src/backend/access/gist/gistbuild.c|  2 +-
 src/backend/access/gist/gistbuildbuffers.c |  2 +-
 src/backend/access/gist/gistget.c  |  2 +-
 src/backend/access/gist/gistvacuum.c   |  2 +-
 src/backend/access/gist/gistvalidate.c |  2 +-
 src/backend/access/hash/hashvalidate.c |  2 +-
 src/backend/access/heap/heapam.c   |  2 +-
 src/backend/access/heap/heapam_handler.c   |  2 +-
 src/backend/access/heap/vacuumlazy.c   |  2 +-
 src/backend/access/index/Makefile  |  8 
 src/backend/access/index/indexam.c |  2 +-
 src/backend/access/index/{amapi.c => indexamapi.c} |  6 +++---
 .../index/{amvalidate.c => indexamvalidate.c}  |  6 +++---
 src/backend/access/index/{genam.c => indexgenam.c} |  8 
 src/backend/access/nbtree/nbtvalidate.c|  2 +-
 src/backend/access/spgist/spgdoinsert.c|  2 +-
 src/backend/access/spgist/spginsert.c  |  2 +-
 src/backend/access/spgist/spgscan.c|  2 +-
 src/backend/access/spgist/spgutils.c   |  2 +-
 src/backend/access/spgist/spgvacuum.c  |  2 +-
 src/backend/access/spgist/spgvalidate.c|  2 +-
 src/backend/bootstrap/bootstrap.c  |  2 +-
 src/backend/catalog/aclchk.c   |  2 +-
 src/backend/catalog/catalog.c  |  2 +-
 src/backend/catalog/dependency.c   |  2 +-
 src/backend/catalog/heap.c |  2 +-
 src/backend/catalog/index.c|  2 +-
 src/backend/catalog/indexing.c |  2 +-
 src/backend/catalog/objectaddress.c|  2 +-
 src/backend/catalog/partition.c|  2 +-
 src/backend/catalog/pg_collation.c |  2 +-
 src/backend/catalog/pg_constraint.c|  2 +-
 src/backend/catalog/pg_db_role_setting.c   |  2 +-
 src/backend/catalog/pg_depend.c|  2 +-
 src/backend/catalog/pg_enum.c  |  2 +-
 src/backend/catalog/pg_inherits.c  |  2 +-
 src/backend/catalog/pg_largeobject.c   |  2 +-
 src/backend/catalog/pg_publication.c   |  2 +-
 src/backend/catalog/pg_range.c

Re: Increase footprint of %m and reduce strerror()

2019-12-23 Thread Michael Paquier

On Fri, Dec 06, 2019 at 02:09:05PM +0900, Michael Paquier wrote:
> I guess that we should do that at the end of the day.  A lookup at the
> in-core tools I see three areas which stand out compared to the rest:
> - pg_waldump, and attached is a patch for it.

Okay, I have committed this one.
--
Michael


signature.asc
Description: PGP signature

Re: Online checksums verification in the backend

2019-12-23 Thread Masahiko Sawada

On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud  wrote:
>
> Hi,
>
> This topic was discussed several times, with the most recent
> discussions found at [1] and [2].  Based on those discussions, my
> understanding is that the current approach in BASE_BACKUP has too many
> drawbacks and we should instead do this check in the backend.  I've
> been working using such approach at VMware, and I'm submitting it here
> to discuss the approach and rationales, and hopefully have such a
> feature integrated.

Thank you for working on this!

>
> First, this was originally developed as an extension.  It means that
> the check is performed using an SRF.  That's maybe not the best
> approach, as a transaction has be kept for the total processing time.
> It can be leveraged by checking each relation independently, but
> that's still not ideal.  Maybe using some utility commands (as part of
> VACUUM or a new CHECK command for instance) would be a better
> approach.
>
> This brings the second consideration: how to report the list corrupted
> blocks to end users.  As I said this is for now returned via the SRF,
> but this is clearly not ideal and should rather be made available more
> globally.  One usage of this information could be block level
> recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> mentioned me that he had ideas on passing the list of corrupted blocks
> using the stat collector.

Yes it's necessary the list of corrupted pages for single page
recovery. Apart from single page recovery I think it's helpful for DBA
if they can find the corrupted blocks in the server logs and on a
system view.

I've also tried to report corrupted pages to the stats collector
during I researching single page recovery in PostgreSQL but one
problem is that the statistics in the stats collector is cleared when
crash recovery. I want the information of block corruption to survive
even when the server down. And we might want to add checksums to the
permanent file having information of database corruption. The
correctness of these information would be important because we can fix
a database by restoring some tables from a logical backup or by doing
reindex etc as long as we have a non-broken information of database
corruption.

>
> Finally, the read and locking considerations.  I tried to cover that
> extensively in the comments, but here are some details on how I tried
> to make the check safe while trying to keep the overhead as low as
> possible.  First thing is that this is only doing buffered reads,
> without any attempt to discard OS cache.  Therefore, any discrepancy
> between the OS cache and the disk cannot be detected unless you do
> other actions, such as sync / drop_caches on GNU/Linux.
>
> An access share lock on the currently checked relation is held,
> meaning that it can't get deleted/truncated.  The total number of
> blocks for the given fork is retrieved first, so any new block will be
> ignored.  Such new blocks are considered out of scope as being written
> after the start of the check.
>
> Each time a buffer is being checked, the target buffer mapping
> partition lock is acquired in shared mode, to prevent concurrent
> eviction.  If the buffer is found in shared buffers, it's pinned and
> released immediately, just to get the state.

I wonder if there is possibility that blocks on disk can be corrupted
even if these are loaded to the shared buffer. ISTM the above method
cannot detect such corruption. Reading and checking blocks fast is
attractive but I thought it's also important to check blocks precisely
without overlooking.

Regards,

--
Masahiko Sawada  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Block level parallel vacuum

2019-12-23 Thread Amit Kapila

On Mon, Dec 23, 2019 at 11:02 PM Mahendra Singh  wrote:
>
> 5. I am not sure that I am right but I can see that we are not consistent 
> while ending the single line comments.
>
> I think, if single line comment is started with "upper case letter", then we 
> should not put period(dot) at the end of comment, but if comment started with 
> "lower case letter", then we should put period(dot) at the end of comment.
>
> a)
> + /* parallel vacuum must be active */
> I think. we should end above comment with dot or we should make "p" of 
> parallel as upper case letter.
>
> b)
> + /* At least count itself */
> I think, above is correct.
>

I have checked a few files in this context and I don't see any
consistency, so I would suggest keeping the things matching with the
nearby code.  Do you have any reason for the above conclusion?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: error context for vacuum to include block number

2019-12-23 Thread Michael Paquier

On Mon, Dec 23, 2019 at 07:24:28PM -0600, Justin Pryzby wrote:
> I renamed.

Hmm.  I have found what was partially itching me for patch 0002, and
that's actually the fact that we don't do the error reporting for heap
within lazy_vacuum_heap() because the code relies too much on updating
two progress parameters at the same time, on top of the fact that you
are mixing multiple concepts with this refactoring.  One problem is
that if this code is refactored in the future, future callers of
lazy_vacuum_heap() would miss the update of the progress reporting.
Splitting things improves also the readability of the code, so
attached is the refactoring I would do for this portion of the set.
It is also more natural to increment num_index_scans when the
reporting happens on consistency grounds.

(Please note that I have not indented yet the patch.)
--
Michael
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ab09d8408c..597d8b5f92 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -158,6 +158,9 @@ static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
 static void lazy_vacuum_index(Relation indrel,
 			  IndexBulkDeleteResult **stats,
 			  LVRelStats *vacrelstats);
+static void lazy_vacuum_all_indexes(Relation onerel, LVRelStats *vacrelstats,
+	Relation *Irel, int nindexes,
+	IndexBulkDeleteResult **indstats);
 static void lazy_cleanup_index(Relation indrel,
 			   IndexBulkDeleteResult *stats,
 			   LVRelStats *vacrelstats);
@@ -740,12 +743,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
 			vacrelstats->num_dead_tuples > 0)
 		{
-			const int	hvp_index[] = {
-PROGRESS_VACUUM_PHASE,
-PROGRESS_VACUUM_NUM_INDEX_VACUUMS
-			};
-			int64		hvp_val[2];
-
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
 			 * hold on the visibility map page.  This isn't necessary for
@@ -758,28 +755,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 vmbuffer = InvalidBuffer;
 			}
 
-			/* Log cleanup info before we touch indexes */
-			vacuum_log_cleanup_info(onerel, vacrelstats);
-
-			/* Report that we are now vacuuming indexes */
-			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-		 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
-
-			/* Remove index entries */
-			for (i = 0; i < nindexes; i++)
-lazy_vacuum_index(Irel[i],
-  &indstats[i],
-  vacrelstats);
-
-			/*
-			 * Report that we are now vacuuming the heap.  We also increase
-			 * the number of index scans here; note that by using
-			 * pgstat_progress_update_multi_param we can update both
-			 * parameters atomically.
-			 */
-			hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP;
-			hvp_val[1] = vacrelstats->num_index_scans + 1;
-			pgstat_progress_update_multi_param(2, hvp_index, hvp_val);
+			/* Work on all the indexes, then the heap */
+			lazy_vacuum_all_indexes(onerel, vacrelstats, Irel,
+	nindexes, indstats);
 
 			/* Remove tuples from heap */
 			lazy_vacuum_heap(onerel, vacrelstats);
@@ -790,7 +768,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * valid.
 			 */
 			vacrelstats->num_dead_tuples = 0;
-			vacrelstats->num_index_scans++;
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1420,33 +1397,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* XXX put a threshold on min number of tuples here? */
 	if (vacrelstats->num_dead_tuples > 0)
 	{
-		const int	hvp_index[] = {
-			PROGRESS_VACUUM_PHASE,
-			PROGRESS_VACUUM_NUM_INDEX_VACUUMS
-		};
-		int64		hvp_val[2];
-
-		/* Log cleanup info before we touch indexes */
-		vacuum_log_cleanup_info(onerel, vacrelstats);
-
-		/* Report that we are now vacuuming indexes */
-		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-	 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
-
-		/* Remove index entries */
-		for (i = 0; i < nindexes; i++)
-			lazy_vacuum_index(Irel[i],
-			  &indstats[i],
-			  vacrelstats);
-
-		/* Report that we are now vacuuming the heap */
-		hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP;
-		hvp_val[1] = vacrelstats->num_index_scans + 1;
-		pgstat_progress_update_multi_param(2, hvp_index, hvp_val);
+		/* Work on all the indexes, and then the heap */
+		lazy_vacuum_all_indexes(onerel, vacrelstats, Irel, nindexes,
+indstats);
 
 		/* Remove tuples from heap */
 		lazy_vacuum_heap(onerel, vacrelstats);
-		vacrelstats->num_index_scans++;
 	}
 
 	/*
@@ -1508,6 +1464,36 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(buf.data);
 }
 
+/*
+ *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
+ *
+ *		This is a utility wrapper for lazy_vacuum_index(), able to do
+ *		progress rep

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-12-23 Thread Amit Kapila

On Sun, Dec 22, 2019 at 5:04 PM vignesh C  wrote:
>
> Few comments:
> assert variable should be within #ifdef USE_ASSERT_CHECKING in patch
> v2-0008-Add-support-for-streaming-to-built-in-replication.patch:
> +   int64   subidx;
> +   boolfound = false;
> +   charpath[MAXPGPATH];
> +
> +   subidx = -1;
> +   subxact_info_read(MyLogicalRepWorker->subid, xid);
> +
> +   /* FIXME optimize the search by bsearch on sorted data */
> +   for (i = nsubxacts; i > 0; i--)
> +   {
> +   if (subxacts[i - 1].xid == subxid)
> +   {
> +   subidx = (i - 1);
> +   found = true;
> +   break;
> +   }
> +   }
> +
> +   /* We should not receive aborts for unknown subtransactions. 
> */
> +   Assert(found);
>

We can use PG_USED_FOR_ASSERTS_ONLY for that variable.

>
> Should we include printing of id here like in earlier cases in
> v2-0002-Issue-individual-invalidations-with-wal_level-log.patch:
> +   appendStringInfo(buf, " relcache %u", msg->rc.relId);
> +   /* not expected, but print something anyway */
> +   else if (msg->id == SHAREDINVALSMGR_ID)
> +   appendStringInfoString(buf, " smgr");
> +   /* not expected, but print something anyway */
> +   else if (msg->id == SHAREDINVALRELMAP_ID)
> +   appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
>

I am not sure if this patch is logging these invalidations, so not
sure if it makes sense to add more ids in the cases you are referring
to.  However, if we change it to logging all invalidations at command
end as being discussed in this thread, then it might be better to do
what you are suggesting.

>
> Should we can add function header for AssertChangeLsnOrder in
> v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
> +static void
> +AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
> +{
>
> This "Assert(txn->first_lsn != InvalidXLogRecPtr)"can be before the
> loop, can be checked only once:
> +   dlist_foreach(iter, &txn->changes)
> +   {
> +   ReorderBufferChange *cur_change;
> +
> +   cur_change = dlist_container(ReorderBufferChange,
> node, iter.cur);
> +
> +   Assert(txn->first_lsn != InvalidXLogRecPtr);
> +   Assert(cur_change->lsn != InvalidXLogRecPtr);
> +   Assert(txn->first_lsn <= cur_change->lsn);
>

This makes sense to me.  Another thing about this function, do we
really need "ReorderBuffer *rb" parameter in this function?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Bogus logic in RelationBuildPartitionDesc

2019-12-23 Thread Robert Haas

On Sat, Dec 21, 2019 at 10:28 AM Tom Lane  wrote:
> I just had to retrieve my jaw from the floor after reading this
> bit in RelationBuildPartitionDesc:
>
>  * The system cache may be out of date; if so, we may find no pg_class
>  * tuple or an old one where relpartbound is NULL.  In that case, try
>  * the table directly.  We can't just AcceptInvalidationMessages() and
>  * retry the system cache lookup because it's possible that a
>  * concurrent ATTACH PARTITION operation has removed itself to the
>  * ProcArray but yet added invalidation messages to the shared queue;
>  * InvalidateSystemCaches() would work, but seems excessive.
>
> As far as I can see, this argument is wrong on every count, and if it
> were right, the code it is defending would still be wrong.
>
> In the first place, it's claiming, based on no evidence, that our whole
> approach to syscaches is wrong.  If this code needs to deal with obsolete
> syscache entries then so do probably thousands of other places.

No, because ATTACH PARTITION uses only ShareUpdateExclusiveLock,
unlike most other DDL.

> In the second place, the argument that AcceptInvalidationMessages wouldn't
> work is BS, because the way that that *actually* works is that transaction
> commit updates clog and sends inval messages before it releases locks.
> So if you've acquired enough of a lock to be sure that the data you want
> to read is stable, then you do not need to worry about whether you've
> received any relevant inval messages.  You have --- and you don't even
> need to call AcceptInvalidationMessages for yourself, because lock
> acquisition already did, see e.g. LockRelationOid.

No, because ATTACH PARTITION uses only ShareUpdateExclusiveLock, which
does not conflict with the AccessShareLock required to build a cache
entry.

> In the third place, if this imaginary risk that the syscache was out of
> date were real, the code would be completely failing to deal with it,
> because all it is testing is whether it found a null relpartbound value.
> That wouldn't handle the case where a non-null relpartbound is obsolete,
> which is what you'd expect after ATTACH PARTITION.

No, that's what you'd expect after DETACH PARTITION, but that takes
AccessExclusiveLock, so the problem doesn't occur.

> Furthermore, if all of the above can be rebutted, then what's the argument
> that reading pg_class directly will produce a better answer?  The only way
> that any of this could be useful is if you're trying to read data that is
> changing under you because you didn't take an adequate lock.  In that case
> there's no guarantee that what you will read from pg_class is up-to-date
> either.

The argument is that the only possible change is the concurrent
addition of a partition, and therefore the only thing that happen is
to go from NULL to non-NULL. It can't go from non-NULL to NULL, nor
from one non-NULL value to another.

> Unsurprisingly, the code coverage report shows that this code path is
> never taken.  I think we could dike out partdesc.c lines 113-151
> altogether, and make the code just above there look more like every
> other syscache access in the backend.

It turns out that I tested this, and that if you do that, it's
possible to produce failures. It's very hard to do so in the context
of a regression test because they are low-probability, but they do
happen. I believe some of the testing details are in the original
thread that's probably linked from the commit message that added those
lines.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: backup manifests

2019-12-23 Thread Robert Haas

On Sun, Dec 22, 2019 at 8:32 PM Rushabh Lathia  wrote:
> Agree, that performance won't be a problem, but that will be bit confusing
> to the user.  As at the start user providing the manifest-checksum (assume
> that user-provided CRC32C) and at the end, user will find the SHA256
> checksum string in the backup_manifest file.

I don't think that's particularly confusing. The documentation should
say that this is the algorithm to be used for checksumming the files
which are backed up. The algorithm to be used for the manifest itself
is another matter. To me, it seems far MORE confusing if the algorithm
used for the manifest itself is magically inferred from the algorithm
used for one of the File lines therein.

> Does this also means that irrespective of whether user provided a checksum
> option or not,  we will be always generating the checksum for the 
> backup_manifest file?

Yes, that is what I am proposing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-23 Thread Robert Haas

On Sat, Dec 21, 2019 at 1:25 PM Tom Lane  wrote:
> > What is the overhead here except the memory consumption?
>
> The time to copy those strings out of shared storage, any time
> you query pg_stat_activity.

It seems like you're masterminding this, and I don't know why. It
seems unlikely that anyone will raise the value unless they have very
long queries, and if those people would rather pay the overhead of
copying more data than have their queries truncated, who are we to
argue?

If increasing the maximum imposed some noticeable cost on
installations the kept the default setting, that might well be a good
argument for not raising the maximum. But I don't think that's the
case. I also suspect that the overhead would be pretty darn small even
for people who *do* raise the default setting. It looks to me like
both reading and write operations on st_activity_raw stop when they
hit a NUL byte, so any performance costs on short queries must come
from second-order effects (e.g. the main shared memory segment is
bigger, so the OS cache is smaller) which are likely irrelevant in
practice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-23 Thread Amit Kapila

On Fri, Dec 20, 2019 at 9:31 AM Amit Khandekar  wrote:
> Attached are the patches from master back up to 94 branch.
>
> PG 9.4 and 9.5 have a common patch to be applied :
> pg94_95_use_vfd_for_logrep.patch
> From PG 9.6 onwards, each version has a separate patch.
>
> For PG 9.6, there is no logical decoding perl test file. So I have
> made a new file 006_logical_decoding_spill.pl that has only the
> specific testcase. Also, for building the test_decoding.so, I had to
> add the EXTRA_INSTALL=contrib/test_decoding line in the
> src/test/recovery/Makefile, because this is the first time we are
> using the plugin in the 9.6 tap test.
>

I am not sure if we need to go that far for 9.6 branch.  If the other
tests for logical decoding are not present, then I don't see why we
need to create a new test file for this test only.  Also, I think this
will make the patch the same for 9.4,9.5 and 9.6.

> From PG 10 onwards, pgstat_report_*() calls around read() are removed
> in the patch, because FileRead() itself reports the wait events.
>

Why there are different patches for 10 and 11?

We should try to minimize the difference between patches in different branches.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-12-23 Thread Robert Haas

On Thu, Dec 12, 2019 at 3:41 AM Amit Kapila  wrote:
> I don't think we have evaluated it yet, but we should do it.  The
> point to note is that it is only for the case when wal_level is
> 'logical' (see IsSubTransactionAssignmentPending) in which case we
> already log more WAL, so this might not impact much.  I guess that it
> might be better to have that check in XLogRecordAssemble for the sake
> of clarity.

I don't think that this is really a valid argument. Just because we
have some overhead now doesn't mean that adding more won't hurt. Even
testing the wal_level costs a little something.

> I think the way invalidations work for logical replication is that
> normally, we always start a new transaction before decoding each
> commit which allows us to accept the invalidations (via
> AtStart_Cache).  However, if there are catalog changes within the
> transaction being decoded, we need to reflect those before trying to
> decode the WAL of operation which happened after that catalog change.
> As we are not logging the WAL for each invalidation, we need to
> execute all the invalidation messages for this transaction at each
> catalog change. We are able to do that now as we decode the entire WAL
> for a transaction only once we get the commit's WAL which contains all
> the invalidation messages.  So, we queue them up and execute them for
> each catalog change which we identify by WAL record
> XLOG_HEAP2_NEW_CID.

Thanks for the explanation. That makes sense. But, it's still true,
AFAICS, that instead of doing this stuff with logging invalidations
you could just InvalidateSystemCaches() in the cases where you are
currently applying all of the transaction's invalidations. That
approach might be worse than changing the way invalidations are
logged, but the two approaches deserve to be compared. One approach
has more CPU overhead and the other has more WAL overhead, so it's a
little hard to compare them, but it seems worth mulling over.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-12-23 Thread Masahiko Sawada

On Fri, 20 Dec 2019 at 22:30, Amit Kapila  wrote:
>
> On Fri, Dec 20, 2019 at 11:47 AM Masahiko Sawada
>  wrote:
> >
> > On Mon, 2 Dec 2019 at 17:32, Dilip Kumar  wrote:
> > >
> > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier  
> > > wrote:
> > > >
> > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > > > I have rebased the patch on the latest head and also fix the issue of
> > > > > "concurrent abort handling of the (sub)transaction." and attached as
> > > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > > > the complete patch set.  I have added the version number so that we
> > > > > can track the changes.
> > > >
> > > > The patch has rotten a bit and does not apply anymore.  Could you
> > > > please send a rebased version?  I have moved it to next CF, waiting on
> > > > author.
> > >
> > > I have rebased the patch set on the latest head.
> >
> > Thank you for working on this.
> >
> > This might have already been discussed but I have a question about the
> > changes of logical replication worker. In the current logical
> > replication there is a problem that the response time are doubled when
> > using synchronous replication because wal senders send changes after
> > commit. It's worse especially when a transaction makes a lot of
> > changes. So I expected this feature to reduce the response time by
> > sending changes even while the transaction is progressing but it
> > doesn't seem to be. The logical replication worker writes changes to
> > temporary files and applies these changes when the worker received
> > commit record (STREAM COMMIT). Since the worker sends the LSN of
> > commit record as flush LSN to the publisher after applying all
> > changes, the publisher must wait for all changes are applied to the
> > subscriber.
> >
>
> The main aim of this feature is to reduce apply lag.  Because if we
> send all the changes together it can delay there apply because of
> network delay, whereas if most of the changes are already sent, then
> we will save the effort on sending the entire data at commit time.
> This in itself gives us decent benefits.  Sure, we can further improve
> it by having separate workers (dedicated to apply the changes) as you
> are suggesting and in fact, there is a patch for that as well(see the
> performance results and bgworker patch at [1]), but if try to shove in
> all the things in one go, then it will be difficult to get this patch
> committed (there are already enough things and the patch is quite big
> that to get it right takes a lot of energy).  So, the plan is
> something like that first we get the basic feature and then try to
> improve by having dedicated workers or things like that.  Does this
> make sense to you?
>

Thank you for explanation. The plan makes sense. But I think in the
current design it's a problem that logical replication worker doesn't
receive changes (and doesn't check interrupts) during applying
committed changes even if we don't have a worker dedicated for
applying. I think the worker should continue to receive changes and
save them to temporary files even during applying changes. Otherwise
the buffer would be easily full and replication gets stuck.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Proposal: Add more compile-time asserts to expose inconsistencies.

2019-12-23 Thread Michael Paquier

On Fri, Dec 20, 2019 at 01:08:47AM +, Smith, Peter wrote:
> I updated the most recent patch (_5 from Michael) so it now has your
> suggested error message rewording.

Hmm.  Based on the last messages, and this one in particular:
https://www.postgresql.org/message-id/20191202155545.yzbfzuppjriti...@alap3.anarazel.de

I am still seeing that a couple of points need an extra lookup:
- Addition of a Decl() in at least one header of the core code.
- Perhaps unifying the fallback implementation between C and C++, with
a closer lookup in particular at StaticAssertStmt() and
StaticAssertExpr().

Peter, could you look at that?  In order to test the C++ portion, you
could use this dummy C++ extension I have just published on my github
account:
https://github.com/michaelpq/pg_plugins/tree/master/blackhole_cplusplus

That's a dirty 15-min hack, but that would be enough to check the C++
part with PGXS.  For now, I am switching the patch as waiting on
author for now.
--
Michael

signature.asc
Description: PGP signature

Re: [HACKERS] Block level parallel vacuum

2019-12-23 Thread Masahiko Sawada

On Mon, 23 Dec 2019 at 19:41, Amit Kapila  wrote:
>
> On Fri, Dec 20, 2019 at 12:13 PM Masahiko Sawada
>  wrote:
> >
> > I've attached the updated version patch that incorporated the all
> > review comments I go so far.
> >
>
> I have further edited the first two patches posted by you.  The
> changes include (a) changed tests to reset the guc, (b) removing some
> stuff which is not required in this version, (c) moving some variables
> around to make them in better order, (d) changed comments and few
> other cosmetic things and (e) commit messages for first two patches.
>
> I think the first two patches attached in this email are in good shape
> and we can commit those unless you or someone has more comments on
> them, the main parallel vacuum patch can still be improved by some
> more test/polish/review.  I am planning to push the first two patches
> next week after another pass.  The first two patches are explained in
> brief as below:
>
> 1. v4-0001-Delete-empty-pages-in-each-pass-during-GIST-VACUUM:  It
> allows us to delete empty pages in each pass during GIST VACUUM.
> Earlier, we use to postpone deleting empty pages till the second stage
> of vacuum to amortize the cost of scanning internal pages.  However,
> that can sometimes (say vacuum is canceled or errored between first
> and second stage) delay the pages to be recycled.  Another thing is
> that to facilitate deleting empty pages in the second stage, we need
> to share the information of internal and empty pages between different
> stages of vacuum.  It will be quite tricky to share this information
> via DSM which is required for the main parallel vacuum patch.  Also,
> it will bring the logic to reclaim deleted pages closer to nbtree
> where we delete empty pages in each pass.  Overall, the advantages of
> deleting empty pages in each pass outweigh the advantages of
> postponing the same.  This patch is discussed in detail in a separate
> thread [1].
>
> 2. v39-0001-Introduce-IndexAM-fields-for-parallel-vacuum.patch:
> Introduce new fields amusemaintenanceworkmem and
> amparallelvacuumoptions in IndexAmRoutine for parallel vacuum.  The
> amusemaintenanceworkmem tells whether a particular IndexAM uses
> maintenance_work_mem or not.  This will help in controlling the memory
> used by individual workers as otherwise, each worker can consume
> memory equal to maintenance_work_mem.  This has been discussed in
> detail in a separate thread as well [2]. The amparallelvacuumoptions
> tell whether a particular IndexAM participates in a parallel vacuum
> and if so in which phase (bulkdelete, vacuumcleanup) of vacuum.
>
>

Thank you for updating the patches!

The first patches look good to me. I'm reviewing other patches and
will post comments if there is.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Block level parallel vacuum

2019-12-23 Thread Amit Kapila

On Tue, Dec 24, 2019 at 12:08 PM Masahiko Sawada
 wrote:
>
>
> The first patches look good to me. I'm reviewing other patches and
> will post comments if there is.
>

Okay, feel free to address few comments raised by Mahendra along with
whatever you find.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Block level parallel vacuum

2019-12-23 Thread Masahiko Sawada

On Tue, 24 Dec 2019 at 15:44, Amit Kapila  wrote:
>
> On Tue, Dec 24, 2019 at 12:08 PM Masahiko Sawada
>  wrote:
> >
> >
> > The first patches look good to me. I'm reviewing other patches and
> > will post comments if there is.
> >

Oops I meant first "two" patches look good to me.

>
> Okay, feel free to address few comments raised by Mahendra along with
> whatever you find.

Thanks!

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Implementing Incremental View Maintenance

2019-12-23 Thread tsunakawa.ta...@fujitsu.com

From: Tatsuo Ishii 
> First of all, we do not think that current approach is the final
> one. Instead we want to implement IVM feature one by one: i.e. we
> start with "immediate update" approach, because it's simple and easier
> to implement. Then we will add "deferred update" mode later on.

I agree about incremental feature introduction.  What I'm simply asking is the 
concrete use case (workload and data), so that I can convince myself to believe 
that this feature is useful and focus on reviewing and testing (because the 
patch seems big and difficult...)


> In fact Oracle has both "immediate update" and "deferred update" mode
> of IVM (actually there are more "mode" with their implementation).
> 
> I recommend you to look into Oracle's materialized view feature
> closely. For fair evaluation, probably we should compare the IVM patch
> with Oracle's "immediate update" (they call it "on statement") mode.
> 
> Probably deferred IVM mode is more suitable for DWH. However as I said
> earlier, we hope to implement the immediate mode first then add the
> deferred mode. Let's start with simple one then add more features.

Yes, I know Oracle's ON STATEMENT refresh mode (I attached references at the 
end for others.)

Unfortunately, it's not clear to me which of ON STATEMENT or ON COMMIT the user 
should choose.  The benefit of ON STATEMENT is that the user does not have to 
create and maintain the materialized view log.  But I'm not sure if and when 
the benefit defeats the performance overhead on DML statements.  It's not 
disclosed whether ON STATEMENT uses triggers.

Could you give your opinion on the following to better understand the proposed 
feature and/or Oracle's ON STATEMENT refresh mode?

* What use case does the feature fit?
If the trigger makes it difficult to use in the data ware house, does the 
feature target OLTP?
What kind of data and query would benefit most from the feature (e.g. join of a 
large sales table and a small product table, where the data volume and 
frequency of data loading is ...)?
In other words, this is about what kind of example we can recommend as a 
typical use case of this feature.

* Do you think the benefit of ON STATEMENT (i.e. do not have to use 
materialized view log) outweighs the drawback of ON  STATEMENT (i.g. DML 
overhead)?

* Do you think it's important to refresh the materialized view after every 
statement, or the per-statement refresh is not a requirement but simply the 
result of implementation?


[References]
https://docs.oracle.com/en/database/oracle/oracle-database/19/dwhsg/refreshing-materialized-views.html#GUID-C40C225A-8328-44D5-AE90-9078C2C773EA
--
7.1.5 About ON COMMIT Refresh for Materialized Views 

A materialized view can be refreshed automatically using the ON COMMIT method. 
Therefore, whenever a transaction commits which has updated the tables on which 
a materialized view is defined, those changes are automatically reflected in 
the materialized view. The advantage of using this approach is you never have 
to remember to refresh the materialized view. The only disadvantage is the time 
required to complete the commit will be slightly longer because of the extra 
processing involved. However, in a data warehouse, this should not be an issue 
because there is unlikely to be concurrent processes trying to update the same 
table. 


7.1.6 About ON STATEMENT Refresh for Materialized Views 

A materialized view that uses the ON STATEMENT refresh mode is automatically 
refreshed every time a DML operation is performed on any of the materialized 
view’s base tables. 

With the ON STATEMENT refresh mode, any changes to the base tables are 
immediately reflected in the materialized view. There is no need to commit the 
transaction or maintain materialized view logs on the base tables. If the DML 
statements are subsequently rolled back, then the corresponding changes made to 
the materialized view are also rolled back. 

The advantage of the ON STATEMENT refresh mode is that the materialized view is 
always synchronized with the data in the base tables, without the overhead of 
maintaining materialized view logs. However, this mode may increase the time 
taken to perform a DML operation because the materialized view is being 
refreshed as part of the DML operation. 
--


https://docs.oracle.com/en/database/oracle/oracle-database/19/dwhsg/release-changes.html#GUID-2A2D6E3B-A3FD-47A8-82A3-1EF95AEF5993
--
ON STATEMENT refresh mode for materialized views 
The ON STATEMENT refresh mode refreshes materialized views every time a DML 
operation is performed on any base table, without the need to commit the 
transaction. This mode does not require you to maintain materialized view logs 
on the base tables. 
--


http://www.oracle.com/us/solutions/sap/matview-refresh-db1

RE: Implementing Incremental View Maintenance

2019-12-23 Thread tsunakawa.ta...@fujitsu.com

From: Yugo Nagata 
> On Mon, 23 Dec 2019 08:08:53 +
> "tsunakawa.ta...@fujitsu.com"  wrote:
> > How about unlogged tables ?  I thought the point of using a temp table is to
> avoid WAL overhead.
> 
> Hmm... this might be another option. However, if we use unlogged tables,
> we will need to create them in a special schema similar to pg_toast
> to split this from user tables. Otherwise, we need to create and drop
> unlogged tables repeatedly for each session.

Maybe we can create the work tables in the same schema as the materialized 
view, following:

* Prefix the table name to indicate that the table is system-managed, thus 
alluding to the user that manually deleting the table would break something.  
This is like the system attribute __imv_count you are proposing.

* Describe the above in the manual.  Columns of serial and bigserial data type 
similarly create sequences behind the scenes.

* Make the work tables depend on the materialized view by recording the 
dependency in pg_depend, so that Dropping the materialized view will also drop 
its work tables.

Regards
Takayuki Tsunakawa

Re: Online checksums verification in the backend

2019-12-23 Thread Julien Rouhaud

On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada  wrote:
>
> On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud  wrote:
> >
> > This brings the second consideration: how to report the list corrupted
> > blocks to end users.  As I said this is for now returned via the SRF,
> > but this is clearly not ideal and should rather be made available more
> > globally.  One usage of this information could be block level
> > recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> > mentioned me that he had ideas on passing the list of corrupted blocks
> > using the stat collector.
>
> Yes it's necessary the list of corrupted pages for single page
> recovery. Apart from single page recovery I think it's helpful for DBA
> if they can find the corrupted blocks in the server logs and on a
> system view.
>
> I've also tried to report corrupted pages to the stats collector
> during I researching single page recovery in PostgreSQL but one
> problem is that the statistics in the stats collector is cleared when
> crash recovery. I want the information of block corruption to survive
> even when the server down.

Yes, having the list of corrupted blocks surviving a crash-and-restart
cycle, and also available after a clean shutdown is definitely
important.

> And we might want to add checksums to the
> permanent file having information of database corruption. The
> correctness of these information would be important because we can fix
> a database by restoring some tables from a logical backup or by doing
> reindex etc as long as we have a non-broken information of database
> corruption.

Agreed

> > Finally, the read and locking considerations.  I tried to cover that
> > extensively in the comments, but here are some details on how I tried
> > to make the check safe while trying to keep the overhead as low as
> > possible.  First thing is that this is only doing buffered reads,
> > without any attempt to discard OS cache.  Therefore, any discrepancy
> > between the OS cache and the disk cannot be detected unless you do
> > other actions, such as sync / drop_caches on GNU/Linux.
> >
> > An access share lock on the currently checked relation is held,
> > meaning that it can't get deleted/truncated.  The total number of
> > blocks for the given fork is retrieved first, so any new block will be
> > ignored.  Such new blocks are considered out of scope as being written
> > after the start of the check.
> >
> > Each time a buffer is being checked, the target buffer mapping
> > partition lock is acquired in shared mode, to prevent concurrent
> > eviction.  If the buffer is found in shared buffers, it's pinned and
> > released immediately, just to get the state.
>
> I wonder if there is possibility that blocks on disk can be corrupted
> even if these are loaded to the shared buffer. ISTM the above method
> cannot detect such corruption. Reading and checking blocks fast is
> attractive but I thought it's also important to check blocks precisely
> without overlooking.

It can definitely happen, and it's the usual doomsday scenario:
database is working fine for months, then postgres is restarted say
for a minor version upgrade and then boom the most populars blocks
that are constantly used in read only were corrupted on disk but never
evicted from shared buffers, and you have a major outage.  I have
witnessed that unfortunately too many times.  This is especially bad
as in this kind of scenario, you typically discover the corruption
once all backup only contains the corrupted blocks.

Note that in the approach I'm suggesting, I do verify blocks that are
loaded in shared buffers, I only ignore the dirty blocks, as they'll
be written by the checkpointer or recovery process in case of unclean
shutdown.  A bufferpin isn't necessary to avoid torn page read, an IO
lock also guarantees that and causes less overhead.  The included TAP
test should also detect the corruption of a
present-in-shared-buffers-non-dirty block.  It could however be
improved eg. by calling pg_prewarm to make sure that it's indeed in
shared_buffers, and also do the same test after a clean restart to
make sure that it's hitting the not-in-shared-buffers case.

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-23 Thread Kyotaro Horiguchi

At Tue, 10 Dec 2019 16:59:25 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> shared_buffers=1GB/wal_buffers=16MB(defalut). pgbench -s 20 uses 256MB
> of storage so all of them can be loaded on shared memory.
> 
> The attached graph shows larger benefit in TPS drop and latency
> increase for HDD. The DDL pages at the corsspoint between commit-FPW
> and commit-sync moves from roughly 300 to 200 in TPS and latency, and
> 1000 to 600 in DDL runtime. If we can rely on the two graphs, 500 (or
> 512) pages seems to be the most promising candidate for the default
> value of wal_skip_threshold.
> regards.

I rebased the patch and changed the default value for the GUC variable
wal_skip_threshold to 4096 kilobytes in config.sgml, storage.c and
guc.c. 4096kB is choosed as it is the nice round number of 500 pages *
8kB = 4000kB.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 64b4f63016c776cd5102b70f5562328dc5d371fa Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml |  43 ++--
 doc/src/sgml/perform.sgml|  47 +
 src/backend/access/gist/gistutil.c   |  31 ++-
 src/backend/access/gist/gistxlog.c   |  21 ++
 src/backend/access/heap/heapam.c |  45 +---
 src/backend/access/heap/heapam_handler.c |  22 +-
 src/backend/access/heap/rewriteheap.c|  21 +-
 src/backend/access/nbtree/nbtsort.c  |  41 +---
 src/backend/access/rmgrdesc/gistdesc.c   |   5 +
 src/backend/access/transam/README|  47 -
 src/backend/access/transam/xact.c|  15 ++
 src/backend/access/transam/xloginsert.c  |  10 +-
 src/backend/access/transam/xlogutils.c   |  17 +-
 src/backend/catalog/heap.c   |   4 +
 src/backend/catalog/storage.c| 257 +--
 src/backend/commands/cluster.c   |  31 +++
 src/backend/commands/copy.c  |  58 +
 src/backend/commands/createas.c  |  11 +-
 src/backend/commands/matview.c   |  12 +-
 src/backend/commands/tablecmds.c |  11 +-
 src/backend/storage/buffer/bufmgr.c  | 123 ++-
 src/backend/storage/smgr/md.c|  35 ++-
 src/backend/storage/smgr/smgr.c  |  37 
 src/backend/utils/cache/relcache.c   | 122 ---
 src/backend/utils/misc/guc.c |  13 ++
 src/bin/psql/input.c |   1 +
 src/include/access/gist_private.h|   2 +
 src/include/access/gistxlog.h|   1 +
 src/include/access/heapam.h  |   3 -
 src/include/access/rewriteheap.h |   2 +-
 src/include/access/tableam.h |  18 +-
 src/include/catalog/storage.h|   5 +
 src/include/storage/bufmgr.h |   4 +
 src/include/storage/smgr.h   |   1 +
 src/include/utils/rel.h  |  57 +++--
 src/include/utils/relcache.h |   8 +-
 src/test/regress/pg_regress.c|   2 +
 37 files changed, 839 insertions(+), 344 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..d455a42c9d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, WAL-logging of some bulk
-operations can be safely skipped, which can make those
-operations much faster (see ).
-Operations in which this optimization can be applied include:
-
- CREATE TABLE AS
- CREATE INDEX
- CLUSTER
- COPY into tables that were created or truncated in the same
- transaction
-
-But minimal WAL does not contain enough information to reconstruct the
-data from a base backup and the WAL logs, so replica or
-higher must be used to enable WAL archiving
-() and streaming replication.
+In minimal level, no information is logged for
+tables or indexes for the remainder of a transaction that creates or
+truncates them.  This can make bulk operations much faster (see
+).  But minimal WAL does not contain
+enough information to reconstruct the data from a base backup and the
+WAL logs, so replica or higher must be used

Re: unsupportable composite type partition keys

2019-12-23 Thread Amit Langote

On Tue, Dec 24, 2019 at 10:59 AM Amit Langote  wrote:
> Btw, does the memory leakage fix in this patch address any of the
> pending concerns that were discussed on the "hyrax vs.
> RelationBuildPartitionDesc" thread earlier this year[1]?
>
> [1] 
> https://www.postgresql.org/message-id/flat/3800.1560366716%40sss.pgh.pa.us#092b6b4f6bf75d2f3f90ef6a3b3eab5b

I thought about this a little and I think it *does* address the main
complaint in the above thread.

The main complaint as I understand is that receiving repeated
invalidations due to partitions being concurrently added while a
PartitionDirectory is holding a pointer to PartitionDesc causes many
copies of PartitionDesc to pile up due to the parent table being
rebuilt upon processing of each invalidation.

Now because we don't build the PartitionDesc in the
RelationClearRelation path, that can't happen.  Although it still
seems possible for the piling up to occur if it's
RelationBuildPartitionDesc that is run repeatedly via
RelationGetParttionDesc while partitions are being concurrently added,
but I couldn't find anything in the partitioning code that does that.

Thanks,
Amit

51 matches

Mail list logo