RE: Implementing Incremental View Maintenance
From: Yugo Nagata > 1. Create a temporary table only once at the first view maintenance in > this session. This is possible if we store names or oid of temporary > tables used for each materialized view in memory. However, users may > access to these temptables whenever during the session. > > 2. Use tuplestores instead of temprary tables. Tuplestores can be > converted to Ephemeral Name Relation (ENR) and used in queries. > It doesn't need updating system catalogs, but indexes can not be > used to access. How about unlogged tables ? I thought the point of using a temp table is to avoid WAL overhead. One concern about the temp table is that it precludes the use of distributed transactions (PREPARE TRANSACTION fails if the transaction accessed a temp table.) This could become a headache when FDW has supported 2PC (which Sawada-san started and Horicuchi-san has taken over.) In the near future, PostgreSQL may evolve into a shared nothing database with distributed transactions like Postgres-XL. Regards Takayuki Tsunakawa
Re: Implementing Incremental View Maintenance
>> But if you want to get always up-to-data you need to pay the cost for >> REFRESH MATERIALIZED VIEW. IVM gives a choice here. > > Thank you, that clarified to some extent. What kind of data do you think of > as an example? > > Materialized view reminds me of the use in a data warehouse. Oracle handles > the top in its Database Data Warehousing Guide, and Microsoft has just > started to offer the materialized view feature in its Azure Synapse Analytics > (formerly SQL Data Warehouse). AWS also has previewed Redshift's > materialized view feature in re:Invent 2019. Are you targeting the data > warehouse (analytics) workload? First of all, we do not think that current approach is the final one. Instead we want to implement IVM feature one by one: i.e. we start with "immediate update" approach, because it's simple and easier to implement. Then we will add "deferred update" mode later on. In fact Oracle has both "immediate update" and "deferred update" mode of IVM (actually there are more "mode" with their implementation). I recommend you to look into Oracle's materialized view feature closely. For fair evaluation, probably we should compare the IVM patch with Oracle's "immediate update" (they call it "on statement") mode. > IIUC, to put (over) simply, the data warehouse has two kind of tables: Probably deferred IVM mode is more suitable for DWH. However as I said earlier, we hope to implement the immediate mode first then add the deferred mode. Let's start with simple one then add more features. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Re: smgr vs DropRelFileNodeBuffers() vs filesystem state vs no critical section
Hi, On 2019-12-07 11:07:04 +0530, Amit Kapila wrote: > On Sat, Dec 7, 2019 at 5:42 AM Andres Freund wrote: > > > > Tom, I seem to recall a recent thread of yours discussing a different > > approach to truncation. I wonder if there's some intersection with > > that. But unfortunately my search somehow has come up with nothing so > > far - do you remember enough to find the thread? > > > > IIUC, a similar problem is discussed in the thread [1] where Tom > proposed a few solutions which are close to what you are proposing. > > [1] - https://www.postgresql.org/message-id/1880.1281020817%40sss.pgh.pa.us It does indeed look like basically the same problem. I was actually remembering a different thread, that was more about truncating with not quite as heavyweight locking. Having pondered this and some related problems (truncation, flushing multiple buffers at once using asynchronous IO, PrefetchBuffer() directly into buffers, cleanup lock implementation), I think we basically need the ability to set something like BM_IO_IN_PROGRESS on multiple buffers (perhaps signalling different forms of IO with different flags, but I'm not sure it's needed). I think we basically ought to replace the current IO buffer locking with condition variables (as Robert has suggested at [1]). Instead of having an array of LWLocks (BufferIOLWLockArray), we'd allocate one condition variable for each buffer. I think we have enough space directly in BufferDesc these days, due to the spinlock removal, but that seems like a detail. For truncation, we'd first iterate over all buffers once to mark them as BM_IO_IN_PROGRESS, then we would truncate the underlying file. If the truncation succeeds, we can use a local palloc'd array of IO_IN_PROGRESS buffers to actually evict them. If the truncation fails, the same array would be used to reset IO_IN_PROGRESS (basically AbortBufferIO, except not setting BM_IO_ERROR for the truncation case). This would solve the problem of truncations leading to diverging fs/buffer state, would not require a PANIC, and would allow to have concurrent buffer eviction to efficiently wait for IO to finish. This would pretty directly replace the current cleanup locks, which would just need the existing flagbit to indicate that refcount=0 should cause a condition variable wakeup. Does somebody see a fundamental hole in this approach? Obviously there's lots of different details to fill in, but most of them seem likely to only be found by actually writing the patch... Greetings, Andres Freund [1] https://postgr.es/m/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com
Re: [HACKERS] Block level parallel vacuum
On Mon, 23 Dec 2019 at 16:24, Mahendra Singh wrote: > > On Fri, 20 Dec 2019 at 17:17, Prabhat Sahu > wrote: > > > > Hi, > > > > While testing this feature with parallel vacuum on "TEMPORARY TABLE", I got > > a server crash on PG Head+V36_patch. > > Changed configuration parameters and Stack trace are as below: > > > > autovacuum = on > > max_worker_processes = 4 > > shared_buffers = 10MB > > max_parallel_workers = 8 > > max_parallel_maintenance_workers = 8 > > vacuum_cost_limit = 2000 > > vacuum_cost_delay = 10 > > min_parallel_table_scan_size = 8MB > > min_parallel_index_scan_size = 0 > > > > -- Stack trace: > > [centos@parallel-vacuum-testing bin]$ gdb -q -c data/core.1399 postgres > > Reading symbols from > > /home/centos/BLP_Vacuum/postgresql/inst/bin/postgres...done. > > [New LWP 1399] > > [Thread debugging using libthread_db enabled] > > Using host libthread_db library "/lib64/libthread_db.so.1". > > Core was generated by `postgres: autovacuum worker postgres > >'. > > Program terminated with signal 6, Aborted. > > #0 0x7f4517d80337 in raise () from /lib64/libc.so.6 > > Missing separate debuginfos, use: debuginfo-install > > glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 > > krb5-libs-1.15.1-37.el7_7.2.x86_64 libcom_err-1.42.9-16.el7.x86_64 > > libgcc-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 > > openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 > > zlib-1.2.7-18.el7.x86_64 > > (gdb) bt > > #0 0x7f4517d80337 in raise () from /lib64/libc.so.6 > > #1 0x7f4517d81a28 in abort () from /lib64/libc.so.6 > > #2 0x00a96341 in ExceptionalCondition (conditionName=0xd18efb > > "strvalue != NULL", errorType=0xd18eeb "FailedAssertion", > > fileName=0xd18ee0 "snprintf.c", lineNumber=442) at assert.c:67 > > #3 0x00b02522 in dopr (target=0x7ffdb0e38450, format=0xc5fa95 > > ".%s\"", args=0x7ffdb0e38538) at snprintf.c:442 > > #4 0x00b01ea6 in pg_vsnprintf (str=0x256df50 "autovacuum: dropping > > orphan temp table \"postgres.", '\177' ..., count=1024, > > fmt=0xc5fa68 "autovacuum: dropping orphan temp table \"%s.%s.%s\"", > > args=0x7ffdb0e38538) at snprintf.c:195 > > #5 0x00afbadf in pvsnprintf (buf=0x256df50 "autovacuum: dropping > > orphan temp table \"postgres.", '\177' ..., len=1024, > > fmt=0xc5fa68 "autovacuum: dropping orphan temp table \"%s.%s.%s\"", > > args=0x7ffdb0e38538) at psprintf.c:110 > > #6 0x00afd34b in appendStringInfoVA (str=0x7ffdb0e38550, > > fmt=0xc5fa68 "autovacuum: dropping orphan temp table \"%s.%s.%s\"", > > args=0x7ffdb0e38538) > > at stringinfo.c:149 > > #7 0x00a970fd in errmsg (fmt=0xc5fa68 "autovacuum: dropping orphan > > temp table \"%s.%s.%s\"") at elog.c:832 > > #8 0x008588d2 in do_autovacuum () at autovacuum.c:2249 > > #9 0x00857b29 in AutoVacWorkerMain (argc=0, argv=0x0) at > > autovacuum.c:1689 > > #10 0x0085772f in StartAutoVacWorker () at autovacuum.c:1483 > > #11 0x0086e64f in StartAutovacuumWorker () at postmaster.c:5562 > > #12 0x0086e106 in sigusr1_handler (postgres_signal_arg=10) at > > postmaster.c:5279 > > #13 > > #14 0x7f4517e3f933 in __select_nocancel () from /lib64/libc.so.6 > > #15 0x00869838 in ServerLoop () at postmaster.c:1691 > > #16 0x00869212 in PostmasterMain (argc=3, argv=0x256bd70) at > > postmaster.c:1400 > > #17 0x0077855d in main (argc=3, argv=0x256bd70) at main.c:210 > > (gdb) > > > > I have tried to reproduce the same with all previously executed queries but > > now I am not able to reproduce the same. > > Thanks Prabhat for reporting this issue. > > I am able to reproduce this issue at my end. I tested and verified > that this issue is not related to parallel vacuum patch. I am able to > reproduce this issue on HEAD without parallel vacuum patch(v37). > > I will report this issue in new thread with reproducible test case. Thank you so much! Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: unsupportable composite type partition keys
On Sun, Dec 22, 2019 at 6:13 AM Tom Lane wrote: > > I wrote: > > As far as point 2 goes, I think this is another outgrowth of the > > fundamental design error that we load a partitioned rel's partitioning > > info immediately when the relcache entry is created, rather than later > > on-demand. If we weren't doing that then it wouldn't be problematic > > to inspect the rel's rowtype while constructing the partitioning info. > > I've bitched about this before, if memory serves, but couldn't light > > a fire under anyone about fixing it. Now I think we have no choice. > > It was never a great idea that minimal construction of a relcache > > entry could result in running arbitrary user-defined code. > > Here's a draft patch for that. Thanks for writing the patch. This also came up recently on another thread [1]. > There are a couple of secondary issues > I didn't do anything about yet: > > * When rebuilding an open relcache entry for a partitioned table, this > coding now always quasi-leaks the old rd_pdcxt, where before that happened > only if the partdesc actually changed. (Even if I'd kept the > equalPartitionDescs call, it would always fail.) I complained about the > quasi-leak behavior before, but this probably pushes it up to the level of > "must fix". What I'm inclined to do is to hack > RelationDecrementReferenceCount so that, when the refcount goes to zero, > we delete any child contexts of rd_pdcxt. That's pretty annoying but in > the big scheme of things it's unlikely to matter. Hacking RelationDecrementReferenceCount() like that sounds OK. -else if (rebuild && newrel->rd_pdcxt != NULL) +if (rebuild && newrel->rd_pdcxt != NULL) Checking rebuild seems unnecessary in this block, although that's true even without the patch. + * To ensure that it's not leaked completely, re-attach it to the + * new reldesc, or make it a child of the new reldesc's rd_pdcxt + * in the unlikely event that there is one already. (See hack in + * RelationBuildPartitionDesc.) ... +if (relation->rd_pdcxt != NULL) /* probably never happens */ +MemoryContextSetParent(newrel->rd_pdcxt, relation->rd_pdcxt); +else +relation->rd_pdcxt = newrel->rd_pdcxt; While I can imagine that RelationBuildPartitionDesc() might encounter an old partition descriptor making the re-parenting hack necessary there, I don't see why it's needed here, because a freshly built relation descriptor would not contain the partition descriptor after this patch. > * It'd be better to declare RelationGetPartitionKey and > RelationGetPartitionDesc in relcache.h and get their callers out of the > business of including rel.h, where possible. Although I agree to declare them in relcache.h, that doesn't reduce the need to include rel.h in their callers much, because anyplace that uses RelationGetPartitionDesc() is also very likely to use RelationGetRelid() which is in rel.h. > * equalPartitionDescs is now dead code, should we remove it? Don't see any problem with doing so. > > Note that the end result of this would be to allow, not prohibit, > > cases like your example. I wonder whether we couldn't also lift > > the restriction against whole-row Vars in partition expressions. > > Doesn't seem like there is much difference between such a Var and > > a row(...)::table_rowtype expression. > > I didn't look into that either. I wouldn't propose back-patching that, > but it'd be interesting to try to fix it in HEAD. Agreed. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BHiwqFucUh7hYkfZ6x1MVcs_R24eUfNVuRwdE_FwuwK8XpSZg%40mail.gmail.com
Re: Implementing Incremental View Maintenance
On Mon, 23 Dec 2019 08:08:53 + "tsunakawa.ta...@fujitsu.com" wrote: > From: Yugo Nagata > > 1. Create a temporary table only once at the first view maintenance in > > this session. This is possible if we store names or oid of temporary > > tables used for each materialized view in memory. However, users may > > access to these temptables whenever during the session. > > > > 2. Use tuplestores instead of temprary tables. Tuplestores can be > > converted to Ephemeral Name Relation (ENR) and used in queries. > > It doesn't need updating system catalogs, but indexes can not be > > used to access. > > How about unlogged tables ? I thought the point of using a temp table is to > avoid WAL overhead. Hmm... this might be another option. However, if we use unlogged tables, we will need to create them in a special schema similar to pg_toast to split this from user tables. Otherwise, we need to create and drop unlogged tables repeatedly for each session. > > One concern about the temp table is that it precludes the use of distributed > transactions (PREPARE TRANSACTION fails if the transaction accessed a temp > table.) This could become a headache when FDW has supported 2PC (which > Sawada-san started and Horicuchi-san has taken over.) In the near future, > PostgreSQL may evolve into a shared nothing database with distributed > transactions like Postgres-XL. This makes sense since you mean that PREPARE TRANSACTION can not be used if any base table of incrementally maintainable materialized views is modified in the transaction, at least in the immediate maintenance. Maybe, this issue can be resolved if we implement the deferred maintenance planned in future because materialized views can be updated in other transactions in this way. > > > Regards > Takayuki Tsunakawa > > > -- Yugo Nagata
Re: Implementing Incremental View Maintenance
Hello, regarding my initial post: > For each insert into a base table there are 3 statements: > - ANALYZE pg_temp_3.pg_temp_81976 > - WITH updt AS ( UPDATE public.mv1 AS mv SET __ivm_count__ = ... > - DROP TABLE pg_temp_3.pg_temp_81976 For me there where 3 points to discuss: - create/drop tables may bloat dictionnary tables - create/drop tables prevents "WITH updt ..." from being shared (with some plan caching) - generates many lines in pg_stat_statements In fact I like the idea of a table created per session, but I would even prefer a common "table" shared between all sessions like GLOBAL TEMPORARY TABLE (or something similar) as described here: https://www.postgresql.org/message-id/flat/157703426606.1198.2452090605041230054.pgcf%40coridan.postgresql.org#331e8344bbae904350af161fb43a0aa6 That would remove the drop/create issue, permits to reduce planning time for "WITH updt ..." statements (as done today in PLpgsql triggers), and would fix the pgss "bloat" issue. Like that the "cost" of the immediate refresh approach would be easier to support ;o) Regards PAscal -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: mdclose() does not cope w/ FileClose() failure
Hello. At Sun, 22 Dec 2019 12:21:00 -0800, Noah Misch wrote in > On Sun, Dec 22, 2019 at 01:19:30AM -0800, Noah Misch wrote: > > I am inclined to fix this by decrementing md_num_open_segs before modifying > > md_seg_fds (second attachment). > > That leaked memory, since _fdvec_resize() assumes md_num_open_segs is also the > allocated array length. The alternative is looking better: I agree that v2 is cleaner in the light of modularity and fixes the memory leak happens at re-open. > > An alternative would be to call > > _fdvec_resize() after every FileClose(), like mdtruncate() does; however, > > the > > repalloc() overhead could be noticeable. (mdclose() is called much more > > frequently than mdtruncate().) > > I can skip repalloc() when the array length decreases, to assuage mdclose()'s > worry. In the mdclose() case, the final _fdvec_resize(reln, fork, 0) will > still pfree() the array. Array elements that mdtruncate() frees today will > instead persist to end of transaction. That is okay, since mdtruncate() > crossing more than one segment boundary is fairly infrequent. For it to > happen, you must either create a >2G relation and then TRUNCATE it in the same > transaction, or VACUUM must find >1-2G of unused space at the end of the > relation. I'm now inclined to do it that way, attached. * It doesn't seem worthwhile complicating the code by having a more * aggressive growth strategy here; the number of segments doesn't * grow that fast, and the memory context internally will sometimes -* avoid doing an actual reallocation. +* avoid doing an actual reallocation. Likewise, since the number of +* segments doesn't shrink that fast, don't shrink at all. During +* mdclose(), we'll pfree the array at nseg==0. If I understand it correctly, it is mentioning the number of the all segment files in a fork, not the length of md_seg_fds arrays at a certain moment. But actually _fdvec_resize is called for every segment opening during mdnblocks (just-after mdopen), and every segment closing during mdclose and mdtruncate as mentioned here. We are going to omit pallocs only in the decreasing case. If we regard repalloc as far faster than FileOpen/FileClose or we care about only increase of segment number of mdopen'ed files and don't care the frequent resize that happens during the functions above, then the comment is right and we may resize the array in the segment-by-segment manner. But if they are comparable each other, or we don't want the array gets resized frequently, we might need to prevent repalloc from happening on every segment increase, too. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH] Increase the maximum value track_activity_query_size
On Sun, Dec 22, 2019 at 09:06:41AM +0100, Julien Rouhaud wrote: On Sun, Dec 22, 2019 at 1:03 AM Tomas Vondra wrote: On Sat, Dec 21, 2019 at 04:25:05PM -0500, Tom Lane wrote: >Nikolay Samokhvalov writes: >> Here is what ORMs do: >> select length('SELECT "column_name_1001", "column_name_1002", >> "column_name_1003", "column_name_1004", "column_name_1005", >> "column_name_1006", "column_name_1007", "column_name_1008", >> "column_name_1009", "column_name_1010", "column_name_1011", >> "column_name_1012", "column_name_1013", "column_name_1014", >> "column_name_1015", "column_name_1016", "column_name_1017", >> "column_name_1018", "column_name_1019", "column_name_1020", >> "column_name_1021", "column_name_1022", "column_name_1023", >> "column_name_1024", "column_name_1025", "column_name_1026", >> "column_name_1027", "column_name_1028", "column_name_1029", >> "column_name_1030", "column_name_1031", "column_name_1032", >> "column_name_1033", "column_name_1034", "column_name_1035", >> "column_name_1036", "column_name_1037", "column_name_1038", >> "column_name_1039", "column_name_1040", "column_name_1041", >> "column_name_1042", "column_name_1043", "column_name_1044", >> "column_name_1045", "column_name_1046", "column_name_1047", >> "column_name_1048", "column_name_1049", "column_name_1050" FROM >> "some_table";'); >> length >> >>1024 >> (1 row) > >> That's it – with default settings, you won't see WHERE clause or >> anything else. > >If that's true, it doesn't offer much of a case for upping the limit >on track_activity_query_size. The longest such a query could reasonably >get is somewhere near NAMEDATALEN times MaxHeapAttributeNumber, which >as it happens is exactly the existing limit on track_activity_query_size. > >> As a result, many queries exceed track_activity_query_size > >How? And if they are, why do you care? Such queries sure seem >pretty content-free. > I believe the example was just a very simplistic example. ORMs can of course generate queries with joins, which can easily exceed the limit you mentioned. >> What is the overhead here except the memory consumption? > >The time to copy those strings out of shared storage, any time >you query pg_stat_activity. > IMO that seems like a reasonable price to pay, if you want to see complete queries and bump the track_activity_query_size value up. Couldn't be pg_stat_statements (or any similar extension) queryid exposure in pg_stat_activity [1] also an alternative? You wouldn't have the parameters but maybe the normalized query would be enough for most analysis. Now, maybe pg_stat_statements jumble overhead for such large statements would be even more problematic. But that would effectively add dependency on pg_stat_statements, no? I don't think we want that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Increase the maximum value track_activity_query_size
On Mon, Dec 23, 2019 at 1:10 PM Tomas Vondra wrote: > > On Sun, Dec 22, 2019 at 09:06:41AM +0100, Julien Rouhaud wrote: > >On Sun, Dec 22, 2019 at 1:03 AM Tomas Vondra > > wrote: > >> > >> On Sat, Dec 21, 2019 at 04:25:05PM -0500, Tom Lane wrote: > >> >> What is the overhead here except the memory consumption? > >> > > >> >The time to copy those strings out of shared storage, any time > >> >you query pg_stat_activity. > >> > > >> > >> IMO that seems like a reasonable price to pay, if you want to see > >> complete queries and bump the track_activity_query_size value up. > > > >Couldn't be pg_stat_statements (or any similar extension) queryid > >exposure in pg_stat_activity [1] also an alternative? You wouldn't > >have the parameters but maybe the normalized query would be enough for > >most analysis. Now, maybe pg_stat_statements jumble overhead for such > >large statements would be even more problematic. > > > > But that would effectively add dependency on pg_stat_statements, no? I > don't think we want that. The queryid field is part of the core, so no dependency is added. You just get a somewhat useless NULL value returned until you load an extension that compute a queryid, which may be pg_stat_statements but any other one will work too.
Re: [Proposal] Extend TableAM routines for ANALYZE scan
Hello, On Thu, Dec 5, 2019 at 11:14 AM Pengzhou Tang wrote: > > When hacking the Zedstore, we need to get a more accurate statistic for > zedstore and we > faced some restrictions: > 1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate > sampling block > numbers, this is not friendly for zedstore which wants to use a logical > block number and might also > not friendly to non-block-oriented Table AMs. > 2) columns of zedstore table store separately, so columns in a row have a > different physical position, > tid in a tuple is invalid for zedstore which means the correlation > statistic is incorrect for zedstore. > 3) RelOptInfo->pages is not correct for Zedstore if we only access partial of > the columns which make >the IO cost much higher than the actual cost. > > For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines > in patch > "0001-ANALYZE-tableam-API-change.patch" which add three more APIs: > scan_analyze_beginscan(), scan_analyze_sample_tuple(), > scan_analyze_endscan(). This provides > more convenience and table AMs can take more control of every step of > sampling rows. Meanwhile, > with the new structure named "AcquireSampleContext", we can acquire extra > info (eg: physical position, > physical size) except the real columns values. > > For 3), we hope we can have a similar mechanism with RelOptInfo->rows which > is calculated from > (RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages with > a page selectivity which > is base on the selected zedstore columns. > 0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch > shows one idea that adding the `stadiskfrac` to pg_statistic and planner use > it to estimate the > RelOptInfo->pages. > > 0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how > Zedstore use the > previous patches to achieve: > 1. use logical block id to acquire the sample rows. > 2. can only acquire sample rows from specified column c1, this is used when > user only analyze table > on specified columns eg: "analyze zs (c1)". > 3 when ANALYZE, zedstore table AM provided extra disksize info, then ANALYZE > compute the > physical fraction statistic of each column and planner use it to estimate > the IO cost based on > the selected columns. I couldn't find an entry for that patchset in the next commitfest. Could you register it so that it won't be forgotten?
Re: Fetching timeline during recovery
On Mon, 23 Dec 2019 12:36:56 +0900 Michael Paquier wrote: > On Fri, Dec 20, 2019 at 11:14:28AM +0100, Jehan-Guillaume de Rorthais wrote: > > Yes, that would be great but sadly, it would introduce a regression on > > various tools relying on them. At least, the one doing "select *" or most > > probably "select func()". > > > > But anyway, adding 5 funcs is not a big deal neither. Too bad they are so > > close to existing ones though. > > Consistency of the data matters a lot if we want to build reliable > tools on top of them in case someone would like to compare the various > modes, and using different functions for those fields creates locking > issues (somewhat the point of Fujii-san upthread?). To sum up: the original patch was about fetching the current timeline of a standby from memory without relying on the asynchronous controlfile or pg_stat_get_wal_receiver() which only shows data when the wal_receiver is running. Fujii-san was pointing out we must fetch both the received LSN and its timeline with the same lock so they are consistent. Michael is now discussing about fetching multiple LSN and their timeline, while keeping them consistent, eg. received+tli and applied+tli. Thank you for pointing this out. I thought about various way to deal with this concern and would like to discuss/defend a new option based on existing pg_stat_get_wal_receiver() function. The only problem I'm facing with this function is that it returns a full NULL record if no wal receiver is active. My idea would be to return a row from pg_stat_get_wal_receiver() as soon as a wal receiver has been replicating during the uptime of the standby, no matter if there's one currently working or not. If no wal receiver is active, the "pid" field would be NULL and the "status" would reports eg. "inactive". All other fields would report their last known value as they are kept in shared memory WalRcv struct. From the monitoring and HA point of view, we are now able to know that a wal receiver existed, the lsn it has stopped, on what timeline, all consistent with the same lock. That answer my original goal. We could extend this with two more fields about replayed lsn and timeline to address last Michael's concern if we decide it's really needed (and I think it's a valid concern for eg. monitoring tools). There's some more potential discussion about the pg_stat_wal_receiver view which relies on pg_stat_get_wal_receiver(). My proposal do not introduce regression with it as the view already filter out NULL data using "WHERE s.pid IS NOT NULL". But: 1. we could decide to remove this filter to expose the data even when no wal receiver is active. It's the same behavior than pg_stat_subscription view. It could introduce regression from tools point of view, but adds some useful information. I would vote 0 for it. 2. we could extend it with new replayed lsn/tli fields. I would vote +1 for it. On the "dark" side of this proposal, we do not deal with the primary side. We still have no way to fetch various lsn+tli from the WAL Writer. However, I included pg_current_wal_lsn_tl() in my original patch only for homogeneity reason and the discussion slipped on this side while paying attention to the user facing function logic and homogeneity. If this discussion decide this is a useful feature, I think it could be addressed in another patch (and I volunteer to deal with it). Bellow the sum up this 6th proposition with examples. When wal receiver never started (same as today): -[ RECORD 1 ]-+-- pid | Ø status| Ø receive_start_lsn | Ø receive_start_tli | Ø received_lsn | Ø received_tli | Ø last_msg_send_time| Ø last_msg_receipt_time | Ø latest_end_lsn| Ø latest_end_time | Ø slot_name | Ø sender_host | Ø sender_port | Ø conninfo | Ø When wal receiver is active: $ select * from pg_stat_get_wal_receiver(); -[ RECORD 1 ]-+- pid | 8576 status| streaming receive_start_lsn | 0/400 receive_start_tli | 1 received_lsn | 0/4000148 received_tli | 1 last_msg_send_time| 2019-12-23 12:28:52.588738+01 last_msg_receipt_time | 2019-12-23 12:28:52.588839+01 latest_end_lsn| 0/4000148 latest_end_time | 2019-12-23 11:15:43.431657+01 slot_name | Ø sender_host | /tmp sender_port | 15441 conninfo | port=15441 application_name=s When wal receiver is not running and shared memory WalRcv is reporting past activity: $ select * from pg_stat_get_wal_receiver(); -[ RECORD 1 ]-+- pid | Ø status| inactive receive_start_lsn | 0/400 receive_start_tli | 1 received_lsn | 0/4000148 received_tli
Re: unsupportable composite type partition keys
Amit Langote writes: > On Sun, Dec 22, 2019 at 6:13 AM Tom Lane wrote: > + * To ensure that it's not leaked completely, re-attach it to the > + * new reldesc, or make it a child of the new reldesc's rd_pdcxt > + * in the unlikely event that there is one already. (See hack in > + * RelationBuildPartitionDesc.) > While I can imagine that RelationBuildPartitionDesc() might encounter > an old partition descriptor making the re-parenting hack necessary > there, I don't see why it's needed here, because a freshly built > relation descriptor would not contain the partition descriptor after > this patch. Well, as the comment says, that's probably unreachable today. But I could see it happening in the future, particularly if we ever allow partitioned system catalogs. There are a lot of paths through this code that are not obvious to the naked eye, and some of them can cause relcache entries to get populated behind-your-back. Most of relcache.c is careful about this; I do not see an excuse for the partition-data code to be less so, even if we think it can't happen today. (I notice that RelationBuildPartitionKey is making a similar assumption that the partkey couldn't magically appear while it's working, and I don't like it much there either.) >> * It'd be better to declare RelationGetPartitionKey and >> RelationGetPartitionDesc in relcache.h and get their callers out of the >> business of including rel.h, where possible. > Although I agree to declare them in relcache.h, that doesn't reduce > the need to include rel.h in their callers much, because anyplace that > uses RelationGetPartitionDesc() is also very likely to use > RelationGetRelid() which is in rel.h. Hm. Well, we can try anyway. > [1] > https://www.postgresql.org/message-id/CA%2BHiwqFucUh7hYkfZ6x1MVcs_R24eUfNVuRwdE_FwuwK8XpSZg%40mail.gmail.com Oh, interesting --- I hadn't been paying much attention to that thread. I'll compare your PoC there to what I did. regards, tom lane
reduce size of fmgr_builtins array
Hi all, Currently, we include the function name string in each FmgrBuiltin struct, whose size is 24 bytes on 64 bit platforms. As far as I can tell, the name is usually unused, so the attached (WIP, untested) patch stores it separately, reducing this struct to 16 bytes. We can go one step further and allocate the names as a single character string, reducing the binary size. It doesn't help much to store offsets, since there are ~40k characters, requiring 32-bit offsets. If we instead compute the offset on the fly from stored name lengths, we can use 8-bit values, saving 19kB of space in the binary over using string pointers. -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services v1-fmgr-func-names.patch Description: Binary data
Re: unsupportable composite type partition keys
BTW, I forgot to mention: while I think the patch to forbid pseudotypes by using CheckAttributeType() can be back-patched, I'm leaning towards not back-patching the other patch. The situation where we get into infinite recursion seems not very likely in practice, and it's not going to cause any crash or data loss, so I think we can just say "sorry that's not supported before v13". The patch as I'm proposing it seems rather invasive for a back-branch fix. Also, changing RelationGetPartitionKey/Desc from macros to functions is at least a weak ABI break. If there are extensions calling either, they might still work without a recompile --- but if they have code paths that are the first thing to touch either field since a relcache flush, they'd crash. regards, tom lane
Re: unsupportable composite type partition keys
On Mon, Dec 23, 2019 at 23:49 Tom Lane wrote: > Amit Langote writes: > > [1] > https://www.postgresql.org/message-id/CA%2BHiwqFucUh7hYkfZ6x1MVcs_R24eUfNVuRwdE_FwuwK8XpSZg%40mail.gmail.com > > Oh, interesting --- I hadn't been paying much attention to that thread. > I'll compare your PoC there to what I did. Actually, I should’ve said that your patch is much better attempt at getting this in order, so there’s not much to see in my patch really. :) Thanks, Amit >
Re: [HACKERS] Block level parallel vacuum
g_indg_On Mon, 23 Dec 2019 at 16:11, Amit Kapila wrote: > > On Fri, Dec 20, 2019 at 12:13 PM Masahiko Sawada > wrote: > > > > I've attached the updated version patch that incorporated the all > > review comments I go so far. > > > > I have further edited the first two patches posted by you. The > changes include (a) changed tests to reset the guc, (b) removing some > stuff which is not required in this version, (c) moving some variables > around to make them in better order, (d) changed comments and few > other cosmetic things and (e) commit messages for first two patches. > > I think the first two patches attached in this email are in good shape > and we can commit those unless you or someone has more comments on > them, the main parallel vacuum patch can still be improved by some > more test/polish/review. I am planning to push the first two patches > next week after another pass. The first two patches are explained in > brief as below: > > 1. v4-0001-Delete-empty-pages-in-each-pass-during-GIST-VACUUM: It > allows us to delete empty pages in each pass during GIST VACUUM. > Earlier, we use to postpone deleting empty pages till the second stage > of vacuum to amortize the cost of scanning internal pages. However, > that can sometimes (say vacuum is canceled or errored between first > and second stage) delay the pages to be recycled. Another thing is > that to facilitate deleting empty pages in the second stage, we need > to share the information of internal and empty pages between different > stages of vacuum. It will be quite tricky to share this information > via DSM which is required for the main parallel vacuum patch. Also, > it will bring the logic to reclaim deleted pages closer to nbtree > where we delete empty pages in each pass. Overall, the advantages of > deleting empty pages in each pass outweigh the advantages of > postponing the same. This patch is discussed in detail in a separate > thread [1]. > > 2. v39-0001-Introduce-IndexAM-fields-for-parallel-vacuum.patch: > Introduce new fields amusemaintenanceworkmem and > amparallelvacuumoptions in IndexAmRoutine for parallel vacuum. The > amusemaintenanceworkmem tells whether a particular IndexAM uses > maintenance_work_mem or not. This will help in controlling the memory > used by individual workers as otherwise, each worker can consume > memory equal to maintenance_work_mem. This has been discussed in > detail in a separate thread as well [2]. The amparallelvacuumoptions > tell whether a particular IndexAM participates in a parallel vacuum > and if so in which phase (bulkdelete, vacuumcleanup) of vacuum. > > > [1] - https://www.postgresql.org/message-id/CAA4eK1LGr%2BMN0xHZpJ2dfS8QNQ1a_aROKowZB%2BMPNep8FVtwAA%40mail.gmail.com > [2] - https://www.postgresql.org/message-id/caa4ek1lmcd5apogzwim5nn58ki+74a6edghx4wd8haskvha...@mail.gmail.com > Hi, I reviewed v39 patch set. Below are the some minor review comments: 1. + * memory equal to maitenance_work_mem, the new maitenance_work_mem for maitenance_work_mem should be replaced by maintenance_work_mem. 2. + * The number of workers can vary between and bulkdelete and cleanup I think, grammatically above sentence is not correct. "and" is extra in above sentence. 3. + /* + * Open table. The lock mode is the same as the leader process. It's + * okay because The lockmode does not conflict among the parallel workers. + */ I think, "lock mode" and "lockmode", both should be same.(means extra space should be removed from "lock mode"). In "The", "T" should be small case letter. 4. + /* We don't support parallel vacuum for autovacuum for now */ I think, above sentence should be like "As of now, we don't support parallel vacuum for autovacuum" 5. I am not sure that I am right but I can see that we are not consistent while ending the single line comments. I think, if single line comment is started with "upper case letter", then we should not put period(dot) at the end of comment, but if comment started with "lower case letter", then we should put period(dot) at the end of comment. a) + /* parallel vacuum must be active */ I think. we should end above comment with dot or we should make "p" of parallel as upper case letter. b) + /* At least count itself */ I think, above is correct. If my understanding is correct, then please let me know so that I can make these changes on the top of v39 patch set. 6. +boolamusemaintenanceworkmem; I think, we haven't ran pgindent. Thanks and Regards Mahendra Thalor EnterpriseDB: http://www.enterprisedb.com
Re: unsupportable composite type partition keys
Amit Langote writes: > On Mon, Dec 23, 2019 at 23:49 Tom Lane wrote: >> Oh, interesting --- I hadn't been paying much attention to that thread. >> I'll compare your PoC there to what I did. > Actually, I should’ve said that your patch is much better attempt at > getting this in order, so there’s not much to see in my patch really. :) One thing I see is that you chose to relocate RelationGetPartitionDesc's declaration to partdesc.h, whereupon RelationBuildPartitionDesc doesn't have to be exported at all anymore. Perhaps that's a better factorization than what I did. It supposes that any caller of RelationGetPartitionDesc is going to need partdesc.h, but that seems reasonable. We could likewise move RelationGetPartitionKey to partcache.h. regards, tom lane
relpages of btree indexes are not truncating even after deleting all the tuples from table and doing vacuum
Hi All, While doing testing of "parallel vacuum" patch, I found that size of index relation is not reducing even after deleting all the tuples and firing vacuum command. I am not sure that this is expected behavior or not. For reference, below I am giving one example. postgres=# create table test (a int); CREATE TABLE postgres=# create index indx1 on test (a); CREATE INDEX postgres=# insert into test (select a from generate_series(1,10) a); INSERT 0 10 postgres=# analyze ; ANALYZE postgres=# select relpages, relname from pg_class where relname = 'indx1'; relpages | relname --+- 276 | indx1 (1 row) -- delete all the tuples from table. postgres=# delete from test ; DELETE 10 -- do vacuum to test tables postgres=# vacuum test ; VACUUM -- check relpages in 'indx1' and 'test' postgres=# select relpages, relname from pg_class where relname = 'indx1'; relpages | relname --+- 276 | indx1 (1 row) -- do vacuum to all the tables and check relpages in 'indx1' postgres=# vacuum ; VACUUM postgres=# select relpages, relname from pg_class where relname = 'indx1'; relpages | relname --+- 276 | indx1 (1 row) -- check relpages in 'test' table postgres=# select relpages, relname from pg_class where relname = 'test'; relpages | relname --+- 0 | test (1 row) >From above example, we can see that after deleting all the tuples from table and firing vacuum command, size of table is reduced but size of index relation is same as before vacuum. Please let me your thoughts. Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com
string literal continuations in C
Per a recent thread, these patches remove string literals split with \-escaped newlines. The first is for the message "materialize mode required, but it is not allowed in this context" where it's more prevalent, and we keep perpetuating it; the second is for other messages, whose bulk is in dblink and tablefunc. I think the split is pointless and therefore propose to push both together as a single commit, but maybe somebody would like me to leave those contrib modules alone. There are many other error messages that are split with no \; I would prefer not to have them, but maybe it would be too intrusive to change them all. So let's do this for now and remove this one point of ugliness. -- Álvaro Herrera39°49'30"S 73°17'W >From 307d334046c62607e907a4f7a71e121c5c4b3dba Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 23 Dec 2019 16:23:57 -0300 Subject: [PATCH 1/2] Fix line continuation --- contrib/pg_stat_statements/pg_stat_statements.c | 3 +-- contrib/tablefunc/tablefunc.c | 12 src/backend/access/transam/xlogfuncs.c | 3 +-- src/backend/commands/extension.c| 9 +++-- src/backend/commands/prepare.c | 3 +-- src/backend/libpq/hba.c | 3 +-- src/backend/replication/logical/launcher.c | 3 +-- src/backend/replication/slotfuncs.c | 3 +-- src/backend/replication/walsender.c | 3 +-- src/backend/utils/adt/pgstatfuncs.c | 6 ++ src/backend/utils/misc/guc.c| 3 +-- src/backend/utils/mmgr/portalmem.c | 3 +-- 12 files changed, 18 insertions(+), 36 deletions(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 63b5888ebb..0590a6769b 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -1415,8 +1415,7 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo, if (!(rsinfo->allowedModes & SFRM_Materialize)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("materialize mode required, but it is not " \ - "allowed in this context"))); + errmsg("materialize mode required, but it is not allowed in this context"))); /* Switch into long-lived context to construct returned data structures */ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c index 256d52fc62..30d94f5d90 100644 --- a/contrib/tablefunc/tablefunc.c +++ b/contrib/tablefunc/tablefunc.c @@ -373,8 +373,7 @@ crosstab(PG_FUNCTION_ARGS) if (!(rsinfo->allowedModes & SFRM_Materialize)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("materialize mode required, but it is not " \ - "allowed in this context"))); + errmsg("materialize mode required, but it is not allowed in this context"))); per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; @@ -650,8 +649,7 @@ crosstab_hash(PG_FUNCTION_ARGS) rsinfo->expectedDesc == NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("materialize mode required, but it is not " \ - "allowed in this context"))); + errmsg("materialize mode required, but it is not allowed in this context"))); per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; oldcontext = MemoryContextSwitchTo(per_query_ctx); @@ -1006,8 +1004,7 @@ connectby_text(PG_FUNCTION_ARGS) rsinfo->expectedDesc == NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("materialize mode required, but it is not " \ - "allowed in this context"))); + errmsg("materialize mode required, but it is not allowed in this context"))); if (fcinfo->nargs == 6) { @@ -1086,8 +1083,7 @@ connectby_text_serial(PG_FUNCTION_ARGS) rsinfo->expectedDesc == NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("materialize mode required, but it is not " \ - "allowed in this context"))); + errmsg("materialize mode required, but it is not allowed in this context"))); if (fcinfo->nargs == 7) { diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index 5fd12152b2..d7d17efaf9 100644 --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c @@ -186,8 +186,7 @@ pg_stop_backup_v2(PG_FUNCTION_ARGS) if (!(rsinfo->allowedModes & SFRM_Materialize)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("materialize mode required, but it is not " \ - "allowed in this context"))); + errmsg("materialize mode required, but it is not allowed in this context"))); /* Build a tuple descriptor for our result type */ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) diff --git a/src/backend/commands/extension.c b/src/backend/c
Re: Should we rename amapi.h and amapi.c?
On Mon, Dec 23, 2019 at 02:34:34PM +0900, Michael Paquier wrote: > Hi all, > > I was working on some stuff for table AMs, and I got to wonder it we > had better rename amapi.h to indexam.h and amapi.c to indexam.c, so as > things are more consistent with table AM. It is a bit annoying to > name the files dedicated to index AMs with what looks like now a too > generic name. That would require switching a couple of header files > for existing module developers, which is always annoying, but the move > makes sense thinking long-term? +1 for being more specific about which AM we're talking about. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: Should we rename amapi.h and amapi.c?
On Sun, Dec 22, 2019 at 9:34 PM Michael Paquier wrote: > Hi all, > > I was working on some stuff for table AMs, and I got to wonder it we > had better rename amapi.h to indexam.h and amapi.c to indexam.c, so as > things are more consistent with table AM. It is a bit annoying to > name the files dedicated to index AMs with what looks like now a too > generic name. That would require switching a couple of header files > for existing module developers, which is always annoying, but the move > makes sense thinking long-term? > > Any thoughts? > I had raised the same earlier and [1] has response from Andres, which was "We probably should rename it, but not in 12..." [1] https://www.postgresql.org/message-id/20190508215135.4eljnhnle5xp3jwb%40alap3.anarazel.de
Re: relpages of btree indexes are not truncating even after deleting all the tuples from table and doing vacuum
On Mon, Dec 23, 2019 at 11:05 AM Mahendra Singh wrote: > From above example, we can see that after deleting all the tuples from table > and firing vacuum command, size of table is reduced but size of index > relation is same as before vacuum. VACUUM is only able to make existing empty pages in indexes recyclable by future page splits within the same index. It is not possible for it to reclaim space for the filesystem. Workload characteristics tend to determine whether or not this limitation is truly important. You can observe which pages are "free" in this sense (i.e. whether they've been placed by the FSM for recycling) by using contrib/pg_freespacemap. -- Peter Geoghegan
Re: unsupportable composite type partition keys
I wrote: > One thing I see is that you chose to relocate RelationGetPartitionDesc's > declaration to partdesc.h, whereupon RelationBuildPartitionDesc doesn't > have to be exported at all anymore. Perhaps that's a better factorization > than what I did. It supposes that any caller of RelationGetPartitionDesc > is going to need partdesc.h, but that seems reasonable. We could likewise > move RelationGetPartitionKey to partcache.h. I concluded that that is indeed a better solution; it does allow removing some rel.h inclusions (though possibly those were just duplicative?), and it also means that relcache.c itself doesn't need any partitioning inclusions at all. Here's a cleaned-up patch that does it like that and also fixes the memory leakage issue. I noticed along the way that with partkeys only being loaded on demand, we no longer need the incredibly-unsafe hack in RelationBuildPartitionKey whereby it just silently ignores failure to read the pg_partitioned_table entry. I also rearranged RelationBuildPartitionDesc so that it uses the same context-reparenting trick as RelationBuildPartitionKey. That doesn't save us anything, but it makes the code considerably more robust, I think; we don't need to assume as much about what partition_bounds_copy does. One other thing worth noting is that I used unlikely() to try to discourage the compiler from inlining RelationBuildPartitionDesc into RelationGetPartitionDesc (and likewise for the Key functions). Not sure how effective that is, but it can't hurt. I haven't removed equalPartitionDescs here; that seems like material for a separate patch (to make it easier to put it back if we need it). regards, tom lane diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 452a7f3..8b68fb7 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -83,7 +83,6 @@ #include "utils/inval.h" #include "utils/lsyscache.h" #include "utils/partcache.h" -#include "utils/rel.h" #include "utils/ruleutils.h" #include "utils/snapmgr.h" #include "utils/syscache.h" diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 8f242ae..a143998 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -812,7 +812,7 @@ DefineIndex(Oid relationId, */ if (partitioned && (stmt->unique || stmt->primary)) { - PartitionKey key = rel->rd_partkey; + PartitionKey key = RelationGetPartitionKey(rel); int i; /* diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index d8cbd6a..dc3d792 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -30,7 +30,6 @@ #include "rewrite/rewriteManip.h" #include "utils/lsyscache.h" #include "utils/partcache.h" -#include "utils/rel.h" #include "utils/rls.h" #include "utils/ruleutils.h" diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c index d160d6f..4c6ca89 100644 --- a/src/backend/partitioning/partbounds.c +++ b/src/backend/partitioning/partbounds.c @@ -35,7 +35,6 @@ #include "utils/hashutils.h" #include "utils/lsyscache.h" #include "utils/partcache.h" -#include "utils/rel.h" #include "utils/ruleutils.h" #include "utils/snapmgr.h" #include "utils/syscache.h" @@ -775,6 +774,11 @@ partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval, /* * Return a copy of given PartitionBoundInfo structure. The data types of bounds * are described by given partition key specification. + * + * Note: it's important that this function and its callees not do any catalog + * access, nor anything else that would result in allocating memory other than + * the returned data structure. Since this is called in a long-lived context, + * that would result in unwanted memory leaks. */ PartitionBoundInfo partition_bounds_copy(PartitionBoundInfo src, diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c index 6ede084..e26e45e 100644 --- a/src/backend/partitioning/partdesc.c +++ b/src/backend/partitioning/partdesc.c @@ -47,17 +47,48 @@ typedef struct PartitionDirectoryEntry PartitionDesc pd; } PartitionDirectoryEntry; +static void RelationBuildPartitionDesc(Relation rel); + + +/* + * RelationGetPartitionDesc -- get partition descriptor, if relation is partitioned + * + * Note: we arrange for partition descriptors to not get freed until the + * relcache entry's refcount goes to zero (see hacks in RelationClose, + * RelationClearRelation, and RelationBuildPartitionDesc). Therefore, even + * though we hand back a direct pointer into the relcache entry, it's safe + * for callers to continue to use that pointer as long as (a) they hold the + * relation open, and (b) they hold a relation lock strong enough to ensure + * that the data doesn't become stale. + */ +PartitionDesc +RelationGetPartitionDesc(Relation rel) +{ + if (rel->rd_rel->relkind != RELKIND_PARTITIONED
Re: unsupportable composite type partition keys
On Tue, Dec 24, 2019 at 12:00 AM Tom Lane wrote: > BTW, I forgot to mention: while I think the patch to forbid pseudotypes > by using CheckAttributeType() can be back-patched, I'm leaning towards > not back-patching the other patch. The situation where we get into > infinite recursion seems not very likely in practice, and it's not > going to cause any crash or data loss, so I think we can just say > "sorry that's not supported before v13". The patch as I'm proposing > it seems rather invasive for a back-branch fix. It is indeed. Just to be sure, by going with "unsupported before v13", which one do you mean: * documenting it as so * giving an error in such cases, like the patch in the first email on this thread did * doing nothing really Thanks, Amit
Re: error context for vacuum to include block number
On Mon, Dec 16, 2019 at 11:49:56AM +0900, Michael Paquier wrote: > On Sun, Dec 15, 2019 at 10:27:12AM -0600, Justin Pryzby wrote: > > I named it so because it calls both lazy_vacuum_index > > ("PROGRESS_VACUUM_PHASE_VACUUM_INDEX") and > > lazy_vacuum_heap("PROGRESS_VACUUM_PHASE_VACUUM_HEAP") > > > > I suppose you don't think the other way around is better? > > lazy_vacuum_index_heap > > Dunno. Let's see if others have other thoughts on the matter. FWIW, > I have a long history at naming things in a way others don't like :) I renamed. And deduplicated two more hunks into a 2nd function. (I'm also including the changes I mentioned here ... in case anyone cares to comment or review). https://www.postgresql.org/message-id/20191220171132.GB30414%40telsasoft.com >From 5317d9f3cee163762563f2255f1dd26800ea858b Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Wed, 27 Nov 2019 20:07:10 -0600 Subject: [PATCH v6 1/6] Rename buf to avoid shadowing buf of another type --- src/backend/access/heap/vacuumlazy.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index ab09d84..3f52278 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -517,7 +517,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, BlockNumber next_unskippable_block; bool skipping_blocks; xl_heap_freeze_tuple *frozen; - StringInfoData buf; + StringInfoData sbuf; const int initprog_index[] = { PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_TOTAL_HEAP_BLKS, @@ -1479,33 +1479,33 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, * This is pretty messy, but we split it up so that we can skip emitting * individual parts of the message when not applicable. */ - initStringInfo(&buf); - appendStringInfo(&buf, + initStringInfo(&sbuf); + appendStringInfo(&sbuf, _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"), nkeep, OldestXmin); - appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"), + appendStringInfo(&sbuf, _("There were %.0f unused item identifiers.\n"), nunused); - appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ", + appendStringInfo(&sbuf, ngettext("Skipped %u page due to buffer pins, ", "Skipped %u pages due to buffer pins, ", vacrelstats->pinskipped_pages), vacrelstats->pinskipped_pages); - appendStringInfo(&buf, ngettext("%u frozen page.\n", + appendStringInfo(&sbuf, ngettext("%u frozen page.\n", "%u frozen pages.\n", vacrelstats->frozenskipped_pages), vacrelstats->frozenskipped_pages); - appendStringInfo(&buf, ngettext("%u page is entirely empty.\n", + appendStringInfo(&sbuf, ngettext("%u page is entirely empty.\n", "%u pages are entirely empty.\n", empty_pages), empty_pages); - appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0)); + appendStringInfo(&sbuf, _("%s."), pg_rusage_show(&ru0)); ereport(elevel, (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages", RelationGetRelationName(onerel), tups_vacuumed, num_tuples, vacrelstats->scanned_pages, nblocks), - errdetail_internal("%s", buf.data))); - pfree(buf.data); + errdetail_internal("%s", sbuf.data))); + pfree(sbuf.data); } -- 2.7.4 >From 8fdee2073e083388865f52874a672e8cc47f7c51 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 12 Dec 2019 19:57:25 -0600 Subject: [PATCH v6 2/6] deduplication --- src/backend/access/heap/vacuumlazy.c | 103 +++ 1 file changed, 43 insertions(+), 60 deletions(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 3f52278..c6dc44b 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -153,6 +153,7 @@ static BufferAccessStrategy vac_strategy; static void lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, Relation *Irel, int nindexes, bool aggressive); +static void lazy_vacuum_index_and_heap(Relation onerel, LVRelStats *vacrelstats, Relation *Irel, int nindexes, IndexBulkDeleteResult **indstats); static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats); static bool lazy_check_needs_freeze(Buffer buf, bool *hastup); static void lazy_vacuum_index(Relation indrel, @@ -740,12 +741,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage && vacrelstats->num_dead_tuples > 0) { - const int hvp_index[] = { -PROGRESS_VACUUM_PHASE, -PROGRESS_VACUUM_NUM_INDEX_VACUUMS - }; - int64 hvp_val[2]; - /* * Before beginning index vacuuming, we release any pin we may * hold on
Re: unsupportable composite type partition keys
On Tue, Dec 24, 2019 at 6:33 AM Tom Lane wrote: > I wrote: > > One thing I see is that you chose to relocate RelationGetPartitionDesc's > > declaration to partdesc.h, whereupon RelationBuildPartitionDesc doesn't > > have to be exported at all anymore. Perhaps that's a better factorization > > than what I did. It supposes that any caller of RelationGetPartitionDesc > > is going to need partdesc.h, but that seems reasonable. We could likewise > > move RelationGetPartitionKey to partcache.h. > > I concluded that that is indeed a better solution; it does allow removing > some rel.h inclusions (though possibly those were just duplicative?), and > it also means that relcache.c itself doesn't need any partitioning > inclusions at all. > > Here's a cleaned-up patch that does it like that and also fixes the > memory leakage issue. Thanks for the updated patch. I didn't find anything to complain about. > I haven't removed equalPartitionDescs here; that seems like material > for a separate patch (to make it easier to put it back if we need it). Seems like a good idea. Btw, does the memory leakage fix in this patch address any of the pending concerns that were discussed on the "hyrax vs. RelationBuildPartitionDesc" thread earlier this year[1]? Thanks, Amit [1] https://www.postgresql.org/message-id/flat/3800.1560366716%40sss.pgh.pa.us#092b6b4f6bf75d2f3f90ef6a3b3eab5b
Re: Implementing Incremental View Maintenance
On Mon, 23 Dec 2019 03:41:18 -0700 (MST) legrand legrand wrote: > Hello, > regarding my initial post: > > > For each insert into a base table there are 3 statements: > > - ANALYZE pg_temp_3.pg_temp_81976 > > - WITH updt AS ( UPDATE public.mv1 AS mv SET __ivm_count__ = ... > > - DROP TABLE pg_temp_3.pg_temp_81976 > > For me there where 3 points to discuss: > - create/drop tables may bloat dictionnary tables > - create/drop tables prevents "WITH updt ..." from being shared (with some > plan caching) > - generates many lines in pg_stat_statements > > In fact I like the idea of a table created per session, but I would even > prefer a common "table" shared between all sessions like GLOBAL TEMPORARY > TABLE (or something similar) as described here: > https://www.postgresql.org/message-id/flat/157703426606.1198.2452090605041230054.pgcf%40coridan.postgresql.org#331e8344bbae904350af161fb43a0aa6 Although I have not looked into this thread, this may be help if this is implemented. However, it would be still necessary to truncate the table before the view maintenance because such tables always exist and can be accessed and modified by any users. -- Yugo Nagata
Re: archive status ".ready" files may be created too early
At Sat, 21 Dec 2019 01:18:24 +, "Bossart, Nathan" wrote in > On 12/18/19, 8:34 AM, "Bossart, Nathan" wrote: > > On 12/17/19, 2:26 AM, "Kyotaro Horiguchi" wrote: > >> If so, we could amend also that case by marking the last segment as > >> .ready when XLogWrite writes the first bytes of the next segment. (As > >> the further corner-case, it still doesn't work if a contination record > >> spans over trhee or more segments.. But I don't (or want not to) think > >> we don't need to consider that case..) > > > > I'm working on a new version of the patch that will actually look at > > the WAL page metadata to determine when it is safe to mark a segment > > as ready for archival. It seems relatively easy to figure out whether > > a page is the last one for the current WAL record. > > I stand corrected. My attempts to add logic to check the WAL records > added quite a bit more complexity than seemed reasonable to maintain > in this code path. For example, I didn't anticipate things like > XLOG_SWITCH records. > > I am still concerned about the corner case you noted, but I have yet > to find a practical way to handle it. You suggested waiting until > writing the first bytes of the next segment before marking a segment > as ready, but I'm not sure that fixes this problem either, and I > wonder if it could result in waiting arbitrarily long before creating > a ".ready" file in some cases. Perhaps I am misunderstanding your > suggestion. > > Another thing I noticed is that any changes in this area could impact > archive_timeout. If we reset the archive_timeout timer when we mark > the segments ready, we could force WAL switches more often. If we do > not move the timer logic, we could be resetting it before the file is > ready for the archiver. However, these differences might be subtle > enough to be okay. You're right. That doesn't seem to work. Another thing I had in my mind was that XLogWrite had an additional flag so that AdvanceXLInsertBuffer can tell not to mark .ready. The function is called while it *is* writing a complete record. So even if AdvanceXLInsertBuffer inhibit marking .ready the succeeding bytes comes soon and we can mark the old segment as .ready at the time. .. + * If record_write == false, we don't mark the last segment as .ready + * if the caller requested to write up to segment boundary. .. static void - XLogWrite(XLogwrtRqst WriteRqst, bool flexible) + XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool record_write) When XLogWrite is called with record_write = false, we don't mark .ready and don't advance lastSegSwitchTime/LSN. At the next time XLogWrite is called with record_write=true, if lastSegSwitchLSN is behind the latest segment boundary before or equal to LogwrtResult.Write, mark the skipped segments as .ready and update lastSegSwitchTime/LSN. Does the above make sense? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Condition variables vs interrupts
On Sat, Dec 21, 2019 at 2:10 PM Shawn Debnath wrote: > On Fri, Dec 20, 2019 at 12:05:34PM +1300, Thomas Munro wrote: > > I think we should probably just remove the unusual ResetLatch() call, > > rather than adding a CFI(). See attached. Thoughts? > > I agree: removing the ResetLatch() and having the wait event code deal > with it is a better way to go. I wonder why the reset was done in the > first place? Thanks for the review. I was preparing to commit this today, but I think I've spotted another latch protocol problem in the stable branches only and I'd to get some more eyeballs on it. I bet it's really hard to hit, but ConditionVariableSleep()'s return path (ie when the CV has been signalled) forgets that the the latch is multiplexed and latches are merged: just because it was set by ConditionVariableSignal() doesn't mean it wasn't *also* set by die() or some other interrupt, and yet at the point we return, we've reset the latch and not run CFI(), and there's nothing to say the caller won't then immediately wait on the latch in some other code path, and sleep like a log despite the earlier delivery of (say) SIGTERM. If I'm right about that, I think the solution is to move that CFI down in the stable branches (which you already did for master in commit 1321509f). 0001-Don-t-reset-latch-in-ConditionVariablePrepare-master.patch Description: Binary data 0001-Don-t-reset-latch-in-ConditionVariableP-backbranches.patch Description: Binary data
Re: Should we rename amapi.h and amapi.c?
On Mon, Dec 23, 2019 at 12:28:36PM -0800, Ashwin Agrawal wrote: > I had raised the same earlier and [1] has response from Andres, which was > "We probably should rename it, but not in 12..." > > [1] > https://www.postgresql.org/message-id/20190508215135.4eljnhnle5xp3jwb%40alap3.anarazel.de Okay, glad to see that this has been mentioned. So let's do some renaming for v13 then. I have studied first if we had better remove amapi.c, then move amvalidate() to amvalidate.c and the handler lookup routine to indexam.c as it already exists, but keeping things ordered as they are makes sense to limit spreading too much dependencies with the syscache mainly, so instead the attached patch does the following changes: - amapi.h -> indexam.h - amapi.c -> indexamapi.c. Here we have an equivalent in access/table/ as tableamapi.c. - amvalidate.c -> indexamvalidate.c - amvalidate.h -> indexamvalidate.h - genam.c -> indexgenam.c Please note that we have also amcmds.c and amcmds.c in the code, but the former could be extended to have utilities for table AMs, and the latter applies to both, so they are better left untouched in my opinion. -- Michael From e4f78403c585d6ead07f1f2c4477942920ef5b7e Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 24 Dec 2019 11:55:34 +0900 Subject: [PATCH] Rename files and headers related to index AM The following renaming is done, so as index AM source files are more consistent with table AMs, and so as no more too generic names are used: - amapi.h -> indexam.h - amapi.c -> indexamapi.c. Here we have an equivalent in access/table/ as tableamapi.c. - amvalidate.c -> indexamvalidate.c - amvalidate.h -> indexamvalidate.h - genam.c -> indexgenam.c --- contrib/bloom/blinsert.c | 2 +- contrib/bloom/bloom.h | 2 +- contrib/bloom/blutils.c| 2 +- contrib/bloom/blvacuum.c | 2 +- contrib/bloom/blvalidate.c | 2 +- contrib/sepgsql/database.c | 2 +- contrib/sepgsql/label.c| 2 +- contrib/sepgsql/proc.c | 2 +- contrib/sepgsql/relation.c | 2 +- contrib/sepgsql/schema.c | 2 +- src/backend/access/brin/brin_inclusion.c | 2 +- src/backend/access/brin/brin_minmax.c | 2 +- src/backend/access/brin/brin_validate.c| 2 +- src/backend/access/common/detoast.c| 2 +- src/backend/access/common/toast_internals.c| 2 +- src/backend/access/gin/ginvalidate.c | 2 +- src/backend/access/gist/gistbuild.c| 2 +- src/backend/access/gist/gistbuildbuffers.c | 2 +- src/backend/access/gist/gistget.c | 2 +- src/backend/access/gist/gistvacuum.c | 2 +- src/backend/access/gist/gistvalidate.c | 2 +- src/backend/access/hash/hashvalidate.c | 2 +- src/backend/access/heap/heapam.c | 2 +- src/backend/access/heap/heapam_handler.c | 2 +- src/backend/access/heap/vacuumlazy.c | 2 +- src/backend/access/index/Makefile | 8 src/backend/access/index/indexam.c | 2 +- src/backend/access/index/{amapi.c => indexamapi.c} | 6 +++--- .../index/{amvalidate.c => indexamvalidate.c} | 6 +++--- src/backend/access/index/{genam.c => indexgenam.c} | 8 src/backend/access/nbtree/nbtvalidate.c| 2 +- src/backend/access/spgist/spgdoinsert.c| 2 +- src/backend/access/spgist/spginsert.c | 2 +- src/backend/access/spgist/spgscan.c| 2 +- src/backend/access/spgist/spgutils.c | 2 +- src/backend/access/spgist/spgvacuum.c | 2 +- src/backend/access/spgist/spgvalidate.c| 2 +- src/backend/bootstrap/bootstrap.c | 2 +- src/backend/catalog/aclchk.c | 2 +- src/backend/catalog/catalog.c | 2 +- src/backend/catalog/dependency.c | 2 +- src/backend/catalog/heap.c | 2 +- src/backend/catalog/index.c| 2 +- src/backend/catalog/indexing.c | 2 +- src/backend/catalog/objectaddress.c| 2 +- src/backend/catalog/partition.c| 2 +- src/backend/catalog/pg_collation.c | 2 +- src/backend/catalog/pg_constraint.c| 2 +- src/backend/catalog/pg_db_role_setting.c | 2 +- src/backend/catalog/pg_depend.c| 2 +- src/backend/catalog/pg_enum.c | 2 +- src/backend/catalog/pg_inherits.c | 2 +- src/backend/catalog/pg_largeobject.c | 2 +- src/backend/catalog/pg_publication.c | 2 +- src/backend/catalog/pg_range.c
Re: Increase footprint of %m and reduce strerror()
On Fri, Dec 06, 2019 at 02:09:05PM +0900, Michael Paquier wrote: > I guess that we should do that at the end of the day. A lookup at the > in-core tools I see three areas which stand out compared to the rest: > - pg_waldump, and attached is a patch for it. Okay, I have committed this one. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud wrote: > > Hi, > > This topic was discussed several times, with the most recent > discussions found at [1] and [2]. Based on those discussions, my > understanding is that the current approach in BASE_BACKUP has too many > drawbacks and we should instead do this check in the backend. I've > been working using such approach at VMware, and I'm submitting it here > to discuss the approach and rationales, and hopefully have such a > feature integrated. Thank you for working on this! > > First, this was originally developed as an extension. It means that > the check is performed using an SRF. That's maybe not the best > approach, as a transaction has be kept for the total processing time. > It can be leveraged by checking each relation independently, but > that's still not ideal. Maybe using some utility commands (as part of > VACUUM or a new CHECK command for instance) would be a better > approach. > > This brings the second consideration: how to report the list corrupted > blocks to end users. As I said this is for now returned via the SRF, > but this is clearly not ideal and should rather be made available more > globally. One usage of this information could be block level > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and > mentioned me that he had ideas on passing the list of corrupted blocks > using the stat collector. Yes it's necessary the list of corrupted pages for single page recovery. Apart from single page recovery I think it's helpful for DBA if they can find the corrupted blocks in the server logs and on a system view. I've also tried to report corrupted pages to the stats collector during I researching single page recovery in PostgreSQL but one problem is that the statistics in the stats collector is cleared when crash recovery. I want the information of block corruption to survive even when the server down. And we might want to add checksums to the permanent file having information of database corruption. The correctness of these information would be important because we can fix a database by restoring some tables from a logical backup or by doing reindex etc as long as we have a non-broken information of database corruption. > > Finally, the read and locking considerations. I tried to cover that > extensively in the comments, but here are some details on how I tried > to make the check safe while trying to keep the overhead as low as > possible. First thing is that this is only doing buffered reads, > without any attempt to discard OS cache. Therefore, any discrepancy > between the OS cache and the disk cannot be detected unless you do > other actions, such as sync / drop_caches on GNU/Linux. > > An access share lock on the currently checked relation is held, > meaning that it can't get deleted/truncated. The total number of > blocks for the given fork is retrieved first, so any new block will be > ignored. Such new blocks are considered out of scope as being written > after the start of the check. > > Each time a buffer is being checked, the target buffer mapping > partition lock is acquired in shared mode, to prevent concurrent > eviction. If the buffer is found in shared buffers, it's pinned and > released immediately, just to get the state. I wonder if there is possibility that blocks on disk can be corrupted even if these are loaded to the shared buffer. ISTM the above method cannot detect such corruption. Reading and checking blocks fast is attractive but I thought it's also important to check blocks precisely without overlooking. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Block level parallel vacuum
On Mon, Dec 23, 2019 at 11:02 PM Mahendra Singh wrote: > > 5. I am not sure that I am right but I can see that we are not consistent > while ending the single line comments. > > I think, if single line comment is started with "upper case letter", then we > should not put period(dot) at the end of comment, but if comment started with > "lower case letter", then we should put period(dot) at the end of comment. > > a) > + /* parallel vacuum must be active */ > I think. we should end above comment with dot or we should make "p" of > parallel as upper case letter. > > b) > + /* At least count itself */ > I think, above is correct. > I have checked a few files in this context and I don't see any consistency, so I would suggest keeping the things matching with the nearby code. Do you have any reason for the above conclusion? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: error context for vacuum to include block number
On Mon, Dec 23, 2019 at 07:24:28PM -0600, Justin Pryzby wrote: > I renamed. Hmm. I have found what was partially itching me for patch 0002, and that's actually the fact that we don't do the error reporting for heap within lazy_vacuum_heap() because the code relies too much on updating two progress parameters at the same time, on top of the fact that you are mixing multiple concepts with this refactoring. One problem is that if this code is refactored in the future, future callers of lazy_vacuum_heap() would miss the update of the progress reporting. Splitting things improves also the readability of the code, so attached is the refactoring I would do for this portion of the set. It is also more natural to increment num_index_scans when the reporting happens on consistency grounds. (Please note that I have not indented yet the patch.) -- Michael diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index ab09d8408c..597d8b5f92 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -158,6 +158,9 @@ static bool lazy_check_needs_freeze(Buffer buf, bool *hastup); static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, LVRelStats *vacrelstats); +static void lazy_vacuum_all_indexes(Relation onerel, LVRelStats *vacrelstats, + Relation *Irel, int nindexes, + IndexBulkDeleteResult **indstats); static void lazy_cleanup_index(Relation indrel, IndexBulkDeleteResult *stats, LVRelStats *vacrelstats); @@ -740,12 +743,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage && vacrelstats->num_dead_tuples > 0) { - const int hvp_index[] = { -PROGRESS_VACUUM_PHASE, -PROGRESS_VACUUM_NUM_INDEX_VACUUMS - }; - int64 hvp_val[2]; - /* * Before beginning index vacuuming, we release any pin we may * hold on the visibility map page. This isn't necessary for @@ -758,28 +755,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, vmbuffer = InvalidBuffer; } - /* Log cleanup info before we touch indexes */ - vacuum_log_cleanup_info(onerel, vacrelstats); - - /* Report that we are now vacuuming indexes */ - pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, - PROGRESS_VACUUM_PHASE_VACUUM_INDEX); - - /* Remove index entries */ - for (i = 0; i < nindexes; i++) -lazy_vacuum_index(Irel[i], - &indstats[i], - vacrelstats); - - /* - * Report that we are now vacuuming the heap. We also increase - * the number of index scans here; note that by using - * pgstat_progress_update_multi_param we can update both - * parameters atomically. - */ - hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP; - hvp_val[1] = vacrelstats->num_index_scans + 1; - pgstat_progress_update_multi_param(2, hvp_index, hvp_val); + /* Work on all the indexes, then the heap */ + lazy_vacuum_all_indexes(onerel, vacrelstats, Irel, + nindexes, indstats); /* Remove tuples from heap */ lazy_vacuum_heap(onerel, vacrelstats); @@ -790,7 +768,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, * valid. */ vacrelstats->num_dead_tuples = 0; - vacrelstats->num_index_scans++; /* * Vacuum the Free Space Map to make newly-freed space visible on @@ -1420,33 +1397,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, /* XXX put a threshold on min number of tuples here? */ if (vacrelstats->num_dead_tuples > 0) { - const int hvp_index[] = { - PROGRESS_VACUUM_PHASE, - PROGRESS_VACUUM_NUM_INDEX_VACUUMS - }; - int64 hvp_val[2]; - - /* Log cleanup info before we touch indexes */ - vacuum_log_cleanup_info(onerel, vacrelstats); - - /* Report that we are now vacuuming indexes */ - pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, - PROGRESS_VACUUM_PHASE_VACUUM_INDEX); - - /* Remove index entries */ - for (i = 0; i < nindexes; i++) - lazy_vacuum_index(Irel[i], - &indstats[i], - vacrelstats); - - /* Report that we are now vacuuming the heap */ - hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP; - hvp_val[1] = vacrelstats->num_index_scans + 1; - pgstat_progress_update_multi_param(2, hvp_index, hvp_val); + /* Work on all the indexes, and then the heap */ + lazy_vacuum_all_indexes(onerel, vacrelstats, Irel, nindexes, +indstats); /* Remove tuples from heap */ lazy_vacuum_heap(onerel, vacrelstats); - vacrelstats->num_index_scans++; } /* @@ -1508,6 +1464,36 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, pfree(buf.data); } +/* + * lazy_vacuum_all_indexes() -- vacuum all indexes of relation. + * + * This is a utility wrapper for lazy_vacuum_index(), able to do + * progress rep
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sun, Dec 22, 2019 at 5:04 PM vignesh C wrote: > > Few comments: > assert variable should be within #ifdef USE_ASSERT_CHECKING in patch > v2-0008-Add-support-for-streaming-to-built-in-replication.patch: > + int64 subidx; > + boolfound = false; > + charpath[MAXPGPATH]; > + > + subidx = -1; > + subxact_info_read(MyLogicalRepWorker->subid, xid); > + > + /* FIXME optimize the search by bsearch on sorted data */ > + for (i = nsubxacts; i > 0; i--) > + { > + if (subxacts[i - 1].xid == subxid) > + { > + subidx = (i - 1); > + found = true; > + break; > + } > + } > + > + /* We should not receive aborts for unknown subtransactions. > */ > + Assert(found); > We can use PG_USED_FOR_ASSERTS_ONLY for that variable. > > Should we include printing of id here like in earlier cases in > v2-0002-Issue-individual-invalidations-with-wal_level-log.patch: > + appendStringInfo(buf, " relcache %u", msg->rc.relId); > + /* not expected, but print something anyway */ > + else if (msg->id == SHAREDINVALSMGR_ID) > + appendStringInfoString(buf, " smgr"); > + /* not expected, but print something anyway */ > + else if (msg->id == SHAREDINVALRELMAP_ID) > + appendStringInfo(buf, " relmap db %u", msg->rm.dbId); > I am not sure if this patch is logging these invalidations, so not sure if it makes sense to add more ids in the cases you are referring to. However, if we change it to logging all invalidations at command end as being discussed in this thread, then it might be better to do what you are suggesting. > > Should we can add function header for AssertChangeLsnOrder in > v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch: > +static void > +AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn) > +{ > > This "Assert(txn->first_lsn != InvalidXLogRecPtr)"can be before the > loop, can be checked only once: > + dlist_foreach(iter, &txn->changes) > + { > + ReorderBufferChange *cur_change; > + > + cur_change = dlist_container(ReorderBufferChange, > node, iter.cur); > + > + Assert(txn->first_lsn != InvalidXLogRecPtr); > + Assert(cur_change->lsn != InvalidXLogRecPtr); > + Assert(txn->first_lsn <= cur_change->lsn); > This makes sense to me. Another thing about this function, do we really need "ReorderBuffer *rb" parameter in this function? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Bogus logic in RelationBuildPartitionDesc
On Sat, Dec 21, 2019 at 10:28 AM Tom Lane wrote: > I just had to retrieve my jaw from the floor after reading this > bit in RelationBuildPartitionDesc: > > * The system cache may be out of date; if so, we may find no pg_class > * tuple or an old one where relpartbound is NULL. In that case, try > * the table directly. We can't just AcceptInvalidationMessages() and > * retry the system cache lookup because it's possible that a > * concurrent ATTACH PARTITION operation has removed itself to the > * ProcArray but yet added invalidation messages to the shared queue; > * InvalidateSystemCaches() would work, but seems excessive. > > As far as I can see, this argument is wrong on every count, and if it > were right, the code it is defending would still be wrong. > > In the first place, it's claiming, based on no evidence, that our whole > approach to syscaches is wrong. If this code needs to deal with obsolete > syscache entries then so do probably thousands of other places. No, because ATTACH PARTITION uses only ShareUpdateExclusiveLock, unlike most other DDL. > In the second place, the argument that AcceptInvalidationMessages wouldn't > work is BS, because the way that that *actually* works is that transaction > commit updates clog and sends inval messages before it releases locks. > So if you've acquired enough of a lock to be sure that the data you want > to read is stable, then you do not need to worry about whether you've > received any relevant inval messages. You have --- and you don't even > need to call AcceptInvalidationMessages for yourself, because lock > acquisition already did, see e.g. LockRelationOid. No, because ATTACH PARTITION uses only ShareUpdateExclusiveLock, which does not conflict with the AccessShareLock required to build a cache entry. > In the third place, if this imaginary risk that the syscache was out of > date were real, the code would be completely failing to deal with it, > because all it is testing is whether it found a null relpartbound value. > That wouldn't handle the case where a non-null relpartbound is obsolete, > which is what you'd expect after ATTACH PARTITION. No, that's what you'd expect after DETACH PARTITION, but that takes AccessExclusiveLock, so the problem doesn't occur. > Furthermore, if all of the above can be rebutted, then what's the argument > that reading pg_class directly will produce a better answer? The only way > that any of this could be useful is if you're trying to read data that is > changing under you because you didn't take an adequate lock. In that case > there's no guarantee that what you will read from pg_class is up-to-date > either. The argument is that the only possible change is the concurrent addition of a partition, and therefore the only thing that happen is to go from NULL to non-NULL. It can't go from non-NULL to NULL, nor from one non-NULL value to another. > Unsurprisingly, the code coverage report shows that this code path is > never taken. I think we could dike out partdesc.c lines 113-151 > altogether, and make the code just above there look more like every > other syscache access in the backend. It turns out that I tested this, and that if you do that, it's possible to produce failures. It's very hard to do so in the context of a regression test because they are low-probability, but they do happen. I believe some of the testing details are in the original thread that's probably linked from the commit message that added those lines. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: backup manifests
On Sun, Dec 22, 2019 at 8:32 PM Rushabh Lathia wrote: > Agree, that performance won't be a problem, but that will be bit confusing > to the user. As at the start user providing the manifest-checksum (assume > that user-provided CRC32C) and at the end, user will find the SHA256 > checksum string in the backup_manifest file. I don't think that's particularly confusing. The documentation should say that this is the algorithm to be used for checksumming the files which are backed up. The algorithm to be used for the manifest itself is another matter. To me, it seems far MORE confusing if the algorithm used for the manifest itself is magically inferred from the algorithm used for one of the File lines therein. > Does this also means that irrespective of whether user provided a checksum > option or not, we will be always generating the checksum for the > backup_manifest file? Yes, that is what I am proposing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] Increase the maximum value track_activity_query_size
On Sat, Dec 21, 2019 at 1:25 PM Tom Lane wrote: > > What is the overhead here except the memory consumption? > > The time to copy those strings out of shared storage, any time > you query pg_stat_activity. It seems like you're masterminding this, and I don't know why. It seems unlikely that anyone will raise the value unless they have very long queries, and if those people would rather pay the overhead of copying more data than have their queries truncated, who are we to argue? If increasing the maximum imposed some noticeable cost on installations the kept the default setting, that might well be a good argument for not raising the maximum. But I don't think that's the case. I also suspect that the overhead would be pretty darn small even for people who *do* raise the default setting. It looks to me like both reading and write operations on st_activity_raw stop when they hit a NUL byte, so any performance costs on short queries must come from second-order effects (e.g. the main shared memory segment is bigger, so the OS cache is smaller) which are likely irrelevant in practice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Fri, Dec 20, 2019 at 9:31 AM Amit Khandekar wrote: > Attached are the patches from master back up to 94 branch. > > PG 9.4 and 9.5 have a common patch to be applied : > pg94_95_use_vfd_for_logrep.patch > From PG 9.6 onwards, each version has a separate patch. > > For PG 9.6, there is no logical decoding perl test file. So I have > made a new file 006_logical_decoding_spill.pl that has only the > specific testcase. Also, for building the test_decoding.so, I had to > add the EXTRA_INSTALL=contrib/test_decoding line in the > src/test/recovery/Makefile, because this is the first time we are > using the plugin in the 9.6 tap test. > I am not sure if we need to go that far for 9.6 branch. If the other tests for logical decoding are not present, then I don't see why we need to create a new test file for this test only. Also, I think this will make the patch the same for 9.4,9.5 and 9.6. > From PG 10 onwards, pgstat_report_*() calls around read() are removed > in the patch, because FileRead() itself reports the wait events. > Why there are different patches for 10 and 11? We should try to minimize the difference between patches in different branches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Dec 12, 2019 at 3:41 AM Amit Kapila wrote: > I don't think we have evaluated it yet, but we should do it. The > point to note is that it is only for the case when wal_level is > 'logical' (see IsSubTransactionAssignmentPending) in which case we > already log more WAL, so this might not impact much. I guess that it > might be better to have that check in XLogRecordAssemble for the sake > of clarity. I don't think that this is really a valid argument. Just because we have some overhead now doesn't mean that adding more won't hurt. Even testing the wal_level costs a little something. > I think the way invalidations work for logical replication is that > normally, we always start a new transaction before decoding each > commit which allows us to accept the invalidations (via > AtStart_Cache). However, if there are catalog changes within the > transaction being decoded, we need to reflect those before trying to > decode the WAL of operation which happened after that catalog change. > As we are not logging the WAL for each invalidation, we need to > execute all the invalidation messages for this transaction at each > catalog change. We are able to do that now as we decode the entire WAL > for a transaction only once we get the commit's WAL which contains all > the invalidation messages. So, we queue them up and execute them for > each catalog change which we identify by WAL record > XLOG_HEAP2_NEW_CID. Thanks for the explanation. That makes sense. But, it's still true, AFAICS, that instead of doing this stuff with logging invalidations you could just InvalidateSystemCaches() in the cases where you are currently applying all of the transaction's invalidations. That approach might be worse than changing the way invalidations are logged, but the two approaches deserve to be compared. One approach has more CPU overhead and the other has more WAL overhead, so it's a little hard to compare them, but it seems worth mulling over. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, 20 Dec 2019 at 22:30, Amit Kapila wrote: > > On Fri, Dec 20, 2019 at 11:47 AM Masahiko Sawada > wrote: > > > > On Mon, 2 Dec 2019 at 17:32, Dilip Kumar wrote: > > > > > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier > > > wrote: > > > > > > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > > > > > I have rebased the patch on the latest head and also fix the issue of > > > > > "concurrent abort handling of the (sub)transaction." and attached as > > > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > > > > > the complete patch set. I have added the version number so that we > > > > > can track the changes. > > > > > > > > The patch has rotten a bit and does not apply anymore. Could you > > > > please send a rebased version? I have moved it to next CF, waiting on > > > > author. > > > > > > I have rebased the patch set on the latest head. > > > > Thank you for working on this. > > > > This might have already been discussed but I have a question about the > > changes of logical replication worker. In the current logical > > replication there is a problem that the response time are doubled when > > using synchronous replication because wal senders send changes after > > commit. It's worse especially when a transaction makes a lot of > > changes. So I expected this feature to reduce the response time by > > sending changes even while the transaction is progressing but it > > doesn't seem to be. The logical replication worker writes changes to > > temporary files and applies these changes when the worker received > > commit record (STREAM COMMIT). Since the worker sends the LSN of > > commit record as flush LSN to the publisher after applying all > > changes, the publisher must wait for all changes are applied to the > > subscriber. > > > > The main aim of this feature is to reduce apply lag. Because if we > send all the changes together it can delay there apply because of > network delay, whereas if most of the changes are already sent, then > we will save the effort on sending the entire data at commit time. > This in itself gives us decent benefits. Sure, we can further improve > it by having separate workers (dedicated to apply the changes) as you > are suggesting and in fact, there is a patch for that as well(see the > performance results and bgworker patch at [1]), but if try to shove in > all the things in one go, then it will be difficult to get this patch > committed (there are already enough things and the patch is quite big > that to get it right takes a lot of energy). So, the plan is > something like that first we get the basic feature and then try to > improve by having dedicated workers or things like that. Does this > make sense to you? > Thank you for explanation. The plan makes sense. But I think in the current design it's a problem that logical replication worker doesn't receive changes (and doesn't check interrupts) during applying committed changes even if we don't have a worker dedicated for applying. I think the worker should continue to receive changes and save them to temporary files even during applying changes. Otherwise the buffer would be easily full and replication gets stuck. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Proposal: Add more compile-time asserts to expose inconsistencies.
On Fri, Dec 20, 2019 at 01:08:47AM +, Smith, Peter wrote: > I updated the most recent patch (_5 from Michael) so it now has your > suggested error message rewording. Hmm. Based on the last messages, and this one in particular: https://www.postgresql.org/message-id/20191202155545.yzbfzuppjriti...@alap3.anarazel.de I am still seeing that a couple of points need an extra lookup: - Addition of a Decl() in at least one header of the core code. - Perhaps unifying the fallback implementation between C and C++, with a closer lookup in particular at StaticAssertStmt() and StaticAssertExpr(). Peter, could you look at that? In order to test the C++ portion, you could use this dummy C++ extension I have just published on my github account: https://github.com/michaelpq/pg_plugins/tree/master/blackhole_cplusplus That's a dirty 15-min hack, but that would be enough to check the C++ part with PGXS. For now, I am switching the patch as waiting on author for now. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] Block level parallel vacuum
On Mon, 23 Dec 2019 at 19:41, Amit Kapila wrote: > > On Fri, Dec 20, 2019 at 12:13 PM Masahiko Sawada > wrote: > > > > I've attached the updated version patch that incorporated the all > > review comments I go so far. > > > > I have further edited the first two patches posted by you. The > changes include (a) changed tests to reset the guc, (b) removing some > stuff which is not required in this version, (c) moving some variables > around to make them in better order, (d) changed comments and few > other cosmetic things and (e) commit messages for first two patches. > > I think the first two patches attached in this email are in good shape > and we can commit those unless you or someone has more comments on > them, the main parallel vacuum patch can still be improved by some > more test/polish/review. I am planning to push the first two patches > next week after another pass. The first two patches are explained in > brief as below: > > 1. v4-0001-Delete-empty-pages-in-each-pass-during-GIST-VACUUM: It > allows us to delete empty pages in each pass during GIST VACUUM. > Earlier, we use to postpone deleting empty pages till the second stage > of vacuum to amortize the cost of scanning internal pages. However, > that can sometimes (say vacuum is canceled or errored between first > and second stage) delay the pages to be recycled. Another thing is > that to facilitate deleting empty pages in the second stage, we need > to share the information of internal and empty pages between different > stages of vacuum. It will be quite tricky to share this information > via DSM which is required for the main parallel vacuum patch. Also, > it will bring the logic to reclaim deleted pages closer to nbtree > where we delete empty pages in each pass. Overall, the advantages of > deleting empty pages in each pass outweigh the advantages of > postponing the same. This patch is discussed in detail in a separate > thread [1]. > > 2. v39-0001-Introduce-IndexAM-fields-for-parallel-vacuum.patch: > Introduce new fields amusemaintenanceworkmem and > amparallelvacuumoptions in IndexAmRoutine for parallel vacuum. The > amusemaintenanceworkmem tells whether a particular IndexAM uses > maintenance_work_mem or not. This will help in controlling the memory > used by individual workers as otherwise, each worker can consume > memory equal to maintenance_work_mem. This has been discussed in > detail in a separate thread as well [2]. The amparallelvacuumoptions > tell whether a particular IndexAM participates in a parallel vacuum > and if so in which phase (bulkdelete, vacuumcleanup) of vacuum. > > Thank you for updating the patches! The first patches look good to me. I'm reviewing other patches and will post comments if there is. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Block level parallel vacuum
On Tue, Dec 24, 2019 at 12:08 PM Masahiko Sawada wrote: > > > The first patches look good to me. I'm reviewing other patches and > will post comments if there is. > Okay, feel free to address few comments raised by Mahendra along with whatever you find. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Block level parallel vacuum
On Tue, 24 Dec 2019 at 15:44, Amit Kapila wrote: > > On Tue, Dec 24, 2019 at 12:08 PM Masahiko Sawada > wrote: > > > > > > The first patches look good to me. I'm reviewing other patches and > > will post comments if there is. > > Oops I meant first "two" patches look good to me. > > Okay, feel free to address few comments raised by Mahendra along with > whatever you find. Thanks! Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: Implementing Incremental View Maintenance
From: Tatsuo Ishii > First of all, we do not think that current approach is the final > one. Instead we want to implement IVM feature one by one: i.e. we > start with "immediate update" approach, because it's simple and easier > to implement. Then we will add "deferred update" mode later on. I agree about incremental feature introduction. What I'm simply asking is the concrete use case (workload and data), so that I can convince myself to believe that this feature is useful and focus on reviewing and testing (because the patch seems big and difficult...) > In fact Oracle has both "immediate update" and "deferred update" mode > of IVM (actually there are more "mode" with their implementation). > > I recommend you to look into Oracle's materialized view feature > closely. For fair evaluation, probably we should compare the IVM patch > with Oracle's "immediate update" (they call it "on statement") mode. > > Probably deferred IVM mode is more suitable for DWH. However as I said > earlier, we hope to implement the immediate mode first then add the > deferred mode. Let's start with simple one then add more features. Yes, I know Oracle's ON STATEMENT refresh mode (I attached references at the end for others.) Unfortunately, it's not clear to me which of ON STATEMENT or ON COMMIT the user should choose. The benefit of ON STATEMENT is that the user does not have to create and maintain the materialized view log. But I'm not sure if and when the benefit defeats the performance overhead on DML statements. It's not disclosed whether ON STATEMENT uses triggers. Could you give your opinion on the following to better understand the proposed feature and/or Oracle's ON STATEMENT refresh mode? * What use case does the feature fit? If the trigger makes it difficult to use in the data ware house, does the feature target OLTP? What kind of data and query would benefit most from the feature (e.g. join of a large sales table and a small product table, where the data volume and frequency of data loading is ...)? In other words, this is about what kind of example we can recommend as a typical use case of this feature. * Do you think the benefit of ON STATEMENT (i.e. do not have to use materialized view log) outweighs the drawback of ON STATEMENT (i.g. DML overhead)? * Do you think it's important to refresh the materialized view after every statement, or the per-statement refresh is not a requirement but simply the result of implementation? [References] https://docs.oracle.com/en/database/oracle/oracle-database/19/dwhsg/refreshing-materialized-views.html#GUID-C40C225A-8328-44D5-AE90-9078C2C773EA -- 7.1.5 About ON COMMIT Refresh for Materialized Views A materialized view can be refreshed automatically using the ON COMMIT method. Therefore, whenever a transaction commits which has updated the tables on which a materialized view is defined, those changes are automatically reflected in the materialized view. The advantage of using this approach is you never have to remember to refresh the materialized view. The only disadvantage is the time required to complete the commit will be slightly longer because of the extra processing involved. However, in a data warehouse, this should not be an issue because there is unlikely to be concurrent processes trying to update the same table. 7.1.6 About ON STATEMENT Refresh for Materialized Views A materialized view that uses the ON STATEMENT refresh mode is automatically refreshed every time a DML operation is performed on any of the materialized view’s base tables. With the ON STATEMENT refresh mode, any changes to the base tables are immediately reflected in the materialized view. There is no need to commit the transaction or maintain materialized view logs on the base tables. If the DML statements are subsequently rolled back, then the corresponding changes made to the materialized view are also rolled back. The advantage of the ON STATEMENT refresh mode is that the materialized view is always synchronized with the data in the base tables, without the overhead of maintaining materialized view logs. However, this mode may increase the time taken to perform a DML operation because the materialized view is being refreshed as part of the DML operation. -- https://docs.oracle.com/en/database/oracle/oracle-database/19/dwhsg/release-changes.html#GUID-2A2D6E3B-A3FD-47A8-82A3-1EF95AEF5993 -- ON STATEMENT refresh mode for materialized views The ON STATEMENT refresh mode refreshes materialized views every time a DML operation is performed on any base table, without the need to commit the transaction. This mode does not require you to maintain materialized view logs on the base tables. -- http://www.oracle.com/us/solutions/sap/matview-refresh-db1
RE: Implementing Incremental View Maintenance
From: Yugo Nagata > On Mon, 23 Dec 2019 08:08:53 + > "tsunakawa.ta...@fujitsu.com" wrote: > > How about unlogged tables ? I thought the point of using a temp table is to > avoid WAL overhead. > > Hmm... this might be another option. However, if we use unlogged tables, > we will need to create them in a special schema similar to pg_toast > to split this from user tables. Otherwise, we need to create and drop > unlogged tables repeatedly for each session. Maybe we can create the work tables in the same schema as the materialized view, following: * Prefix the table name to indicate that the table is system-managed, thus alluding to the user that manually deleting the table would break something. This is like the system attribute __imv_count you are proposing. * Describe the above in the manual. Columns of serial and bigserial data type similarly create sequences behind the scenes. * Make the work tables depend on the materialized view by recording the dependency in pg_depend, so that Dropping the materialized view will also drop its work tables. Regards Takayuki Tsunakawa
Re: Online checksums verification in the backend
On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada wrote: > > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud wrote: > > > > This brings the second consideration: how to report the list corrupted > > blocks to end users. As I said this is for now returned via the SRF, > > but this is clearly not ideal and should rather be made available more > > globally. One usage of this information could be block level > > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and > > mentioned me that he had ideas on passing the list of corrupted blocks > > using the stat collector. > > Yes it's necessary the list of corrupted pages for single page > recovery. Apart from single page recovery I think it's helpful for DBA > if they can find the corrupted blocks in the server logs and on a > system view. > > I've also tried to report corrupted pages to the stats collector > during I researching single page recovery in PostgreSQL but one > problem is that the statistics in the stats collector is cleared when > crash recovery. I want the information of block corruption to survive > even when the server down. Yes, having the list of corrupted blocks surviving a crash-and-restart cycle, and also available after a clean shutdown is definitely important. > And we might want to add checksums to the > permanent file having information of database corruption. The > correctness of these information would be important because we can fix > a database by restoring some tables from a logical backup or by doing > reindex etc as long as we have a non-broken information of database > corruption. Agreed > > Finally, the read and locking considerations. I tried to cover that > > extensively in the comments, but here are some details on how I tried > > to make the check safe while trying to keep the overhead as low as > > possible. First thing is that this is only doing buffered reads, > > without any attempt to discard OS cache. Therefore, any discrepancy > > between the OS cache and the disk cannot be detected unless you do > > other actions, such as sync / drop_caches on GNU/Linux. > > > > An access share lock on the currently checked relation is held, > > meaning that it can't get deleted/truncated. The total number of > > blocks for the given fork is retrieved first, so any new block will be > > ignored. Such new blocks are considered out of scope as being written > > after the start of the check. > > > > Each time a buffer is being checked, the target buffer mapping > > partition lock is acquired in shared mode, to prevent concurrent > > eviction. If the buffer is found in shared buffers, it's pinned and > > released immediately, just to get the state. > > I wonder if there is possibility that blocks on disk can be corrupted > even if these are loaded to the shared buffer. ISTM the above method > cannot detect such corruption. Reading and checking blocks fast is > attractive but I thought it's also important to check blocks precisely > without overlooking. It can definitely happen, and it's the usual doomsday scenario: database is working fine for months, then postgres is restarted say for a minor version upgrade and then boom the most populars blocks that are constantly used in read only were corrupted on disk but never evicted from shared buffers, and you have a major outage. I have witnessed that unfortunately too many times. This is especially bad as in this kind of scenario, you typically discover the corruption once all backup only contains the corrupted blocks. Note that in the approach I'm suggesting, I do verify blocks that are loaded in shared buffers, I only ignore the dirty blocks, as they'll be written by the checkpointer or recovery process in case of unclean shutdown. A bufferpin isn't necessary to avoid torn page read, an IO lock also guarantees that and causes less overhead. The included TAP test should also detect the corruption of a present-in-shared-buffers-non-dirty block. It could however be improved eg. by calling pg_prewarm to make sure that it's indeed in shared_buffers, and also do the same test after a clean restart to make sure that it's hitting the not-in-shared-buffers case.
Re: [HACKERS] WAL logging problem in 9.4.3?
At Tue, 10 Dec 2019 16:59:25 +0900 (JST), Kyotaro Horiguchi wrote in > shared_buffers=1GB/wal_buffers=16MB(defalut). pgbench -s 20 uses 256MB > of storage so all of them can be loaded on shared memory. > > The attached graph shows larger benefit in TPS drop and latency > increase for HDD. The DDL pages at the corsspoint between commit-FPW > and commit-sync moves from roughly 300 to 200 in TPS and latency, and > 1000 to 600 in DDL runtime. If we can rely on the two graphs, 500 (or > 512) pages seems to be the most promising candidate for the default > value of wal_skip_threshold. > regards. I rebased the patch and changed the default value for the GUC variable wal_skip_threshold to 4096 kilobytes in config.sgml, storage.c and guc.c. 4096kB is choosed as it is the nice round number of 500 pages * 8kB = 4000kB. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 64b4f63016c776cd5102b70f5562328dc5d371fa Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH] Rework WAL-skipping optimization While wal_level=minimal we omit WAL-logging for certain some operations on relfilenodes that are created in the current transaction. The files are fsynced at commit. The machinery accelerates bulk-insertion operations but it fails in certain sequence of operations and a crash just after commit may leave broken table files. This patch overhauls the machinery so that WAL-loggings on all operations are omitted for such relfilenodes. This patch also introduces a new feature that small files are emitted as a WAL record instead of syncing. The new GUC variable wal_skip_threshold controls the threshold. --- doc/src/sgml/config.sgml | 43 ++-- doc/src/sgml/perform.sgml| 47 + src/backend/access/gist/gistutil.c | 31 ++- src/backend/access/gist/gistxlog.c | 21 ++ src/backend/access/heap/heapam.c | 45 +--- src/backend/access/heap/heapam_handler.c | 22 +- src/backend/access/heap/rewriteheap.c| 21 +- src/backend/access/nbtree/nbtsort.c | 41 +--- src/backend/access/rmgrdesc/gistdesc.c | 5 + src/backend/access/transam/README| 47 - src/backend/access/transam/xact.c| 15 ++ src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 17 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c| 257 +-- src/backend/commands/cluster.c | 31 +++ src/backend/commands/copy.c | 58 + src/backend/commands/createas.c | 11 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c | 11 +- src/backend/storage/buffer/bufmgr.c | 123 ++- src/backend/storage/smgr/md.c| 35 ++- src/backend/storage/smgr/smgr.c | 37 src/backend/utils/cache/relcache.c | 122 --- src/backend/utils/misc/guc.c | 13 ++ src/bin/psql/input.c | 1 + src/include/access/gist_private.h| 2 + src/include/access/gistxlog.h| 1 + src/include/access/heapam.h | 3 - src/include/access/rewriteheap.h | 2 +- src/include/access/tableam.h | 18 +- src/include/catalog/storage.h| 5 + src/include/storage/bufmgr.h | 4 + src/include/storage/smgr.h | 1 + src/include/utils/rel.h | 57 +++-- src/include/utils/relcache.h | 8 +- src/test/regress/pg_regress.c| 2 + 37 files changed, 839 insertions(+), 344 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5d1c90282f..d455a42c9d 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2481,21 +2481,14 @@ include_dir 'conf.d' levels. This parameter can only be set at server start. -In minimal level, WAL-logging of some bulk -operations can be safely skipped, which can make those -operations much faster (see ). -Operations in which this optimization can be applied include: - - CREATE TABLE AS - CREATE INDEX - CLUSTER - COPY into tables that were created or truncated in the same - transaction - -But minimal WAL does not contain enough information to reconstruct the -data from a base backup and the WAL logs, so replica or -higher must be used to enable WAL archiving -() and streaming replication. +In minimal level, no information is logged for +tables or indexes for the remainder of a transaction that creates or +truncates them. This can make bulk operations much faster (see +). But minimal WAL does not contain +enough information to reconstruct the data from a base backup and the +WAL logs, so replica or higher must be used
Re: unsupportable composite type partition keys
On Tue, Dec 24, 2019 at 10:59 AM Amit Langote wrote: > Btw, does the memory leakage fix in this patch address any of the > pending concerns that were discussed on the "hyrax vs. > RelationBuildPartitionDesc" thread earlier this year[1]? > > [1] > https://www.postgresql.org/message-id/flat/3800.1560366716%40sss.pgh.pa.us#092b6b4f6bf75d2f3f90ef6a3b3eab5b I thought about this a little and I think it *does* address the main complaint in the above thread. The main complaint as I understand is that receiving repeated invalidations due to partitions being concurrently added while a PartitionDirectory is holding a pointer to PartitionDesc causes many copies of PartitionDesc to pile up due to the parent table being rebuilt upon processing of each invalidation. Now because we don't build the PartitionDesc in the RelationClearRelation path, that can't happen. Although it still seems possible for the piling up to occur if it's RelationBuildPartitionDesc that is run repeatedly via RelationGetParttionDesc while partitions are being concurrently added, but I couldn't find anything in the partitioning code that does that. Thanks, Amit