Re: shared-memory based stats collector
At Tue, 06 Oct 2020 10:06:44 +0900 (JST), Kyotaro Horiguchi wrote in > The previous version failed to flush local database stats for certain > condition. That behavior caused useless retries and finally a forced > flush that leads to contention. I fixed that and will measure > performance with this version. I (we) got some performance numbers. - Fetching 1 tuple from 1 of 100 tables from 100 to 800 clients. - Fetching 1 tuple from 1 of 10 tables from 100 to 800 clients. Those showed speed of over 400,000 TPS at maximum, and no siginificant difference is seen between patched and unpatched at the all range of the test. I tried 5 seconds as PGSTAT_MIN_INTERVAL (10s in the patch) but that made no difference. - Fetching 1 tuple from 1 table from 800 clients. No graph for this is not attached but this test shows speed of over 42 TPS with or without the v39 patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Fri, Oct 2, 2020 at 12:26 PM Keisuke Kuroda wrote: > > Hi Dilip, Amit, > > > > 5. Can you please once repeat the performance test done by Keisuke-San > > > to see if you have similar observations? Additionally, see if you are > > > also seeing the inconsistency related to the Truncate message reported > > > by him and if so why? > > > > > > > Okay, I will set up and do this test early next week. Keisuke-San, > > can you send me your complete test script? > > Yes, I've attached a test script(test_pg_recvlogical.sh) > > Sorry, the issue with TRUNCATE not outputting was due to a build miss > on my part. > Even before the patch, TRUNCATE decodes and outputting correctly. > So, please check the performance only. > > I have tested it again and will share the results with you. > > Also, the argument of palloc was still MemoryContextAlloc, > which prevented me from applying the patch, so I've only fixed that part. > > # test script > > Please set PGHOME and CLUSTER_PUB before run. > > sh test_pg_recvlogical.sh > > # perf command > > perf record --call-graph dwarf -p [walsender pid] > perf report -i perf.data --no-children > > # before patch > > decode + invalidation = 222s > > 2020-10-02 14:55:50 BEGIN 509 > 2020-10-02 14:59:42 table nsp_001.tbl_001, nsp_001.part_0001 ... > nsp_001.part_0999, nsp_001.part_1000: TRUNCATE: (no-flags) > 2020-10-02 14:59:42 COMMIT 509 (at 2020-10-02 14:55:50.349219+09) I could not see this issue even without the patch, it is taking less than 1s even without the patch. See below results 2020-10-08 13:00:49 BEGIN 509 2020-10-08 13:00:49 table nsp_001.part_0001: INSERT:... 2020-10-08 13:00:49 COMMIT 509 (at 2020-10-08 13:00:48.741986+05:30) Am I missing something? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
dynamic result sets support in extended query protocol
I want to progress work on stored procedures returning multiple result sets. Examples of how this could work on the SQL side have previously been shown [0]. We also have ongoing work to make psql show multiple result sets [1]. This appears to work fine in the simple query protocol. But the extended query protocol doesn't support multiple result sets at the moment [2]. This would be desirable to be able to use parameter binding, and also since one of the higher-level goals would be to support the use case of stored procedures returning multiple result sets via JDBC. [0]: https://www.postgresql.org/message-id/flat/4580ff7b-d610-eaeb-e06f-4d686896b93b%402ndquadrant.com [1]: https://commitfest.postgresql.org/29/2096/ [2]: https://www.postgresql.org/message-id/9507.1534370765%40sss.pgh.pa.us (Terminology: I'm calling this project "dynamic result sets", which includes several concepts: 1) multiple result sets, 2) those result sets can have different structures, 3) the structure of the result sets is decided at run time, not declared in the schema/procedure definition/etc.) One possibility I rejected was to invent a third query protocol beside the simple and extended one. This wouldn't really match with the requirements of JDBC and similar APIs because the APIs for sending queries don't indicate whether dynamic result sets are expected or required, you only indicate that later by how you process the result sets. So we really need to use the existing ways of sending off the queries. Also, avoiding a third query protocol is probably desirable in general to avoid extra code and APIs. So here is my sketch on how this functionality could be woven into the extended query protocol. I'll go through how the existing protocol exchange works and then point out the additions that I have in mind. These additions could be enabled by a _pq_ startup parameter sent by the client. Alternatively, it might also work without that because the client would just reject protocol messages it doesn't understand, but that's probably less desirable behavior. So here is how it goes: C: Parse S: ParseComplete At this point, the server would know whether the statement it has parsed can produce dynamic result sets. For a stored procedure, this would be declared with the procedure definition, so when the CALL statement is parsed, this can be noticed. I don't actually plan any other cases, but for the sake of discussion, perhaps some variant of EXPLAIN could also return multiple result sets, and that could also be detected from parsing the EXPLAIN invocation. At this point a client would usually do C: Describe (statement) S: ParameterDescription S: RowDescription New would be that the server would now also respond with a new message, say, S: DynamicResultInfo that indicates that dynamic result sets will follow later. The message would otherwise be empty. (We could perhaps include the number of result sets, but this might not actually be useful, and perhaps it's better not to spent effort on counting things that don't need to be counted.) (If we don't guard this by a _pq_ startup parameter from the client, an old client would now error out because of an unexpected protocol message.) Now the normal bind and execute sequence follows: C: Bind S: BindComplete (C: Describe (portal)) (S: RowDescription) C: Execute S: ... (DataRows) S: CommandComplete In the case of a CALL with output parameters, this "primary" result set contains one row with the output parameters (existing behavior). Now, if the client has seen DynamicResultInfo earlier, it should now go into a new subsequence to get the remaining result sets, like this (naming obviously to be refined): C: NextResult S: NextResultReady C: Describe (portal) S: RowDescription C: Execute S: CommandComplete C: NextResult ... C: NextResult S: NoNextResult C: Sync S: ReadyForQuery I think this would all have to use the unnamed portal, but perhaps there could be other uses with named portals. Some details to be worked out. One could perhaps also do without the DynamicResultInfo message and just put extra information into the CommandComplete message indicating "there are more result sets after this one". (Following the model from the simple query protocol, CommandComplete really means one result set complete, not the whole top-level command. ReadyForQuery means the whole command is complete. This is perhaps debatable, and interesting questions could also arise when considering what should happen in the simple query protocol when a query string consists of multiple commands each returning multiple result sets. But it doesn't really seem sensible to cater to that.) One thing that's missing in this sequence is a way to specify the desired output format (text/binary) for each result set. This could be added to the NextResult message, but at that point the client doesn't yet know the number of columns in the
Re: Parallel INSERT (INTO ... SELECT ...)
On Wed, Oct 7, 2020 at 7:25 PM Greg Nancarrow wrote: > > On Wed, Oct 7, 2020 at 12:40 AM Bharath Rupireddy > wrote: > > In parallel, we are not doing anything(due to the same reason > > explained in above comment) to find whether there is a foreign > > partition or not while deciding to go with parallel/non-parallel copy, > > we are just throwing an error during the first tuple insertion into > > the partition. > > > > errmsg("cannot perform PARALLEL COPY if partition has BEFORE/INSTEAD > > OF triggers, or if the partition is foreign partition"), > > errhint("Try COPY without PARALLEL option"))); > > > > I'm wondering whether code similar to the following can safely be used > to detect a foreign partition: > > if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > { > int i; > PartitionDesc pd = RelationGetPartitionDesc(rel); > for (i = 0; i < pd->nparts; i++) > { > if (get_rel_relkind(pd->oids[i]) == RELKIND_FOREIGN_TABLE) > { > table_close(rel, NoLock); > return false; > } > } > } > Actually, the addition of this kind of check is still not good enough. Partitions can have their own constraints, triggers, column default expressions etc. and a partition itself can be partitioned. I've written code to recursively walk the partitions and do all the various checks for parallel-insert-safety as before, but it's doing a fair bit of work. Any other idea of dealing with this? Seems it can't be avoided if you want to support partitioned tables and partitions. Regards, Greg Nancarrow Fujitsu Australia
Re: Parallel INSERT (INTO ... SELECT ...)
On Thu, Oct 8, 2020 at 1:42 PM Greg Nancarrow wrote: > > On Wed, Oct 7, 2020 at 7:25 PM Greg Nancarrow wrote: > > > > On Wed, Oct 7, 2020 at 12:40 AM Bharath Rupireddy > > wrote: > > > In parallel, we are not doing anything(due to the same reason > > > explained in above comment) to find whether there is a foreign > > > partition or not while deciding to go with parallel/non-parallel copy, > > > we are just throwing an error during the first tuple insertion into > > > the partition. > > > > > > errmsg("cannot perform PARALLEL COPY if partition has BEFORE/INSTEAD > > > OF triggers, or if the partition is foreign partition"), > > > errhint("Try COPY without PARALLEL option"))); > > > > > > > I'm wondering whether code similar to the following can safely be used > > to detect a foreign partition: > > > > if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > > { > > int i; > > PartitionDesc pd = RelationGetPartitionDesc(rel); > > for (i = 0; i < pd->nparts; i++) > > { > > if (get_rel_relkind(pd->oids[i]) == RELKIND_FOREIGN_TABLE) > > { > > table_close(rel, NoLock); > > return false; > > } > > } > > } > > > > Actually, the addition of this kind of check is still not good enough. > Partitions can have their own constraints, triggers, column default > expressions etc. and a partition itself can be partitioned. > I've written code to recursively walk the partitions and do all the > various checks for parallel-insert-safety as before, but it's doing a > fair bit of work. > Any other idea of dealing with this? Seems it can't be avoided if you > want to support partitioned tables and partitions. > IMHO, it's good not to do all of this recursive checking right now, which may complicate the code or may restrict the performance gain. Having said that, in future we may have to do something about it. Others may have better opinions on this point. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: dynamic result sets support in extended query protocol
Are you proposing to bump up the protocol version (either major or minor)? I am asking because it seems you are going to introduce some new message types. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp > I want to progress work on stored procedures returning multiple result > sets. Examples of how this could work on the SQL side have previously > been shown [0]. We also have ongoing work to make psql show multiple > result sets [1]. This appears to work fine in the simple query > protocol. But the extended query protocol doesn't support multiple > result sets at the moment [2]. This would be desirable to be able to > use parameter binding, and also since one of the higher-level goals > would be to support the use case of stored procedures returning > multiple result sets via JDBC. > > [0]: > https://www.postgresql.org/message-id/flat/4580ff7b-d610-eaeb-e06f-4d686896b93b%402ndquadrant.com > [1]: https://commitfest.postgresql.org/29/2096/ > [2]: > https://www.postgresql.org/message-id/9507.1534370765%40sss.pgh.pa.us > > (Terminology: I'm calling this project "dynamic result sets", which > includes several concepts: 1) multiple result sets, 2) those result > sets can have different structures, 3) the structure of the result > sets is decided at run time, not declared in the schema/procedure > definition/etc.) > > One possibility I rejected was to invent a third query protocol beside > the simple and extended one. This wouldn't really match with the > requirements of JDBC and similar APIs because the APIs for sending > queries don't indicate whether dynamic result sets are expected or > required, you only indicate that later by how you process the result > sets. So we really need to use the existing ways of sending off the > queries. Also, avoiding a third query protocol is probably desirable > in general to avoid extra code and APIs. > > So here is my sketch on how this functionality could be woven into the > extended query protocol. I'll go through how the existing protocol > exchange works and then point out the additions that I have in mind. > > These additions could be enabled by a _pq_ startup parameter sent by > the client. Alternatively, it might also work without that because > the client would just reject protocol messages it doesn't understand, > but that's probably less desirable behavior. > > So here is how it goes: > > C: Parse > S: ParseComplete > > At this point, the server would know whether the statement it has > parsed can produce dynamic result sets. For a stored procedure, this > would be declared with the procedure definition, so when the CALL > statement is parsed, this can be noticed. I don't actually plan any > other cases, but for the sake of discussion, perhaps some variant of > EXPLAIN could also return multiple result sets, and that could also be > detected from parsing the EXPLAIN invocation. > > At this point a client would usually do > > C: Describe (statement) > S: ParameterDescription > S: RowDescription > > New would be that the server would now also respond with a new > message, say, > > S: DynamicResultInfo > > that indicates that dynamic result sets will follow later. The > message would otherwise be empty. (We could perhaps include the > number of result sets, but this might not actually be useful, and > perhaps it's better not to spent effort on counting things that don't > need to be counted.) > > (If we don't guard this by a _pq_ startup parameter from the client, > an old client would now error out because of an unexpected protocol > message.) > > Now the normal bind and execute sequence follows: > > C: Bind > S: BindComplete > (C: Describe (portal)) > (S: RowDescription) > C: Execute > S: ... (DataRows) > S: CommandComplete > > In the case of a CALL with output parameters, this "primary" result > set contains one row with the output parameters (existing behavior). > > Now, if the client has seen DynamicResultInfo earlier, it should now > go into a new subsequence to get the remaining result sets, like this > (naming obviously to be refined): > > C: NextResult > S: NextResultReady > C: Describe (portal) > S: RowDescription > C: Execute > > S: CommandComplete > C: NextResult > ... > C: NextResult > S: NoNextResult > C: Sync > S: ReadyForQuery > > I think this would all have to use the unnamed portal, but perhaps > there could be other uses with named portals. Some details to be > worked out. > > One could perhaps also do without the DynamicResultInfo message and > just put extra information into the CommandComplete message indicating > "there are more result sets after this one". > > (Following the model from the simple query protocol, CommandComplete > really means one result set complete, not the whole top-level > command. ReadyForQuery means the whole command is complete. This is > perhaps debatable, and interesting questions could also ari
Re: Resetting spilled txn statistics in pg_stat_replication
On Thu, 8 Oct 2020 at 14:10, Amit Kapila wrote: > > On Thu, Oct 8, 2020 at 7:46 AM Masahiko Sawada > wrote: > > > > On Wed, 7 Oct 2020 at 17:52, Amit Kapila wrote: > > > > > > On Wed, Oct 7, 2020 at 11:24 AM Masahiko Sawada > > > wrote: > > > > > > > > On Tue, 6 Oct 2020 at 17:56, Amit Kapila > > > > wrote: > > > > > > > > > > On Tue, Oct 6, 2020 at 9:34 AM Masahiko Sawada > > > > > wrote: > > > > > > > > > > > > Looking at pgstat_reset_replslot_counter() in the v8 patch, even if > > > > > > we > > > > > > pass a physical slot name to pg_stat_reset_replication_slot() a > > > > > > PgStat_MsgResetreplslotcounter is sent to the stats collector. I’m > > > > > > okay with not raising an error but maybe we can have it not to send > > > > > > the message in that case. > > > > > > > > > > > > > > > > makes sense, so changed accordingly. > > > > > > > > > > > > > Thank you for updating the patch! > > > > > > > > > > Thanks, I will push the first one tomorrow unless I see more comments > > > and test-case one later. > > > > I thought we could have a test case for the reset function, what do you > > think? > > > > We can write if we want but there are few things we need to do for > that like maybe a new function like wait_for_spill_stats which will > check if the counters have become zero. Then probably call a reset > function, call a new wait function, and then again check stats to > ensure they are reset to 0. Yes. > We can't write any advanced test which means reset the existing stats > perform some tests and again check stats because *slot_get_changes() > function can start from the previous WAL for which we have covered the > stats. We might write that if we can somehow track the WAL positions > from the previous test. I am not sure if we want to go there. Can we use pg_logical_slot_peek_changes() instead to decode the same transactions multiple times? Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
Hi Dilip, > I could not see this issue even without the patch, it is taking less > than 1s even without the patch. See below results > > 2020-10-08 13:00:49 BEGIN 509 > 2020-10-08 13:00:49 table nsp_001.part_0001: INSERT:... > 2020-10-08 13:00:49 COMMIT 509 (at 2020-10-08 13:00:48.741986+05:30) > > Am I missing something? Thanks for running the tests. It is the TRUNCATE decoding that takes time. INSERT decoding is fast, even before the patch is applied. 2020-10-02 14:55:48 BEGIN 508 2020-10-02 14:55:48 table nsp_001.part_0001: INSERT ... 2020-10-02 14:55:49 COMMIT 508 (at 2020-10-02 14:55:48.744019+09) However, TRUNCATE decode is slow and take 222s in my environment. 2020-10-02 14:55:50 BEGIN 509 2020-10-02 14:59:42 table nsp_001.tbl_001 ... ns p_001.part_1000: TRUNCATE: (no-flags) 2020-10-02 14:59:42 COMMIT 509 (at 2020-10-02 14:55:50.349219+09) This script will wait 10 seconds after INSERT exits before executing TRUNCATE, please wait for it to run. When TRUNCATE completes, the walsender process should be at 100% CPU. Best Regards, -- Keisuke Kuroda NTT Software Innovation Center keisuke.kuroda.3...@gmail.com
Fix typos in reorderbuffer.c
@@ -1432,7 +1432,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) ReorderBufferCleanupTXN(rb, subtxn); } - /* cleanup changes in the toplevel txn */ + /* cleanup changes in the txn */ dlist_foreach_modify(iter, &txn->changes) { ReorderBufferChange *change; @@ -1533,7 +1533,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) ReorderBufferTruncateTXN(rb, subtxn); } - /* cleanup changes in the toplevel txn */ + /* cleanup changes in the txn */ dlist_foreach_modify(iter, &txn->changes) { ReorderBufferChange *change; Both the above functions are recursive and will clean the changes for both the top-level transaction and subtransactions. So, I feel the comments should be accordingly updated. -- With Regards, Amit Kapila. v1-0001-Fix-typos-in-reorderbuffer.c.patch Description: Binary data
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Thu, Oct 8, 2020 at 2:05 PM Keisuke Kuroda wrote: > > Hi Dilip, > > > I could not see this issue even without the patch, it is taking less > > than 1s even without the patch. See below results > > > > 2020-10-08 13:00:49 BEGIN 509 > > 2020-10-08 13:00:49 table nsp_001.part_0001: INSERT:... > > 2020-10-08 13:00:49 COMMIT 509 (at 2020-10-08 13:00:48.741986+05:30) > > > > Am I missing something? > > Thanks for running the tests. > It is the TRUNCATE decoding that takes time. > INSERT decoding is fast, even before the patch is applied. > > 2020-10-02 14:55:48 BEGIN 508 > 2020-10-02 14:55:48 table nsp_001.part_0001: INSERT ... > 2020-10-02 14:55:49 COMMIT 508 (at 2020-10-02 14:55:48.744019+09) > > However, TRUNCATE decode is slow > and take 222s in my environment. > > 2020-10-02 14:55:50 BEGIN 509 > 2020-10-02 14:59:42 table nsp_001.tbl_001 ... ns p_001.part_1000: > TRUNCATE: (no-flags) > 2020-10-02 14:59:42 COMMIT 509 (at 2020-10-02 14:55:50.349219+09) > > This script will wait 10 seconds after INSERT exits > before executing TRUNCATE, please wait for it to run. > > When TRUNCATE completes, > the walsender process should be at 100% CPU. Okay, thanks for the info, I will run again and see this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: speed up unicode normalization quick check
On Thu, Oct 8, 2020 at 2:48 AM Michael Paquier wrote: > On Wed, Oct 07, 2020 at 03:18:44PM +0900, Michael Paquier wrote: > I looked at this one again today, and applied it. I looked at what > MSVC compiler was able to do in terms of optimizationswith > shift-and-add for multipliers, and it is by far not as good as gcc or > clang, applying imul for basically all the primes we could use for the > perfect hash generation. > Thanks for picking this up! As I recall, godbolt.org also showed MSVC unable to do this optimization. > > I have tested 0002 and 0003, that had better be merged together at the > > end, and I can see performance improvements with MSVC and gcc similar > > to what is being reported upthread, with 20~30% gains for simple > > data sample using IS NFC/NFKC. That's cool. > > For these two, I have merged both together and did some adjustments as > per the attached. Not many tweaks, mainly some more comments for the > unicode header files as the number of structures generated gets > higher. Looks fine overall, but one minor nit: I'm curious why you made a separate section in the pgindent exclusions. The style in that file seems to be one comment per category. -- John Naylor
Re: Resetting spilled txn statistics in pg_stat_replication
On Thu, Oct 8, 2020 at 1:55 PM Masahiko Sawada wrote: > > On Thu, 8 Oct 2020 at 14:10, Amit Kapila wrote: > > > > > > We can write if we want but there are few things we need to do for > > that like maybe a new function like wait_for_spill_stats which will > > check if the counters have become zero. Then probably call a reset > > function, call a new wait function, and then again check stats to > > ensure they are reset to 0. > > Yes. > I am not sure if it is worth but probably it is not a bad idea especially if we extend the existing tests based on your below idea? > > We can't write any advanced test which means reset the existing stats > > perform some tests and again check stats because *slot_get_changes() > > function can start from the previous WAL for which we have covered the > > stats. We might write that if we can somehow track the WAL positions > > from the previous test. I am not sure if we want to go there. > > Can we use pg_logical_slot_peek_changes() instead to decode the same > transactions multiple times? > I think this will do the trick. If we want to go there then I suggest we can have a separate regression test file in test_decoding with name as decoding_stats, stats, or something like that. We can later add the tests related to streaming stats in that file as well. -- With Regards, Amit Kapila.
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Thu, Oct 8, 2020 at 2:17 PM Dilip Kumar wrote: > > On Thu, Oct 8, 2020 at 2:05 PM Keisuke Kuroda > wrote: > > > > Hi Dilip, > > > > > I could not see this issue even without the patch, it is taking less > > > than 1s even without the patch. See below results > > > > > > 2020-10-08 13:00:49 BEGIN 509 > > > 2020-10-08 13:00:49 table nsp_001.part_0001: INSERT:... > > > 2020-10-08 13:00:49 COMMIT 509 (at 2020-10-08 13:00:48.741986+05:30) > > > > > > Am I missing something? > > > > Thanks for running the tests. > > It is the TRUNCATE decoding that takes time. > > INSERT decoding is fast, even before the patch is applied. > > > > 2020-10-02 14:55:48 BEGIN 508 > > 2020-10-02 14:55:48 table nsp_001.part_0001: INSERT ... > > 2020-10-02 14:55:49 COMMIT 508 (at 2020-10-02 14:55:48.744019+09) > > > > However, TRUNCATE decode is slow > > and take 222s in my environment. > > > > 2020-10-02 14:55:50 BEGIN 509 > > 2020-10-02 14:59:42 table nsp_001.tbl_001 ... ns p_001.part_1000: > > TRUNCATE: (no-flags) > > 2020-10-02 14:59:42 COMMIT 509 (at 2020-10-02 14:55:50.349219+09) > > > > This script will wait 10 seconds after INSERT exits > > before executing TRUNCATE, please wait for it to run. > > > > When TRUNCATE completes, > > the walsender process should be at 100% CPU. > > Okay, thanks for the info, I will run again and see this. > Now, I can see the truncate time reduced from 5mins to just 1 sec Before patch 2020-10-08 14:18:48 BEGIN 510 2020-10-08 14:23:02 COMMIT 510 (at 2020-10-08 14:18:48.88462+05:30) truncate time: ~5mins After patch : 2020-10-08 14:30:22 BEGIN 510 2020-10-08 14:30:22 COMMIT 510 (at 2020-10-08 14:30:22.766092+05:30) truncate time: < 1s -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Thu, 8 Oct 2020 at 09:47, Dilip Kumar wrote: > > This script will wait 10 seconds after INSERT exits > > before executing TRUNCATE, please wait for it to run. Has this been tested with anything other than the one test case? It would be good to know how the patch handles a transaction that contains many aborted subtransactions that contain invals. -- Simon Riggshttp://www.enterprisedb.com/
RE: Transactions involving multiple postgres foreign servers, take 2
Sorry to be late to respond. (My PC is behaving strangely after upgrading Win10 2004) From: Masahiko Sawada > After more thoughts on Tsunakawa-san’s idea it seems to need the > following conditions: > > * At least postgres_fdw is viable to implement these APIs while > guaranteeing not to happen any error. > * A certain number of FDWs (or majority of FDWs) can do that in a > similar way to postgres_fdw by using the guideline and probably > postgres_fdw as a reference. > > These are necessary for FDW implementors to implement APIs while > following the guideline and for the core to trust them. > > As far as postgres_fdw goes, what we need to do when committing a > foreign transaction resolution is to get a connection from the > connection cache or create and connect if not found, construct a SQL > query (COMMIT/ROLLBACK PREPARED with identifier) using a fixed-size > buffer, send the query, and get the result. The possible place to > raise an error is limited. In case of failures such as connection > error FDW can return false to the core along with a flag indicating to > ask the core retry. Then the core will retry to resolve foreign > transactions after some sleep. OTOH if FDW sized up that there is no > hope of resolving the foreign transaction, it also could return false > to the core along with another flag indicating to remove the entry and > not to retry. Also, the transaction resolution by FDW needs to be > cancellable (interruptible) but cannot use CHECK_FOR_INTERRUPTS(). > > Probably, as Tsunakawa-san also suggested, it’s not impossible to > implement these APIs in postgres_fdw while guaranteeing not to happen > any error, although not sure the code complexity. So I think the first > condition may be true but not sure about the second assumption, > particularly about the interruptible part. Yeah, I expect the commit of the second phase should not be difficult for the FDW developer. As for the cancellation during commit retry, I don't think we necessarily have to make the TM responsible for retrying the commits. Many DBMSs have their own timeout functionality such as connection timeout, socket timeout, and statement timeout. Users can set those parameters in the foreign server options based on how long the end user can wait. That is, TM calls FDW's commit routine just once. If the TM makes efforts to retry commits, the duration would be from a few seconds to 30 seconds. Then, we can hold back the cancellation during that period. > I thought we could support both ideas to get their pros; supporting > Tsunakawa-san's idea and then my idea if necessary, and FDW can choose > whether to ask the resolver process to perform 2nd phase of 2PC or > not. But it's not a good idea in terms of complexity. I don't feel the need for leaving the commit to the resolver during normal operation. seems like if failed to resolve, the backend would return an > acknowledgment of COMMIT to the client and the resolver process > resolves foreign prepared transactions in the background. So we can > ensure that the distributed transaction is completed at the time when > the client got an acknowledgment of COMMIT if 2nd phase of 2PC is > successfully completed in the first attempts. OTOH, if it failed for > whatever reason, there is no such guarantee. From an optimistic > perspective, i.g., the failures are unlikely to happen, it will work > well but IMO it’s not uncommon to fail to resolve foreign transactions > due to network issue, especially in an unreliable network environment > for example geo-distributed database. So I think it will end up > requiring the client to check if preceding distributed transactions > are completed or not in order to see the results of these > transactions. That issue exists with any method, doesn't it? Regards Takayuki Tsunakawa
Re: [HACKERS] Custom compression methods
On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar wrote: > > On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar wrote: > > > > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra > > wrote: > > > > > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote: > > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra > > > > wrote: > > > >> > > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote: > > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra > > > >> > wrote: > > > >> >> > > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote: > > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra > > > >> >> > wrote: > > > >> >> > > > > >> >> >Thanks, Tomas for your feedback. > > > >> >> > > > > >> >> >> 9) attcompression ... > > > >> >> >> > > > >> >> >> The main issue I see is what the patch does with attcompression. > > > >> >> >> Instead > > > >> >> >> of just using it to store a the compression method, it's also > > > >> >> >> used to > > > >> >> >> store the preserved compression methods. And using NameData to > > > >> >> >> store > > > >> >> >> this seems wrong too - if we really want to store this info, the > > > >> >> >> correct > > > >> >> >> way is either using text[] or inventing charvector or similar. > > > >> >> > > > > >> >> >The reason for using the NameData is the get it in the fixed part > > > >> >> >of > > > >> >> >the data structure. > > > >> >> > > > > >> >> > > > >> >> Why do we need that? It's possible to have varlena fields with > > > >> >> direct > > > >> >> access (see pg_index.indkey for example). > > > >> > > > > >> >I see. While making it NameData I was thinking whether we have an > > > >> >option to direct access the varlena. Thanks for pointing me there. I > > > >> >will change this. > > > >> > > > > >> > Adding NameData just to make > > > >> >> it fixed-length means we're always adding 64B even if we just need a > > > >> >> single byte, which means ~30% overhead for the > > > >> >> FormData_pg_attribute. > > > >> >> That seems a bit unnecessary, and might be an issue with many > > > >> >> attributes > > > >> >> (e.g. with many temp tables, etc.). > > > >> > > > > >> >You are right. Even I did not like to keep 64B for this, so I will > > > >> >change it. > > > >> > > > > >> >> > > > >> >> >> But to me this seems very much like a misuse of attcompression > > > >> >> >> to track > > > >> >> >> dependencies on compression methods, necessary because we don't > > > >> >> >> have a > > > >> >> >> separate catalog listing compression methods. If we had that, I > > > >> >> >> think we > > > >> >> >> could simply add dependencies between attributes and that > > > >> >> >> catalog. > > > >> >> > > > > >> >> >Basically, up to this patch, we are having only built-in > > > >> >> >compression > > > >> >> >methods and those can not be dropped so we don't need any > > > >> >> >dependency > > > >> >> >at all. We just want to know what is the current compression > > > >> >> >method > > > >> >> >and what is the preserve compression methods supported for this > > > >> >> >attribute. Maybe we can do it better instead of using the NameData > > > >> >> >but I don't think it makes sense to add a separate catalog? > > > >> >> > > > > >> >> > > > >> >> Sure, I understand what the goal was - all I'm saying is that it > > > >> >> looks > > > >> >> very much like a workaround needed because we don't have the > > > >> >> catalog. > > > >> >> > > > >> >> I don't quite understand how could we support custom compression > > > >> >> methods > > > >> >> without listing them in some sort of catalog? > > > >> > > > > >> >Yeah for supporting custom compression we need some catalog. > > > >> > > > > >> >> >> Moreover, having the catalog would allow adding compression > > > >> >> >> methods > > > >> >> >> (from extensions etc) instead of just having a list of hard-coded > > > >> >> >> compression methods. Which seems like a strange limitation, > > > >> >> >> considering > > > >> >> >> this thread is called "custom compression methods". > > > >> >> > > > > >> >> >I think I forgot to mention while submitting the previous patch > > > >> >> >that > > > >> >> >the next patch I am planning to submit is, Support creating the > > > >> >> >custom > > > >> >> >compression methods wherein we can use pg_am catalog to insert the > > > >> >> >new > > > >> >> >compression method. And for dependency handling, we can create an > > > >> >> >attribute dependency on the pg_am row. Basically, we will create > > > >> >> >the > > > >> >> >attribute dependency on the current compression method AM as well > > > >> >> >as > > > >> >> >on the preserved compression methods AM. As part of this, we will > > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression > > > >> >> >field > > > >> >> >will be converted to the oid_vector (first OID will be of the > > > >> >> >current > > > >> >> >compression method, followed by the preserved compression method's > > > >> >> >oids). > > > >> >> > > > >
Re: Fix typos in reorderbuffer.c
On Thu, 8 Oct 2020 at 17:37, Amit Kapila wrote: > > @@ -1432,7 +1432,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > ReorderBufferCleanupTXN(rb, subtxn); > } > > - /* cleanup changes in the toplevel txn */ > + /* cleanup changes in the txn */ > dlist_foreach_modify(iter, &txn->changes) > { > ReorderBufferChange *change; > @@ -1533,7 +1533,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > ReorderBufferTruncateTXN(rb, subtxn); > } > > - /* cleanup changes in the toplevel txn */ > + /* cleanup changes in the txn */ > dlist_foreach_modify(iter, &txn->changes) > { > ReorderBufferChange *change; > > Both the above functions are recursive and will clean the changes for > both the top-level transaction and subtransactions. Right. > So, I feel the > comments should be accordingly updated. +1 for this change. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: [Patch] Optimize dropping of relation buffers using dlist
On Thursday, October 8, 2020 3:38 PM, Tsunakawa-san wrote: > Hi Kirk san, Thank you for looking into my patches! > (1) > + * This returns an InvalidBlockNumber when smgr_cached_nblocks is not > + * available and when not in recovery path. > > + /* > + * We cannot believe the result from smgr_nblocks is always accurate > + * because lseek of buggy Linux kernels doesn't account for a recent > + * write. > + */ > + if (!InRecovery && result == InvalidBlockNumber) > + return InvalidBlockNumber; > + > > These are unnecessary, because mdnblocks() never returns > InvalidBlockNumber and conseuently smgrnblocks() doesn't return > InvalidBlockNumber. Yes. Thanks for carefully looking into that. Removed. > (2) > +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *isCached) > > I think it's better to make the argument name iscached so that camel case > aligns with forknum, which is not forkNum. This is kinda tricky because of the surrounding code which follows inconsistent coding style too. So I just followed the same like below and retained the change. extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync); > (3) > + * This is caused by buggy Linux kernels that might not have > accounted > + * the recent write. If a fork's nblocks is invalid, exit loop. > > "accounted for" is the right English? > I think The second sentence should be described in terms of its meaning, not > the program logic. For example, something like "Give up the optimization if > the block count of any fork cannot be trusted." Fixed. > Likewise, express the following part in semantics. > > + * Do explicit hashtable lookup if the total of nblocks of relation's > forks > + * is not invalid and the nblocks to be invalidated is less than the I revised it like below: "Look up the buffer in the hashtable if the block size is known to be accurate and the total blocks to be invalidated is below the full scan threshold. Otherwise, give up the optimization." > (4) > + if (nForkBlocks[i] == InvalidBlockNumber) > + { > + nTotalBlocks = InvalidBlockNumber; > + break; > + } > > Use isCached in if condition because smgrnblocks() doesn't return > InvalidBlockNumber. Fixed. if (!isCached) > (5) > + nBlocksToInvalidate = nTotalBlocks - firstDelBlock[i]; > > should be > > + nBlocksToInvalidate += (nForkBlocks[i] - firstDelBlock[i]); Fixed. > (6) > + bufHdr->tag.blockNum >= > firstDelBlock[j]) > + InvalidateBuffer(bufHdr); /* > releases spinlock */ > > The right side of >= should be cur_block. Fixed. Attached are the updated patches. Thank you again for the reviews. Regards, Kirk Jamison 0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch Description: 0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch 0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch Description: 0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch 0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch Description: 0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch
RE: [Patch] Optimize dropping of relation buffers using dlist
Hi, > Attached are the updated patches. Sorry there was an error in the 3rd patch. So attached is a rebase one. Regards, Kirk Jamison 0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch Description: 0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch 0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch Description: 0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch 0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch Description: 0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Re: Asynchronous Append on postgres_fdw nodes.
On 10/5/20 11:35 AM, Etsuro Fujita wrote: Hi, I found a small problem. If we have a mix of async and sync subplans when we catch an assertion on a busy connection. Just for example: PLAN Nested Loop (cost=100.00..174316.95 rows=975 width=8) (actual time=5.191..9.262 rows=9 loops=1) Join Filter: (frgn.a = l.a) Rows Removed by Join Filter: 8991 -> Append (cost=0.00..257.20 rows=11890 width=4) (actual time=0.419..2.773 rows=1000 loops=1) Async subplans: 4 -> Async Foreign Scan on f_1 l_2 (cost=100.00..197.75 rows=2925 width=4) (actual time=0.381..0.585 rows=211 loops=1) -> Async Foreign Scan on f_2 l_3 (cost=100.00..197.75 rows=2925 width=4) (actual time=0.005..0.206 rows=195 loops=1) -> Async Foreign Scan on f_3 l_4 (cost=100.00..197.75 rows=2925 width=4) (actual time=0.003..0.282 rows=187 loops=1) -> Async Foreign Scan on f_4 l_5 (cost=100.00..197.75 rows=2925 width=4) (actual time=0.003..0.316 rows=217 loops=1) -> Seq Scan on l_0 l_1 (cost=0.00..2.90 rows=190 width=4) (actual time=0.017..0.057 rows=190 loops=1) -> Materialize (cost=100.00..170.94 rows=975 width=4) (actual time=0.001..0.002 rows=9 loops=1000) -> Foreign Scan on frgn (cost=100.00..166.06 rows=975 width=4) (actual time=0.766..0.768 rows=9 loops=1) Reproduction script 'test1.sql' see in attachment. Here I force the problem reproduction with setting enable_hashjoin and enable_mergejoin to off. 'asyncmix.patch' contains my solution to this problem. -- regards, Andrey Lepikhov Postgres Professional test1.sql Description: application/sql diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 14824368cc..613d406982 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -455,7 +455,7 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, void *arg); static void create_cursor(ForeignScanState *node); static void request_more_data(ForeignScanState *node); -static void fetch_received_data(ForeignScanState *node); +static void fetch_received_data(ForeignScanState *node, bool vacateconn); static void vacate_connection(PgFdwState *fdwconn, bool clear_queue); static void close_cursor(PGconn *conn, unsigned int cursor_number); static PgFdwModifyState *create_foreign_modify(EState *estate, @@ -1706,15 +1706,19 @@ postgresIterateForeignScan(ForeignScanState *node) { /* * finish the running query before sending the next command for - * this node + * this node. + * When the plan contains both asynchronous subplans and non-async + * subplans backend could request more data in async mode and want to + * get data in sync mode by the same connection. Here it must wait + * for async data before request another. */ - if (!fsstate->s.commonstate->busy) -vacate_connection((PgFdwState *)fsstate, false); + if (fsstate->s.commonstate->busy) +vacate_connection(&fsstate->s, false); request_more_data(node); /* Fetch the result immediately. */ - fetch_received_data(node); + fetch_received_data(node, false); } else if (!fsstate->s.commonstate->busy) { @@ -1749,7 +1753,7 @@ postgresIterateForeignScan(ForeignScanState *node) /* fetch the leader's data and enqueue it for the next request */ if (available) { -fetch_received_data(leader); +fetch_received_data(leader, false); add_async_waiter(leader); } } @@ -3729,7 +3733,7 @@ request_more_data(ForeignScanState *node) * Fetches received data and automatically send requests of the next waiter. */ static void -fetch_received_data(ForeignScanState *node) +fetch_received_data(ForeignScanState *node, bool vacateconn) { PgFdwScanState *fsstate = GetPgFdwScanState(node); PGresult *volatile res = NULL; @@ -3817,7 +3821,8 @@ fetch_received_data(ForeignScanState *node) waiter = move_to_next_waiter(node); /* send the next request if any */ - if (waiter) + if (waiter && (!vacateconn || + GetPgFdwScanState(node)->s.conn != GetPgFdwScanState(waiter)->s.conn)) request_more_data(waiter); MemoryContextSwitchTo(oldcontext); @@ -3843,7 +3848,7 @@ vacate_connection(PgFdwState *fdwstate, bool clear_queue) * query */ leader = commonstate->leader; - fetch_received_data(leader); + fetch_received_data(leader, true); /* let the first waiter be the next leader of this connection */ move_to_next_waiter(leader);
Wired if-statement in gen_partprune_steps_internal
Hi: I found the following code in gen_partprune_steps_internal, which looks the if-statement to be always true since list_length(results) > 1; I added an Assert(step_ids != NIL) and all the test cases passed. if the if-statement is always true, shall we remove it to avoid confusion? gen_partprune_steps_internal(GeneratePruningStepsContext *context, if (list_length(result) > 1) { List *step_ids = NIL; foreach(lc, result) { PartitionPruneStep *step = lfirst(lc); step_ids = lappend_int(step_ids, step->step_id); } Assert(step_ids != NIL); if (step_ids != NIL) // This should always be true. { PartitionPruneStep *step; step = gen_prune_step_combine(context, step_ids, PARTPRUNE_COMBINE_INTERSECT); result = lappend(result, step); } } -- Best Regards Andy Fan
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Oct 7, 2020 at 11:19 PM Robert Haas wrote: > > On Wed, Sep 16, 2020 at 3:33 PM Robert Haas wrote: > > I don't mind a loop, but that one looks broken. We have to clear the > > bit before we call the function that processes that type of barrier. > > Otherwise, if we succeed in absorbing the barrier but a new instance > > of the same barrier arrives meanwhile, we'll fail to realize that we > > need to absorb the new one. > > Here's a new version of the patch for allowing errors in > barrier-handling functions and/or rejection of barriers by those > functions. I think this responds to all of the previous review > comments from Andres. Also, here is an 0002 which is a handy bit of > test code that I wrote. It's not for commit, but it is useful for > finding bugs. > > In addition to improving 0001 based on the review comments, I also > tried to write a better commit message for it, but it might still be > possible to do better there. It's a bit hard to explain the idea in > the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process > with an XID -- and possibly a bunch of sub-XIDs, and possibly while > idle-in-transaction -- can elect to FATAL rather than absorbing the > barrier. I suspect for other barrier types we might have certain > (hopefully short) stretches of code where a barrier of a particular > type can't be absorbed because we're in the middle of doing something > that relies on the previous value of whatever state is protected by > the barrier. Holding off interrupts in those stretches of code would > prevent the barrier from being absorbed, but would also prevent query > cancel, backend termination, and absorption of other barrier types, so > it seems possible that just allowing the barrier-absorption function > for a barrier of that type to just refuse the barrier until after the > backend exits the critical section of code will work out better. > > Just for kicks, I tried running 'make installcheck-parallel' while > emitting placeholder barriers every 0.05 s after altering the > barrier-absorption function to always return false, just to see how > ugly that was. In round figures, it made it take 24 s vs. 21 s, so > it's actually not that bad. However, it all depends on how many times > you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine > that the effect might be very non-uniform. That is, if you can get the > code to be running a tight loop that does little real work but does > CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of > barrier, it will probably suck. Therefore, I'm inclined to think that > the fairly strong cautionary logic in the patch is reasonable, but > perhaps it can be better worded somehow. Thoughts welcome. > > I have not rebased the remainder of the patch series over these two. > That I'll do. On a quick look at the latest 0001 patch, the following hunk to reset leftover flags seems to be unnecessary: + /* + * If some barrier types were not successfully absorbed, we will have + * to try again later. + */ + if (!success) + { + ResetProcSignalBarrierBits(flags); + return; + } When the ProcessBarrierPlaceholder() function returns false without an error, that barrier flag gets reset within the while loop. The case when it has an error, the rest of the flags get reset in the catch block. Correct me if I am missing something here. Regards, Amul
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
> On Thu, Oct 08, 2020 at 09:34:51AM +0800, Andy Fan wrote: > > > Other than that I wanted to ask what are the plans to proceed with this > > patch? It's been a while since the question was raised in which format > > to keep unique key expressions, and as far as I can see no detailed > > suggestions or patch changes were proposed as a follow up. Obviously I > > would love to see the first two preparation patches committed to avoid > > dependencies between patches, and want to suggest an incremental > > approach with simple format for start (what we have right now) with the > > idea how to extend it in the future to cover more cases. > > > > I think the hardest part of this series is commit 2, it probably needs > lots of > dedicated time to review which would be the hardest part for the reviewers. > I don't have a good suggestion, however. Sure, and I would review the patch as well. But as far as I understand the main issue is "how to store uniquekey expressions", and as long as it is not decided, no additional review will move the patch forward I guess.
Re: [HACKERS] Runtime Partition Pruning
On Wed, Oct 7, 2020 at 7:00 PM Andy Fan wrote: > > > > On Wed, Oct 7, 2020 at 5:05 PM Andy Fan wrote: >> >> >> >> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan wrote: Now, in my experience, the current system for custom plans vs. generic plans doesn't approach the problem in this way at all, and in my experience that results in some pretty terrible behavior. It will do things like form a custom plan every time because the estimated cost of the custom plan is lower than the estimated cost of the generic plan even though the two plans are structurally identical; only the estimates differ. It will waste gobs of CPU cycles by replanning a primary key lookup 5 times just on the off chance that a lookup on the primary key index isn't the best option. But this patch isn't going to fix any of that. The best we can probably do is try to adjust the costing for Append paths in some way that reflects the costs and benefits of pruning. I'm tentatively in favor of trying to do something modest in that area, but I don't have a detailed proposal. >>> >>> I just realized this issue recently and reported it at [1], then Amit >>> pointed >>> me to this issue being discussed here, so I would like to continue this >>> topic >>> here. >>> >>> I think we can split the issue into 2 issues. One is the partition prune >>> in initial >>> partition prune, which maybe happen in custom plan case only and caused >>> the above issue. The other one happens in the "Run-Time" partition prune, >>> I admit that is an important issue to resolve as well, but looks harder. >>> So I >>> think we can fix the first one at first. >>> >>> ... When we count for the cost of a >>> generic plan, we can reduce the cost based on such information. >> >> >> This way doesn't work since after the initial partition prune, not only the >> cost of the Append node should be reduced, the cost of other plans should >> be reduced as well [1] >> >> However I think if we can use partition prune information from a custom plan >> at the cost_append_path stage, it looks the issue can be fixed. If so, the >> idea >> is similar to David's idea in [2], however Robert didn't agree with this[2]. >> Can anyone elaborate this objection? for a partkey > $1 or BETWEEN cases, >> some real results from the past are probably better than some hard-coded >> assumptions IMO. > > > I can understand Robert's idea now, he intended to resolve both the > "initial-partition-prune" case and "runtime partition prune" case while I > always think > about the former case (Amit reminded me about that at the beginning, but I > just > wake up right now..) > > With the Robert's idea, I think we can do some hack at create_append_path, > we can figure out the pruneinfo (it was done at create_append_plan now) at > that > stage, and then did a fix_append_path with such pruneinfo. We need to fix not > only the cost of AppendPath, but also rows of AppendPath, For example: > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the > plan is: > Nest Loop >Nest Loop > t1 > Append (p) >t2 > > Then the rows of Append (p) will be very important. > > For this idea, how to estimate how many partitions/rows can be pruned is a > key > part. Robert said "I was thinking, rather, that if we know for example that > we've doing pruning on partition_column = $1, then we know that only > one partition will match. That's probably a common case. If we've > got partition_column > $1, we could assume that, say, 75% of the > partitions would match. partition_column BETWEEN $1 and $2 is > probably a bit more selective, so maybe we assume 50% of the > partitions would match.". I think we can't say the 75% or 50% is a good > number, but the keypoint may be "partition_column = $1" is a common > case. And for the others case, we probably don't make it worse. > > I think we need to do something here, any thoughts? Personally I'm more > like this idea now. Yes, I think we have to do something on those lines. But I am wondering why our stats machinery is failing to do that already. For equality I think it's straight forward. But even for other operators we should use the statistics from the partitioned table to estimate the selectivity and then adjust number of scanned partitions = (number of partitions * fraction of rows scanned). That might be better than 50% or 75%. -- Best Wishes, Ashutosh Bapat
Re: Asynchronous Append on postgres_fdw nodes.
Hi, I want to suggest one more improvement. Currently the is_async_capable_path() routine allow only ForeignPath nodes as async capable path. But in some cases we can allow SubqueryScanPath as async capable too. For example: SELECT * FROM ((SELECT * FROM foreign_1) UNION ALL (SELECT a FROM foreign_2)) AS b; is async capable, but: SELECT * FROM ((SELECT * FROM foreign_1 LIMIT 10) UNION ALL (SELECT a FROM foreign_2 LIMIT 10)) AS b; doesn't async capable. The patch in attachment tries to improve this situation. -- regards, Andrey Lepikhov Postgres Professional >From fa73b84e8c456c48ef4788304d2ed14f31365aac Mon Sep 17 00:00:00 2001 From: Andrey Lepikhov Date: Thu, 8 Oct 2020 15:46:41 +0500 Subject: [PATCH] 2 --- contrib/postgres_fdw/expected/postgres_fdw.out | 3 ++- src/backend/optimizer/path/allpaths.c | 4 src/backend/optimizer/plan/createplan.c| 2 +- 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index eca44a4f40..e5972574b6 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -2082,6 +2082,7 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 Output: t1.c1, t2.c1 Group Key: t1.c1, t2.c1 -> Append + Async subplans: 2 -> Foreign Scan Output: t1.c1, t2.c1 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2) @@ -2090,7 +2091,7 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 Output: t1_1.c1, t2_1.c1 Relations: (public.ft1 t1_1) INNER JOIN (public.ft2 t2_1) Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" -(20 rows) +(21 rows) SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) UNION SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)) AS t (t1c1, t2c1) GROUP BY t1c1 ORDER BY t1c1 OFFSET 100 LIMIT 10; t1c1 | avg diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 17e9a7a897..5822ba83e0 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -3991,6 +3991,10 @@ is_async_capable_path(Path *path) fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path)) return true; } + break; + case T_SubqueryScanPath: + if (is_async_capable_path(((SubqueryScanPath *) path)->subpath)) +return true; default: break; } diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 3ae46ed6f1..efb1b0cb4e 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -1231,7 +1231,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags) * Classify as async-capable or not. If we have decided to run the * children in parallel, we cannot any one of them run asynchronously. * Planner thinks that all subnodes are executed in order if this - * append is orderd. No subpaths cannot be run asynchronously in that + * append is ordered. No subpaths cannot be run asynchronously in that * case. */ if (pathkeys == NIL && -- 2.17.1
Re: speed up unicode normalization quick check
On Thu, Oct 08, 2020 at 04:52:18AM -0400, John Naylor wrote: > Looks fine overall, but one minor nit: I'm curious why you made a separate > section in the pgindent exclusions. The style in that file seems to be one > comment per category. Both parts indeed use PerfectHash.pm, but are generated by different scripts so that looked better as separate items. -- Michael signature.asc Description: PGP signature
Re: partition routing layering in nodeModifyTable.c
On Wed, Oct 7, 2020 at 9:07 PM Heikki Linnakangas wrote: > On 07/10/2020 12:50, Amit Langote wrote: > > On Tue, Oct 6, 2020 at 12:45 AM Heikki Linnakangas wrote: > >> It would be better to set it in make_modifytable(), just > >> after calling PlanDirectModify(). > > > > Actually, that's how it was done in earlier iterations but I think I > > decided to move that into the FDW's functions due to some concern of > > one of the other patches that depended on this patch. Maybe it makes > > sense to bring that back into make_modifytable() and worry about the > > other patch later. > > On second thoughts, I take back my earlier comment. Setting it in > make_modifytable() relies on the assumption that the subplan is a single > ForeignScan node, on the target relation. The documentation for > PlanDirectModify says: > > > To execute the direct modification on the remote server, this > > function must rewrite the target subplan with a ForeignScan plan node > > that executes the direct modification on the remote server. >> > So I guess that assumption is safe. But I'd like to have some wiggle > room here. Wouldn't it be OK to have a Result node on top of the > ForeignScan, for example? If it really must be a simple ForeignScan > node, the PlanDirectModify API seems pretty strange. > > I'm not entirely sure what I would like to do with this now. I could > live with either version, but I'm not totally happy with either. (I like > your suggestion below) Assuming you mean the idea of using RT index to access ResultRelInfos in es_result_relations, we would still need to store the index in the ForeignScan node, so the question of whether to do it in make_modifytable() or in PlanDirectModify() must still be answered. > Looking at this block in postgresBeginDirectModify: > > > /* > >* Identify which user to do the remote access as. This should match > > what > >* ExecCheckRTEPerms() does. > >*/ > > Assert(fsplan->resultRelIndex >= 0); > > dmstate->resultRelIndex = fsplan->resultRelIndex; > > rtindex = list_nth_int(resultRelations, fsplan->resultRelIndex); > > rte = exec_rt_fetch(rtindex, estate); > > userid = rte->checkAsUser ? rte->checkAsUser : GetUserId(); > > That's a complicated way of finding out the target table's RTI. We > should probably store the result RTI in the ForeignScan in the first place. > > >> Another idea is to merge "resultRelIndex" and a "range table index" into > >> one value. Range table entries that are updated would have a > >> ResultRelInfo, others would not. I'm not sure if that would end up being > >> cleaner or messier than what we have now, but might be worth trying. > > > > I have thought about something like this before. An idea I had is to > > make es_result_relations array indexable by plain RT indexes, then we > > don't need to maintain separate indexes that we do today for result > > relations. > > That sounds like a good idea. es_result_relations is currently an array > of ResultRelInfos, so that would leave a lot of unfilled structs in the > array. But in on of your other threads, you proposed turning > es_result_relations into an array of pointers anyway > (https://www.postgresql.org/message-id/CA+HiwqE4k1Q2TLmCAvekw+8_NXepbnfUOamOeX=kphrdtfs...@mail.gmail.com). Okay, I am reorganizing the patches around that idea and will post an update soon. -- Amit Langote EDB: http://www.enterprisedb.com
Re: Transactions involving multiple postgres foreign servers, take 2
On Thu, 8 Oct 2020 at 18:05, tsunakawa.ta...@fujitsu.com wrote: > > Sorry to be late to respond. (My PC is behaving strangely after upgrading > Win10 2004) > > From: Masahiko Sawada > > After more thoughts on Tsunakawa-san’s idea it seems to need the > > following conditions: > > > > * At least postgres_fdw is viable to implement these APIs while > > guaranteeing not to happen any error. > > * A certain number of FDWs (or majority of FDWs) can do that in a > > similar way to postgres_fdw by using the guideline and probably > > postgres_fdw as a reference. > > > > These are necessary for FDW implementors to implement APIs while > > following the guideline and for the core to trust them. > > > > As far as postgres_fdw goes, what we need to do when committing a > > foreign transaction resolution is to get a connection from the > > connection cache or create and connect if not found, construct a SQL > > query (COMMIT/ROLLBACK PREPARED with identifier) using a fixed-size > > buffer, send the query, and get the result. The possible place to > > raise an error is limited. In case of failures such as connection > > error FDW can return false to the core along with a flag indicating to > > ask the core retry. Then the core will retry to resolve foreign > > transactions after some sleep. OTOH if FDW sized up that there is no > > hope of resolving the foreign transaction, it also could return false > > to the core along with another flag indicating to remove the entry and > > not to retry. Also, the transaction resolution by FDW needs to be > > cancellable (interruptible) but cannot use CHECK_FOR_INTERRUPTS(). > > > > Probably, as Tsunakawa-san also suggested, it’s not impossible to > > implement these APIs in postgres_fdw while guaranteeing not to happen > > any error, although not sure the code complexity. So I think the first > > condition may be true but not sure about the second assumption, > > particularly about the interruptible part. > > Yeah, I expect the commit of the second phase should not be difficult for the > FDW developer. > > As for the cancellation during commit retry, I don't think we necessarily > have to make the TM responsible for retrying the commits. Many DBMSs have > their own timeout functionality such as connection timeout, socket timeout, > and statement timeout. > Users can set those parameters in the foreign server options based on how > long the end user can wait. That is, TM calls FDW's commit routine just once. What about temporary network failures? I think there are users who don't want to give up resolving foreign transactions failed due to a temporary network failure. Or even they might want to wait for transaction completion until they send a cancel request. If we want to call the commit routine only once and therefore want FDW to retry connecting the foreign server within the call, it means we require all FDW implementors to write a retry loop code that is interruptible and ensures not to raise an error, which increases difficulty. Also, what if the user sets the statement timeout to 60 sec and they want to cancel the waits after 5 sec by pressing ctl-C? You mentioned that client libraries of other DBMSs don't have asynchronous execution functionality. If the SQL execution function is not interruptible, the user will end up waiting for 60 sec, which seems not good. > If the TM makes efforts to retry commits, the duration would be from a few > seconds to 30 seconds. Then, we can hold back the cancellation during that > period. > > > > I thought we could support both ideas to get their pros; supporting > > Tsunakawa-san's idea and then my idea if necessary, and FDW can choose > > whether to ask the resolver process to perform 2nd phase of 2PC or > > not. But it's not a good idea in terms of complexity. > > I don't feel the need for leaving the commit to the resolver during normal > operation. I meant it's for FDWs that cannot guarantee not to happen error during resolution. > seems like if failed to resolve, the backend would return an > > acknowledgment of COMMIT to the client and the resolver process > > resolves foreign prepared transactions in the background. So we can > > ensure that the distributed transaction is completed at the time when > > the client got an acknowledgment of COMMIT if 2nd phase of 2PC is > > successfully completed in the first attempts. OTOH, if it failed for > > whatever reason, there is no such guarantee. From an optimistic > > perspective, i.g., the failures are unlikely to happen, it will work > > well but IMO it’s not uncommon to fail to resolve foreign transactions > > due to network issue, especially in an unreliable network environment > > for example geo-distributed database. So I think it will end up > > requiring the client to check if preceding distributed transactions > > are completed or not in order to see the results of these > > transactions. > > That issue exists with any method, doesn't it? Yes, but if we don’t ret
Re: Resetting spilled txn statistics in pg_stat_replication
On Thu, Oct 8, 2020 at 7:46 AM Masahiko Sawada wrote: > > On Wed, 7 Oct 2020 at 17:52, Amit Kapila wrote: > > > > > I think after we are done with this the next > > step would be to finish the streaming stats work [1]. We probably need > > to review and add the test case in that patch. If nobody else shows up > > I will pick it up and complete it. > > +1 > I can review that patch. > I have rebased the stream stats patch and made minor modifications. I haven't done a detailed review but one thing that I think is not correct is: @@ -3496,10 +3499,18 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) txn->snapshot_now = NULL; } + + rb->streamCount += 1; + rb->streamBytes += txn->total_size; + + /* Don't consider already streamed transaction. */ + rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1; + /* Process and send the changes to output plugin. */ ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now, command_id, true); I think we should update the stream stats after ReorderBufferProcessTXN rather than before because any error in ReorderBufferProcessTXN can lead to an unnecessary update of stats. But OTOH, the txn flags, and other data can be changed after ReorderBufferProcessTXN so we need to save them in a temporary variable before calling the function. -- With Regards, Amit Kapila. v2-0001-Track-streaming-stats.patch Description: Binary data
Re: extensible options syntax for replication parser?
On Wed, Jun 24, 2020 at 11:51 AM Robert Haas wrote: > Thanks for the review. v2 attached, hopefully fixing the compilation > issue you mentioned. Tushar Ahuja reported to me off-list that my basebackup refactoring patch set was changing whether or not the following message appeared: NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup That patch set includes this patch, and the reason for the behavior difference turned out to be that I had gotten an if-test that is part of this patch backwards. Here is v3, fixing that. It is a little disappointing that this mistake didn't cause any existing regression tests to fail. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company v3-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch Description: Binary data
Re: Wrong example in the bloom documentation
On Thu, Oct 8, 2020 at 06:34:32AM +, Daniel Westermann (DWE) wrote: > Hi, > > as this does not get any attention on the docs-list, once again here. > Any thoughts on this? I was hoping someone more experienced with this would comment, but seeing none, I will apply it in a day or two to all supported versions? Have you tested this output back to 9.5? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: [PATCH] ecpg: fix progname memory leak
On Thu, Oct 8, 2020 at 10:35:41AM +0900, Michael Paquier wrote: > On Wed, Oct 07, 2020 at 02:31:54PM +0300, Maksim Kita wrote: > > Fix progname memory leak in ecpg client. > > Issue taken from todo list https://wiki.postgresql.org/wiki/Todo. > > FWIW, I don't see much point in doing that. For one, we have a > more-or-less established rule that progname remains set until the > application leaves, and there are much more places where we leak > memory like that. As one example, just see the various places where > we use pg_strdup for option parsing. At the end, it does not really > matter as these are applications running for a short amount of time. Agreed, but what does the TODO item mean then? Fix small memory leaks in ecpg Memory leaks in a short running application like ecpg are not really a problem, but make debugging more complicated Should it be removed? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Probably typo in multixact.c
On Thu, Oct 8, 2020 at 10:26:39AM +0900, Michael Paquier wrote: > On Thu, Oct 08, 2020 at 01:15:35AM +, Hou, Zhijie wrote: > > Hi > > > > In multixact.c I found some comments like the following: > > > > * Similar to AtEOX_MultiXact but for COMMIT PREPARED > > * Discard the local MultiXactId cache like in AtEOX_MultiXact > > > > Since there's no function called "AtEOX_MultiXact" in the code, > > I think the "AtEOX_MultiXact" may be a typo. > > > > AtEOXact_MultiXact seems to be the right function here. > > Yes, that looks like a simple typo to me as well. > AtEOXact_MultiXact() shares portions of the logics in > PostPrepare_MultiXact and multixact_twophase_postcommit. FYI, this patch was applied. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: [PATCH] ecpg: fix progname memory leak
On Thu, 2020-10-08 at 12:27 -0400, Bruce Momjian wrote: > On Thu, Oct 8, 2020 at 10:35:41AM +0900, Michael Paquier wrote: > > On Wed, Oct 07, 2020 at 02:31:54PM +0300, Maksim Kita wrote: > > > Fix progname memory leak in ecpg client. > > > Issue taken from todo list https://wiki.postgresql.org/wiki/Todo. > > > > FWIW, I don't see much point in doing that. For one, we have a > > more-or-less established rule that progname remains set until the > > application leaves, and there are much more places where we leak > > memory like that. As one example, just see the various places > > where > > we use pg_strdup for option parsing. At the end, it does not > > really > > matter as these are applications running for a short amount of > > time. > > Agreed, but what does the TODO item mean then? > > Fix small memory leaks in ecpg > Memory leaks in a short running application like ecpg are > not really > a problem, but make debugging more complicated > > Should it be removed? I'd say yes, let's remove it. Actually I wasn't even aware it's on there. While I agree that it makes debugging of memory handling in ecpg more difficult, I don't see much of a point in it. Michael -- Michael Meskes Michael at Fam-Meskes dot De Michael at Meskes dot (De|Com|Net|Org) Meskes at (Debian|Postgresql) dot Org
Re: [PATCH] ecpg: fix progname memory leak
On Thu, Oct 8, 2020 at 06:58:14PM +0200, Michael Meskes wrote: > On Thu, 2020-10-08 at 12:27 -0400, Bruce Momjian wrote: > > On Thu, Oct 8, 2020 at 10:35:41AM +0900, Michael Paquier wrote: > > > On Wed, Oct 07, 2020 at 02:31:54PM +0300, Maksim Kita wrote: > > > > Fix progname memory leak in ecpg client. > > > > Issue taken from todo list https://wiki.postgresql.org/wiki/Todo. > > > > > > FWIW, I don't see much point in doing that. For one, we have a > > > more-or-less established rule that progname remains set until the > > > application leaves, and there are much more places where we leak > > > memory like that. As one example, just see the various places > > > where > > > we use pg_strdup for option parsing. At the end, it does not > > > really > > > matter as these are applications running for a short amount of > > > time. > > > > Agreed, but what does the TODO item mean then? > > > > Fix small memory leaks in ecpg > > Memory leaks in a short running application like ecpg are > > not really > > a problem, but make debugging more complicated > > > > Should it be removed? > > I'd say yes, let's remove it. Actually I wasn't even aware it's on > there. While I agree that it makes debugging of memory handling in ecpg > more difficult, I don't see much of a point in it. OK, TODO removed. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: [PATCH] ecpg: fix progname memory leak
On Wed, Oct 7, 2020 at 6:35 PM Michael Paquier wrote: > On Wed, Oct 07, 2020 at 02:31:54PM +0300, Maksim Kita wrote: > > Fix progname memory leak in ecpg client. > > Issue taken from todo list https://wiki.postgresql.org/wiki/Todo. > > FWIW, I don't see much point in doing that. > I hope that everyone takes 10 seconds and realizes that this appears to be this person's first submitted patch. I would think a little more respect for the attempt to patch a minor issue would be afforded to such a person. Seems technically sound and they are not trying to change the world with their first attempt. Maybe a reminder that the TODO list is not always spectacular and that a double check with the list before doing something might be in order (in fact adding that to the top of the TODO list might be a great option here as well). It's not going to win a Turing award - but I thought this project was a little more friendly then what I've seen in this thread towards a first time contributor. John
Re: [PATCH] ecpg: fix progname memory leak
On Thu, Oct 8, 2020 at 10:13:53AM -0700, John W Higgins wrote: > > On Wed, Oct 7, 2020 at 6:35 PM Michael Paquier wrote: > > On Wed, Oct 07, 2020 at 02:31:54PM +0300, Maksim Kita wrote: > > Fix progname memory leak in ecpg client. > > Issue taken from todo list https://wiki.postgresql.org/wiki/Todo. > > FWIW, I don't see much point in doing that. > > > I hope that everyone takes 10 seconds and realizes that this appears to be > this > person's first submitted patch. I would think a little more respect for the > attempt to patch a minor issue would be afforded to such a person. Seems > technically sound and they are not trying to change the world with their first > attempt. > > Maybe a reminder that the TODO list is not always spectacular and that a > double > check with the list before doing something might be in order (in fact adding > that to the top of the TODO list might be a great option here as well). > > It's not going to win a Turing award - but I thought this project was a little > more friendly then what I've seen in this thread towards a first time > contributor. You mean like the warning at the top of the TODO list? https://wiki.postgresql.org/wiki/Todo#Development_Process WARNING for Developers: Unfortunately this list does not contain all the information necessary for someone to start coding a feature. Some of these items might have become unnecessary since they were added --- others might be desirable but the implementation might be unclear. When selecting items listed below, be prepared to first discuss the value of the feature. Do not assume that you can select one, code it and then expect it to be committed. Always discuss design on Hackers list before starting to code. The flow should be: Desirability -> Design -> Implement -> Test -> Review -> Commit -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: [PATCH] ecpg: fix progname memory leak
Michael Meskes writes: > On Thu, 2020-10-08 at 12:27 -0400, Bruce Momjian wrote: >> Agreed, but what does the TODO item mean then? >> >> Fix small memory leaks in ecpg >> Memory leaks in a short running application like ecpg are not really >> a problem, but make debugging more complicated >> >> Should it be removed? > I'd say yes, let's remove it. Actually I wasn't even aware it's on > there. While I agree that it makes debugging of memory handling in ecpg > more difficult, I don't see much of a point in it. Agreed. In theory there might be some point in removing leaks that happen per-statement, so as to avoid unreasonable memory bloat when processing an extremely long input file. In practice nobody has complained about that, and if somebody did I'd probably question the sanity of putting so much code into one file. (The C compiler would likely bloat even more while processing the output...) Moreover, given the way the ecpg grammar works, it'd be really painful to avoid all such leaks, and it might well introduce bugs not fix them. In any case, if this TODO item is going to lead to ideas as dubious as "let's free progname before exiting", it's not helpful. regards, tom lane
Re: Wrong example in the bloom documentation
Hi Bruce, >On Thu, Oct 8, 2020 at 06:34:32AM +, Daniel Westermann (DWE) wrote: >> Hi, >> >> as this does not get any attention on the docs-list, once again here. >> Any thoughts on this? >I was hoping someone more experienced with this would comment, but >seeing none, I will apply it in a day or two to all supported versions? >Have you tested this output back to 9.5? I hoped that as well. No, I tested down to 9.6 because the change happened in 10. Regards Daniel
Re: Wrong example in the bloom documentation
On Thu, Oct 8, 2020 at 06:12:47PM +, Daniel Westermann (DWE) wrote: > Hi Bruce, > > >On Thu, Oct 8, 2020 at 06:34:32AM +, Daniel Westermann (DWE) wrote: > >> Hi, > >> > >> as this does not get any attention on the docs-list, once again here. > >> Any thoughts on this? > > >I was hoping someone more experienced with this would comment, but > >seeing none, I will apply it in a day or two to all supported versions? > >Have you tested this output back to 9.5? > > I hoped that as well. No, I tested down to 9.6 because the change happened in > 10. OK, so the patch should be applied only to PG 10 and later? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Assertion failure with LEFT JOINs among >500 relations
Hi, I hit an assertion failure. When asserts disabled, it works fine even with more tables (>5000). Steps to reproduce: CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 int, value_3 float, value_4 bigint); 250 relations work fine, see the query (too long to copy & paste here): https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L41 -- when # relations >500, we hit the assertion (too long to copy & paste here): See the query: https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L45 And, the backtrace: (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT * frame #0: 0x7fff639fa2c2 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x7fff63ab5bf1 libsystem_pthread.dylib`pthread_kill + 284 frame #2: 0x7fff639646a6 libsystem_c.dylib`abort + 127 frame #3: 0x000102180a02 postgres`ExceptionalCondition(conditionName=, errorType=, fileName=, lineNumber=) at assert.c:67:2 frame #4: 0x000101ece9b2 postgres`initial_cost_mergejoin(root=0x7ff0, workspace=0x7ffeedf5b528, jointype=JOIN_INNER, mergeclauses=, outer_path=0x00012ebf12d0, inner_path=0x4093d800, outersortkeys=0x, innersortkeys=0x00012ebf68e8, extra=0x7ffeedf5b6f8) at costsize.c:3043:2 frame #5: 0x000101eda01b postgres`try_mergejoin_path(root=0x000104a12618, joinrel=0x00012ebeede0, outer_path=0x00012ebf12d0, inner_path=0x0001283d00e8, pathkeys=0x00012ebf67e0, mergeclauses=0x00012ebf6890, outersortkeys=0x, innersortkeys=0x00012ebf68e8, jointype=JOIN_LEFT, extra=0x7ffeedf5b6f8, is_partial=) at joinpath.c:615:2 frame #6: 0x000101ed9426 postgres`sort_inner_and_outer(root=0x000104a12618, joinrel=0x00012ebeede0, outerrel=, innerrel=, jointype=JOIN_LEFT, extra=0x7ffeedf5b6f8) at joinpath.c:1038:3 frame #7: 0x000101ed8f7a postgres`add_paths_to_joinrel(root=0x000104a12618, joinrel=0x00012ebeede0, outerrel=0x00012ebe7b48, innerrel=0x000127f146e0, jointype=, sjinfo=, restrictlist=0x00012ebf42b0) at joinpath.c:269:3 frame #8: 0x000101edbdc6 postgres`populate_joinrel_with_paths(root=0x000104a12618, rel1=0x00012ebe7b48, rel2=0x000127f146e0, joinrel=0x00012ebeede0, sjinfo=0x00012809edc8, restrictlist=0x00012ebf42b0) at joinrels.c:824:4 frame #9: 0x000101edb57a postgres`make_join_rel(root=0x000104a12618, rel1=0x00012ebe7b48, rel2=0x000127f146e0) at joinrels.c:760:2 frame #10: 0x000101edb1ec postgres`make_rels_by_clause_joins(root=0x000104a12618, old_rel=0x00012ebe7b48, other_rels_list=, other_rels=) at joinrels.c:312:11 frame #11: 0x000101edada3 postgres`join_search_one_level(root=0x000104a12618, level=2) at joinrels.c:123:4 frame #12: 0x000101ec7feb postgres`standard_join_search(root=0x000104a12618, levels_needed=8, initial_rels=0x00012ebf4078) at allpaths.c:3097:3 frame #13: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280a5618) at allpaths.c:2993:14 frame #14: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280ab320) at allpaths.c:2993:14 frame #15: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280b1028) at allpaths.c:2993:14 frame #16: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280b6d30) at allpaths.c:2993:14 frame #17: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280bca38) at allpaths.c:2993:14 frame #18: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280c2740) at allpaths.c:2993:14 frame #19: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280c8448) at allpaths.c:2993:14 frame #20: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280ce150) at allpaths.c:2993:14 frame #21: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280d3e58) at allpaths.c:2993:14 frame #22: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280d9b60) at allpaths.c:2993:14 frame #23: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280df868) at allpaths.c:2993:14 frame #24: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280e5570) at allpaths.c:2993:14 frame #25: 0x000101ec6b38 postgres`make_rel_from_joinlist(root=0x000104a12618, joinlist=0x0001280eb278) at allpaths.c:2993:14
Expansion of our checks for connection-loss errors
Over in the thread at [1], we've tentatively determined that the reason buildfarm member lorikeet is currently failing is that its network stack returns ECONNABORTED for (some?) connection failures, whereas our code is only expecting ECONNRESET. Fujii Masao therefore proposes that we treat ECONNABORTED the same as ECONNRESET. I think this is a good idea, but after a bit of research I feel it does not go far enough. I find these POSIX-standard errnos that also seem likely candidates to be returned for a hard loss of connection: ECONNABORTED EHOSTUNREACH ENETDOWN ENETUNREACH All of these have been in POSIX since SUSv2, so it seems unlikely that we need to #ifdef any of them. (It is in any case pretty silly that we have #ifdefs around a very small minority of our references to ECONNRESET :-(.) There are some other related errnos, such as ECONNREFUSED, that don't seem like they'd be returned for a failure of a pre-existing connection, so we don't need to include them in such tests. Accordingly, I propose the attached patch (an expansion of Fujii-san's) that causes us to test for all five errnos anyplace we had been checking for ECONNRESET. I felt that this was getting to the point where we'd better centralize the knowledge of what to check, so the patch does that, via an inline function and an admittedly hacky macro. I also upgraded some places such as strerror.c to have full support for these symbols. All of the machines I have (even as far back as HPUX 10.20) also define ENETRESET and EHOSTDOWN. However, those symbols do not appear in SUSv2. ENETRESET was added at some later point, but EHOSTDOWN is still not in POSIX. For the moment I've left these second-tier symbols out of the patch, but there's a case for adding them. I'm not sure whether there'd be any point in trying to #ifdef them. BTW, I took out the conditional defines of some of these errnos in libpq's win32.h; AFAICS that's been dead code ever since we added #define's for them to win32_port.h. Am I missing something? This seems like a bug fix to me, so I'm inclined to back-patch. regards, tom lane [1] https://www.postgresql.org/message-id/flat/E1kPc9v-0005L4-2l%40gemulon.postgresql.org diff --git a/src/backend/port/win32/socket.c b/src/backend/port/win32/socket.c index 6fbd1ed6fb..0f28c07ed1 100644 --- a/src/backend/port/win32/socket.c +++ b/src/backend/port/win32/socket.c @@ -123,10 +123,14 @@ TranslateSocketError(void) case WSAEHOSTUNREACH: case WSAEHOSTDOWN: case WSAHOST_NOT_FOUND: + errno = EHOSTUNREACH; + break; case WSAENETDOWN: - case WSAENETUNREACH: case WSAENETRESET: - errno = EHOSTUNREACH; + errno = ENETDOWN; + break; + case WSAENETUNREACH: + errno = ENETUNREACH; break; case WSAENOTCONN: case WSAESHUTDOWN: diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index d0b368530e..8937f223a8 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -712,9 +712,7 @@ errcode_for_socket_access(void) { /* Loss of connection */ case EPIPE: -#ifdef ECONNRESET - case ECONNRESET: -#endif + case ALL_CONNECTION_LOSS_ERRNOS: edata->sqlerrcode = ERRCODE_CONNECTION_FAILURE; break; diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c index f0587f41e4..b8ab939179 100644 --- a/src/bin/pg_dump/parallel.c +++ b/src/bin/pg_dump/parallel.c @@ -1825,7 +1825,7 @@ piperead(int s, char *buf, int len) { int ret = recv(s, buf, len, 0); - if (ret < 0 && WSAGetLastError() == WSAECONNRESET) + if (ret < 0 && errno_is_connection_loss(WSAGetLastError())) { /* EOF on the pipe! */ ret = 0; diff --git a/src/include/port.h b/src/include/port.h index 84bf2c363f..ffa21e782a 100644 --- a/src/include/port.h +++ b/src/include/port.h @@ -99,6 +99,30 @@ extern void pgfnames_cleanup(char **filenames); ) #endif +/* Test for all errnos that report loss of an established TCP connection */ +static inline bool +errno_is_connection_loss(int e) +{ + if (e == ECONNRESET || + e == ECONNABORTED || + e == EHOSTUNREACH || + e == ENETDOWN || + e == ENETUNREACH) + return true; + return false; +} + +/* + * To test for connection-loss errnos in a switch statement, write + * "case ALL_CONNECTION_LOSS_ERRNOS:". + */ +#define ALL_CONNECTION_LOSS_ERRNOS \ + ECONNRESET: \ + case ECONNABORTED: \ + case EHOSTUNREACH: \ + case ENETDOWN: \ + case ENETUNREACH + /* Portable locale initialization (in exec.c) */ extern void set_pglocale_pgservice(const char *argv0, const char *app); diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h index 8b6576b23d..7ce8c87e8d 100644 --- a/src/include/port/win32_port.h +++ b/src/include/port/win32_port.h @@ -351,6 +351,10 @@ extern int pgwin32_safestat(const char *path, struct stat *buf); #define EADDRNOTAVAIL WSAEADDRNOTAVAIL #undef EHOSTUNREACH #define EHOSTUNREACH WSAEHOSTUNREACH +#undef ENETDOWN +#define ENETDOWN WSAENETDO
Re: Wrong example in the bloom documentation
"Daniel Westermann (DWE)" writes: >> I was hoping someone more experienced with this would comment, but >> seeing none, I will apply it in a day or two to all supported versions? >> Have you tested this output back to 9.5? > I hoped that as well. No, I tested down to 9.6 because the change happened in > 10. The patch assumes that parallel query is enabled, which is not true by default before v10, so it should certainly not be applied before v10 (at least not without significant revisions). regards, tom lane
Re: Wrong example in the bloom documentation
On Thu, Oct 8, 2020 at 03:43:32PM -0400, Tom Lane wrote: > "Daniel Westermann (DWE)" writes: > >> I was hoping someone more experienced with this would comment, but > >> seeing none, I will apply it in a day or two to all supported versions? > >> Have you tested this output back to 9.5? > > > I hoped that as well. No, I tested down to 9.6 because the change happened > > in 10. > > The patch assumes that parallel query is enabled, which is not true by > default before v10, so it should certainly not be applied before v10 > (at least not without significant revisions). I think we should just go for simple application cases for this. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Minor documentation error regarding streaming replication protocol
On Sun, Sep 13, 2020 at 10:25:01PM +0200, Brar Piening wrote: > While implementing streaming replication client functionality for Npgsql > I stumbled upon a minor documentation error at > https://www.postgresql.org/docs/current/protocol-replication.html > > The "content" return value for the TIMELINE_HISTORYcommand is advertised > as bytea while it is in fact raw ASCII bytes. > > How did I find out? > Since the value I get doesn't start with a "\x" and contains ascii text, > although I've bytea_outputset to hex, I first thought that the streaming > replication protocol simply doesn't honor bytea_output, but then I > realized that I also get unencoded tabs and newlines which wouldn't be > possible if the value woud be passed through byteaout. > > This is certainly a minor problem since the timeline history file only > contains generated strings that are ASCII-only, so just using the > unencoded bytes is actually easier than decoding bytea. > OTOH it did cost me a few hours (writing a bytea decoder and figuring > out why it doesn't work by looking at varlena.c and proving the docs > wrong) so I want to point this out here since it is possibly an > unintended behavior or at least a documentation error. > Also I'm wary of taking dependency on an undocumented implementation > detail that could possibly change at any point. I have looked at this. It seems SendTimeLineHistory() is sending raw bytes from the history file, with no encoding conversion, and ReceiveXlogStream() is receiving it, again assuming it is just plain text. I am not sure we really have an SQL data type where we do this. BYTEA doesn't do encoding conversion, but does backslash procesing, and TEXT does encoding conversion. I suppose we either have to document this as BYTEA with no backslash processing, or TEXT with no encoding conversion --- I think I prefer the later. Attached is a patch to update the docs, and the data type passed by SendTimeLineHistory(). Does this look safe to everyone? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index f5e3318106..3cc30a480b 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -1877,11 +1877,11 @@ The commands accepted in replication mode are: - content (bytea) + content (text) - Contents of the timeline history file. + Contents of the timeline history file; no encoding conversion is performed. diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 7c9d1b67df..7db8975065 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -496,7 +496,7 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd) pq_sendstring(&buf, "content"); /* col name */ pq_sendint32(&buf, 0); /* table oid */ pq_sendint16(&buf, 0); /* attnum */ - pq_sendint32(&buf, BYTEAOID); /* type oid */ + pq_sendint32(&buf, TEXTOID); /* type oid */ pq_sendint16(&buf, -1); /* typlen */ pq_sendint32(&buf, 0); /* typmod */ pq_sendint16(&buf, 0); /* format code */
Re: [PATCH] ecpg: fix progname memory leak
On Thu, Oct 8, 2020 at 10:13:53AM -0700, John W Higgins wrote: >It's not going to win a Turing award - but I thought this project was a >little more friendly then what I've seen in this thread towards a first >time contributor. Instead, it is unfriendly. It takes a lot of motivation to "try" to submit a patch. Good lucky Maksim Kita. Thanks for the support John regards, Ranier Vilela
Re: Parallel INSERT (INTO ... SELECT ...)
On Tue, Oct 6, 2020 at 10:38 PM Greg Nancarrow wrote: +if (estate->es_plannedstmt->commandType == CMD_INSERT) ... +if ((XactReadOnly || (IsInParallelMode() && queryDesc->plannedstmt->commandType != CMD_INSERT)) && ... +isParallelInsertLeader = nodeModifyTableState->operation == CMD_INSERT; ... One thing I noticed is that you have logic, variable names and assertions all over the tree that assume that we can only do parallel *inserts*. I agree 100% with your plan to make Parallel Insert work first, it is an excellent goal and if we get it in it'll be a headline feature of PG14 (along with COPY etc). That said, I wonder if it would make sense to use more general naming (isParallelModifyLeader?), be more liberal where you really mean "is it DML", and find a way to centralise the logic about which DML commands types are currently allowed (ie insert only for now) for assertions and error checks etc, so that in future we don't have to go around and change all these places and rename things again and again. While contemplating that, I couldn't resist taking a swing at the main (?) show stopper for Parallel Update and Parallel Delete, judging by various clues left in code comments by Robert: combo command IDs created by other processes. Here's a rapid prototype to make that work (though perhaps not as efficiently as we'd want, not sure). With that in place, I wonder what else we'd need to extend your patch to cover all three operations... it can't be much! Of course I don't want to derail your work on Parallel Insert, I'm just providing some motivation for my comments on the (IMHO) shortsightedness of some of the coding. PS Why not use git format-patch to create patches? From ad2b5e07a09603b09859dfcbde6addd51096cbb0 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 9 Oct 2020 00:27:07 +1300 Subject: [PATCH] Coordinate combo command IDs with parallel workers. Previously, we would serialize the leader's combo command IDs and restore a read-only copy of them in worker processes, not allowing updates. Instead, migrate them into shared memory, in preparation for parallel update/delete queries where new combo command IDs might need to be created in any process and visible to others. XXX This design causes every backend to maintain its own deduplication hash table, and requires a shared lock to look up any combocid. Both policies could be reconsidered, basically a memory size vs locking tradeoff. Need some experience/profiling of real work to see how much any of this really matters. Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.gmail.com --- src/backend/access/common/session.c| 28 +- src/backend/access/heap/heapam.c | 10 - src/backend/access/transam/README.parallel | 6 - src/backend/access/transam/parallel.c | 14 - src/backend/storage/lmgr/lwlock.c | 2 + src/backend/utils/time/combocid.c | 290 - src/include/access/session.h | 20 +- src/include/storage/lwlock.h | 1 + src/include/utils/combocid.h | 8 +- 9 files changed, 275 insertions(+), 104 deletions(-) diff --git a/src/backend/access/common/session.c b/src/backend/access/common/session.c index 0ec61d48a2..7e1bffb680 100644 --- a/src/backend/access/common/session.c +++ b/src/backend/access/common/session.c @@ -23,6 +23,7 @@ #include "access/session.h" #include "storage/lwlock.h" #include "storage/shm_toc.h" +#include "utils/combocid.h" #include "utils/memutils.h" #include "utils/typcache.h" @@ -43,6 +44,7 @@ */ #define SESSION_KEY_DSA UINT64CONST(0x0001) #define SESSION_KEY_RECORD_TYPMOD_REGISTRY UINT64CONST(0x0002) +#define SESSION_KEY_FIXED UINT64CONST(0x0003) /* This backend's current session. */ Session*CurrentSession = NULL; @@ -74,8 +76,10 @@ GetSessionDsmHandle(void) dsm_segment *seg; size_t typmod_registry_size; size_t size; + void *fixed_space; void *dsa_space; void *typmod_registry_space; + SessionFixed *fixed; dsa_area *dsa; MemoryContext old_context; @@ -91,6 +95,10 @@ GetSessionDsmHandle(void) old_context = MemoryContextSwitchTo(TopMemoryContext); shm_toc_initialize_estimator(&estimator); + /* Estimate size for the fixed-sized per-session state. */ + shm_toc_estimate_keys(&estimator, 1); + shm_toc_estimate_chunk(&estimator, sizeof(SessionFixed)); + /* Estimate space for the per-session DSA area. */ shm_toc_estimate_keys(&estimator, 1); shm_toc_estimate_chunk(&estimator, SESSION_DSA_SIZE); @@ -113,6 +121,14 @@ GetSessionDsmHandle(void) dsm_segment_address(seg), size); + /* Create the simple fixed-sized session state. */ + fixed_space = shm_toc_allocate(toc, sizeof(SessionFixed)); + fixed = (SessionFixed *) fixed_space; + memset(fixed, 0, sizeof(*fixed)); + LWLockInitialize(&fixed->shared_combocid_lock, LWTRANCHE_SHARED_COMBO
Re: [HACKERS] Custom compression methods
On Thu, Oct 08, 2020 at 02:38:27PM +0530, Dilip Kumar wrote: On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar wrote: On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar wrote: > > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra > wrote: > > > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote: > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra > > > wrote: > > >> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote: > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra > > >> > wrote: > > >> >> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote: > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra > > >> >> > wrote: > > >> >> > > > >> >> >Thanks, Tomas for your feedback. > > >> >> > > > >> >> >> 9) attcompression ... > > >> >> >> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead > > >> >> >> of just using it to store a the compression method, it's also used to > > >> >> >> store the preserved compression methods. And using NameData to store > > >> >> >> this seems wrong too - if we really want to store this info, the correct > > >> >> >> way is either using text[] or inventing charvector or similar. > > >> >> > > > >> >> >The reason for using the NameData is the get it in the fixed part of > > >> >> >the data structure. > > >> >> > > > >> >> > > >> >> Why do we need that? It's possible to have varlena fields with direct > > >> >> access (see pg_index.indkey for example). > > >> > > > >> >I see. While making it NameData I was thinking whether we have an > > >> >option to direct access the varlena. Thanks for pointing me there. I > > >> >will change this. > > >> > > > >> > Adding NameData just to make > > >> >> it fixed-length means we're always adding 64B even if we just need a > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute. > > >> >> That seems a bit unnecessary, and might be an issue with many attributes > > >> >> (e.g. with many temp tables, etc.). > > >> > > > >> >You are right. Even I did not like to keep 64B for this, so I will change it. > > >> > > > >> >> > > >> >> >> But to me this seems very much like a misuse of attcompression to track > > >> >> >> dependencies on compression methods, necessary because we don't have a > > >> >> >> separate catalog listing compression methods. If we had that, I think we > > >> >> >> could simply add dependencies between attributes and that catalog. > > >> >> > > > >> >> >Basically, up to this patch, we are having only built-in compression > > >> >> >methods and those can not be dropped so we don't need any dependency > > >> >> >at all. We just want to know what is the current compression method > > >> >> >and what is the preserve compression methods supported for this > > >> >> >attribute. Maybe we can do it better instead of using the NameData > > >> >> >but I don't think it makes sense to add a separate catalog? > > >> >> > > > >> >> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks > > >> >> very much like a workaround needed because we don't have the catalog. > > >> >> > > >> >> I don't quite understand how could we support custom compression methods > > >> >> without listing them in some sort of catalog? > > >> > > > >> >Yeah for supporting custom compression we need some catalog. > > >> > > > >> >> >> Moreover, having the catalog would allow adding compression methods > > >> >> >> (from extensions etc) instead of just having a list of hard-coded > > >> >> >> compression methods. Which seems like a strange limitation, considering > > >> >> >> this thread is called "custom compression methods". > > >> >> > > > >> >> >I think I forgot to mention while submitting the previous patch that > > >> >> >the next patch I am planning to submit is, Support creating the custom > > >> >> >compression methods wherein we can use pg_am catalog to insert the new > > >> >> >compression method. And for dependency handling, we can create an > > >> >> >attribute dependency on the pg_am row. Basically, we will create the > > >> >> >attribute dependency on the current compression method AM as well as > > >> >> >on the preserved compression methods AM. As part of this, we will > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field > > >> >> >will be converted to the oid_vector (first OID will be of the current > > >> >> >compression method, followed by the preserved compression method's > > >> >> >oids). > > >> >> > > > >> >> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't > > >> >> quite match what I though AMs are about, but maybe it's just my fault. > > >> >> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and > > >> >> then add the catalog later - I think we should start with the catalog > > >> >> right away. The advantage is that if we end up committing only some of > > >> >> the patches in this cycle, we already have all the infrastructure etc
Re: POC: postgres_fdw insert batching
On Thu, Oct 08, 2020 at 02:40:10AM +, tsunakawa.ta...@fujitsu.com wrote: Hello Tomas san, Thank you for picking up this. I'm interested in this topic, too. (As an aside, we'd like to submit a bulk insert patch for ECPG in the near future.) As others referred, Andrey-san's fast COPY to foreign partitions is also promising. But I think your bulk INSERT is a separate feature and offers COPY cannot do -- data transformation during loading with INSERT SELECT and CREATE TABLE AS SELECT. Is there anything that makes you worry and stops development? Could I give it a try to implement this (I'm not sure I can, sorry. I'm worried if we can change the executor's call chain easily.) It's primarily a matter of having too much other stuff on my plate, thus not having time to work on this feature. I was not too worried about any particular issue, but I wanted some feedback before spending more time on extending the API. I'm not sure when I'll have time to work on this again, so if you are interested and willing to work on it, please go ahead. I'll gladly do reviews and help you with it. 1) Extend the FDW API? Yes, I think, because FDWs for other DBMSs will benefit from this. (But it's questionable whether we want users to transfer data in Postgres database to other DBMSs...) I think transferring data to other databases is fine - interoperability is a big advantage for users, I don't see it as something threatening the PostgreSQL project. I doubt this would make it more likely for users to migrate from PostgreSQL - there are many ways to do that already. MySQL and SQL Server has the same bulk insert syntax as Postgres, i.e., INSERT INTO table VALUES(record1), (record2), ... Oracle doesn't have this syntax, but it can use CTE as follows: INSERT INTO table WITH t AS ( SELECT record1 FROM DUAL UNION ALL SELECT record2 FROM DUAL UNION ALL ... ) SELECT * FROM t; And many DBMSs should have CTAS, INSERT SELECT, and INSERT SELECT record1 UNION ALL SELECT record2 ... True. In some cases INSERT may be replaced by COPY, but it has various other features too. The API would simply be: TupleTableSlot ** ExecForeignMultiInsert(EState *estate, ResultRelInfo *rinfo, TupleTableSlot **slot, TupleTableSlot **planSlot, int numSlots); +1, seems quite reasonable 2) What about the insert results? I'm wondering if we can report success or failure of each inserted row, because the remote INSERT will fail entirely. Other FDWs may be able to do it, so the API can be like above. Yeah. I think handling complete failure should not be very difficult, but there are cases that worry me more. For example, what if there's a before trigger (on the remote db) that "skips" inserting some of the rows by returning NULL? For the same reason, support for RETURNING clause will vary from DBMS to DBMS. Yeah. I wonder if the FDW needs to indicate which features are supported by the ExecForeignMultiInsert, e.g. by adding a function that decides whether batch insert is supported (it might also do that internally by calling ExecForeignInsert, of course). 3) What about the other DML operations (DELETE/UPDATE)? I don't think they are necessary for the time being. If we want them, they will be implemented using the libpq batch/pipelining as Andres-san said. I agree. 3) Should we do batching for COPY insteads? I'm thinking of issuing INSERT with multiple records as your patch does, because: * When the user executed INSERT statements, it would look strange to the user if the remote SQL is displayed as COPY. * COPY doesn't invoke rules unlike INSERT. (I don't think the rule is a feature what users care about, though.) Also, I'm a bit concerned that there might be, or will be, other differences between INSERT and COPY. I agree. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: speed up unicode normalization quick check
On Thu, Oct 8, 2020 at 8:29 AM Michael Paquier wrote: > On Thu, Oct 08, 2020 at 04:52:18AM -0400, John Naylor wrote: > > Looks fine overall, but one minor nit: I'm curious why you made a > separate > > section in the pgindent exclusions. The style in that file seems to be > one > > comment per category. > > Both parts indeed use PerfectHash.pm, but are generated by different > scripts so that looked better as separate items. Okay, thanks. -- John Naylor EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Assertion failure with LEFT JOINs among >500 relations
On Fri, 9 Oct 2020 at 08:16, Onder Kalaci wrote: > I hit an assertion failure. When asserts disabled, it works fine even with > more tables (>5000). > > Steps to reproduce: > CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 > int, value_3 float, value_4 bigint); > 250 relations work fine, see the query (too long to copy & paste here): > https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L41 I had a quick look at this and I can recreate it using the following (using psql) select 'explain select count(*) from users_table ' || string_Agg('LEFT JOIN users_table u'|| x::text || ' USING (user_id)',' ') from generate_Series(1,379)x; \gexec That triggers the assert due to the Assert(outer_skip_rows <= outer_rows); failing in initial_cost_mergejoin(). The reason it fails is that outer_path_rows has become infinity due to calc_joinrel_size_estimate continually multiplying in the join selectivity of 0.05 (due to our 200 default num distinct from lack of any stats) which after a number of iterations causes the number to become very large. Instead of running 379 joins from above, try with 378 and you get: Aggregate (cost=NaN..NaN rows=1 width=8) -> Nested Loop Left Join (cost=33329.16..NaN rows=Infinity width=0) Join Filter: (users_table.user_id = u378.user_id) -> Merge Left Join (cost=33329.16.. width=4) Merge Cond: (users_table.user_id = u377.user_id) -> Merge Left Join (cost=33240.99.. width=4) Changing the code in initial_cost_mergejoin() to add: if (outer_path_rows <= 0 || isnan(outer_path_rows)) outer_path_rows = 1; +else if (isinf(outer_path_rows)) +outer_path_rows = DBL_MAX; does seem to fix the problem, but that's certainly not the right fix. Perhaps the right fix is to modify clamp_row_est() with: @@ -193,7 +194,9 @@ clamp_row_est(double nrows) * better and to avoid possible divide-by-zero when interpolating costs. * Make it an integer, too. */ - if (nrows <= 1.0) + if (isinf(nrows)) + nrows = rint(DBL_MAX); + else if (nrows <= 1.0) nrows = 1.0; else nrows = rint(nrows); but the row estimates are getting pretty insane well before then. DBL_MAX is 226 orders of magnitude more than the estimated number of atoms in the observable universe, so it seems pretty unreasonable that someone might figure out a way to store that many tuples on a disk any time soon. Perhaps DBL_MAX is way to big a number to clamp at. I'm just not sure what we should reduce it to so that it is reasonable. David
Re: Assertion failure with LEFT JOINs among >500 relations
David Rowley writes: > The reason it fails is that outer_path_rows has become infinity due to > calc_joinrel_size_estimate continually multiplying in the join > selectivity of 0.05 (due to our 200 default num distinct from lack of > any stats) which after a number of iterations causes the number to > become very large. 0.005, but yeah. We're estimating that each additional join inflates the output size by about 6x (1270 * 0.005), and after a few hundred of those, it'll overflow. > Perhaps the right fix is to modify clamp_row_est() with: I thought of that too, but as you say, if the rowcount has overflowed a double then we've got way worse problems. It'd make more sense to try to keep the count to a saner value in the first place. In the end, (a) this is an Assert, so not a problem for production systems, and (b) it's going to take you longer than you want to wait to join 500+ tables, anyhow, unless maybe they're empty. I'm kind of disinclined to do anything in the way of a band-aid fix. If somebody has an idea for a different way of estimating the join size with no stats, we could talk about that. I notice though that the only way a plan of this sort isn't going to blow up at execution is if the join multiplication factor is at most 1, ie the join key is unique. But guess what, we already know what to do in that case. Adding a unique or pkey constraint to users_table.user_id causes the plan to collapse entirely (if they're left joins) or at least still produce a small rowcount estimate (if plain joins). regards, tom lane
Re: Assertion failure with LEFT JOINs among >500 relations
On Fri, 9 Oct 2020 at 12:16, Tom Lane wrote: > > > Perhaps the right fix is to modify clamp_row_est() with: > > I thought of that too, but as you say, if the rowcount has overflowed a > double then we've got way worse problems. It'd make more sense to try > to keep the count to a saner value in the first place. I wonder if there was something more logical we could do to maintain sane estimates too, but someone could surely still cause it to blow up by writing a long series of clause-less joins. We can't really get away from the fact that we must estimate those as inner_rows * outer_rows I admit it's annoying to add cycles to clamp_row_est() for such insane cases. David
Re: Minor documentation error regarding streaming replication protocol
On Thu, Oct 08, 2020 at 04:23:06PM -0400, Bruce Momjian wrote: > I have looked at this. It seems SendTimeLineHistory() is sending raw > bytes from the history file, with no encoding conversion, and > ReceiveXlogStream() is receiving it, again assuming it is just plain > text. I am not sure we really have an SQL data type where we do this. > BYTEA doesn't do encoding conversion, but does backslash procesing, and > TEXT does encoding conversion. > > I suppose we either have to document this as BYTEA with no backslash > processing, or TEXT with no encoding conversion --- I think I prefer the > later. As StartupXLOG() tells, The timeline history file can include as reason the recovery target name which may not be made just of ASCII characters as that's the value specified in pg_create_restore_point by the user, so bytea is correct, no? -- Michael signature.asc Description: PGP signature
Re: Assertion failure with LEFT JOINs among >500 relations
David Rowley writes: > I admit it's annoying to add cycles to clamp_row_est() for such insane cases. I poked at this a bit more closely, and noted that the actual problem is that when we do this: outer_skip_rows = rint(outer_path_rows * outerstartsel); we have outer_path_rows = inf, outerstartsel = 0, and of course inf times zero is NaN. So we end up asserting "NaN <= Inf", not "Inf <= Inf" (which wouldn't have caused a problem). If we did want to do something here, I'd consider something like if (isnan(outer_skip_rows)) outer_skip_rows = 0; if (isnan(inner_skip_rows)) inner_skip_rows = 0; (We shouldn't need that for outer_rows/inner_rows, since the endsel values can't be 0.) Messing with clamp_row_est would be a much more indirect way of fixing it, as well as having more widespread effects. In the end though, I'm still not terribly excited about this. regards, tom lane
Re: [HACKERS] logical decoding of two-phase transactions
On Thu, Oct 8, 2020 at 5:25 PM Amit Kapila wrote: > > COMMENT > > Line 177 > > +logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * > > prepare_data) > > > > prepare_data->prepare_type = flags; > > This code may be OK but it does seem a bit of an abuse of the flags. > > > > e.g. Are they flags or are the really enum values? > > e.g. And if they are effectively enums (it appears they are) then > > seemed inconsistent that |= was used when they were previously > > assigned. > > > > ; > > I don't understand this point. As far as I can see at the time of > write (logicalrep_write_prepare()), the patch has used |=, and at the > time of reading (logicalrep_read_prepare()) it has used assignment > which seems correct from the code perspective. Do you have a better > proposal? OK. I will explain my thinking when I wrote that review comment. I agree all is "correct" from a code perspective. But IMO using bit arithmetic implies that different combinations are also possible, whereas in current code they are not. So code is kind of having a bet each-way - sometimes treating "flags" as bit flags and sometimes as enums. e.g. If these flags are not really bit flags at all then the logicalrep_write_prepare() code might just as well be written as below: BEFORE if (rbtxn_commit_prepared(txn)) flags |= LOGICALREP_IS_COMMIT_PREPARED; else if (rbtxn_rollback_prepared(txn)) flags |= LOGICALREP_IS_ROLLBACK_PREPARED; else flags |= LOGICALREP_IS_PREPARE; /* Make sure exactly one of the expected flags is set. */ if (!PrepareFlagsAreValid(flags)) elog(ERROR, "unrecognized flags %u in prepare message", flags); AFTER if (rbtxn_commit_prepared(txn)) flags = LOGICALREP_IS_COMMIT_PREPARED; else if (rbtxn_rollback_prepared(txn)) flags = LOGICALREP_IS_ROLLBACK_PREPARED; else flags = LOGICALREP_IS_PREPARE; ~ OTOH, if you really do want to anticipate having future flag bit combinations then maybe the PrepareFlagsAreValid() macro ought to to be tweaked accordingly, and the logicalrep_read_prepare() code maybe should look more like below: BEFORE /* set the action (reuse the constants used for the flags) */ prepare_data->prepare_type = flags; AFTER /* set the action (reuse the constants used for the flags) */ prepare_data->prepare_type = flags & LOGICALREP_IS_COMMIT_PREPARED ? LOGICALREP_IS_COMMIT_PREPARED : flags & LOGICALREP_IS_ROLLBACK_PREPARED ? LOGICALREP_IS_ROLLBACK_PREPARED : LOGICALREP_IS_PREPARE; Kind Regards. Peter Smith Fujitsu Australia
Re: [PATCH] ecpg: fix progname memory leak
On Thu, Oct 08, 2020 at 01:17:25PM -0400, Tom Lane wrote: > Agreed. In theory there might be some point in removing leaks that happen > per-statement, so as to avoid unreasonable memory bloat when processing an > extremely long input file. In practice nobody has complained about that, > and if somebody did I'd probably question the sanity of putting so much > code into one file. (The C compiler would likely bloat even more while > processing the output...) Moreover, given the way the ecpg grammar works, > it'd be really painful to avoid all such leaks, and it might well > introduce bugs not fix them. I got to wonder about that, and I don't quite see how to make all the memory handling clean without having at least a concept of per-statement memory context to make any cleanup easy once a statement is done. That's a lot of infrastructure for a case nobody really complained about, indeed.. > In any case, if this TODO item is going to lead to ideas as dubious > as "let's free progname before exiting", it's not helpful. +1. Agreed to just remove it. -- Michael signature.asc Description: PGP signature
Re: Assertion failure with LEFT JOINs among >500 relations
On Fri, 9 Oct 2020 at 12:59, Tom Lane wrote: > If we did want to do something here, I'd consider something like > > if (isnan(outer_skip_rows)) > outer_skip_rows = 0; > if (isnan(inner_skip_rows)) > inner_skip_rows = 0; Are you worried about the costs above the join that triggers that coming out as NaN with that fix? It appears that's the case. Cost comparisons of paths with that are not going to do anything along the lines of sane. I guess whether or not that matters depends on if we expect any real queries to hit this, or if we just want to stop the Assert failure. ... 500 joins. I'm willing to listen to the explanation use case, but in absence of that explanation, I'd be leaning towards "you're doing it wrong". If that turns out to be true, then perhaps your proposed fix is okay. David
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Jamison, Kirk/ジャミソン カーク > > (6) > > + bufHdr->tag.blockNum >= > > firstDelBlock[j]) > > + InvalidateBuffer(bufHdr); /* > > releases spinlock */ > > > > The right side of >= should be cur_block. > > Fixed. >= should be =, shouldn't it? Please measure and let us see just the recovery performance again because the critical part of the patch was modified. If the performance is good as the previous one, and there's no review interaction with others in progress, I'll mark the patch as ready for committer in a few days. Regards Takayuki Tsunakawa
Remove some unnecessary if-condition
Hi I found some likely unnecessary if-condition in code. 1. Some check in else branch seems unnecessary. In (/src/backend/replication/logical/reorderbuffer.c) ① @@ -4068,7 +4068,7 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn, > bool found; > if (!found) > { >... > } > else if (found && chunk_seq != ent->last_chunk_seq + 1) >... The check of "found" in else if branch seems unnecessary. ② (/src/backend/utils/init/postinit.c) @@ -924,11 +924,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, > bool bootstrap = IsBootstrapProcessingMode(); > if (bootstrap) > { >... > } > else if(...) > {...} > else > { >if (!bootstrap) >{ >... >} > } The check of "bootstrap" in else branch seems unnecessary. 2.In (/src/interfaces/ecpg/compatlib/informix.c) @@ -944,7 +944,7 @@ rupshift(char *str) > for (len--; str[len] && str[len] == ' '; len--); The first "str[len]" seems unnecessary since " str[len] == ' '" will check it as well. Do you think we should remove these if-condition for code clean ? Best regards, houzj 0001-Remove-some-unnecessary-path.patch Description: 0001-Remove-some-unnecessary-path.patch
Re: Expansion of our checks for connection-loss errors
At Thu, 08 Oct 2020 15:15:54 -0400, Tom Lane wrote in > Over in the thread at [1], we've tentatively determined that the > reason buildfarm member lorikeet is currently failing is that its > network stack returns ECONNABORTED for (some?) connection failures, > whereas our code is only expecting ECONNRESET. Fujii Masao therefore > proposes that we treat ECONNABORTED the same as ECONNRESET. I think > this is a good idea, but after a bit of research I feel it does not > go far enough. I find these POSIX-standard errnos that also seem > likely candidates to be returned for a hard loss of connection: > > ECONNABORTED > EHOSTUNREACH > ENETDOWN > ENETUNREACH > > All of these have been in POSIX since SUSv2, so it seems unlikely > that we need to #ifdef any of them. (It is in any case pretty silly > that we have #ifdefs around a very small minority of our references > to ECONNRESET :-(.) > > There are some other related errnos, such as ECONNREFUSED, that > don't seem like they'd be returned for a failure of a pre-existing > connection, so we don't need to include them in such tests. > > Accordingly, I propose the attached patch (an expansion of > Fujii-san's) that causes us to test for all five errnos anyplace > we had been checking for ECONNRESET. I felt that this was getting to > the point where we'd better centralize the knowledge of what to check, > so the patch does that, via an inline function and an admittedly hacky > macro. I also upgraded some places such as strerror.c to have full > support for these symbols. > > All of the machines I have (even as far back as HPUX 10.20) also > define ENETRESET and EHOSTDOWN. However, those symbols do not appear > in SUSv2. ENETRESET was added at some later point, but EHOSTDOWN is > still not in POSIX. For the moment I've left these second-tier > symbols out of the patch, but there's a case for adding them. I'm > not sure whether there'd be any point in trying to #ifdef them. > > BTW, I took out the conditional defines of some of these errnos in > libpq's win32.h; AFAICS that's been dead code ever since we added > #define's for them to win32_port.h. Am I missing something? > > This seems like a bug fix to me, so I'm inclined to back-patch. > > regards, tom lane > > [1] > https://www.postgresql.org/message-id/flat/E1kPc9v-0005L4-2l%40gemulon.postgresql.org +1 for the direction. In terms of connection errors, connect(2) and bind(2) can return EADDRNOTAVAIL. bind(2) and listen(2) can return EADDRINUSE. FWIW I recetnly saw pgbench getting EADDRNOTAVAIL. (They have mapping from respective WSA errors in TranslateSocketError()) I'm not sure how we should treat EMFILE/ENFILE/ENOBUFS/ENOMEM from accept(2). (select(2) can return ENOMEM.) I'd make errno_is_connection_loss use ALL_CONNECTION_LOSS_ERRNOS to avoid duplication definition of the errno list. - if (ret < 0 && WSAGetLastError() == WSAECONNRESET) + if (ret < 0 && errno_is_connection_loss(WSAGetLastError())) Don't we need to use TranslateSocketError() before? + /* We might get ECONNRESET etc here if using TCP and backend died */ + if (errno_is_connection_loss(SOCK_ERRNO)) Perhaps I'm confused but SOCK_ERROR doesn't seem portable between Windows and Linux. = /* * These macros are needed to let error-handling code be portable between * Unix and Windows. (ugh) */ #ifdef WIN32 #define SOCK_ERRNO (WSAGetLastError()) #define SOCK_STRERROR winsock_strerror #define SOCK_ERRNO_SET(e) WSASetLastError(e) #else #define SOCK_ERRNO errno #define SOCK_STRERROR strerror_r #define SOCK_ERRNO_SET(e) (errno = (e)) #endif = AFAICS SOCK_ERRNO is intended to be used idiomatically as: > SOCK_STRERROR(SOCK_ERRNO, ...) The WSAE values from WSAGetLastError() and E values in errno are not compatible and needs translation by TranslateSocketError()? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Expansion of our checks for connection-loss errors
Kyotaro Horiguchi writes: > At Thu, 08 Oct 2020 15:15:54 -0400, Tom Lane wrote in >> Accordingly, I propose the attached patch (an expansion of >> Fujii-san's) that causes us to test for all five errnos anyplace >> we had been checking for ECONNRESET. > +1 for the direction. > In terms of connection errors, connect(2) and bind(2) can return > EADDRNOTAVAIL. bind(2) and listen(2) can return EADDRINUSE. FWIW I > recetnly saw pgbench getting EADDRNOTAVAIL. (They have mapping from > respective WSA errors in TranslateSocketError()) I do not think we have any issues with connection-time errors; or at least, if we do, the spots being touched here certainly shouldn't need to worry about them. These places are dealing with already-established connections. > I'd make errno_is_connection_loss use ALL_CONNECTION_LOSS_ERRNOS to > avoid duplication definition of the errno list. Hmm, might be worth doing, but I'm not sure. I am worried about whether compilers will generate equally good code that way. > - if (ret < 0 && WSAGetLastError() == WSAECONNRESET) > + if (ret < 0 && errno_is_connection_loss(WSAGetLastError())) > Don't we need to use TranslateSocketError() before? Oh, I missed that. But: > Perhaps I'm confused but SOCK_ERROR doesn't seem portable between > Windows and Linux. In that case, nothing would have worked on Windows for the last ten years, so you're mistaken. I think the actual explanation why this works, and why that test in parallel.c probably still works even with my mistake, is that win32_port.h makes sure that our values of ECONNRESET etc match WSAECONNRESET etc. IOW, we'd not actually need TranslateSocketError at all, except that it maps some not-similarly-named error codes for conditions that don't exist in Unix into ones that do. We probably do want TranslateSocketError in this parallel.c test so that anything that it maps to one of the errno_is_connection_loss codes will be recognized; but the basic cases would work anyway, unless I misunderstand this stuff entirely. regards, tom lane
Re: Assertion failure with LEFT JOINs among >500 relations
David Rowley writes: > Are you worried about the costs above the join that triggers that > coming out as NaN with that fix? It appears that's the case. [ pokes at that... ] Yeah, it looks like nestloop cost estimation also has some issues with inf-times-zero producing NaN; it's just not asserting about it. I notice there are some other ad-hoc isnan() checks scattered about costsize.c, too. Maybe we should indeed consider fixing clamp_row_estimate to get rid of inf (and nan too, I suppose) so that we'd not need those. I don't recall the exact cases that made us introduce those checks, but they were for cases a lot more easily reachable than this one, I believe. regards, tom lane
Re: [Patch] Optimize dropping of relation buffers using dlist
At Fri, 9 Oct 2020 00:41:24 +, "tsunakawa.ta...@fujitsu.com" wrote in > From: Jamison, Kirk/ジャミソン カーク > > > (6) > > > + bufHdr->tag.blockNum >= > > > firstDelBlock[j]) > > > + InvalidateBuffer(bufHdr); /* > > > releases spinlock */ > > > > > > The right side of >= should be cur_block. > > > > Fixed. > > >= should be =, shouldn't it? > > Please measure and let us see just the recovery performance again because the > critical part of the patch was modified. If the performance is good as the > previous one, and there's no review interaction with others in progress, I'll > mark the patch as ready for committer in a few days. The performance is expected to be kept since smgrnblocks() is called in a non-hot code path and actually it is called at most four times per a buffer drop in this patch. But it's better making it sure. I have some comments on the latest patch. @@ -445,6 +445,7 @@ BlockNumber visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks) { BlockNumber newnblocks; + boolcached; All the added variables added by 0002 is useless because all the caller sites are not interested in the value. smgrnblocks should accept NULL as isCached. (I'm agree with Tsunakawa-san that the camel-case name is not common there.) + nForkBlocks[i] = smgrnblocks(smgr_reln, forkNum[i], &isCached); + + if (!isCached) "is cached" is not the property that code is interested in. No other callers to smgrnblocks are interested in that property. The need for caching is purely internal of smgrnblocks(). On the other hand, we are going to utilize the property of "accuracy" that is a biproduct of reducing fseek calls, and, again, not interested in how it is achieved. So I suggest that the name should be "accurite" or something that is not suggest the mechanism used under the hood. + if (nTotalBlocks != InvalidBlockNumber && + nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD) I don't think nTotalBlocks is useful. What we need here is only total blocks for every forks (nForkBlocks[]) and the total number of buffers to be invalidated for all forks (nBlocksToInvalidate). > > > The right side of >= should be cur_block. > > > > Fixed. > > >= should be =, shouldn't it? It's just from a paranoia. What we are going to invalidate is blocks blockNum of which >= curBlock. Although actually there's no chance of any other processes having replaced the buffer with another page (with lower blockid) of the same relation after BugTableLookup(), that condition makes it sure not to leave blocks to be invalidated left alone. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
Oops! Sorry for the mistake. At Fri, 09 Oct 2020 11:12:01 +0900 (JST), Kyotaro Horiguchi wrote in > At Fri, 9 Oct 2020 00:41:24 +, "tsunakawa.ta...@fujitsu.com" > wrote in > > From: Jamison, Kirk/ジャミソン カーク > > > > (6) > > > > + bufHdr->tag.blockNum >= > > > > firstDelBlock[j]) > > > > + InvalidateBuffer(bufHdr); > > > > /* > > > > releases spinlock */ > > > > > > > > The right side of >= should be cur_block. > > > > > > Fixed. > > > > >= should be =, shouldn't it? > > > > Please measure and let us see just the recovery performance again because > > the critical part of the patch was modified. If the performance is good as > > the previous one, and there's no review interaction with others in > > progress, I'll mark the patch as ready for committer in a few days. > > The performance is expected to be kept since smgrnblocks() is called > in a non-hot code path and actually it is called at most four times > per a buffer drop in this patch. But it's better making it sure. > > I have some comments on the latest patch. > > @@ -445,6 +445,7 @@ BlockNumber > visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks) > { > BlockNumber newnblocks; > + boolcached; > > All the added variables added by 0002 is useless because all the > caller sites are not interested in the value. smgrnblocks should > accept NULL as isCached. (I'm agree with Tsunakawa-san that the > camel-case name is not common there.) > > + nForkBlocks[i] = smgrnblocks(smgr_reln, forkNum[i], &isCached); > + > + if (!isCached) > > "is cached" is not the property that code is interested in. No other callers > to smgrnblocks are interested in that property. The need for caching is > purely internal of smgrnblocks(). > > On the other hand, we are going to utilize the property of "accuracy" > that is a biproduct of reducing fseek calls, and, again, not > interested in how it is achieved. > > So I suggest that the name should be "accurite" or something that is > not suggest the mechanism used under the hood. > > + if (nTotalBlocks != InvalidBlockNumber && > + nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD) > > I don't think nTotalBlocks is useful. What we need here is only total > blocks for every forks (nForkBlocks[]) and the total number of buffers > to be invalidated for all forks (nBlocksToInvalidate). > > > > > > The right side of >= should be cur_block. > > > > > > Fixed. > > > > >= should be =, shouldn't it? > > It's just from a paranoia. What we are going to invalidate is blocks > blockNum of which >= curBlock. Although actually there's no chance of Sorry. What we are going to invalidate is blocks that are blocNum >= firstDelBlock[i]. So what I wanted to suggest was the condition should be + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) && + bufHdr->tag.forkNum == forkNum[j] && + bufHdr->tag.blockNum >= firstDelBlock[j]) > any other processes having replaced the buffer with another page (with > lower blockid) of the same relation after BugTableLookup(), that > condition makes it sure not to leave blocks to be invalidated left > alone. And I forgot to mention the patch names. I think many of us name the patches using -v option of git-format-patch, and assign the version to a patch-set thus the version number of all files that are posted at once is same. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Transactions involving multiple postgres foreign servers, take 2
From: Masahiko Sawada > What about temporary network failures? I think there are users who > don't want to give up resolving foreign transactions failed due to a > temporary network failure. Or even they might want to wait for > transaction completion until they send a cancel request. If we want to > call the commit routine only once and therefore want FDW to retry > connecting the foreign server within the call, it means we require all > FDW implementors to write a retry loop code that is interruptible and > ensures not to raise an error, which increases difficulty. > > Yes, but if we don’t retry to resolve foreign transactions at all on > an unreliable network environment, the user might end up requiring > every transaction to check the status of foreign transactions of the > previous distributed transaction before starts. If we allow to do > retry, I guess we ease that somewhat. OK. As I said, I'm not against trying to cope with temporary network failure. I just don't think it's mandatory. If the network failure is really temporary and thus recovers soon, then the resolver will be able to commit the transaction soon, too. Then, we can have a commit retry timeout or retry count like the following WebLogic manual says. (I couldn't quickly find the English manual, so below is in Japanese. I quoted some text that got through machine translation, which appears a bit strange.) https://docs.oracle.com/cd/E92951_01/wls/WLJTA/trxcon.htm -- Abandon timeout Specifies the maximum time (in seconds) that the transaction manager attempts to complete the second phase of a two-phase commit transaction. In the second phase of a two-phase commit transaction, the transaction manager attempts to complete the transaction until all resource managers indicate that the transaction is complete. After the abort transaction timer expires, no attempt is made to resolve the transaction. If the transaction enters a ready state before it is destroyed, the transaction manager rolls back the transaction and releases the held lock on behalf of the destroyed transaction. -- > Also, what if the user sets the statement timeout to 60 sec and they > want to cancel the waits after 5 sec by pressing ctl-C? You mentioned > that client libraries of other DBMSs don't have asynchronous execution > functionality. If the SQL execution function is not interruptible, the > user will end up waiting for 60 sec, which seems not good. FDW functions can be uninterruptible in general, aren't they? We experienced that odbc_fdw didn't allow cancellation of SQL execution. Regards Takayuki Tsunakawa
Re: Parallel INSERT (INTO ... SELECT ...)
On Fri, Oct 9, 2020 at 8:41 AM Thomas Munro wrote: > One thing I noticed is that you have logic, variable names and > assertions all over the tree that assume that we can only do parallel > *inserts*. I agree 100% with your plan to make Parallel Insert work > first, it is an excellent goal and if we get it in it'll be a headline > feature of PG14 (along with COPY etc). That said, I wonder if it > would make sense to use more general naming (isParallelModifyLeader?), > be more liberal where you really mean "is it DML", and find a way to > centralise the logic about which DML commands types are currently > allowed (ie insert only for now) for assertions and error checks etc, > so that in future we don't have to go around and change all these > places and rename things again and again. > Fair points. I agree, it would make more sense to generalise the naming and centralise the DML-command-type checks, rather than everything being insert-specific. It was getting a bit ugly. I'll work on that. > While contemplating that, I couldn't resist taking a swing at the main > (?) show stopper for Parallel Update and Parallel Delete, judging by > various clues left in code comments by Robert: combo command IDs > created by other processes. Here's a rapid prototype to make that > work (though perhaps not as efficiently as we'd want, not sure). With > that in place, I wonder what else we'd need to extend your patch to > cover all three operations... it can't be much! Of course I don't > want to derail your work on Parallel Insert, I'm just providing some > motivation for my comments on the (IMHO) shortsightedness of some of > the coding. > Thanks for your prototype code for coordination of combo command IDs with the workers. It does give me the incentive to look beyond that issue and see whether parallel Update and parallel Delete are indeed possible. I'll be sure to give it a go! > PS Why not use git format-patch to create patches? Guess I was being a bit lazy - will use git format-patch in future. Regards, Greg Nancarrow Fujitsu Australia
Re: Expansion of our checks for connection-loss errors
At Thu, 08 Oct 2020 21:41:55 -0400, Tom Lane wrote in > Kyotaro Horiguchi writes: > > At Thu, 08 Oct 2020 15:15:54 -0400, Tom Lane wrote in > >> Accordingly, I propose the attached patch (an expansion of > >> Fujii-san's) that causes us to test for all five errnos anyplace > >> we had been checking for ECONNRESET. > > > +1 for the direction. > > > In terms of connection errors, connect(2) and bind(2) can return > > EADDRNOTAVAIL. bind(2) and listen(2) can return EADDRINUSE. FWIW I > > recetnly saw pgbench getting EADDRNOTAVAIL. (They have mapping from > > respective WSA errors in TranslateSocketError()) > > I do not think we have any issues with connection-time errors; > or at least, if we do, the spots being touched here certainly > shouldn't need to worry about them. These places are dealing > with already-established connections. errcode_for_socket_access() is called for connect, bind and lesten but I understand we don't consider the case since we don't have an actual issue related to the functions. > > I'd make errno_is_connection_loss use ALL_CONNECTION_LOSS_ERRNOS to > > avoid duplication definition of the errno list. > > Hmm, might be worth doing, but I'm not sure. I am worried about > whether compilers will generate equally good code that way. The two are placed side-by-side so either will do for me. > > - if (ret < 0 && WSAGetLastError() == WSAECONNRESET) > > + if (ret < 0 && errno_is_connection_loss(WSAGetLastError())) > > > Don't we need to use TranslateSocketError() before? > > Oh, I missed that. But: > > > Perhaps I'm confused but SOCK_ERROR doesn't seem portable between > > Windows and Linux. > > In that case, nothing would have worked on Windows for the last > ten years, so you're mistaken. I think the actual explanation > why this works, and why that test in parallel.c probably still > works even with my mistake, is that win32_port.h makes sure that > our values of ECONNRESET etc match WSAECONNRESET etc. Mm. Sure. > IOW, we'd not actually need TranslateSocketError at all, except > that it maps some not-similarly-named error codes for conditions > that don't exist in Unix into ones that do. We probably do want > TranslateSocketError in this parallel.c test so that anything that > it maps to one of the errno_is_connection_loss codes will be > recognized; but the basic cases would work anyway, unless I > misunderstand this stuff entirely. Yeah, that seems to work. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: POC: postgres_fdw insert batching
From: Tomas Vondra > I'm not sure when I'll have time to work on this again, so if you are > interested and willing to work on it, please go ahead. I'll gladly do > reviews and help you with it. Thank you very much. > I think transferring data to other databases is fine - interoperability > is a big advantage for users, I don't see it as something threatening > the PostgreSQL project. I doubt this would make it more likely for users > to migrate from PostgreSQL - there are many ways to do that already. Definitely true. Users may want to use INSERT SELECT to do some data transformation in their OLTP database and load it into a non-Postgres data warehouse. > Yeah. I think handling complete failure should not be very difficult, > but there are cases that worry me more. For example, what if there's a > before trigger (on the remote db) that "skips" inserting some of the > rows by returning NULL? > Yeah. I wonder if the FDW needs to indicate which features are supported > by the ExecForeignMultiInsert, e.g. by adding a function that decides > whether batch insert is supported (it might also do that internally by > calling ExecForeignInsert, of course). Thanks for your advice. I'll try to address them. Regards Takayuki Tsunakawa
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs wrote: > > On Thu, 8 Oct 2020 at 09:47, Dilip Kumar wrote: > > > > This script will wait 10 seconds after INSERT exits > > > before executing TRUNCATE, please wait for it to run. > > Has this been tested with anything other than the one test case? > > It would be good to know how the patch handles a transaction that > contains many aborted subtransactions that contain invals. > Are you thinking from the angle of performance or functionality? I don't see how this patch can impact either of those. Basically, it will not execute any extra invalidations then it is executing without the patch for aborted subtransactions. Can you please explain in a bit more detail about your fear? Having said that, I think it would be a good idea to test the scenario you mentioned to ensure that we have not broken anything unknowingly. -- With Regards, Amit Kapila.
Re: Fix typos in reorderbuffer.c
On Thu, Oct 8, 2020 at 2:40 PM Masahiko Sawada wrote: > > On Thu, 8 Oct 2020 at 17:37, Amit Kapila wrote: > > > > > So, I feel the > > comments should be accordingly updated. > > +1 for this change. > Thanks, I have pushed this and along with it pushed a typo-fix in logical.c. -- With Regards, Amit Kapila.
Re: Minor documentation error regarding streaming replication protocol
On Fri, Oct 9, 2020 at 08:52:50AM +0900, Michael Paquier wrote: > On Thu, Oct 08, 2020 at 04:23:06PM -0400, Bruce Momjian wrote: > > I have looked at this. It seems SendTimeLineHistory() is sending raw > > bytes from the history file, with no encoding conversion, and > > ReceiveXlogStream() is receiving it, again assuming it is just plain > > text. I am not sure we really have an SQL data type where we do this. > > BYTEA doesn't do encoding conversion, but does backslash procesing, and > > TEXT does encoding conversion. > > > > I suppose we either have to document this as BYTEA with no backslash > > processing, or TEXT with no encoding conversion --- I think I prefer the > > later. > > As StartupXLOG() tells, The timeline history file can include as > reason the recovery target name which may not be made just of ASCII > characters as that's the value specified in pg_create_restore_point by > the user, so bytea is correct, no? Good point. The reporter was assuming the data would come to the client in the bytea output format specified by the GUC, e.g. \x..., so that doesn't happen either. As I said before, it is more raw bytes, but we don't have an SQL data type for that. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
RE: extension patch of CREATE OR REPLACE TRIGGER
Hello > > * A lesser point is that I think you're overcomplicating the code by > > applying heap_modify_tuple. You might as well just build the new > > tuple normally in all cases, and then apply either CatalogTupleInsert or > CatalogTupleUpdate. > > > > * Also, the search for an existing trigger tuple is being done the hard way. > > You're using an index on (tgrelid, tgname), so you could include the > > name in the index key and expect that there's at most one match. > While waiting for a new reply, I'll doing those 2 refactorings. I'm done with those refactorings. Please have a look at the changes of the latest patch. > > * I don't think that you've fully thought through the implications of > > replacing a trigger for a table that the current transaction has > > already modified. Is it really sufficient, or even useful, to do > > this: > > > > +/* > > + * If this trigger has pending events, throw an error. > > + */ > > +if (trigger_deferrable && > > + AfterTriggerPendingOnRel(RelationGetRelid(rel))) > > > > As an example, if we change a BEFORE trigger to an AFTER trigger, > > that's not going to affect the fact that we *already* fired that > > trigger. Maybe this is okay and we just need to document it, but I'm not > convinced. > > > > * BTW, I don't think a trigger necessarily has to be deferrable in > > order to have pending AFTER events. The existing use of > > AfterTriggerPendingOnRel certainly doesn't assume that. But really, I > > think we probably ought to be applying CheckTableNotInUse which'd > > include that test. (Another way in which there's fuzzy thinking here > > is that AfterTriggerPendingOnRel isn't specific to *this* > > trigger.) > Hmm, actually, when I just put a code of CheckTableNotInUse() in > CreateTrigger(), it throws error like "cannot CREATE OR REPLACE TRIGGER > because it is being used by active queries in this session". > This causes an break of the protection for internal cases above and a > contradiction of already passed test cases. > I though adding a condition of in_partition==false to call > CheckTableNotInUse(). > But this doesn't work in a corner case that child trigger generated > internally is > pending and we don't want to allow the replacement of this kind of trigger. > Did you have any good idea to achieve both points at the same time ? Still, in terms of this point, I'm waiting for a comment ! Regards, Takamichi Osumi CREATE_OR_REPLACE_TRIGGER_v13.patch Description: CREATE_OR_REPLACE_TRIGGER_v13.patch
Re: [HACKERS] logical decoding of two-phase transactions
On Fri, Oct 9, 2020 at 5:45 AM Peter Smith wrote: > > On Thu, Oct 8, 2020 at 5:25 PM Amit Kapila wrote: > > > COMMENT > > > Line 177 > > > +logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * > > > prepare_data) > > > > > > prepare_data->prepare_type = flags; > > > This code may be OK but it does seem a bit of an abuse of the flags. > > > > > > e.g. Are they flags or are the really enum values? > > > e.g. And if they are effectively enums (it appears they are) then > > > seemed inconsistent that |= was used when they were previously > > > assigned. > > > > > > ; > > > > I don't understand this point. As far as I can see at the time of > > write (logicalrep_write_prepare()), the patch has used |=, and at the > > time of reading (logicalrep_read_prepare()) it has used assignment > > which seems correct from the code perspective. Do you have a better > > proposal? > > OK. I will explain my thinking when I wrote that review comment. > > I agree all is "correct" from a code perspective. > > But IMO using bit arithmetic implies that different combinations are > also possible, whereas in current code they are not. > So code is kind of having a bet each-way - sometimes treating "flags" > as bit flags and sometimes as enums. > > e.g. If these flags are not really bit flags at all then the > logicalrep_write_prepare() code might just as well be written as > below: > > BEFORE > if (rbtxn_commit_prepared(txn)) > flags |= LOGICALREP_IS_COMMIT_PREPARED; > else if (rbtxn_rollback_prepared(txn)) > flags |= LOGICALREP_IS_ROLLBACK_PREPARED; > else > flags |= LOGICALREP_IS_PREPARE; > > /* Make sure exactly one of the expected flags is set. */ > if (!PrepareFlagsAreValid(flags)) > elog(ERROR, "unrecognized flags %u in prepare message", flags); > > > AFTER > if (rbtxn_commit_prepared(txn)) > flags = LOGICALREP_IS_COMMIT_PREPARED; > else if (rbtxn_rollback_prepared(txn)) > flags = LOGICALREP_IS_ROLLBACK_PREPARED; > else > flags = LOGICALREP_IS_PREPARE; > > ~ > > OTOH, if you really do want to anticipate having future flag bit > combinations > I don't anticipate more combinations rather I am not yet sure whether we want to distinguish these operations with flags or have separate messages for each of these operations. I think for now we can go with your proposal above. -- With Regards, Amit Kapila.
Re: Assertion failure with LEFT JOINs among >500 relations
On Fri, 9 Oct 2020 at 15:06, Tom Lane wrote: > I notice there are some other ad-hoc isnan() checks scattered > about costsize.c, too. Maybe we should indeed consider fixing > clamp_row_estimate to get rid of inf (and nan too, I suppose) > so that we'd not need those. I don't recall the exact cases > that made us introduce those checks, but they were for cases > a lot more easily reachable than this one, I believe. Is there actually a case where nrows could be NaN? If not, then it seems like a wasted check. Wouldn't it take one of the input relations or the input rels to have an Inf row estimate (which won't happen after changing clamp_row_estimate()), or the selectivity estimate being NaN. David
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Fri, Oct 9, 2020 at 8:40 AM Amit Kapila wrote: > > On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs wrote: > > > > On Thu, 8 Oct 2020 at 09:47, Dilip Kumar wrote: > > > > > > This script will wait 10 seconds after INSERT exits > > > > before executing TRUNCATE, please wait for it to run. > > > > Has this been tested with anything other than the one test case? > > > > It would be good to know how the patch handles a transaction that > > contains many aborted subtransactions that contain invals. > > > > Are you thinking from the angle of performance or functionality? I > don't see how this patch can impact either of those. Basically, it > will not execute any extra invalidations then it is executing without > the patch for aborted subtransactions. Can you please explain in a bit > more detail about your fear? > > Having said that, I think it would be a good idea to test the scenario > you mentioned to ensure that we have not broken anything unknowingly. Yeah, even I feel that nothing should impact in this area because on abort we are anyway executing all the invalidations and we will continue to do so with the patch as well. I will test this scenario to ensure nothing is broken. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Fri, Oct 9, 2020 at 10:06 AM Dilip Kumar wrote: > > On Fri, Oct 9, 2020 at 8:40 AM Amit Kapila wrote: > > > > On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs wrote: > > > > > > On Thu, 8 Oct 2020 at 09:47, Dilip Kumar wrote: > > > > > > > > This script will wait 10 seconds after INSERT exits > > > > > before executing TRUNCATE, please wait for it to run. > > > > > > Has this been tested with anything other than the one test case? > > > > > > It would be good to know how the patch handles a transaction that > > > contains many aborted subtransactions that contain invals. > > > > > > > Are you thinking from the angle of performance or functionality? I > > don't see how this patch can impact either of those. Basically, it > > will not execute any extra invalidations then it is executing without > > the patch for aborted subtransactions. Can you please explain in a bit > > more detail about your fear? > > > > Having said that, I think it would be a good idea to test the scenario > > you mentioned to ensure that we have not broken anything unknowingly. > > Yeah, even I feel that nothing should impact in this area because on > abort we are anyway executing all the invalidations and we will > continue to do so with the patch as well. > True, but I think we execute the invalidations only for the streaming case, otherwise, neither we need to execute invalidations nor we are doing it for abort case. > I will test this scenario > to ensure nothing is broken. > If I have not missed anything then probably you need to prepare a scenario where we need to do streaming. I am fine with the testing of the non-streaming case as well if you want to ensure that we have not broken that case by starting to execute invalidations. -- With Regards, Amit Kapila.
Re: Parallel copy
On Thu, Oct 8, 2020 at 12:14 AM vignesh C wrote: > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila wrote: > > > > > > I am convinced by the reason given by Kyotaro-San in that another > > thread [1] and performance data shown by Peter that this can't be an > > independent improvement and rather in some cases it can do harm. Now, > > if you need it for a parallel-copy path then we can change it > > specifically to the parallel-copy code path but I don't understand > > your reason completely. > > > > Whenever we need data to be populated, we will get a new data block & > pass it to CopyGetData to populate the data. In case of file copy, the > server will completely fill the data block. We expect the data to be > filled completely. If data is available it will completely load the > complete data block in case of file copy. There is no scenario where > even if data is present a partial data block will be returned except > for EOF or no data available. But in case of STDIN data copy, even > though there is 8K data available in data block & 8K data available in > STDIN, CopyGetData will return as soon as libpq buffer data is more > than the minread. We will pass new data block every time to load data. > Every time we pass an 8K data block but CopyGetData loads a few bytes > in the new data block & returns. I wanted to keep the same data > population logic for both file copy & STDIN copy i.e copy full 8K data > blocks & then the populated data can be required. There is an > alternative solution I can have some special handling in case of STDIN > wherein the existing data block can be passed with the index from > where the data should be copied. Thoughts? > What you are proposing as an alternative solution, isn't that what we are doing without the patch? IIUC, you require this because of your corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that right? If so, what is the difficulty in making it behave similar to the non-parallel case? -- With Regards, Amit Kapila.
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Fri, Oct 9, 2020 at 10:29 AM Amit Kapila wrote: > > On Fri, Oct 9, 2020 at 10:06 AM Dilip Kumar wrote: > > > > On Fri, Oct 9, 2020 at 8:40 AM Amit Kapila wrote: > > > > > > On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs wrote: > > > > > > > > On Thu, 8 Oct 2020 at 09:47, Dilip Kumar wrote: > > > > > > > > > > This script will wait 10 seconds after INSERT exits > > > > > > before executing TRUNCATE, please wait for it to run. > > > > > > > > Has this been tested with anything other than the one test case? > > > > > > > > It would be good to know how the patch handles a transaction that > > > > contains many aborted subtransactions that contain invals. > > > > > > > > > > Are you thinking from the angle of performance or functionality? I > > > don't see how this patch can impact either of those. Basically, it > > > will not execute any extra invalidations then it is executing without > > > the patch for aborted subtransactions. Can you please explain in a bit > > > more detail about your fear? > > > > > > Having said that, I think it would be a good idea to test the scenario > > > you mentioned to ensure that we have not broken anything unknowingly. > > > > Yeah, even I feel that nothing should impact in this area because on > > abort we are anyway executing all the invalidations and we will > > continue to do so with the patch as well. > > > > True, but I think we execute the invalidations only for the streaming > case, otherwise, neither we need to execute invalidations nor we are > doing it for abort case. Right > > I will test this scenario > > to ensure nothing is broken. > > > > If I have not missed anything then probably you need to prepare a > scenario where we need to do streaming. Yes the test should be like BEGIN Savepoint 1 - DDL - large operation rollback to s1; -DDL large operation rollback to s1; ... So ideally every time it tries to stream the subtransaction there should be concurrent abort detected and it will execute all the invalidations. > I am fine with the testing of > the non-streaming case as well if you want to ensure that we have not > broken that case by starting to execute invalidations. Okay -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Thu, Oct 8, 2020 at 12:14 AM vignesh C wrote: > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila wrote: > > > > + */ > > > > +typedef struct ParallelCopyLineBoundary > > > > > > > > Are we doing all this state management to avoid using locks while > > > > processing lines? If so, I think we can use either spinlock or LWLock > > > > to keep the main patch simple and then provide a later patch to make > > > > it lock-less. This will allow us to first focus on the main design of > > > > the patch rather than trying to make this datastructure processing > > > > lock-less in the best possible way. > > > > > > > > > > The steps will be more or less same if we use spinlock too. step 1, step > > > 3 & step 4 will be common we have to use lock & unlock instead of step 2 > > > & step 5. I feel we can retain the current implementation. > > > > > > > I'll study this in detail and let you know my opinion on the same but > > in the meantime, I don't follow one part of this comment: "If they > > don't follow this order the worker might process wrong line_size and > > leader might populate the information which worker has not yet > > processed or in the process of processing." > > > > Do you want to say that leader might overwrite some information which > > worker hasn't read yet? If so, it is not clear from the comment. > > Another minor point about this comment: > > > > Here leader and worker must follow these steps to avoid any corruption > or hang issue. Changed it to: > * The leader & worker process access the shared line information by following > * the below steps to avoid any data corruption or hang: > Actually, I wanted more on the lines why such corruption or hang can happen? It might help reviewers to understand why you have followed such a sequence. > > > > How did you ensure that this is fixed? Have you tested it, if so > > please share the test? I see a basic problem with your fix. > > > > + /* Report WAL/buffer usage during parallel execution */ > > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false); > > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false); > > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > > + &walusage[ParallelWorkerNumber]); > > > > You need to call InstrStartParallelQuery() before the actual operation > > starts, without that stats won't be accurate? Also, after calling > > WaitForParallelWorkersToFinish(), you need to accumulate the stats > > collected from workers which neither you have done nor is possible > > with the current code in your patch because you haven't made any > > provision to capture them in BeginParallelCopy. > > > > I suggest you look into lazy_parallel_vacuum_indexes() and > > begin_parallel_vacuum() to understand how the buffer/wal usage stats > > are accumulated. Also, please test this functionality using > > pg_stat_statements. > > > > Made changes accordingly. > I have verified it using: > postgres=# select * from pg_stat_statements where query like '%copy%'; > userid | dbid | queryid| > query >| plans | total_plan_time | > min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time | > calls | total_exec_time | min_exec_time | max_exec_time | > mean_exec_time | stddev_exec_time | rows | shared_blks_hi > t | shared_blks_read | shared_blks_dirtied | shared_blks_written | > local_blks_hit | local_blks_read | local_blks_dirtied | > local_blks_written | temp_blks_read | temp_blks_written | blk_ > read_time | blk_write_time | wal_records | wal_fpi | wal_bytes > +---+--+-+---+-+- > --+---++--+---+-+---+---++--++--- > --+--+-+-++-++++---+- > --++-+-+--- > 10 | 13743 | -6947756673093447609 | copy hw from > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format > csv, delimiter ',') | 0 | 0 | > 0 | 0 | 0 |0 | > 1 | 265.195105 |265.195105 |265.195105 | 265.195105 > |0 | 175000 |191 > 6 |0 | 946 | 946 | > 0 | 0 | 0 | 0 > | 0 | 0 | > 0 | 0 |1116 | 0 | 3587203 > 10 | 13743 | 8570215596364326047 | copy hw from > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format > csv, delimiter ',', paral
Re: Wrong example in the bloom documentation
Hi Bruce, Tom, On Thu, Oct 8, 2020 at 03:43:32PM -0400, Tom Lane wrote: >> "Daniel Westermann (DWE)" writes: >> >> I was hoping someone more experienced with this would comment, but >> >> seeing none, I will apply it in a day or two to all supported versions? >> >> Have you tested this output back to 9.5? >> >> > I hoped that as well. No, I tested down to 9.6 because the change happened >> > in 10. >> >> The patch assumes that parallel query is enabled, which is not true by >> default before v10, so it should certainly not be applied before v10 >> (at least not without significant revisions). Yes, the behavior change was in 10. Before 10 the example is fine, I would not apply that to any prior version, otherwise the whole example needs to be rewritten. Regards Daniel
Re: Transactions involving multiple postgres foreign servers, take 2
At Fri, 9 Oct 2020 02:33:37 +, "tsunakawa.ta...@fujitsu.com" wrote in > From: Masahiko Sawada > > What about temporary network failures? I think there are users who > > don't want to give up resolving foreign transactions failed due to a > > temporary network failure. Or even they might want to wait for > > transaction completion until they send a cancel request. If we want to > > call the commit routine only once and therefore want FDW to retry > > connecting the foreign server within the call, it means we require all > > FDW implementors to write a retry loop code that is interruptible and > > ensures not to raise an error, which increases difficulty. > > > > Yes, but if we don’t retry to resolve foreign transactions at all on > > an unreliable network environment, the user might end up requiring > > every transaction to check the status of foreign transactions of the > > previous distributed transaction before starts. If we allow to do > > retry, I guess we ease that somewhat. > > OK. As I said, I'm not against trying to cope with temporary network > failure. I just don't think it's mandatory. If the network failure is > really temporary and thus recovers soon, then the resolver will be able to > commit the transaction soon, too. I should missing something, though... I don't understand why we hate ERRORs from fdw-2pc-commit routine so much. I think remote-commits should be performed before local commit passes the point-of-no-return and the v26-0002 actually places AtEOXact_FdwXact() before the critical section. (FWIW, I think remote commits should be performed by backends, not by another process, because backends should wait for all remote-commits to end anyway and it is simpler. If we want to multiple remote-commits in parallel, we could do that by adding some async-waiting interface.) > Then, we can have a commit retry timeout or retry count like the following > WebLogic manual says. (I couldn't quickly find the English manual, so below > is in Japanese. I quoted some text that got through machine translation, > which appears a bit strange.) > > https://docs.oracle.com/cd/E92951_01/wls/WLJTA/trxcon.htm > -- > Abandon timeout > Specifies the maximum time (in seconds) that the transaction manager attempts > to complete the second phase of a two-phase commit transaction. > > In the second phase of a two-phase commit transaction, the transaction > manager attempts to complete the transaction until all resource managers > indicate that the transaction is complete. After the abort transaction timer > expires, no attempt is made to resolve the transaction. If the transaction > enters a ready state before it is destroyed, the transaction manager rolls > back the transaction and releases the held lock on behalf of the destroyed > transaction. > -- That's not a retry timeout but a timeout for total time of all 2nd-phase-commits. But I think it would be sufficient. Even if an fdw could retry 2pc-commit, it's a matter of that fdw and the core has nothing to do with. > > Also, what if the user sets the statement timeout to 60 sec and they > > want to cancel the waits after 5 sec by pressing ctl-C? You mentioned > > that client libraries of other DBMSs don't have asynchronous execution > > functionality. If the SQL execution function is not interruptible, the > > user will end up waiting for 60 sec, which seems not good. I think fdw-2pc-commit can be interruptible safely as far as we run the remote commits before entring critical section of local commit. > FDW functions can be uninterruptible in general, aren't they? We experienced > that odbc_fdw didn't allow cancellation of SQL execution. At least postgres_fdw is interruptible while waiting the remote. create view lt as select 1 as slp from (select pg_sleep(10)) t; create foreign table ft(slp int) server sv1 options (table_name 'lt'); select * from ft; ^CCancel request sent ERROR: canceling statement due to user request regrds. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: [PoC] Non-volatile WAL buffer
Hi Takashi, There are some differences between our HW/SW configuration and test steps. I attached postgresql.conf I used for your reference. I would like to try postgresql.conf and steps you provided in the later days to see if I can find cause. I also ran pgbench and postgres server on the same server but on different NUMA node, and ensure server process and PMEM on the same NUMA node. I used similar steps are yours from step 1 to 9. But some difference in later steps, major of them are: In step 10), I created a database and table for test by: #create database: psql -c "create database insert_bench;" #create table: psql -d insert_bench -c "create table test(crt_time timestamp, info text default '75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d79a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');" in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds to warm up. In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f ./test.sql -T 600 -c _ -j _ insert_bench. (test.sql can be found in attachment) For HW/SW conf, the major differences are: CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled) OS Distro: CentOS 8.2.2004 Kernel: 4.18.0-193.6.3.el8_2.x86_64 GCC: 8.3.1 Best regards Gang -Original Message- From: Takashi Menjo Sent: Tuesday, October 6, 2020 4:49 PM To: Deng, Gang Cc: pgsql-hack...@postgresql.org; 'Takashi Menjo' Subject: RE: [PoC] Non-volatile WAL buffer Hi Gang, I have tried to but yet cannot reproduce performance degrade you reported when inserting 328-byte records. So I think the condition of you and me would be different, such as steps to reproduce, postgresql.conf, installation setup, and so on. My results and condition are as follows. May I have your condition in more detail? Note that I refer to your "Storage over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer." Best regards, Takashi # Results See the attached figure. In short, Non-volatile WAL buffer got better performance than Original (PMEM). # Steps Note that I ran postgres server and pgbench in a single-machine system but separated two NUMA nodes. PMEM and PCI SSD for the server process are on the server-side NUMA node. 01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0) 02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0) 03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1) 04) Make /mnt/pmem0/pg_wal directory for WAL 05) Make /mnt/nvme0n1/pgdata directory for PGDATA 06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...) - Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile WAL buffer 07) Edit postgresql.conf as the attached one - Please remove nvwal_* lines in the case of Original (PMEM) 08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl -l pg.log start) 09) Create a database (createdb --locale=C --encoding=UTF8) 10) Initialize pgbench tables with s=50 (pgbench -i -s 50) 11) Change # characters of "filler" column of "pgbench_history" table to 300 (ALTER TABLE pgbench_history ALTER filler TYPE character(300);) - This would make the row size of the table 328 bytes 12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop) 13) Remount the PMEM and the PCIe SSD 14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- pg_ctl -l pg.log start) 15) Run pg_prewarm for all the four pgbench_* tables 16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r -M prepared -T 1800 -c __ -j __) - It executes the default tpcb-like transactions I repeated all the steps three times for each (c,j) then got the median "tps = __ (including connections establishing)" of the three as throughput and the "latency average = __ ms " of that time as average latency. # Environment variables export PGHOST=/tmp export PGPORT=5432 export PGDATABASE="$USER" export PGUSER="$USER" export PGDATA=/mnt/nvme0n1/pgdata # Setup - System: HPE ProLiant DL380 Gen10 - CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by BIOS) - DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels per socket) - Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 sockets (256 GiB per channel x 6 channels per socket; interleaving enabled) - PCIe SSD: DC P4800X Series SSDPED1K750GA - Distro: Ubuntu 20.04.1 - C compiler: gcc 9.3.0 - libc: glibc 2.31 - Linux kernel: 5.7 (vanilla) - Filesystem: ext4 (DAX enabled when using Optane PMem) -
Re: Parallel copy
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow wrote: > > On Thu, Oct 8, 2020 at 5:44 AM vignesh C wrote: > > > Attached v6 patch with the fixes. > > > > Hi Vignesh, > > I noticed a couple of issues when scanning the code in the following patch: > > v6-0003-Allow-copy-from-command-to-process-data-from-file.patch > > In the following code, it will put a junk uint16 value into *destptr > (and thus may well cause a crash) on a Big Endian architecture > (Solaris Sparc, s390x, etc.): > You're storing a (uint16) string length in a uint32 and then pulling > out the lower two bytes of the uint32 and copying them into the > location pointed to by destptr. > > > static void > +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr, > + uint32 *copiedsize) > +{ > + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0; > + > + memcpy(destptr, (uint16 *) &len, sizeof(uint16)); > + *copiedsize += sizeof(uint16); > + if (len) > + { > + memcpy(destptr + sizeof(uint16), srcPtr, len); > + *copiedsize += len; > + } > +} > > I suggest you change the code to: > > uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0; > memcpy(destptr, &len, sizeof(uint16)); > > [I assume string length here can't ever exceed (65535 - 1), right?] > Your suggestion makes sense to me if the assumption related to string length is correct. If we can't ensure that then we need to probably use four bytes uint32 to store the length. > Looking a bit deeper into this, I'm wondering if in fact your > EstimateStringSize() and EstimateNodeSize() functions should be using > BUFFERALIGN() for EACH stored string/node (rather than just calling > shm_toc_estimate_chunk() once at the end, after the length of packed > strings and nodes has been estimated), to ensure alignment of start of > each string/node. Other Postgres code appears to be aligning each > stored chunk using shm_toc_estimate_chunk(). See the definition of > that macro and its current usages. > I am not sure if this required for the purpose of correctness. AFAIU, we do store/estimate multiple parameters in same way at other places, see EstimateParamListSpace and SerializeParamList. Do you have something else in mind? While looking at the latest code, I observed below issue in patch v6-0003-Allow-copy-from-command-to-process-data-from-file: + /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */ + est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState)); + shm_toc_estimate_chunk(&pcxt->estimator, est_cstateshared); + shm_toc_estimate_keys(&pcxt->estimator, 1); + + strsize = EstimateCstateSize(pcxt, cstate, attnamelist, &whereClauseStr, + &rangeTableStr, &attnameListStr, + ¬nullListStr, &nullListStr, + &convertListStr); Here, do we need to separately estimate the size of SerializedParallelCopyState when it is also done in EstimateCstateSize? -- With Regards, Amit Kapila.
Re: range_agg
čt 24. 9. 2020 v 2:05 odesílatel Paul A Jungwirth < p...@illuminatedcomputing.com> napsal: > On Sun, Aug 16, 2020 at 12:55 PM Paul A Jungwirth > wrote: > > This is rebased on the current master, including some changes to doc > > tables and pg_upgrade handling of type oids. > > Here is a rebased version of this patch, including a bunch of cleanup > from Alvaro. (Thanks Alvaro!) > I tested this patch and It looks well, I have not any objections 1. there are not new warnings 2. make check-world passed 3. build doc without problems 4. doc is enough, regress tests too 5. there was not objection against this feature in discussion, and I think it is interesting and useful feature - good additional to arrays Regards Pavel > Paul >