Re: [PATCH] Missing links between system catalog documentation pages
Hello Dagfinn, The attached patch applies cleanly, doc generation is ok. I'm ok with adding such links systematically. makes the first mention of another system catalog or view (as well as pg_hba.conf in pg_hba_file_lines) a link, for easier navigation. Why only the first mention? It seems unlikely that I would ever read such chapter linearly, and even so that I would want to jump especially on the first occurrence but not on others, so ISTM that it should be done all mentions? It's the first mention in the introductory paragraph of _each_ catalog table/view page, not the first mention in the entire catalogs.sgml file. E.g. https://www.postgresql.org/docs/current/catalog-pg-aggregate.html has two mentions of pg_proc one word apart: Each entry in pg_aggregate is an extension of an entry in pg_proc. The pg_proc entry carries the aggregate's name, … I didn't think there was much point in linkifying both in that case, and other similar situations. The point is that the user reads a sentence, attempts to jump but sometimes can't, because the is not the first occurrence. I'd go for all mentions of another relation should be link. Alse, ISTM you missed some, maybe you could consider adding them? eg pg_database in the very first paragraph of the file, pg_attrdef in pg_attribute description, quite a few in pg_class… -- Fabien.
Re: suggest to rename enable_incrementalsort
On Sun, Jun 21, 2020 at 8:26 AM Peter Eisentraut wrote: > > I suggest to rename enable_incrementalsort to enable_incremental_sort. > This is obviously more readable and also how we have named recently > added multiword planner parameters. > > See attached patch. +1, this is a way better name (and patch LGTM on REL_13_STABLE).
tag typos in "catalog.sgml"
Hello, While reviewing a documentation patch, I noticed that a few tags where wrong in "catalog.sgml". Attached patch fixes them. -- Fabien.diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 700271fd40..5a66115df1 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -897,7 +897,7 @@ An entry's amopmethod must match the - opfmethod of its containing operator family (including + opfmethod of its containing operator family (including amopmethod here is an intentional denormalization of the catalog structure for performance reasons). Also, amoplefttype and amoprighttype must match @@ -5064,10 +5064,10 @@ SCRAM-SHA-256$:&l An operator class's opcmethod must match the - opfmethod of its containing operator family. + opfmethod of its containing operator family. Also, there must be no more than one pg_opclass - row having opcdefault true for any given combination of - opcmethod and opcintype. + row having opcdefault true for any given combination of + opcmethod and opcintype.
Missing "Up" navigation link between parts and doc root?
Hello devs, I've been annoyed that the documentation navigation does not always has an "Up" link. It has them inside parts, but the link disappears and you have to go for the "Home" link which is far on the right when on the root page of a part? Is there a good reason not to have the "Up" link there as well? -- Fabien.
Raising stop and warn limits
In brief, I'm proposing to raise xidWrapLimit-xidStopLimit to 3M and xidWrapLimit-xidWarnLimit to 40M. Likewise for mxact counterparts. PostgreSQL has three "stop limit" values beyond which only single-user mode will assign new values of a certain counter: - xidStopLimit protects pg_xact, pg_commit_ts, pg_subtrans, and pg_serial. SetTransactionIdLimit() withholds a million XIDs, and warnings start ten million before that. - multiStopLimit protects pg_multixact/offsets. SetMultiXactIdLimit() withholds 100 mxacts, and warnings start at ten million. - offsetStopLimit protects pg_multixact/members. SetOffsetVacuumLimit() withholds [1,2) SLRU segments thereof (50k-100k member XIDs). No warning phase for this one. Reasons to like a larger stop limit: 1. VACUUM, to advance a limit, may assign IDs subject to one of the limits. VACUUM formerly consumed XIDs, not mxacts. It now consumes mxacts, not XIDs. I think a DBA can suppress VACUUM's mxact consumption by stopping all transactions older than vacuum_freeze_min_age, including prepared transactions. 2. We currently have edge-case bugs when assigning values in the last few dozen pages before the wrap limit (https://postgr.es/m/20190214072623.ga1139...@rfd.leadboat.com and https://postgr.es/m/20200525070033.ga1591...@rfd.leadboat.com). A higher stop limit could make this class of bug unreachable outside of single-user mode. That's valuable against undiscovered bugs of this class. 3. Any bug in stop limit enforcement is less likely to have user impact. For a live example, see the XXX comment that https://postgr.es/m/attachment/111084/slru-truncate-modulo-v3.patch adds to CheckPointPredicate(). Raising a stop limit prompts an examination of warn limits, which represent the time separating the initial torrent of warnings from the service outage. The current limits appeared in 2005; transaction rates have grown, while human reaction times have not. I like warnings starting when an SLRU is 98% consumed (40M XIDs or mxacts remaining). That doesn't change things enough to make folks reconfigure VACUUM, and it buys back some of the grace period DBAs had in 2005. I considered 95-97%, but the max_val of autovacuum_freeze_max_age would then start the warnings before the autovacuum. While that wouldn't rule out a value lower than 98%, 98% felt fine anyhow. For the new stop limits, I propose allowing 99.85% SLRU fullness (stop with 3M XIDs or mxacts remaining). If stopping this early will bother users, an alternative is 3M for XIDs and 0.2M for others. Either way leaves at least two completely-idle segments for each SLRU, which I expect to mitigate present and future edge-case bugs. Changing this does threaten clusters that experience pg_upgrade when close to a limit. pg_upgrade can fail or, worse, yield a cluster that spews warnings shortly after the upgrade. I could implement countermeasures, but they would take effect only when one upgrades a cluster having a 98%-full SLRU. I propose not to change pg_upgrade; some sites may find cause to do a whole-cluster VACUUM before pg_upgrade. Do you agree or disagree with that choice? I am attaching a patch (not for commit) that demonstrates the pg_upgrade behavior that nearly-full-SLRU clusters would see. Thanks, nm diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index f3da40a..c43ebbf 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -795,6 +795,10 @@ TrimCLOG(void) int slotno; char *byteptr; + /* hack for pg_resetwal moving us to the middle of a page */ + if (!SimpleLruDoesPhysicalPageExist(XactCtl, pageno)) + SimpleLruZeroPage(XactCtl, pageno); + slotno = SimpleLruReadPage(XactCtl, pageno, false, xid); byteptr = XactCtl->shared->page_buffer[slotno] + byteno; diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index ce84dac..ed47ce0 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -2042,6 +2042,10 @@ TrimMultiXact(void) int slotno; MultiXactOffset *offptr; + /* hack for pg_resetwal moving us to the middle of a page */ + if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno)) + SimpleLruZeroPage(MultiXactOffsetCtl, pageno); + slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact); offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno]; offptr += entryno; diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c index 2334418..454ff8b 100644 --- a/src/bin/pg_resetwal/pg_resetwal.c +++ b/src/bin/pg_resetwal/pg_resetwal.c @@ -434,6
Re: pg_regress cleans up tablespace twice.
On Sun, Jun 21, 2020 at 12:08:37PM +1200, Thomas Munro wrote: > I'm not sure what needs to change, but in the meantime I told it to > comment out the offending test from the schedule files: > > +before_test: > + - 'perl -p -i.bak -e "s/^test: tablespace/#test: tablespace/" > src/test/regress/serial_schedule' > + - 'perl -p -i.bak -e "s/^test: tablespace/#test: tablespace/" > src/test/regress/parallel_schedule' > > Now the results are slowly turning green again. Thanks, and sorry for the trouble. What actually happened back in 2018? I can see c2ff3c68 in the git history of the cfbot code, but it does not give much details. -- Michael signature.asc Description: PGP signature
Re: new heapcheck contrib module
On Sat, Jun 13, 2020 at 2:36 AM Mark Dilger wrote: > > > > > On Jun 11, 2020, at 11:35 PM, Dilip Kumar wrote: > > > > On Fri, Jun 12, 2020 at 12:40 AM Mark Dilger > > wrote: > >> > >> > >> > >>> On Jun 11, 2020, at 9:14 AM, Dilip Kumar wrote: > >>> > >>> I have just browsed through the patch and the idea is quite > >>> interesting. I think we can expand it to check that whether the flags > >>> set in the infomask are sane or not w.r.t other flags and xid status. > >>> Some examples are > >>> > >>> - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED > >>> should not be set in new_infomask2. > >>> - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we > >>> actually cross verify the transaction status from the CLOG and check > >>> whether is matching the hint bit or not. > >>> > >>> While browsing through the code I could not find that we are doing > >>> this kind of check, ignore if we are already checking this. > >> > >> Thanks for taking a look! > >> > >> Having both of those bits set simultaneously appears to fall into a > >> different category than what I wrote verify_heapam.c to detect. > > > > Ok > > > > > >> It doesn't violate any assertion in the backend, nor does it cause > >> the code to crash. (At least, I don't immediately see how it does > >> either of those things.) At first glance it appears invalid to have > >> those bits both set simultaneously, but I'm hesitant to enforce that > >> without good reason. If it is a good thing to enforce, should we also > >> change the backend code to Assert? > > > > Yeah, it may not hit assert or crash but it could lead to a wrong > > result. But I agree that it could be an assertion in the backend > > code. > > For v7, I've added an assertion for this. Per heap/README.tuplock, "We > currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit > is set." I added an assertion for that, too. Both new assertions are in > RelationPutHeapTuple(). I'm not sure if that is the best place to put the > assertion, but I am confident that the assertion needs to only check tuples > destined for disk, as in memory tuples can and do violate the assertion. > > Also for v7, I've updated contrib/amcheck to report these two conditions as > corruption. > > > What about the other check, like hint bit is saying the > > transaction is committed but actually as per the clog the status is > > something else. I think in general processing it is hard to check > > such things in backend no? because if the hint bit is set saying that > > the transaction is committed then we will directly check its > > visibility with the snapshot. I think a corruption checker may be a > > good tool for catching such anomalies. > > I already made some design changes to this patch to avoid taking the > CLogTruncationLock too often. I'm happy to incorporate this idea, but > perhaps you could provide a design on how to do it without all the extra > locking? If not, I can try to get this into v8 as an optional check, so > users can turn it on at their discretion. Having the check enabled by > default is probably a non-starter. Okay, even I can't think a way to do it without an extra locking. I have looked into 0001 patch and I have a few comments. 1. + + /* Skip over unused/dead/redirected line pointers */ + if (!ItemIdIsUsed(ctx.itemid) || + ItemIdIsDead(ctx.itemid) || + ItemIdIsRedirected(ctx.itemid)) + continue; Isn't it a good idea to verify the Redirected Itemtid? Because we will still access the redirected item id to find the actual tuple from the index scan. Maybe not exactly at this level, but we can verify that the link itemid store in that is within the itemid range of the page or not. 2. + /* Check for tuple header corruption */ + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader) + { + confess(ctx, + psprintf("t_hoff < SizeofHeapTupleHeader (%u < %u)", + ctx->tuphdr->t_hoff, + (unsigned) SizeofHeapTupleHeader)); + fatal = true; + } I think we can also check that if there is no NULL attributes (if (!(t_infomask & HEAP_HASNULL)) then ctx->tuphdr->t_hoff should be equal to SizeofHeapTupleHeader. 3. + ctx->offset = 0; + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++) + { + if (!check_tuple_attribute(ctx)) + break; + } + ctx->offset = -1; + ctx->attnum = -1; So we are first setting ctx->offset to 0, then inside check_tuple_attribute, we will keep updating the offset as we process the attributes and after the loop is over we set ctx->offset to -1, I did not understand that why we need to reset it to -1, do we ever check for that. We don't even initialize the ctx->offset to -1 while initializing the context for the tuple so I do not understand what is the meaning of the random value -1. 4. + if (!VARATT_IS_EXTENDED(chunk)) + { + chunksize = VARSIZE(chunk) - VARHDRSZ; + chunkdata = VARDATA(chunk); + } + else if (VARATT_IS_SHORT(chunk)) + { + /* + * could happen due to heap_form_tuple doing its thing + */
Re: tag typos in "catalog.sgml"
On Sun, Jun 21, 2020 at 09:10:35AM +0200, Fabien COELHO wrote: > While reviewing a documentation patch, I noticed that a few tags where wrong > in "catalog.sgml". Attached patch fixes them. Good catches, thanks Fabien. I will fix that tomorrow or so. -- Michael signature.asc Description: PGP signature
Re: pg_regress cleans up tablespace twice.
On Sun, Jun 21, 2020 at 8:42 PM Michael Paquier wrote: > On Sun, Jun 21, 2020 at 12:08:37PM +1200, Thomas Munro wrote: > > I'm not sure what needs to change, but in the meantime I told it to > > comment out the offending test from the schedule files: > > > > +before_test: > > + - 'perl -p -i.bak -e "s/^test: tablespace/#test: tablespace/" > > src/test/regress/serial_schedule' > > + - 'perl -p -i.bak -e "s/^test: tablespace/#test: tablespace/" > > src/test/regress/parallel_schedule' > > > > Now the results are slowly turning green again. > > Thanks, and sorry for the trouble. What actually happened back in > 2018? I can see c2ff3c68 in the git history of the cfbot code, but it > does not give much details. commit ce5d3424d6411f7a7228fd4463242cb382af3e0c Author: Andrew Dunstan Date: Sat Oct 20 09:02:36 2018 -0400 Lower privilege level of programs calling regression_main On Windows this mean that the regression tests can now safely and successfully run as Administrator, which is useful in situations like Appveyor. Elsewhere it's a no-op. Backpatch to 9.5 - this is harder in earlier branches and not worth the trouble. Discussion: https://postgr.es/m/650b0c29-9578-8571-b1d2-550d7f89f...@2ndquadrant.com
Re: suggest to rename enable_incrementalsort
On Sun, Jun 21, 2020 at 09:05:32AM +0200, Julien Rouhaud wrote: On Sun, Jun 21, 2020 at 8:26 AM Peter Eisentraut wrote: I suggest to rename enable_incrementalsort to enable_incremental_sort. This is obviously more readable and also how we have named recently added multiword planner parameters. See attached patch. +1, this is a way better name (and patch LGTM on REL_13_STABLE). The reason why I kept the single-word variant is consistency with other GUCs that affect planning, like enable_indexscan, enable_hashjoin and many others. That being said, I'm not particularly attached this choice, so if you think this is better I'm OK with it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Initial progress reporting for COPY command
po 15. 6. 2020 v 2:18 odesílatel Michael Paquier napsal: > Hi Josef, > > On Sun, Jun 14, 2020 at 02:32:33PM +0200, Josef Šimánek wrote: > > Hello, as proposed by Pavel Stěhule and discussed on local czech > PostgreSQL > > maillist ( > > > https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer > ), > > I have prepared an initial patch for COPY command progress reporting. > > Sounds like a good idea to me. > Great. I will continue working on this. > > I have not found any tests for progress reporting. Are there any? It > would > > need two backends running (one running COPY, one checking output of > report > > view). Is there any similar test I can inspire at? In theory, it should > be > > possible to use dblink_send_query to run async COPY command in the > > background. > > We don't have any tests in core. I think that making deterministic > test cases is rather tricky here as long as we don't have a more > advanced testing framework that allows is to lock certain code paths > and keep around an expected state until a second session comes around > and looks at the progress catalog (even that would need adding more > code to core to mark the extra point looked at). So I think that it is > fine to not focus on that for this feature. The important parts are > the choice of the progress points and the data sent to MyProc, and > both should be chosen wisely. > Thanks for the info. I'm focusing exactly at looking for right spots to report the progress. I'll attach new patch with better places and supporting more options of reporting (including STDIN, STDOUT) soon and also I'll try to add it to commitfest. > > > My initial (attached) patch also doesn't introduce documentation for this > > system view. I can add that later once this patch is finalized (if that > > happens). > > You may want to add it to the next commit fest: > https://commitfest.postgresql.org/28/ > Documentation is necessary, and having some would ease reviews. > -- > Michael >
Re: [PATCH] Initial progress reporting for COPY command
po 15. 6. 2020 v 6:39 odesílatel Fujii Masao napsal: > > > On 2020/06/14 21:32, Josef Šimánek wrote: > > Hello, as proposed by Pavel Stěhule and discussed on local czech > PostgreSQL maillist ( > https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer), > I have prepared an initial patch for COPY command progress reporting. > > Sounds nice! > > > > file - bool - is file is used? > > program - bool - is program used? > > Are these fields really necessary in a progress view? > What values are reported when STDOUT/STDIN is specified in COPY command? > For STDOUT and STDIN file is true and program is false. > > file_bytes_processed - amount of bytes processed when file is used > (otherwise 0), works for both direction ( > > FROM/TO) when file is used (file = t) > > What value is reported when STDOUT/STDIN is specified in COPY command? For my first patch nothing was reported on STDOUT/STDIN usage. I'll attach new patch soon supporting those as well. > > > Regards, > > > -- > Fujii Masao > Advanced Computing Technology Center > Research and Development Headquarters > NTT DATA CORPORATION >
Re: [PATCH] Initial progress reporting for COPY command
po 15. 6. 2020 v 2:18 odesílatel Michael Paquier napsal: > Hi Josef, > > On Sun, Jun 14, 2020 at 02:32:33PM +0200, Josef Šimánek wrote: > > Hello, as proposed by Pavel Stěhule and discussed on local czech > PostgreSQL > > maillist ( > > > https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer > ), > > I have prepared an initial patch for COPY command progress reporting. > > Sounds like a good idea to me. > > > I have not found any tests for progress reporting. Are there any? It > would > > need two backends running (one running COPY, one checking output of > report > > view). Is there any similar test I can inspire at? In theory, it should > be > > possible to use dblink_send_query to run async COPY command in the > > background. > > We don't have any tests in core. I think that making deterministic > test cases is rather tricky here as long as we don't have a more > advanced testing framework that allows is to lock certain code paths > and keep around an expected state until a second session comes around > and looks at the progress catalog (even that would need adding more > code to core to mark the extra point looked at). So I think that it is > fine to not focus on that for this feature. The important parts are > the choice of the progress points and the data sent to MyProc, and > both should be chosen wisely. > > > My initial (attached) patch also doesn't introduce documentation for this > > system view. I can add that later once this patch is finalized (if that > > happens). > > You may want to add it to the next commit fest: > https://commitfest.postgresql.org/28/ > Documentation is necessary, and having some would ease reviews. > I have added documentation, more code comments and I'll upload patch to commit fest. > -- > Michael >
Re: [PATCH] Initial progress reporting for COPY command
po 15. 6. 2020 v 7:34 odesílatel Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> napsal: > > I'm using ftell to get current position in file to populate > file_bytes_processed without error handling (ftell can return -1L and also > populate errno on problems). > > > > 1. Is that a good way to get progress of file processing? > > IMO, it's better to handle the error cases. One possible case where > ftell can return -1 and set errno is when the total bytes processed is > more than LONG_MAX. > > Will your patch handle file_bytes_processed reporting for COPY FROM > STDIN cases? For this case, ftell can't be used. > > Instead of using ftell and worrying about the errors, a simple > approach could be to have a uint64 variable in CopyStateData to track > the number of bytes read whenever CopyGetData is called. This approach > can also handle the case of COPY FROM STDIN. > Thanks for suggestion. I used this approach and latest patch supports both STDIN and STDOUT now. > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com >
Re: Possible NULL pointer deferenced (src/interfaces/libpq/fe-exec.c (line 563)
Em dom., 21 de jun. de 2020 às 02:16, Tom Lane escreveu: > Ranier Vilela writes: > > The res->curBlock pointer possibly, can be NULL here (line 563). > > No, it can't. > > To get to that line, nBytes has to be > 0, which means res->spaceLeft > has to be > 0, which cannot happen while res->curBlock is NULL. > Hi Tom, thanks for answer. regards, Ranier Vilela
Re: [PATCH] Missing links between system catalog documentation pages
Hi Fabien, Fabien COELHO writes: >> It's the first mention in the introductory paragraph of _each_ catalog >> table/view page, not the first mention in the entire catalogs.sgml file. >> E.g. https://www.postgresql.org/docs/current/catalog-pg-aggregate.html >> has two mentions of pg_proc one word apart: >> >> Each entry in pg_aggregate is an extension of an entry in pg_proc. The >> pg_proc entry carries the aggregate's name, … >> >> I didn't think there was much point in linkifying both in that case, and >> other similar situations. > > The point is that the user reads a sentence, attempts to jump but > sometimes can't, because the is not the first occurrence. I'd go for all > mentions of another relation should be link. Okay, I'll make them all links, except the pg_aggregate aggfnoid column, which I've changed from "pg_proc OID of the aggregate function" to just "OID of the agregate function", since pg_proc is linked immediately prior in the "references" section, and we generally don't mention the catalog table again in similar cases elsehwere. > Alse, ISTM you missed some, maybe you could consider adding them? eg > pg_database in the very first paragraph of the file, pg_attrdef in > pg_attribute description, quite a few in pg_class… Yes, I only looked at the intro paragraphs of the per-catalog pages, not the overview section nor the text after the column tables. I've gone through them all now and linked them. Updated patch attached. - ilmari -- "A disappointingly low fraction of the human race is, at any given time, on fire." - Stig Sandbeck Mathisen >From b2940b93f5ed17c0e739d6ed0c9411a44bd87428 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= Date: Sun, 24 May 2020 21:48:29 +0100 Subject: [PATCH v2 1/2] Add missing cross-links in system catalog documentation This makes all mentions of other system catalogs and views in the system system catalog and view documentation pages hyperlinks, for easier navigation. Also linkify the first mention of pg_hba.conf in pg_hba_file_rules, as that's more specific and easier to spot than the link to the client authentication chapter at the bottom. --- doc/src/sgml/catalogs.sgml | 142 - 1 file changed, 78 insertions(+), 64 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 700271fd40..051fa7d361 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -16,7 +16,8 @@ Normally, one should not change the system catalogs by hand, there are normally SQL commands to do that. (For example, CREATE DATABASE inserts a row into the - pg_database catalog — and actually + pg_database + catalog — and actually creates the database on disk.) There are some exceptions for particularly esoteric operations, but many of those have been made available as SQL commands over time, and so the need for direct manipulation @@ -381,9 +382,10 @@ sum, count, and max. Each entry in pg_aggregate is an extension of an entry - in pg_proc. The pg_proc - entry carries the aggregate's name, input and output data types, and - other information that is similar to ordinary functions. + in pg_proc. + The pg_proc entry carries the aggregate's name, + input and output data types, and other information that is similar to + ordinary functions. @@ -407,7 +409,7 @@ (references pg_proc.oid) - pg_proc OID of the aggregate function + OID of the aggregate function @@ -902,7 +904,7 @@ catalog structure for performance reasons). Also, amoplefttype and amoprighttype must match the oprleft and oprright fields of the - referenced pg_operator entry. + referenced pg_operator entry. @@ -1099,7 +1101,8 @@ table columns. There will be exactly one pg_attribute row for every column in every table in the database. (There will also be attribute entries for - indexes, and indeed all objects that have pg_class + indexes, and indeed all objects that have + pg_class entries.) @@ -1270,7 +1273,7 @@ This column has a default expression or generation expression, in which case there will be a corresponding entry in the - pg_attrdef catalog that actually defines the + pg_attrdef catalog that actually defines the expression. (Check attgenerated to determine whether this is a default or a generation expression.) @@ -1402,9 +1405,9 @@ In a dropped column's pg_attribute entry, atttypid is reset to zero, but attlen and the other fields copied from - pg_type are still valid. This arrangement is needed + pg_type are still valid. This arrangement is needed to cope with the situation where the dropped column's data type was - later dropped, and so there is no pg_type row anymore. + later dropped, and so there is no pg_type row anymore. attlen and
Re: [PATCH] Missing links between system catalog documentation pages
Fabien COELHO writes: >> I didn't think there was much point in linkifying both in that case, and >> other similar situations. > The point is that the user reads a sentence, attempts to jump but > sometimes can't, because the is not the first occurrence. I'd go for all > mentions of another relation should be link. That has not been our practice up to now, eg in comparable cases in discussions of GUC variables, only the first reference is xref-ified. I think it could be kind of annoying to make every reference a link, both for regular readers (the link decoration is too bold in most browsers) and for users of screen-reader software. There is a fair question as to how far apart two references should be before we both of them. But I think that distance does need to be more than zero, and probably more than one para. regards, tom lane
vs for command line tools in the docs
Hi Hackers, While I was looking at linkifying SQL commands in the system catalog docs, I noticed that catalog.sgml uses the tag to refer to initdb, while I'd expected it to use . Looking for patterns, I grepped for the two tags with contents consisting only of lower-case letters, numbers, hyphens and underscores, to restrict it to things that look like shell command names, not SQL commands or full command lines. When referring to binaries shipped with postgres itself, is by far the most common, except for initdb, postgres, ecpg, pg_ctl, pg_resetwal, and pg_config. For external programs it's more of a mix, but overall there are more than twice as many instances of of this form than of . I'm not proposing going throuhgh all 1500+ instances in one fell swoop, but some consistency (and a policy going forward) would be nice. - ilmari -- "A disappointingly low fraction of the human race is, at any given time, on fire." - Stig Sandbeck Mathisen
Re: [PATCH] Missing links between system catalog documentation pages
On 2020-Jun-21, Tom Lane wrote: > That has not been our practice up to now, eg in comparable cases in > discussions of GUC variables, only the first reference is xref-ified. > I think it could be kind of annoying to make every reference a link, > both for regular readers (the link decoration is too bold in most > browsers) and for users of screen-reader software. In the glossary I also changed things so that only the first reference of a term in another term's definition is linked; my experience reading the originals as submitted (which did link them all at some point) is that the extra links are very distracting, bad for readability. So +1 for not adding links to every single mention. > There is a fair question as to how far apart two references should > be before we both of them. But I think that distance > does need to be more than zero, and probably more than one para. Nod. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: vs for command line tools in the docs
ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes: > While I was looking at linkifying SQL commands in the system catalog > docs, I noticed that catalog.sgml uses the tag to refer to > initdb, while I'd expected it to use . I agree that the latter is what we generally use. regards, tom lane
vs formatting in the docs
Hi Hackers, While looking at making more SQL into links, I noticed that loses the monospace formatting of , and can't itself be wrapped in . This becomes particularly apparent when you have one link that can be an next to another that's ... because it's actually referring to a specific variant of the command. By some trial and error I found that putting inside the tag propagates the formatting to the contents. We already do this with for (most of) the client and server applications. Is this something we want to do consistently for both? - ilmari -- "The surreality of the universe tends towards a maximum" -- Skud's Law "Never formulate a law or axiom that you're not prepared to live with the consequences of." -- Skud's Meta-Law
Re: vs formatting in the docs
On 2020-Jun-21, Dagfinn Ilmari Mannsåker wrote: > While looking at making more SQL into links, I > noticed that loses the monospace formatting of , and > can't itself be wrapped in . Ouch. > By some trial and error I found that putting inside the > tag propagates the formatting to the contents. > We already do this with for (most of) the client and > server applications. Is this something we want to do consistently for > both? Looking at the ones that use , it looks like manpages are not damaged, so +1. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: vs for command line tools in the docs
On 06/21/20 11:16, Tom Lane wrote: > ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes: >> docs, I noticed that catalog.sgml uses the tag to refer to >> initdb, while I'd expected it to use . > > I agree that the latter is what we generally use. 'The latter' is in the Subject:, in the body Regards, -Chap
Re: vs formatting in the docs
Alvaro Herrera writes: > On 2020-Jun-21, Dagfinn Ilmari Mannsåker wrote: > >> While looking at making more SQL into links, I >> noticed that loses the monospace formatting of , and >> can't itself be wrapped in . > > Ouch. > >> By some trial and error I found that putting inside the >> tag propagates the formatting to the contents. >> We already do this with for (most of) the client and >> server applications. Is this something we want to do consistently for >> both? > > Looking at the ones that use , it looks like manpages are > not damaged, so +1. Attached are two patches: the first adds the missing tags, the second adds to all the SQL commands (specifically anything with 7). I'll add it to the commitfest. - ilmari -- - Twitter seems more influential [than blogs] in the 'gets reported in the mainstream press' sense at least. - Matt McLeod - That'd be because the content of a tweet is easier to condense down to a mainstream media article. - Calle Dybedahl >From 739e80f8e25b193af4fba54bceb1a7b37c47a793 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= Date: Sun, 21 Jun 2020 16:26:45 +0100 Subject: [PATCH 1/2] Add missing tags in application doc s Most of them already have this, but some were missing --- doc/src/sgml/ref/initdb.sgml | 2 +- doc/src/sgml/ref/pg_basebackup.sgml | 2 +- doc/src/sgml/ref/pg_config-ref.sgml | 2 +- doc/src/sgml/ref/pg_dump.sgml | 2 +- doc/src/sgml/ref/pg_receivewal.sgml | 2 +- doc/src/sgml/ref/pg_restore.sgml | 2 +- doc/src/sgml/ref/pg_verifybackup.sgml | 2 +- 7 files changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml index 1635fcb1fd..b6a55ce105 100644 --- a/doc/src/sgml/ref/initdb.sgml +++ b/doc/src/sgml/ref/initdb.sgml @@ -9,7 +9,7 @@ - initdb + initdb 1 Application diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index db480be674..ce1653faf7 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -9,7 +9,7 @@ - pg_basebackup + pg_basebackup 1 Application diff --git a/doc/src/sgml/ref/pg_config-ref.sgml b/doc/src/sgml/ref/pg_config-ref.sgml index 6c797746cc..e177769188 100644 --- a/doc/src/sgml/ref/pg_config-ref.sgml +++ b/doc/src/sgml/ref/pg_config-ref.sgml @@ -9,7 +9,7 @@ - pg_config + pg_config 1 Application diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 2f0807e912..8cd5875f67 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -9,7 +9,7 @@ - pg_dump + pg_dump 1 Application diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml index cad4689ae6..865ec84262 100644 --- a/doc/src/sgml/ref/pg_receivewal.sgml +++ b/doc/src/sgml/ref/pg_receivewal.sgml @@ -9,7 +9,7 @@ - pg_receivewal + pg_receivewal 1 Application diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml index 232f88024f..6cb06d4910 100644 --- a/doc/src/sgml/ref/pg_restore.sgml +++ b/doc/src/sgml/ref/pg_restore.sgml @@ -9,7 +9,7 @@ - pg_restore + pg_restore 1 Application diff --git a/doc/src/sgml/ref/pg_verifybackup.sgml b/doc/src/sgml/ref/pg_verifybackup.sgml index 44f4e67d57..527af5ad28 100644 --- a/doc/src/sgml/ref/pg_verifybackup.sgml +++ b/doc/src/sgml/ref/pg_verifybackup.sgml @@ -9,7 +9,7 @@ - pg_verifybackup + pg_verifybackup 1 Application -- 2.26.2 >From e8cd316e87e467cf70b1989c7c68d3110dbe5323 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= Date: Sun, 21 Jun 2020 16:52:47 +0100 Subject: [PATCH 2/2] Add to SQL command tags This means that links to them will get the proper monospace formatting. Manpage formatting is not affected. --- doc/src/sgml/ref/abort.sgml | 2 +- doc/src/sgml/ref/alter_aggregate.sgml | 2 +- doc/src/sgml/ref/alter_collation.sgml | 2 +- doc/src/sgml/ref/alter_conversion.sgml| 2 +- doc/src/sgml/ref/alter_database.sgml | 2 +- doc/src/sgml/ref/alter_default_privileges.sgml| 2 +- doc/src/sgml/ref/alter_domain.sgml| 2 +- doc/src/sgml/ref/alter_event_trigger.sgml | 2 +- doc/src/sgml/ref/alter_extension.sgml | 2 +- doc/src/sgml/ref/alter_foreign_data_wrapper.sgml | 2 +- doc/src/sgml/ref/alter_foreign_table.sgml | 2 +- doc/src/sgml/ref/alter_function.sgml | 2 +- doc/src/sgml/ref/alter_group.sgml | 2 +- doc/src/sgml/ref/alter_index.sgml | 2 +- doc/src/sgml/ref/alter_language.sgml | 2 +- doc/src/sgml/ref/alter_large_object.sgml | 2 +- doc/src/sgml/ref/alter_materialized_view.sgml | 2 +- doc/src/sgml/ref/alter_opclass.sgml | 2 +- doc/src/sgml
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
Hi, On 6/5/20 11:51 AM, Alvaro Herrera wrote: > On 2020-Jun-05, Dave Cramer wrote: > >> On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera >> wrote: > >>> Ouch ... so they made IDENT in the replication grammar be a trigger to >>> enter the regular grammar. Crazy. No way to put those worms back in >>> the tin now, I guess. >> >> Is that documented ? > > I don't think it is. > >>> It is still my opinion that we should prohibit a logical replication >>> connection from being used to do physical replication. Horiguchi-san, >>> Sawada-san and Masao-san are all of the same opinion. Dave Cramer (of >>> the JDBC team) is not opposed to the change -- he says they're just >>> using it because they didn't realize they should be doing differently. >> >> I think my exact words were >> >> "I don't see this is a valid reason to keep doing something. If it is >> broken then fix it. >> Clients can deal with the change." >> >> in response to: >> >>> Well, I don't really think that we can just break a behavior that >>> exists since 9.4 as you could break applications relying on the >>> existing behavior, and that's also the point of Vladimir upthread. >> >> Which is different than not being opposed to the change. I don't see this >> as broken, and it's quite possible that some of our users are using >> it. > > Apologies for misinterpreting. > >> It certainly needs to be documented > > I'd rather not. The PG13 RMT had a discussion about this thread, and while the initial crash has been fixed, we decided to re-open the Open Item around whether we should allow physical replication to be initiated in a logical replication session. We anticipate a resolution for PG13, whether it is explicitly disallowing physical replication from occurring on a logical replication slot, maintaining the status quo, or something else such that there is consensus on the approach. Thanks, Jonathan signature.asc Description: OpenPGP digital signature
Re: [PATCH] Missing links between system catalog documentation pages
Alvaro Herrera writes: > On 2020-Jun-21, Tom Lane wrote: > >> That has not been our practice up to now, eg in comparable cases in >> discussions of GUC variables, only the first reference is xref-ified. >> I think it could be kind of annoying to make every reference a link, >> both for regular readers (the link decoration is too bold in most >> browsers) and for users of screen-reader software. > > In the glossary I also changed things so that only the first reference > of a term in another term's definition is linked; my experience reading > the originals as submitted (which did link them all at some point) is > that the extra links are very distracting, bad for readability. So +1 > for not adding links to every single mention. There were only three cases of multiple mentions of the same table in a single paragraph, I've removed them in the attached patch. I've also added a second patch that makes the SQL commands links. There were some cases of the same commands being mentioned in the descriptions of multiple columns in the same table, but I've left those in place, since that feels less disruptive than in prose. >> There is a fair question as to how far apart two references should >> be before we both of them. But I think that distance >> does need to be more than zero, and probably more than one para. > > Nod. I tend to agree. - ilmari -- "A disappointingly low fraction of the human race is, at any given time, on fire." - Stig Sandbeck Mathisen >From 165c4aa4613d572a0547a2f2d01dcf06dfa93a24 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= Date: Sun, 24 May 2020 21:48:29 +0100 Subject: [PATCH v3 1/2] Add missing cross-links in system catalog documentation This makes the first mention of a system catalog or view in each paragraph in the system system catalog and view documentation pages hyperlinks, for easier navigation. Also linkify the first mention of pg_hba.conf in pg_hba_file_rules, as that's more specific and easier to spot than the link to the client authentication chapter at the bottom. --- doc/src/sgml/catalogs.sgml | 140 - 1 file changed, 77 insertions(+), 63 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 700271fd40..47d78d2ff8 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -16,7 +16,8 @@ Normally, one should not change the system catalogs by hand, there are normally SQL commands to do that. (For example, CREATE DATABASE inserts a row into the - pg_database catalog — and actually + pg_database + catalog — and actually creates the database on disk.) There are some exceptions for particularly esoteric operations, but many of those have been made available as SQL commands over time, and so the need for direct manipulation @@ -381,9 +382,10 @@ sum, count, and max. Each entry in pg_aggregate is an extension of an entry - in pg_proc. The pg_proc - entry carries the aggregate's name, input and output data types, and - other information that is similar to ordinary functions. + in pg_proc. + The pg_proc entry carries the aggregate's name, + input and output data types, and other information that is similar to + ordinary functions. @@ -407,7 +409,7 @@ (references pg_proc.oid) - pg_proc OID of the aggregate function + OID of the aggregate function @@ -902,7 +904,7 @@ catalog structure for performance reasons). Also, amoplefttype and amoprighttype must match the oprleft and oprright fields of the - referenced pg_operator entry. + referenced pg_operator entry. @@ -1099,7 +1101,8 @@ table columns. There will be exactly one pg_attribute row for every column in every table in the database. (There will also be attribute entries for - indexes, and indeed all objects that have pg_class + indexes, and indeed all objects that have + pg_class entries.) @@ -1270,7 +1273,7 @@ This column has a default expression or generation expression, in which case there will be a corresponding entry in the - pg_attrdef catalog that actually defines the + pg_attrdef catalog that actually defines the expression. (Check attgenerated to determine whether this is a default or a generation expression.) @@ -1402,7 +1405,7 @@ In a dropped column's pg_attribute entry, atttypid is reset to zero, but attlen and the other fields copied from - pg_type are still valid. This arrangement is needed + pg_type are still valid. This arrangement is needed to cope with the situation where the dropped column's data type was later dropped, and so there is no pg_type row anymore. attlen and the other fields can be used @@ -1835,14 +1838,16 @@ The catalog pg_class catalogs tables and most - everything else that has columns or is
Re: [PATCH] Add support for choosing huge page size
> Documentation syntax error "2MB" shows up as: Ops, sorry, should be fixed now. > The build is currently failing on Windows: Ahh, thanks. Looks like the Windows stuff isn't autogenerated, so maybe this new patch works.. > When using huge_pages=on, huge_page_size=1GB, but default shared_buffers, I noticed that the error message reports the wrong (unrounded) size in this message: Ahh, yes, that is correct. Switched to printing the _real_ allocsize now! > 1GB pages are so big that it becomes a little tricky to set shared buffers large enough without wasting RAM. What I mean is, if I want to use shared_buffers=16GB, I need to have at least 17 huge pages available, but the 17th page is nearly entirely wasted! Imagine that on POWER 16GB pages. That makes me wonder if we should actually redefine these GUCs differently so that you state the total, or at least use the rounded memory for buffers... I think we could consider that to be a separate problem with a separate patch though. Yes, that is a good point! But as you say, I guess that fits better in another patch. > Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a 3.5GB table against itself. [...] Thanks for the results! Will look into your patch when I get time, but it certainly looks cool! I have a 4-node numa machine with ~100GiB of memory and a single node numa machine, so i'll take some benchmarks when I get time! > I wondered if this was something to do > with NUMA effects on this two node box, so I tried running that again > with postgres under numactl --cpunodebind 0 --membind 0 and I got: [...] Yes, making this "properly" numa aware to avoid/limit cross-numa memory access is kinda tricky. When reserving huge pages they are distributed more or less evenly between the nodes, and they can be found by using `grep -R "" /sys/devices/system/node/node*/hugepages/hugepages-*/nr_hugepages` (can also be written to), so there _may_ be a chance that the huge pages you got was on another node than 0 (due to the fact that there not were enough), but that is just guessing. tor. 18. jun. 2020 kl. 06:01 skrev Thomas Munro : > > Hi Odin, > > Documentation syntax error "2MB" shows up as: > > config.sgml:1605: parser error : Opening and ending tag mismatch: > literal line 1602 and para > > ^ > > Please install the documentation tools > https://www.postgresql.org/docs/devel/docguide-toolsets.html, rerun > configure and "make docs" to see these kinds of errors. > > The build is currently failing on Windows: > > undefined symbol: HAVE_DECL_MAP_HUGE_MASK at src/include/pg_config.h > line 143 at src/tools/msvc/Mkvcbuild.pm line 851. > > I think that's telling us that you need to add this stuff into > src/tools/msvc/Solution.pm, so that we can say it doesn't have it. I > don't have Windows but whenever you post a new version we'll see if > Windows likes it here: > > http://cfbot.cputube.org/odin-ugedal.html > > When using huge_pages=on, huge_page_size=1GB, but default > shared_buffers, I noticed that the error message reports the wrong > (unrounded) size in this message: > > 2020-06-18 02:06:30.407 UTC [73552] HINT: This error usually means > that PostgreSQL's request for a shared memory segment exceeded > available memory, swap space, or huge pages. To reduce the request > size (currently 149069824 bytes), reduce PostgreSQL's shared memory > usage, perhaps by reducing shared_buffers or max_connections. > > The request size was actually: > > mmap(NULL, 1073741824, PROT_READ|PROT_WRITE, > MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB|30< ENOMEM (Cannot allocate memory) > > 1GB pages are so big that it becomes a little tricky to set shared > buffers large enough without wasting RAM. What I mean is, if I want > to use shared_buffers=16GB, I need to have at least 17 huge pages > available, but the 17th page is nearly entirely wasted! Imagine that > on POWER 16GB pages. That makes me wonder if we should actually > redefine these GUCs differently so that you state the total, or at > least use the rounded memory for buffers... I think we could consider > that to be a separate problem with a separate patch though. > > Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a > 3.5GB table against itself. Hash joins are the perfect way to > exercise the TLB because they're very likely to miss. I also applied > my patch[1] to allow parallel queries to use shared memory from the > main shared memory area, so that they benefit from the configured page > size, using pages that are allocated once at start up. (Without that, > you'd have to mess around with /dev/shm mount options, and then hope > that pages were available at query time, and it'd also be slower for > other stupid implementation reasons). > > # echo never > /sys/kernel/mm/transparent_hugepage/enabled > # echo 8500 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages > # echo 17 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages > > shared_buffers=
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
Hi, On 2020-06-21 13:45:36 -0400, Jonathan S. Katz wrote: > The PG13 RMT had a discussion about this thread, and while the initial > crash has been fixed, we decided to re-open the Open Item around whether > we should allow physical replication to be initiated in a logical > replication session. Since this is a long-time issue, this doesn't quite seem like an issue for the RMT? > We anticipate a resolution for PG13, whether it is explicitly > disallowing physical replication from occurring on a logical replication > slot, maintaining the status quo, or something else such that there is > consensus on the approach. s/logical replication slot/logical replication connection/? I still maintain that adding restrictions here is a bad idea. Even disregarding the discussion of running normal queries interspersed, it's useful to be able to both request WAL and receive logical changes over the same connection. E.g. for creating a logical replica by first doing a physical base backup (vastly faster), or fetching WAL for decoding large transactions onto a standby. And I just don't see any reasons to disallow it. There's basically no reduction in complexity by doing so. Greetings, Andres Freund
Re: [PATCH] Add support for choosing huge page size
Hi, On 2020-06-18 16:00:49 +1200, Thomas Munro wrote: > Unfortunately I can't access the TLB miss counters on this system due > to virtualisation restrictions, and the systems where I can don't have > 1GB pages. According to cpuid(1) this system has a fairly typical > setup: > >cache and TLB information (2): > 0x63: data TLB: 2M/4M pages, 4-way, 32 entries > data TLB: 1G pages, 4-way, 4 entries > 0x03: data TLB: 4K pages, 4-way, 64 entries Hm. Doesn't that system have a second level of TLB (STLB) with more 1GB entries? I think there's some errata around what intel exposes via cpuid around this :( Guessing that this is a skylake server chip? https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Memory_Hierarchy > [...] Additionally there is a unified L2 TLB (STLB) > [...] STLB > [...] 1 GiB page translations: > [...] 16 entries; 4-way set associative > This operation is touching about 8GB of data (scanning 3.5GB of table, > building a 4.5GB hash table) so 4 x 1GB is not enough do this without > TLB misses. I assume this uses 7 workers? > Let's try that again, except this time with shared_buffers=4GB, > dynamic_shared_memory_main_size=4GB, and only half as many tuples in > t, so it ought to fit: > > 4KB pages: 6.37 seconds > 2MB pages: 4.96 seconds > 1GB pages: 5.07 seconds > > Well that's disappointing. Hm, I don't actually know the answer to this: If this actually uses multiple workers, won't the fact that each has an independent page table (despite having overlapping contents) lead to there being fewer actually available 1GB entries available? Obviously depends on how processes are scheduled (iirc hyperthreading shares dTLBs). Might be worth looking at whether there are cpu migrations or testing with a single worker. > I wondered if this was something to do > with NUMA effects on this two node box, so I tried running that again > with postgres under numactl --cpunodebind 0 --membind 0 and I got: > 4KB pages: 5.43 seconds > 2MB pages: 4.05 seconds > 1GB pages: 4.00 seconds > > From this I can't really conclude that it's terribly useful to use > larger page sizes, but it's certainly useful to have the ability to do > further testing using the proposed GUC. Due to the low number of 1GB entries they're quite likely to be problematic imo. Especially when there's more concurrent misses than there are page table entries. I'm somewhat doubtful that it's useful to use 1GB entries for all of our shared memory when that's bigger than the maximum covered size. I suspect that it'd better to use 1GB entries for some and smaller entries for the rest of the memory. Greetings, Andres Freund
Re: [PATCH] Missing links between system catalog documentation pages
Hello Tom, I didn't think there was much point in linkifying both in that case, and other similar situations. The point is that the user reads a sentence, attempts to jump but sometimes can't, because the is not the first occurrence. I'd go for all mentions of another relation should be link. That has not been our practice up to now, eg in comparable cases in discussions of GUC variables, only the first reference is xref-ified. I think it could be kind of annoying to make every reference a link, both for regular readers (the link decoration is too bold in most browsers) Hmmm. That looks like an underlying CSS issue, not that links are intrinsically bad. I find it annoying that the same thing appears differently from one line to the next. It seems I'm the only one who likes things to be uniform, though. and for users of screen-reader software. I do not know about those, and what constraints it puts on markup. -- Fabien.
Re: Parallel Seq Scan vs kernel read ahead
On Sat, 20 Jun 2020 at 08:00, Robert Haas wrote: > > On Thu, Jun 18, 2020 at 10:10 PM David Rowley wrote: > > Here's a patch which caps the maximum chunk size to 131072. If > > someone doubles the page size then that'll be 2GB instead of 1GB. I'm > > not personally worried about that. > > Maybe use RELSEG_SIZE? I was hoping to keep the guarantees that the chunk size is always a power of 2. If, for example, someone configured PostgreSQL --with-segsize=3, then RELSEG_SIZE would be 393216 with the standard BLCKSZ. Not having it a power of 2 does mean the ramp-down is more uneven when the sizes become very small: postgres=# select 393216>>x from generate_Series(0,18)x; ?column? -- 393216 196608 98304 49152 24576 12288 6144 3072 1536 768 384 192 96 48 24 12 6 3 1 (19 rows) Perhaps that's not a problem though, but then again, perhaps just keeping it at 131072 regardless of RELSEG_SIZE and BLCKSZ is also ok. The benchmarks I did on Windows [1] showed that the returns diminished once we started making the step size some decent amount so my thoughts are that I've set PARALLEL_SEQSCAN_MAX_CHUNK_SIZE to something large enough that it'll make no difference to the performance anyway. So there's probably not much point in giving it too much thought. Perhaps pg_nextpower2_32(RELSEG_SIZE) would be okay though. David [1] https://www.postgresql.org/message-id/caaphdvoppka+q5y_k_6cuv4u6dphmz771veumuzls3d3mwy...@mail.gmail.com
Re: suggest to rename enable_incrementalsort
On Sun, 21 Jun 2020 at 23:22, Tomas Vondra wrote: > > On Sun, Jun 21, 2020 at 09:05:32AM +0200, Julien Rouhaud wrote: > >On Sun, Jun 21, 2020 at 8:26 AM Peter Eisentraut > > wrote: > >> > >> I suggest to rename enable_incrementalsort to enable_incremental_sort. > >> This is obviously more readable and also how we have named recently > >> added multiword planner parameters. > >> > >> See attached patch. > > > >+1, this is a way better name (and patch LGTM on REL_13_STABLE). > > > > The reason why I kept the single-word variant is consistency with other > GUCs that affect planning, like enable_indexscan, enable_hashjoin and > many others. Looking at the other enable_* GUCs, for all the ones that aim to disable a certain executor node type, with the exception of enable_hashagg and enable_bitmapscan, they're all pretty consistent in naming the GUC after the executor node's .c file: enable_bitmapscan nodeBitmapHeapscan.c enable_gathermergenodeGatherMerge.c enable_hashaggnodeAgg.c enable_hashjoin nodeHashjoin.c enable_incrementalsortnodeIncrementalSort.c enable_indexonlyscan nodeIndexonlyscan.c enable_indexscan nodeIndexscan.c enable_material nodeMaterial.c enable_mergejoin nodeMergejoin.c enable_nestloop nodeNestloop.c enable_parallel_appendnodeAppend.c enable_parallel_hash nodeHash.c enable_partition_pruning enable_partitionwise_aggregate enable_partitionwise_join enable_seqscannodeSeqscan.c enable_sort nodeSort.c enable_tidscannodeTidscan.c enable_partition_pruning, enable_partitionwise_aggregate, enable_partitionwise_join are the odd ones out here as they're not really related to a specific node type. Going by that, it does seem the current name for enable_incrementalsort is consistent with the majority. Naming it enable_incremental_sort looks like it would be more suited if the feature had been added by overloading nodeSort.c. In that regard, it would be similar to enable_parallel_append and enable_parallel_hash, where the middle word becomes a modifier. David
Get rid of runtime handling of AlternativeSubPlan?
Back in bd3daddaf232d95b0c9ba6f99b0170a0147dd8af, which introduced AlternativeSubPlans, I wrote: There is a lot more that could be done based on this infrastructure: in particular it's interesting to consider switching to the hash plan if we start out using the non-hashed plan but find a lot more upper rows going by than we expected. I have therefore left some minor inefficiencies in place, such as initializing both subplans even though we will currently only use one. That commit will be twelve years old come August, and nobody has either built anything else atop it or shown any interest in making the plan choice switchable mid-run. So it seems like kind of a failed experiment. Therefore, I'm considering the idea of ripping out all executor support for AlternativeSubPlan and instead having the planner replace an AlternativeSubPlan with the desired specific SubPlan somewhere late in planning, possibly setrefs.c. Admittedly, the relevant executor support only amounts to a couple hundred lines, but that's not nothing. A perhaps-more-useful effect is to get rid of the confusing and poorly documented EXPLAIN output that you get for an AlternativeSubPlan. I also noted that the existing subplan-selection logic in ExecInitAlternativeSubPlan is really pretty darn bogus, in that it uses a one-size-fits-all execution count estimate of parent->plan->plan_rows, no matter which subexpression the subplan is in. This is only appropriate for subplans in the plan node's targetlist, and can be either too high or too low elsewhere. It'd be relatively easy for setrefs.c to do better, I think, since it knows which subexpression it's working on at any point. Thoughts? regards, tom lane
Re: Get rid of runtime handling of AlternativeSubPlan?
On Mon, 22 Jun 2020 at 12:20, Tom Lane wrote: > > Back in bd3daddaf232d95b0c9ba6f99b0170a0147dd8af, which introduced > AlternativeSubPlans, I wrote: > > There is a lot more that could be done based on this infrastructure: in > particular it's interesting to consider switching to the hash plan if we > start > out using the non-hashed plan but find a lot more upper rows going by than > we > expected. I have therefore left some minor inefficiencies in place, such as > initializing both subplans even though we will currently only use one. > > That commit will be twelve years old come August, and nobody has either > built anything else atop it or shown any interest in making the plan choice > switchable mid-run. So it seems like kind of a failed experiment. > > Therefore, I'm considering the idea of ripping out all executor support > for AlternativeSubPlan and instead having the planner replace an > AlternativeSubPlan with the desired specific SubPlan somewhere late in > planning, possibly setrefs.c. > > Admittedly, the relevant executor support only amounts to a couple hundred > lines, but that's not nothing. A perhaps-more-useful effect is to get rid > of the confusing and poorly documented EXPLAIN output that you get for an > AlternativeSubPlan. > > I also noted that the existing subplan-selection logic in > ExecInitAlternativeSubPlan is really pretty darn bogus, in that it uses a > one-size-fits-all execution count estimate of parent->plan->plan_rows, no > matter which subexpression the subplan is in. This is only appropriate > for subplans in the plan node's targetlist, and can be either too high > or too low elsewhere. It'd be relatively easy for setrefs.c to do > better, I think, since it knows which subexpression it's working on > at any point. When I was working on [1] a few weeks ago, I did wonder if I'd have to use an AlternativeSubPlan when doing result caching for subqueries. The problem is likely the same as why they were invented in the first place; we basically don't know how many rows the parent will produce when planning the subplan. For my case, I have an interest in both the number of rows in the outer plan, and the ndistinct estimate on the subplan parameters. If the parameters for the subquery are all distinct, then there's not much sense in trying to cache results to use later. We're never going to need them. Right now, if I wanted to use AlternativeSubPlan to delay the choice of this until run-time, then I'd be missing information about the ndistinct estimation since we don't have that information available in the final plan. Perhaps that's an argument for doing this in setrefs.c instead. I could look up the ndistinct estimate there. For switching plans on the fly during execution. I can see the sense in that as an idea. For the hashed subplan case, we'd likely want to switch to hashing mode if we discovered that there were many more rows in the outer query than we had thought there would be. However, I'm uncertain if Result Cache would never need anything similar as technically we could just switch off the caching if we discovered our cache hit ration was either terrible or 0. We would have an additional node to pull tuples through, however. Switching would also require that the tupleslot type was the same between the alternatives. David [1] https://www.postgresql.org/message-id/caaphdvrpcqyqdwergywx8j+2dlungxu+fosbq1uscxrunyx...@mail.gmail.com
Re: Improve planner cost estimations for alternative subplans
I wrote: > Nope. The entire reason why we have that kluge is that we don't know > until much later how many times we expect to execute the subplan. > AlternativeSubPlan allows the decision which subplan form to use to be > postponed till runtime; but when we're doing things like estimating the > cost and selectivity of a where-clause, we don't know that. > Maybe there's some way to recast things to avoid that problem, > but I have little clue what it'd look like. Actually ... maybe it's not that bad. Clearly there would be a circularity issue for selectivity estimation, but all the alternatives should have the same selectivity. Cost estimation is a different story: by the time we need to do cost estimation for a subexpression, we do in many cases have an idea how often the subexpression will be executed. I experimented with adding a number-of-evaluations parameter to cost_qual_eval, and found that the majority of call sites do have something realistic they can pass. The attached very-much-WIP patch shows my results so far. There's a lot of loose ends: * Any call site using COST_QUAL_EVAL_DUMMY_NUM_EVALS is a potential spot for future improvement. The only one that seems like it might be fundamentally unsolvable is cost_subplan's usage; but that would only matter if a subplan's testexpr contains an AlternativeSubPlan, which is probably a negligible case in practice. The other ones seem to just require more refactoring than I cared to do on a Sunday afternoon. * I did not do anything for postgres_fdw.c beyond making it compile. We can surely do better there, but it might require some rethinking of the way that plan costs get cached. * The merge and hash join costsize.c functions estimate costs of qpquals (i.e. quals to be applied at the join that are not being used as merge or hash conditions) by computing cost_qual_eval of the whole joinrestrictlist and then subtracting off the cost of the merge or hash quals. This is kind of broken if we want to use different num_eval estimates for the qpquals and the merge/hash quals, which I think we do. This probably just needs some refactoring to fix. We also need to save the relevant rowcounts in the join Path nodes so that createplan.c can do the right thing. * I think it might be possible to improve the situation in get_agg_clause_costs() if we're willing to postpone collection of the actual aggregate costs till later. This'd require two passes over the aggregate expressions, but that seems like it might not be terribly expensive. (I'd be inclined to also look at the idea of merging duplicate agg calls at plan time not run time, if we refactor that.) * I had to increase the number of table rows in one updatable_views.sql test to keep the plans the same. Without that, the planner realized that a seqscan would be cheaper than an indexscan. The result wasn't wrong exactly, but it failed to prove that leakproof quals could be used as indexquals, so I think we need to keep the plan choice the same. Anyway, this is kind of invasive, but I think it shouldn't really add noticeable costs as long as we save relevant rowcounts rather than recomputing them in createplan.c. Is it worth doing? I dunno. AlternativeSubPlan is pretty much a backwater, I think --- if it were interesting performance-wise to a lot of people, more would have been done with it by now. regards, tom lane diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 9fc53cad68..1149c298ba 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -656,7 +656,8 @@ postgresGetForeignRelSize(PlannerInfo *root, JOIN_INNER, NULL); - cost_qual_eval(&fpinfo->local_conds_cost, fpinfo->local_conds, root); + cost_qual_eval(&fpinfo->local_conds_cost, fpinfo->local_conds, + COST_QUAL_EVAL_DUMMY_NUM_EVALS, root); /* * Set # of retrieved rows and cached relation costs to some negative @@ -2766,7 +2767,8 @@ estimate_path_cost_size(PlannerInfo *root, /* Add in the eval cost of the locally-checked quals */ startup_cost += fpinfo->local_conds_cost.startup; total_cost += fpinfo->local_conds_cost.per_tuple * retrieved_rows; - cost_qual_eval(&local_cost, local_param_join_conds, root); + cost_qual_eval(&local_cost, local_param_join_conds, + COST_QUAL_EVAL_DUMMY_NUM_EVALS, root); startup_cost += local_cost.startup; total_cost += local_cost.per_tuple * retrieved_rows; @@ -2782,7 +2784,8 @@ estimate_path_cost_size(PlannerInfo *root, { QualCost tlist_cost; - cost_qual_eval(&tlist_cost, fdw_scan_tlist, root); + cost_qual_eval(&tlist_cost, fdw_scan_tlist, + COST_QUAL_EVAL_DUMMY_NUM_EVALS, root); startup_cost -= tlist_cost.startup; total_cost -= tlist_cost.startup; total_cost -= tlist_cost.per_tuple * rows; @@ -2870,9 +2873,11 @@ estimate_path_cost_size(PlannerInfo *root, /* * Calculate the cost of clauses push
Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)
On Sun, Jun 07, 2020 at 10:07:19AM +0200, Fabien COELHO wrote: > Hello Justin, > > Rebased onto 7b48f1b490978a8abca61e9a9380f8de2a56f266 and renumbered OIDs. Rebased again on whatever broke func.sgml. > pg_stat_file() and pg_stat_dir_files() now return a char type, as well as > the function which call them, but the documentation does not seem to say > that it is the case. Fixed, thanks > I must admit that I'm not a fan on the argument management of > pg_ls_dir_metadata and pg_ls_dir_metadata_1arg and others. I understand that > it saves a few lines though, so maybe let it be. I think you're saying that you don't like the _1arg functions, but they're needed to allow the regression tests to pass: | * note: this wrapper is necessary to pass the sanity check in opr_sanity, | * which checks that all built-in functions that share the implementing C | * function take the same number of arguments > There is a comment in pg_ls_dir_files which talks about pg_ls_dir. > > Could pg_ls_*dir functions C implementations be dropped in favor of a pure > SQL implementation, like you did with recurse? I'm still waiting to hear feedback from a commiter if this is a good idea to put this into the system catalog. Right now, ts_debug is the only nontrivial function. > If so, ISTM that pg_ls_dir_files() could be significantly simplified by > moving its filtering flag to SQL conditions on "type" and others. That could > allow not to change the existing function output a keep the "isdir" column > defined as "type = 'd'" where it was used previously, if someone complains, > but still have the full capability of "ls". That would also allow to drop > the "*_1arg" hacks. Basically I'm advocating having 1 or 2 actual C > functions, and all other variants managed at the SQL level. You want to get rid of the 1arg stuff and just have one function. I see your point, but I guess the C function would still need to accept a "missing_ok" argument, so we need two functions, so there's not much utility in getting rid of the "include_dot_dirs" argument, which is there for consistency with pg_ls_dir. Conceivably we could 1) get rid of pg_ls_dir, and 2) get rid of the include_dot_dirs argument and 3) maybe make "missing_ok" a required argument; and, 4) get rid of the C wrapper functions, and replace with a bunch of stuff like this: SELECT name, size, access, modification, change, creation, type='d' AS isdir FROM pg_ls_dir_metadata('pg_wal') WHERE substring(name,1,1)!='.' AND type!='d'; Where the defaults I changed in this patchset still remain to be discussed: with or without metadata, hidden files, dotdirs. As I'm still waiting for committer feedback on the first 10 patches, so not intending to add more. -- Justin >From 60593b397f03ec840a97f8004d97177dec463752 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Mon, 16 Mar 2020 14:12:55 -0500 Subject: [PATCH v19 01/10] Document historic behavior of links to directories.. Backpatch to 9.5: pg_stat_file --- doc/src/sgml/func.sgml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b7c450ea29..9f47745c5a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -25881,7 +25881,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); Returns a record containing the file's size, last access time stamp, last modification time stamp, last file status change time stamp (Unix platforms only), file creation time stamp (Windows only), and a flag -indicating if it is a directory. +indicating if it is a directory (or a symbolic link to a directory). This function is restricted to superusers by default, but other users -- 2.17.0 >From c1acf58316869e176bc2b215e2545f9183f100e3 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Mon, 30 Mar 2020 18:59:16 -0500 Subject: [PATCH v19 02/10] pg_stat_file and pg_ls_dir_* to use lstat().. pg_ls_dir_* will now skip (no longer show) symbolic links, same as other non-regular file types, as we advertize we do since 8b6d94cf6. That seems to be the intented behavior, since irregular file types are 1) less portable; and, 2) we don't currently show a file's type except for "bool is_dir". pg_stat_file will now 1) show metadata of links themselves, rather than their target; and, 2) specifically, show links to directories with "is_dir=false"; and, 3) not error on broken symlinks. --- doc/src/sgml/func.sgml | 2 +- src/backend/utils/adt/genfile.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 9f47745c5a..b7c450ea29 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -25881,7 +25881,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); Returns a record containing the file's size, last access time stamp, last modification time stamp, last file status ch
Re: [PATCH] Initial progress reporting for COPY command
On 2020/06/21 20:33, Josef Šimánek wrote: po 15. 6. 2020 v 6:39 odesílatel Fujii Masao mailto:masao.fu...@oss.nttdata.com>> napsal: On 2020/06/14 21:32, Josef Šimánek wrote: > Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist (https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer), I have prepared an initial patch for COPY command progress reporting. Sounds nice! > file - bool - is file is used? > program - bool - is program used? Are these fields really necessary in a progress view? What values are reported when STDOUT/STDIN is specified in COPY command? For STDOUT and STDIN file is true and program is false. Could you tell me why these columns are necessary in *progress* view? If we want to see what copy command is actually running, we can see pg_stat_activity, instead. For example, SELECT pc.*, a.query FROM pg_stat_progress_copy pc, pg_stat_activity a WHERE pc.pid = a.pid; > file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction ( > FROM/TO) when file is used (file = t) What value is reported when STDOUT/STDIN is specified in COPY command? For my first patch nothing was reported on STDOUT/STDIN usage. I'll attach new patch soon supporting those as well. Thanks for the patch! With the patch, pg_stat_progress_copy seems to report the progress of the processing on file_fdw. Is this intentional? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Resetting spilled txn statistics in pg_stat_replication
On Sun, Jun 21, 2020 at 3:27 AM Tomas Vondra wrote: > > On Thu, Jun 18, 2020 at 12:21:17PM +0530, Amit Kapila wrote: > >On Thu, Jun 18, 2020 at 8:01 AM Masahiko Sawada > > wrote: > >> > >> On Wed, 17 Jun 2020 at 20:14, Amit Kapila wrote: > >> > > >> > > >> > I had written above in the context of persisting these stats. I mean > >> > to say if the process has bounced or server has restarted then the > >> > previous stats might not make much sense because we were planning to > >> > use pid [1], so the stats from process that has exited might not make > >> > much sense or do you think that is okay? If we don't want to persist > >> > and the lifetime of these stats is till the process is alive then we > >> > are fine. > >> > > >> > >> Sorry for confusing you. The above my idea is about having the stats > >> per slots. That is, we add spill_txns, spill_count and spill_bytes to > >> pg_replication_slots or a new view pg_stat_logical_replication_slots > >> with some columns: slot_name plus these stats columns and stats_reset. > >> The idea is that the stats values accumulate until either the slot is > >> dropped, the server crashed, the user executes the reset function, or > >> logical decoding is performed with different logical_decoding_work_mem > >> value than the previous time. In other words, the stats values are > >> reset in either case. That way, I think the stats values always > >> correspond to logical decoding using the same slot with the same > >> logical_decoding_work_mem value. > >> > > > >What if the decoding has been performed by multiple backends using the > >same slot? In that case, it will be difficult to make the judgment > >for the value of logical_decoding_work_mem based on stats. It would > >make sense if we provide a way to set logical_decoding_work_mem for a > >slot but not sure if that is better than what we have now. > > > >What problems do we see in displaying these for each process? I think > >users might want to see the stats for the exited processes or after > >server restart but I think both of those are not even possible today. > >I think the stats are available till the corresponding WALSender > >process is active. > > > > I don't quite see what the problem is. We're in this exact position with > many other stats we track and various GUCs. If you decide to tune the > setting for a particular slot, you simply need to be careful which > backends decode the slot and what GUC values they used. > What problem do you if we allow it to display per-process (WALSender or backend)? They are incurred by the WALSender or by backends so displaying them accordingly seems more straightforward and logical to me. As of now, we don't allow it to be set for a slot, so it won't be convenient for the user to tune it per slot. I think we can allow to set it per-slot but not sure if there is any benefit for the same. > I really think we should not be inventing something that automatically > resets the stats when someone happens to change the GUC. > I agree with that. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Resetting spilled txn statistics in pg_stat_replication
On Mon, Jun 22, 2020 at 8:22 AM Amit Kapila wrote: > > On Sun, Jun 21, 2020 at 3:27 AM Tomas Vondra > wrote: > > > > On Thu, Jun 18, 2020 at 12:21:17PM +0530, Amit Kapila wrote: > > >On Thu, Jun 18, 2020 at 8:01 AM Masahiko Sawada > > > wrote: > > >> > > >> On Wed, 17 Jun 2020 at 20:14, Amit Kapila > > >> wrote: > > >> > > > >> > > > >> > I had written above in the context of persisting these stats. I mean > > >> > to say if the process has bounced or server has restarted then the > > >> > previous stats might not make much sense because we were planning to > > >> > use pid [1], so the stats from process that has exited might not make > > >> > much sense or do you think that is okay? If we don't want to persist > > >> > and the lifetime of these stats is till the process is alive then we > > >> > are fine. > > >> > > > >> > > >> Sorry for confusing you. The above my idea is about having the stats > > >> per slots. That is, we add spill_txns, spill_count and spill_bytes to > > >> pg_replication_slots or a new view pg_stat_logical_replication_slots > > >> with some columns: slot_name plus these stats columns and stats_reset. > > >> The idea is that the stats values accumulate until either the slot is > > >> dropped, the server crashed, the user executes the reset function, or > > >> logical decoding is performed with different logical_decoding_work_mem > > >> value than the previous time. In other words, the stats values are > > >> reset in either case. That way, I think the stats values always > > >> correspond to logical decoding using the same slot with the same > > >> logical_decoding_work_mem value. > > >> > > > > > >What if the decoding has been performed by multiple backends using the > > >same slot? In that case, it will be difficult to make the judgment > > >for the value of logical_decoding_work_mem based on stats. It would > > >make sense if we provide a way to set logical_decoding_work_mem for a > > >slot but not sure if that is better than what we have now. > > > > > >What problems do we see in displaying these for each process? I think > > >users might want to see the stats for the exited processes or after > > >server restart but I think both of those are not even possible today. > > >I think the stats are available till the corresponding WALSender > > >process is active. > > > > > > > I don't quite see what the problem is. We're in this exact position with > > many other stats we track and various GUCs. If you decide to tune the > > setting for a particular slot, you simply need to be careful which > > backends decode the slot and what GUC values they used. > > > > What problem do you if we allow it to display per-process (WALSender > or backend)? They are incurred by the WALSender or by backends so > displaying them accordingly seems more straightforward and logical to > me. > > As of now, we don't allow it to be set for a slot, so it won't be > convenient for the user to tune it per slot. I think we can allow to > set it per-slot but not sure if there is any benefit for the same. > If we display stats as discussed in email [1] (pid, slot_name, spill_txns, spill_count, etc.), then we can even find the stats w.r.t each slot. [1] - https://www.postgresql.org/message-id/CA%2Bfd4k5nqeFdhpnCULpTh9TR%2B15rHZSbz0SDC6sZhr_v99SeKA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
proposal: unescape_text function
Hi There is one user request for unescape function in core. https://stackoverflow.com/questions/20124393/convert-escaped-unicode-character-back-to-actual-character-in-postgresql/20125412?noredirect=1#comment110502526_20125412 This request is about possibility that we do with string literal via functional interface instead string literals only I wrote plpgsql function, but built in function can be simpler: CREATE OR REPLACE FUNCTION public.unescape(text, text) RETURNS text LANGUAGE plpgsql AS $function$ DECLARE result text; BEGIN EXECUTE format('SELECT U&%s UESCAPE %s', quote_literal(replace($1, '\u','^')), quote_literal($2)) INTO result; RETURN result; END; $function$ postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^'); unescape - Odpovědná osoba(1 row) What do you think about this? Regards Pavel
Re: Add support for INDEX_CLEANUP and TRUNCATE to vacuumdb
On Fri, Jun 19, 2020 at 10:57:01AM +0900, Michael Paquier wrote: > Thanks. This flavor looks good to me in terms of code, and the test > coverage is what's needed for all the code paths added. This version > is using my suggestion of upthread for the option names: --no-truncate > and --no-index-cleanup. Are people fine with this choice? Okay. I have gone through the patch again, and applied it as of 9550ea3. Thanks. -- Michael signature.asc Description: PGP signature
Re: tag typos in "catalog.sgml"
On Sun, Jun 21, 2020 at 07:31:16PM +0900, Michael Paquier wrote: > Good catches, thanks Fabien. I will fix that tomorrow or so. And applied to HEAD. -- Michael signature.asc Description: PGP signature
Re: I'd like to discuss scaleout at PGCon
Hi, I read through the symfora paper and it is a nice technique. I am not very sure about where Hyder is used commercially but given that it has come out of Microsoft Research so some microsoft products might be using it/some of these concepts already. With Regards, Sumanta Mukherjee. EnterpriseDB: http://www.enterprisedb.com On Wed, Jun 17, 2020 at 9:38 PM tsunakawa.ta...@fujitsu.com < tsunakawa.ta...@fujitsu.com> wrote: > Hello, > > > > It seems you didn't include pgsql-hackers. > > > > > > From: Sumanta Mukherjee > > > I saw the presentation and it is great except that it seems to be > unclear of both SD and SN if the storage and the compute are being > explicitly separated. Separation of storage and compute would have some > cost advantages as per my understanding. The following two work (ref below) > has some information about the usefulness of this technique for scale out > and so it would be an interesting addition to see if in the SN > architecture that is being proposed could be modified to take care of this > phenomenon and reap the gain. > > > > Thanks. Separation of compute and storage is surely to be considered. > Unlike the old days when the shared storage was considered to be a > bottleneck with slow HDDs and FC-SAN, we could now expect high speed shared > storage thanks to flash memory, NVMe-oF, and RDMA. > > > > > 1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A > > > Transactional Record Manager for Shared Flash. In CIDR 2011. > > > > This is interesting. I'll go into this. Do you know there's any product > based on Hyder? OTOH, Hyder seems to require drastic changes when adopting > for Postgres -- OCC, log-structured database, etc. I'd like to hear how > feasible those are. However, its scale-out capability without the need for > data or application partitioning appears appealing. > > > > > > To explore another possibility that would have more affinity with the > current Postgres, let me introduce our proprietary product called > Symfoware. It's not based on Postgres. > > > > It has shared nothing scale-out functionality with full ACID based on 2PC, > conventional 2PL locking and distributed deadlock resolution. Despite > being shared nothing, all the database files and transaction logs are > stored on shared storage. > > > > The database is divided into "log groups." Each log group has one > transaction log and multiple tablespaces (it's called "database space" > instead of tablespace.) > > > > Each DB instance in the cluster owns multiple log groups, and handles > reads/writes to the data in its owning log groups. When a DB instance > fails, other surviving DB instances take over the log groups of the failed > DB instance, recover the data using the transaction log of the log group, > and accepts reads/writes to the data in the log group. The DBA configures > which DB instance initially owns which log groups and which DB instances > are candidates to take over which log groups. > > > > This way, no server is idle as a standby. All DB instances work hard to > process read-write transactions. This "no idle server for HA" is one of > the things Oracle RAC users want in terms of cost. > > > > However, it still requires data and application partitioning unlike > Hyder. Does anyone think of a way to eliminate partitioning? Data and > application partitioning is what Oracle RAC users want to avoid or cannot > tolerate. > > > > Ref: Introduction of the Symfoware shared nothing scale-out called "load > share." > > > https://pdfs.semanticscholar.org/8b60/163593931cebc58e9f637cfb501500230adc.pdf > > > > > > Regards > > Takayuki Tsunakawa > > > > > > --- below is Sumanta's original mail --- > > *From:* Sumanta Mukherjee > *Sent:* Wednesday, June 17, 2020 5:34 PM > *To:* Tsunakawa, Takayuki/綱川 貴之 > *Cc:* Bruce Momjian ; Merlin Moncure ; > Robert Haas ; maumau...@gmail.com > *Subject:* Re: I'd like to discuss scaleout at PGCon > > > > Hello, > > > > I saw the presentation and it is great except that it seems to be unclear > of both SD and SN if the storage and the compute are being explicitly > separated. Separation of storage and compute would have some cost > advantages as per my understanding. The following two work (ref below) has > some information about the usefulness of this technique for scale out and > so it would be an interesting addition to see if in the SN architecture > that is being proposed could be modified to take care of this phenomenon > and reap the gain. > > > > 1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A > Transactional Record Manager for Shared Flash. In CIDR 2011. > > > > 2. Dhruba Borthakur. 2017. The Birth of RocksDB-Cloud. http://rocksdb. > blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html. > > > > With Regards, > > Sumanta Mukherjee. > > EnterpriseDB: http://www.enterprisedb.com > > > > > > > >
Backpatch b61d161c14
I propose to backpatch b61d161c14 [1] (Introduce vacuum errcontext to display additional information.). In the recent past, we have seen an error report similar to "ERROR: found xmin 2157740646 from before relfrozenxid 1197" from multiple EDB customers. A similar report is seen on pgsql-bugs as well [2] which I think has triggered the implementation of this feature for v13. Such reports mostly indicate database corruption rather than any postgres bug which is also indicated by the error-code (from before relfrozenxid) for this message. I think there is a good reason to back-patch this as multiple users are facing similar issues. This patch won't fix this issue but it will help us in detecting the problematic part of the heap/index and then if users wish they can delete the portion of data that appeared to be corrupted and resume the operations on that relation. I have tried to back-patch this for v12 and attached is the result. The attached patch passes make check-world but I have yet to test it manually and also prepare the patch for other branches once we agree on this proposal. Thoughts? [1] - commit b61d161c146328ae6ba9ed937862d66e5c8b035a Author: Amit Kapila Date: Mon Mar 30 07:33:38 2020 +0530 Introduce vacuum errcontext to display additional information. The additional information displayed will be block number for error occurring while processing heap and index name for error occurring while processing the index. [2] - https://www.postgresql.org/message-id/20190807235154.erbmr4o4bo6vgnjv%40alap3.anarazel.de -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com Introduce-vacuum-errcontext-to-display-additional-in.v12.patch Description: Binary data
Re: Failures with wal_consistency_checking and 13~
On Sat, Jun 20, 2020 at 05:43:19PM +0300, Alexander Korotkov wrote: > I have discovered and fixed the issue in a44dd932ff. spg_mask() > masked unused space only when pagehdr->pd_lower > > SizeOfPageHeaderData. But during the vacuum regression tests, one > page has been erased completely and pagehdr->pd_lower was set to > SizeOfPageHeaderData. Actually, 13 didn't introduce any issue, it > just added a test that spotted the issue. The issue is here since > a507b86900. Thanks Alexander for looking at this issue! My runs with wal_consistency_checking are all clear now. -- Michael signature.asc Description: PGP signature
Re: min_safe_lsn column in pg_replication_slots view
On Sat, Jun 20, 2020 at 03:53:54PM +0900, Michael Paquier wrote: > On Sat, Jun 20, 2020 at 09:45:52AM +0530, Amit Kapila wrote: >> Isn't this information specific to checkpoints, so maybe better to >> display in view pg_stat_bgwriter? > > Not sure that's a good match. If we decide to expose that, a separate > function returning a LSN based on the segment number from > XLogGetLastRemovedSegno() sounds fine to me, like > pg_wal_last_recycled_lsn(). Perhaps somebody has a better name in > mind? I was thinking on this one for the last couple of days, and came up with the name pg_wal_oldest_lsn(), as per the attached, traking the oldest WAL location still available. That's unfortunately too late for beta2, but let's continue the discussion. -- Michael diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h index 7644147cf5..7de4338910 100644 --- a/src/include/catalog/catversion.h +++ b/src/include/catalog/catversion.h @@ -53,6 +53,6 @@ */ /* mmddN */ -#define CATALOG_VERSION_NO 202006151 +#define CATALOG_VERSION_NO 202006221 #endif diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 61f2c2f5b4..1a07877086 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -6090,6 +6090,10 @@ proname => 'pg_current_wal_flush_lsn', provolatile => 'v', prorettype => 'pg_lsn', proargtypes => '', prosrc => 'pg_current_wal_flush_lsn' }, +{ oid => '9054', descr => 'oldest wal location still available', + proname => 'pg_wal_oldest_lsn', provolatile => 'v', + prorettype => 'pg_lsn', proargtypes => '', + prosrc => 'pg_wal_oldest_lsn' }, { oid => '2850', descr => 'wal filename and byte offset, given a wal location', proname => 'pg_walfile_name_offset', prorettype => 'record', @@ -10063,9 +10067,9 @@ proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f', proretset => 't', provolatile => 's', prorettype => 'record', proargtypes => '', - proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,pg_lsn}', - proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}', - proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_safe_lsn}', + proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text}', + proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}', + proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status}', prosrc => 'pg_get_replication_slots' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index 290658b22c..ccb3b5d5db 100644 --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c @@ -387,6 +387,31 @@ pg_current_wal_flush_lsn(PG_FUNCTION_ARGS) PG_RETURN_LSN(current_recptr); } +/* + * Report the oldest WAL location still available after WAL segment removal + * + * This is useful to monitor how much WAL retention is happening with + * replication slots and concurrent checkpoints. NULL means that no WAL + * segments have been removed since startup yet. + */ +Datum +pg_wal_oldest_lsn(PG_FUNCTION_ARGS) +{ + XLogRecPtr oldestptr; + XLogSegNo last_removed_seg; + + last_removed_seg = XLogGetLastRemovedSegno(); + + /* Leave if no segments have been removed since startup */ + if (last_removed_seg == 0) + PG_RETURN_NULL(); + + XLogSegNoOffsetToRecPtr(last_removed_seg + 1, 0, + wal_segment_size, oldestptr); + + PG_RETURN_LSN(oldestptr); +} + /* * Report the last WAL receive location (same format as pg_start_backup etc) * diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 5314e9348f..507b602a08 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -878,8 +878,7 @@ CREATE VIEW pg_replication_slots AS L.catalog_xmin, L.restart_lsn, L.confirmed_flush_lsn, -L.wal_status, -L.min_safe_lsn +L.wal_status FROM pg_get_replication_slots() AS L LEFT JOIN pg_database D ON (L.datoid = D.oid); diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 06e4955de7..590f7054d6 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -236,7 +236,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS) Datum pg_get_replication_slots(PG_FUNCTION_ARGS) { -#define PG_GET_REPLICATION_SLOTS_COLS 13 +#define PG_GET_REPLICATION_SLOTS_COLS 12 ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; TupleDesc tupdesc; Tuplestorestate *tupstore; @@ -282,7 +282,6 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) Datum values[PG_GET_REPLI
Re: Parallel Seq Scan vs kernel read ahead
On Mon, 22 Jun 2020 at 16:54, David Rowley wrote: > I also tested this an AMD machine running Ubuntu 20.04 on kernel > version 5.4.0-37. I used the same 100GB table I mentioned in [1], but > with the query "select * from t where a < 0;", which saves having to > do any aggregate work. I just wanted to add a note here that Thomas and I just discussed this a bit offline. He recommended I try setting the kernel readhead a bit higher. It was set to 128kB, so I cranked it up to 2MB with: sudo blockdev --setra 4096 /dev/nvme0n1p2 I didn't want to run the full test again as it took quite a long time, so I just tried with 32 workers. The first two results here are taken from the test results I just posted 1 hour ago. Master readhead=128kB = 89921.283 ms v2 patch readhead=128kB = 36085.642 ms master readhead=2MB = 60984.905 ms v2 patch readhead=2MB = 22611.264 ms There must be a fairly large element of reading from the page cache there since 22.6 seconds means 4528MB/sec over the 100GB table. The maximum for a PCIe 3.0 x4 slot is 3940MB/s David
[PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Hi, When a query on foreign table is executed from a local session using postgres_fdw, as expected the local postgres backend opens a connection which causes a remote session/backend to be opened on the remote postgres server for query execution. One observation is that, even after the query is finished, the remote session/backend still persists on the remote postgres server. Upon researching, I found that there is a concept of Connection Caching for the remote connections made using postgres_fdw. Local backend/session can cache up to 8 different connections per backend. This caching is useful as it avoids the cost of reestablishing new connections per foreign query. However, at times, there may be situations where the long lasting local sessions may execute very few foreign queries and remaining all are local queries, in this scenario, the remote sessions opened by the local sessions/backends may not be useful as they remain idle and eat up the remote server connections capacity. This problem gets even worse(though this use case is a bit imaginary) if all of max_connections(default 100 and each backend caching 8 remote connections) local sessions open remote sessions and they are cached in the local backend. I propose to have a new session level GUC called "enable_connectioncache"(name can be changed if it doesn't correctly mean the purpose) with the default value being true which means that all the remote connections are cached. If set to false, the connections are not cached and so are remote sessions closed by the local backend/session at the end of each remote transaction. Attached the initial patch(based on commit 9550ea3027aa4f290c998afd8836a927df40b09d), test setup. Another approach to solve this problem could be that (based on Robert's idea[1]) automatic clean up of cache entries, but letting users decide on caching also seems to be good. Please note that documentation is still pending. Thoughts? Test Case: without patch: 1. Run the query on foreign table 2. Look for the backend/session opened on the remote postgres server, it exists till the local session remains active. with patch: 1. SET enable_connectioncache TO false; 2. Run the query on the foreign table 3. Look for the backend/session opened on the remote postgres server, it should not exist. [1] - https://www.postgresql.org/message-id/CA%2BTgmob_ksTOgmbXhno%2Bk5XXPOK%2B-JYYLoU3MpXuutP4bH7gzA%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com testcase Description: Binary data From d4594067b29ab1414e9741a7e6abd91daf078201 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Mon, 22 Jun 2020 10:07:53 +0530 Subject: [PATCH v1] enable_connectioncache GUC for postgres_fdw connection caching postgres_fdw connection caching - cause remote sessions linger till the local session exit. --- contrib/postgres_fdw/connection.c | 3 ++- src/backend/utils/hash/dynahash.c | 11 +++ src/backend/utils/misc/guc.c | 15 +++ src/backend/utils/misc/postgresql.conf.sample | 2 +- src/include/utils/hsearch.h | 4 5 files changed, 33 insertions(+), 2 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index 52d1fe3563..de994f678b 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -869,7 +869,8 @@ pgfdw_xact_callback(XactEvent event, void *arg) */ if (PQstatus(entry->conn) != CONNECTION_OK || PQtransactionStatus(entry->conn) != PQTRANS_IDLE || - entry->changing_xact_state) + entry->changing_xact_state || + scan.enableconncache == false) { elog(DEBUG3, "discarding connection %p", entry->conn); disconnect_pg_server(entry); diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 2688e27726..a5448f4af4 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -276,6 +276,9 @@ static bool has_seq_scans(HTAB *hashp); */ static MemoryContext CurrentDynaHashCxt = NULL; +/* parameter for enabling fdw connection hashing */ +bool enable_connectioncache = true; + static void * DynaHashAlloc(Size size) { @@ -1383,6 +1386,14 @@ hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp) status->hashp = hashp; status->curBucket = 0; status->curEntry = NULL; + status->enableconncache = true; + + /* see if the cache was for postgres_fdw connections and + user chose to disable connection caching*/ + if ((strcmp(hashp->tabname,"postgres_fdw connections") == 0) && + !enable_connectioncache) + status->enableconncache = false; + if (!hashp->frozen) register_seq_scan(hashp); } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 75fc6f11d6..f7a57415fc 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -127,6 +127,7 @@ extern char *temp_tablespaces; extern bool ignore_checksum_failure; extern bool i
Re: [POC] Fast COPY FROM command for the table with foreign partitions
19.06.2020 19:58, Etsuro Fujita пишет: On Tue, Jun 2, 2020 at 2:51 PM Andrey Lepikhov wrote: Hiding the COPY code under the buffers management machinery allows us to generalize buffers machinery, execute one COPY operation on each buffer and simplify error handling. I'm not sure that it's really a good idea that the bulk-insert API is designed the way it's tightly coupled with the bulk-insert machinery in the core, because 1) some FDWs might want to send tuples provided by the core to the remote, one by one, without storing them in a buffer, or 2) some other FDWs might want to store the tuples in the buffer and send them in a lump as postgres_fdw in the proposed patch but might want to do so independently of MAX_BUFFERED_TUPLES and/or MAX_BUFFERED_BYTES defined in the bulk-insert machinery. I agree that we would need special handling for cases you mentioned above if we design this API based on something like the idea I proposed in that thread. Agreed As i understand, main idea of the thread, mentioned by you, is to add "COPY FROM" support without changes in FDW API. I don't think so; I think we should introduce new API for this feature to keep the ExecForeignInsert() API simple. Ok All that I can offer in this place now is to introduce one new ExecForeignBulkInsert(buf) routine that will execute single "COPY FROM STDIN" operation, send tuples and close the operation. We can use the ExecForeignInsert() routine for each buffer tuple if ExecForeignBulkInsert() is not supported. Agreed. In the next version (see attachment) of the patch i removed Begin/End fdwapi routines. Now we have only the ExecForeignBulkInsert() routine. -- Andrey Lepikhov Postgres Professional >From 108dc421cec88ab5afd092f40da3fa31b8fcfbc5 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 22 Jun 2020 10:28:42 +0500 Subject: [PATCH] Fast COPY FROM into the foreign or sharded table. This feature enables bulk COPY into foreign table in the case of multi inserts is possible and foreign table has non-zero number of columns. --- contrib/postgres_fdw/deparse.c| 60 - .../postgres_fdw/expected/postgres_fdw.out| 33 ++- contrib/postgres_fdw/postgres_fdw.c | 87 contrib/postgres_fdw/postgres_fdw.h | 1 + contrib/postgres_fdw/sql/postgres_fdw.sql | 28 +++ src/backend/commands/copy.c | 206 -- src/include/commands/copy.h | 5 + src/include/foreign/fdwapi.h | 8 + 8 files changed, 344 insertions(+), 84 deletions(-) diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index ad37a74221..a37981ff66 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList, static void appendFunctionName(Oid funcid, deparse_expr_cxt *context); static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno, deparse_expr_cxt *context); +static List *deparseRelColumnList(StringInfo buf, Relation rel, + bool enclose_in_parens); /* * Helper functions @@ -1758,6 +1760,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte, withCheckOptionList, returningList, retrieved_attrs); } +/* + * Deparse COPY FROM into given buf. + * We need to use list of parameters at each query. + */ +void +deparseCopyFromSql(StringInfo buf, Relation rel) +{ + appendStringInfoString(buf, "COPY "); + deparseRelation(buf, rel); + (void) deparseRelColumnList(buf, rel, true); + + appendStringInfoString(buf, " FROM STDIN "); +} + /* * deparse remote UPDATE statement * @@ -2061,6 +2077,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel) */ void deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs) +{ + appendStringInfoString(buf, "SELECT "); + *retrieved_attrs = deparseRelColumnList(buf, rel, false); + + /* Don't generate bad syntax for zero-column relation. */ + if (list_length(*retrieved_attrs) == 0) + appendStringInfoString(buf, "NULL"); + + /* + * Construct FROM clause + */ + appendStringInfoString(buf, " FROM "); + deparseRelation(buf, rel); +} + +/* + * Construct the list of columns of given foreign relation in the order they + * appear in the tuple descriptor of the relation. Ignore any dropped columns. + * Use column names on the foreign server instead of local names. + * + * Optionally enclose the list in parantheses. + */ +static List * +deparseRelColumnList(StringInfo buf, Relation rel, bool enclose_in_parens) { Oid relid = RelationGetRelid(rel); TupleDesc tupdesc = RelationGetDescr(rel); @@ -2069,10 +2109,8 @@ deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs) List *options; ListCell *lc; bool first = true; + List *retrieved_attrs = NIL; - *retrieved_attrs = NIL; - - appendStringInfoString(buf, "SELECT "); for (i = 0; i < tupdesc->natts; i++) { /* Ignore
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila wrote: > > On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila wrote: > > > > I have few more comments on the patch > > 0013-Change-buffile-interface-required-for-streaming-.patch: > > > > Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru: > 1. > The subxact file is only create if there > + * are any suxact info under this xid. > + */ > +typedef struct StreamXidHash > > Lets slightly reword the part of the comment as "The subxact file is > created iff there is any suxact info under this xid." Done > > 2. > @@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s) > subxact_info_write(MyLogicalRepWorker->subid, stream_xid); > stream_close_file(); > > + /* Commit the per-stream transaction */ > + CommitTransactionCommand(); > > Before calling commit, ensure that we are in a valid transaction. I > think we can have an Assert for IsTransactionState(). Done > 3. > @@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s) > > int64 i; > int64 subidx; > - int fd; > + BufFile*fd; > bool found = false; > char path[MAXPGPATH]; > + StreamXidHash *ent; > > subidx = -1; > + ensure_transaction(); > subxact_info_read(MyLogicalRepWorker->subid, xid); > > Why to call ensure_transaction here? Is there any reason that we > won't have a valid transaction by now? If not, then its better to > have an Assert for IsTransactionState(). We are only starting transaction from stream_start to stream_stop, so at stream_abort we will not have the transaction. > 4. > - if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts)) > + if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts)) > { > - int save_errno = errno; > + int save_errno = errno; > > - CloseTransientFile(fd); > + BufFileClose(fd); > > On error, won't these files be close automatically? If so, why at > this place and before other errors, we need to close this? Yes, that's correct. I have fixed those. > 5. > if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len)) > { > int save_errno = errno; > > BufFileClose(fd); > errno = save_errno; > ereport(ERROR, > (errcode_for_file_access(), > errmsg("could not read file \"%s\": %m", > > Can we change the error message to "could not read from streaming > transactions file .." or something like that and similarly we can > change the message for failure in reading changes file? Done > 6. > if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts)) > { > int save_errno = errno; > > BufFileClose(fd); > errno = save_errno; > ereport(ERROR, > (errcode_for_file_access(), > errmsg("could not write to file \"%s\": %m", > > Similar to previous, can we change it to "could not write to streaming > transactions file BufFileWrite is not returning failure anymore. > 7. > @@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid, > bool first_segment) > * for writing, in append mode. > */ > if (first_segment) > - flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY); > - else > - flags = (O_WRONLY | O_APPEND | PG_BINARY); > + { > + /* > + * Shared fileset handle must be allocated in the persistent context. > + */ > + SharedFileSet *fileset = > + MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > > - stream_fd = OpenTransientFile(path, flags); > + PrepareTempTablespaces(); > + SharedFileSetInit(fileset, NULL); > > Why are we calling PrepareTempTablespaces here? It is already called > in SharedFileSetInit. My bad, First I tired using SharedFileSetInit but later it got changed for forgot to remove this part. > 8. > + /* > + * Start a transaction on stream start, this transaction will be committed > + * on the stream stop. We need the transaction for handling the buffile, > + * used for serializing the streaming data and subxact info. > + */ > + ensure_transaction(); > > I think we need this for PrepareTempTablespaces to set the > temptablespaces. Also, isn't it required for a cleanup of buffile > resources at the transaction end? Are there any other reasons for it > as well? The comment should be a bit more clear for why we need a > transaction here. I am not sure that will it make sense to add a comment here that why buffile and sharedfileset need a transaction? Do you think that we should add comment in buffile/shared fileset API that it should be called under a transaction? > 9. > * Open a file for streamed changes from a toplevel transaction identified > * by stream_xid (global variable). If it's the first chunk of streamed > * changes for this transaction, perform cleanup by removing existing > * files after a possible previous crash. > .. > stream_open_file(Oid subid, TransactionId xid, bool first_segment) > > The above part comment atop stream_open_file needs to be changed after > new implementation. Done > 10. > * enabled. This context is reeset on each stream stop. > */ > LogicalStreamingContext = AllocSetContextCreate(ApplyContext, > > /reeset/reset Done
Testing big endian code with Travis CI's new s390x support
Hi, This is something I've wanted several times in the past, so I thought others here could be interested: if you're looking for a way to run your development branch through check-world on a big endian box, the new s390x support[1] on Travis is good for that. Capacity is a bit limited, so I don't think I'll point cfbot.cputube.org at it just yet (maybe I need to invent a separate slow build cycle). I tried it just now and found that cfbot's .travis.yml file[2] just needed this at the top: arch: - s390x ... and then it needed these lines commented out: #before_install: # - echo '/tmp/%e-%s-%p.core' | sudo tee /proc/sys/kernel/core_pattern I didn't look into why, but otherwise that fails with a permission error on that environment, so it'd be nice to figure out what's up with that so we can still get back traces from cores. [1] https://blog.travis-ci.com/2019-11-12-multi-cpu-architecture-ibm-power-ibm-z [2] https://github.com/macdice/cfbot/blob/master/travis/.travis.yml
pg_resetwal --next-transaction-id may cause database failed to restart.
hello hackers, When I try to use pg_resetwal tool to skip some transaction ID, I get a problem that is the tool can accept all transaction id I offered with '-x' option, however, the database may failed to restart because of can not read file under $PGDATA/pg_xact. For example, the 'NextXID' in a database is 1000, if you offer '-x 32769' then the database failed to restart. I read the document of pg_resetwal tool, it told me to write a 'safe value', but I think pg_resetwal tool should report it and refuse to exec walreset work when using an unsafe value, rather than remaining it until the user restarts the database. I do a initial patch to limit the input, now it accepts transaction in two ways: 1. The transaction ID is on the same CLOG page with the 'NextXID' in pg_control. 2. The transaction ID is right at the end of a CLOG page. The input limited above can ensure the database restart successfully. The same situation with multixact and multixact-offset option and I make the same change in the patch. Do you think it is an issue? Regards, Highgo Software (Canada/China/Pakistan) URL : www.highgo.ca EMAIL: mailto:movead(dot)li(at)highgo(dot)ca pg_resetwal_transaction_limit.patch Description: Binary data