Re: Another usability issue with our TAP tests
On Mon, Jul 16, 2018 at 01:13:36PM -0400, Tom Lane wrote: > Since "make check-world" is rather chatty, if you get a failure while > running it under high parallelism, the location of the failure has often > scrolled off the terminal window by the time all the other subjobs > exit. Yes, I have pested about that myself a bit lately. > This is not a huge problem for tests using our traditional infrastructure, > because you can just run "git status" to look for regression.diffs files. > But a TAP test failure leaves nothing behind that git will consider > unusual. I've repeatedly had to run check-world with no parallelism > (wasting many minutes) in order to locate which test actually failed. Even for tests currently running regression.diffs is created but remains empty. > I'm not sure about a good way to improve this. One idea that comes > to mind is to tweak the "make check" rules so that the tmp_check > subdirectories are automatically deleted on successful completion, > but not on failure, and then remove tmp_check from the .gitignore lists. > But the trouble with that is sometimes you want to look at the test logs > afterwards, even when make thought the test succeeded. I definitely want to be able to look at TAP test logs even if they succeed, and only wipe them out at the next run, so I think that this should be independent. What about an on-disk empty file then which gets simply created in the INIT() block of TestLib.pm, and removed in the END block only if all tests pass? We know the base directory thanks to $ENV{TESTDIR} which should just be "." if undefined. Then we name it say, "prove_running" or similar. -- Michael signature.asc Description: PGP signature
Re: missing toast table for pg_policy
On 7/17/18, Michael Paquier wrote: > I was just having a second look at this patch, and did a bit more tests > with pg_upgrade which passed. > > +-- 2. pg_largeobject and pg_largeobject_metadata, to avoid problems > +-- with pg_upgrade > John, what's actually the failure that was seen here? It would be nice > to see this patch committed but the reason here should be more > explicit about why this cannot happen. I'll copy what I wrote upthread last month: On 6/19/18, John Naylor wrote: > On 2/20/18, Michael Paquier wrote: >> Regression tests of pg_upgrade are failing as follows: >> New cluster database "postgres" is not empty >> Failure, exiting >> + rm -rf /tmp/pg_upgrade_check-Xn0ZLe > > I looked into this briefly. The error comes from > check_new_cluster_is_empty() in src/bin/pg_upgrade/check.c, which > contains the comment > > /* pg_largeobject and its index should be skipped */ I didn't dig deeper, since TOAST and the large object facility are mutually exclusive so there shouldn't be a toast table here anyway. Hope this helps. -John Naylor
Re: missing toast table for pg_policy
On Tue, Jul 17, 2018 at 02:55:07PM +0700, John Naylor wrote: > I didn't dig deeper, since TOAST and the large object facility are > mutually exclusive so there shouldn't be a toast table here anyway. > Hope this helps. [... digging ...] This comes from get_rel_infos where large objects are treated as user data. Rather than the comment you added, I would rather do the following: "Large object catalogs and toast tables are mutually exclusive and large object data is handled as user data by pg_upgrade, which would cause failures." -- Michael signature.asc Description: PGP signature
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
On 16 July 2018 at 21:55, Tomas Vondra wrote: > > > On 07/16/2018 02:54 PM, Dean Rasheed wrote: >> On 16 July 2018 at 13:23, Tomas Vondra wrote: > The top-level clauses allow us to make such deductions, with deeper > clauses it's much more difficult (perhaps impossible). Because for > example with (a=1 AND b=1) there can be just a single match, so if we > find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR > b=2)) it's not that simple, because there may be multiple combinations > and so a match in MCV does not guarantee anything. Actually, it guarantees a lower bound on the overall selectivity, and maybe that's the best that we can do in the absence of any other stats. >>> Hmmm, is that actually true? Let's consider a simple example, with two >>> columns, each with just 2 values, and a "perfect" MCV list: >>> >>> a | b | frequency >>>--- >>> 1 | 1 | 0.5 >>> 2 | 2 | 0.5 >>> >>> And let's estimate sel(a=1 & b=2). >> >> OK.In this case, there are no MCV matches, so there is no lower bound (it's >> 0). >> >> What we could do though is also impose an upper bound, based on the >> sum of non-matching MCV frequencies. In this case, the upper bound is >> also 0, so we could actually say the resulting selectivity is 0. >> > > Hmmm, it's not very clear to me how would we decide which of these cases > applies, because in most cases we don't have MCV covering 100% rows. > > Anyways, I've been thinking about how to modify the code to wort the way > you proposed (in a way sufficient for a PoC). But after struggling with > it for a while it occurred to me it might be useful to do it on paper > first, to verify how would it work on your examples. > > So let's use this data > > create table foo(a int, b int); > insert into foo select 1,1 from generate_series(1,5); > insert into foo select 1,2 from generate_series(1,4); > insert into foo select 1,x/10 from generate_series(30,25) g(x); > insert into foo select 2,1 from generate_series(1,3); > insert into foo select 2,2 from generate_series(1,2); > insert into foo select 2,x/10 from generate_series(30,50) g(x); > insert into foo select 3,1 from generate_series(1,1); > insert into foo select 3,2 from generate_series(1,5000); > insert into foo select 3,x from generate_series(3,60) g(x); > insert into foo select x,x/10 from generate_series(4,75) g(x); > > Assuming we have perfectly exact statistics, we have these MCV lists > (both univariate and multivariate): > > select a, count(*), round(count(*) /2254937.0, 4) AS frequency > from foo group by a order by 2 desc; > > a| count | frequency > ++--- > 3 | 614998 |0.2727 > 2 | 549971 |0.2439 > 1 | 339971 |0.1508 > 1014 | 1 |0. > 57220 | 1 |0. > ... > > select b, count(*), round(count(*) /2254937.0, 4) AS frequency > from foo group by b order by 2 desc; > > b| count | frequency > +---+--- > 1 | 90010 |0.0399 > 2 | 65010 |0.0288 > 3 |31 |0. > 7 |31 |0. >... > > select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency > from foo group by a, b order by 3 desc; > > a| b| count | frequency > ++---+--- > 1 | 1 | 5 |0.0222 > 1 | 2 | 4 |0.0177 > 2 | 1 | 3 |0.0133 > 2 | 2 | 2 |0.0089 > 3 | 1 | 1 |0.0044 > 3 | 2 | 5000 |0.0022 > 2 | 12445 |10 |0. > ... > > And let's estimate the two queries with complex clauses, where the > multivariate stats gave 2x overestimates. > > SELECT * FROM foo WHERE a=1 and (b=1 or b=2); > -- actual 9, univariate: 24253, multivariate: 181091 > >univariate: > > sel(a=1) = 0.1508 > sel(b=1) = 0.0399 > sel(b=2) = 0.0288 > sel(b=1 or b=2) = 0.0673 > >multivariate: > sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770) > > The second multivariate estimate comes from assuming only the first 5 > items make it to the multivariate MCV list (covering 6.87% of the data) > and extrapolating the selectivity to the non-MCV data too. > > (Notice it's about 2x the actual selectivity, so this extrapolation due > to not realizing the MCV already contains all the matches is pretty much > responsible for the whole over-estimate). > Agreed. I think the actual MCV stats I got included the first 6 entries, but yes, that's only around 7% of the data. > So, how would the proposed algorithm work? Let's start with "a=1": > >sel(a=1) = 0.1508 > > I don't see much point in applying the two "b" clauses independently (or > how would it be done, as it's effectively a single clause): > >sel(b=1 or b=2) = 0.0673 > > And we get > >total_sel = s
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. At Mon, 16 Jul 2018 16:14:09 -0400, Alvaro Herrera wrote in <20180716201409.2qfcneo4qkdwjvpv@alvherre.pgsql> > On 2018-Jul-12, Heikki Linnakangas wrote: > > > > > Thanks for the pointer. My tap test has been covering two out of > > > > the three scenarios you have in your script. I have been able to > > > > convert the extra as the attached, and I have added as well an > > > > extra test with TRUNCATE triggers. So it seems to me that we want > > > > to disable the optimization if any type of trigger are defined on > > > > the relation copied to as it could be possible that these triggers > > > > work on the blocks copied as well, for any BEFORE/AFTER and > > > > STATEMENT/ROW triggers. What do you think? > > > > > > Yeah, this seems like the only sane approach. > > > > Doesn't have to be a trigger, could be a CHECK constraint, datatype > > input function, etc. Admittedly, having a datatype input function that > > inserts to the table is worth a "huh?", but I'm feeling very confident > > that we can catch all such cases, and some of them might even be > > sensible. > > A counterexample could be a a JSON compresion scheme that uses a catalog > for a dictionary of keys. Hasn't this been described already? Also not > completely out of the question for GIS data, I think (Not sure if > PostGIS does this already.) In the third case, IIUC, disabling bulk-insertion after any WAL-logging insertion happend seems to work. The attached diff to v2 patch makes the three TAP tests pass. It uses relcache to store the last insertion XID but it will not be invalidated during a COPY operation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 72395a50b8..e5c651b498 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2509,6 +2509,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, MarkBufferDirty(buffer); + /* + * Bulk insertion is not safe after a WAL-logging insertion in the same + * transaction. We don't start bulk insertion under inhibitin conditions + * but we also need to cancel WAL-skipping in the case where WAL-logging + * insertion happens during a bulk insertion. This happens by anything + * that can insert a tuple during bulk insertion such like triggers, + * constraints or type conversions. We need not worry about relcache flush + * happening while a bulk insertion is running. + */ + if (relation->last_logged_insert_xid == xid) + options &= ~HEAP_INSERT_SKIP_WAL; + /* XLOG stuff */ if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) { @@ -2582,6 +2594,12 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, recptr = XLogInsert(RM_HEAP_ID, info); PageSetLSN(page, recptr); + + /* + * If this happens during a bulk insertion, stop WAL skipping for the + * rest of the current command. + */ + relation->last_logged_insert_xid = xid; } END_CRIT_SECTION(); diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 7674369613..7b9a7af2d2 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2416,10 +2416,8 @@ CopyFrom(CopyState cstate) { hi_options |= HEAP_INSERT_SKIP_FSM; - if (!XLogIsNeeded() && - cstate->rel->trigdesc == NULL && - RelationGetNumberOfBlocks(cstate->rel) == 0) - hi_options |= HEAP_INSERT_SKIP_WAL; + if (!XLogIsNeeded() && RelationGetNumberOfBlocks(cstate->rel) == 0) + hi_options |= HEAP_INSERT_SKIP_WAL; } /* diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 30a956822f..34a692a497 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -1575,6 +1575,9 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged, { /* Immediate, non-rollbackable truncation is OK */ heap_truncate_one_rel(rel); + + /* Allow bulk-insert */ + rel->last_logged_insert_xid = InvalidTransactionId; } else { diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 6125421d39..99fb7e1dd8 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -1243,6 +1243,8 @@ RelationBuildDesc(Oid targetRelId, bool insertIt) /* It's fully valid */ relation->rd_isvalid = true; + relation->last_logged_insert_xid = InvalidTransactionId; + return relation; } diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h index c97f9d1b43..6ee575ad14 100644 --- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -188,6 +188,9 @@ typedef struct RelationData /* use "struct" here to avoid needing to include pgstat.h: */ struct PgStat_TableStatus *pgstat_info; /* statistics collection area */ + + /* XID of the last transaction on which WAL-logged insertion happened */ + TransactionId last_logged_insert_xid; } RelationData;
65279 Invisible ASCII Character
Hi all, Today i got one problem what i have saved more that one procedures in a folder. After i have concatenate those files into single file through shell script. After am trying to run that single file in my server, it was showing syntax error. If i remove first character it's run. That first character is invisible, I have checked that *ascii* value, it is *65279*. How can i overcome this problem. I want to run my single file without this error. By below command i am concatenating individual scripting files to single. *:> cat *.* > .sql* select ascii(''); select chr(65279); select length(''); Plz. give me suggestions, it's blocker to my work. Thanking you, -- *Best Regards:* Ramanna Gunde
foreign key to foreign table
Hi, I have a foreign table created with postgres_fdw and with that, I tried to create a local table to reference the foreign table in order to set a foreign key constraint in my local table but I get an error message saying that the referenced table is not a table. Is there a way I can reference a foreign table? Thank you. -- Kaye Ann Ignacio, Programmer proceedit "the BPaaS company" kaye.igna...@proceedit.com +34 679188011 (mobile)
Re: [HACKERS] Restricting maximum keep segments by repslots
Hello. At Tue, 17 Jul 2018 13:37:59 +0900, Masahiko Sawada wrote in > >> The usage of XLogSegNoOffsetToRecPtr is wrong. Since we specify the > >> destination at 4th argument the wal_segment_size will be changed in > >> the above expression. The regression tests by PostgreSQL Patch Tester > > > > I'm not sure I get this correctly, the definition of the macro is > > as follows. > > > > | #define XLogSegNoOffsetToRecPtr(segno, offset, dest, wal_segsz_bytes) \ > > | (dest) = (segno) * (wal_segsz_bytes) + (offset) > > > > The destination is the *3rd* parameter and the forth is segment > > size which is not to be written. > > Please see commit a22445ff0b which flipped input and output arguments. > So maybe you need to rebase the patches to current HEAD. Mmm. Thanks. I never thought such change happned but it is accidentially took away in the latest patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: 65279 Invisible ASCII Character
## ramsiddu007 (ramsiddu...@gmail.com): > If i remove first character it's run. That first > character is invisible, I have checked that *ascii* value, it is *65279*. That's not an ASCII-value, ASCII has 8 bits at most. What you've got there is a UTF-16 Byte Order Mark: 65279 is 0xfeff (one of the well-known constants). I'd suggest you get your editor configured to write files without BOM. Maybe there's a workaround via locale settings - but I have no machine with an UTF-16 locale available. Another approach is using recode on your files before concatenating. This question isn't really for pgsql-hackers - I'm redirecting to -general. Regards, Christoph -- Spare Space
Re: file cloning in pg_upgrade and CREATE DATABASE
On 17.07.18 08:58, Michael Paquier wrote: > Hm... I am wondering if we actually want the "auto" mode where we make > the option smarter automatically. I am afraid of users trying to use it > and being surprised that there is no gain while they expected some. I > would rather switch that to an on/off switch, which defaults to "off", > or simply what is available now. huge_pages=try was a bad idea as the > result is not deterministic, so I would not have more of that... Why do you think that was a bad idea? Doing the best possible job by default without requiring explicit configuration in each case seems like an attractive feature. > Putting CloneFile and check_reflink in a location that other frontend > binaries could use would be nice, like pg_rewind. This could be done in subsequent patches, but the previous iteration of this patch for CREATE DATABASE integration already showed that each of those cases needs separate consideration. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: foreign key to foreign table
On Tue, Jul 17, 2018 at 12:13 PM, Kaye Ann Ignacio wrote: > Hi, > > I have a foreign table created with postgres_fdw and with that, I tried to > create a local table to reference the foreign table in order to set a > foreign key constraint in my local table but I get an error message saying > that the referenced table is not a table. Is there a way I can reference a > foreign table? I don't think so. Since the data in a foreign table resides on a foreign server and can be manipulated without local server knowing about it, it's very possible that those foreign key constraints will become invalid without local server knowing about them. This looks like a question for general mailing list to me. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Refactor documentation for wait events (Was: pgsql: Add wait event for fsync of WAL segments)
On Tue, Jul 17, 2018 at 12:19 AM, Michael Paquier wrote: > And the patch previously sent removes them, but perhaps I am missing > your point? I was just confused. Sorry for the noise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add SKIP LOCKED to VACUUM and ANALYZE
On Tue, Jul 17, 2018 at 2:21 AM, Michael Paquier wrote: > For this part, it seems to me that we can do better than what is in > HEAD: > - Call RangeVarGetRelidExtended without lock. I haven't been following this work closely, but I just want to point out that the reason why RangeVarGetRelidExtended exists is because: (1) we cannot lock a relation without looking up its OID, and should not lock it without doing permissions checks first, and at least as currently designed, those functions expect an OID, but (2) we cannot look up a relation first and only lock it afterwards, because DDL might happen in the middle and leave us locking the wrong relation When RangeVarGetRelidExtended is called with an argument other than NoLock, it effectively makes locking, permissions checking, and the name lookup happen simultaneously, so that it is neither possible to lock a relation for which you don't have permissions, nor to change the identity of the relation after the name lookup has been done and before the lock is acquired. On the other hand, when you call it with NoLock, you don't get those nice behaviors. So I'm inclined to think that any place in the source code that calls RangeVarGetRelidExtended with NoLock is a bug, and we should be trying to get rid of the cases we still have, not adding more. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [PATCH] kNN for SP-GiST
Attached 7th version of the patches: * renamed recheck fields and variables * fixed formatting of the one if-statement On 15.07.2018 14:54, Andrey Borodin wrote: 14.07.2018, 1:31, Nikita Glukhov wrote: Attached 6th version of the patches. ... 2. The patch leaves contribs intact. Do extensions with sp-gist opclasses need to update it's behavior somehow to be used as-is? Or to support new functionality? There are no SP-GiST opclasses in contrib/, so there is nothing to change in contrib/. Moreover, existing extensions need to be updated only for support of new distance-ordered searches -- they need to implement distances[][] array calculation in their spgInnerConsistent() and spgLeafConsistent() support functions. So, if extension will not update anything - it will keep all preexisting functionality, right? Yes, of course. I have some more nitpicks about naming + recheckQual = out.recheck; + recheckDist = out.recheckDistances; Can we call this things somehow in one fashion? I would prefer to rename "spgLeafConsitentOut.recheck" to "recheckQual" but it is not good to rename user-visible fields, so I decided to rename all fields and variables to "recheck"/"recheckDistances". Also, this my be a stupid question, but do we really need to differ this two recheck cases? It is certain that we cannot merge them? This recheck cases are absolutely different. In the first case, opclass support functions can not accurately determine whether the leaf value satisfies the boolean search operator (compressed values can be stored in leaves). In the second case, opclass returns only a approximate value (the lower bound) of the leaf distances. In the example below operator 'polygon >> polygon' does not need recheck because bounding box (they are stored in the index instead of polygons) test gives exact results, but the ordering operator 'polygon <-> point' needs recheck: SELECT * FROM polygons WHERE poly >> polygon(box '((0,0),(1,1))') ORDER BY poly <-> point '(0,0)'; This seems strange if-formatting + /* If allTheSame, they should all or none of 'em match */ + if (innerTuple->allTheSame) + if (out.nNodes != 0 && out.nNodes != nNodes) + elog(ERROR, "inconsistent inner_consistent results for allTheSame inner tuple"); + + if (out.nNodes) // few lines before you compare with 0 Fixed. -- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c index f57380a..a4345ef 100644 --- a/src/backend/utils/adt/geo_ops.c +++ b/src/backend/utils/adt/geo_ops.c @@ -41,7 +41,6 @@ enum path_delim static int point_inside(Point *p, int npts, Point *plist); static int lseg_crossing(double x, double y, double px, double py); static BOX *box_construct(double x1, double x2, double y1, double y2); -static BOX *box_fill(BOX *result, double x1, double x2, double y1, double y2); static bool box_ov(BOX *box1, BOX *box2); static double box_ht(BOX *box); static double box_wd(BOX *box); @@ -451,7 +450,7 @@ box_construct(double x1, double x2, double y1, double y2) /* box_fill - fill in a given box struct */ -static BOX * +BOX * box_fill(BOX *result, double x1, double x2, double y1, double y2) { if (x1 > x2) @@ -482,7 +481,7 @@ box_fill(BOX *result, double x1, double x2, double y1, double y2) /* box_copy - copy a box */ BOX * -box_copy(BOX *box) +box_copy(const BOX *box) { BOX *result = (BOX *) palloc(sizeof(BOX)); diff --git a/src/include/utils/geo_decls.h b/src/include/utils/geo_decls.h index a589e42..94806c2 100644 --- a/src/include/utils/geo_decls.h +++ b/src/include/utils/geo_decls.h @@ -182,6 +182,7 @@ typedef struct extern double point_dt(Point *pt1, Point *pt2); extern double point_sl(Point *pt1, Point *pt2); extern double pg_hypot(double x, double y); -extern BOX *box_copy(BOX *box); +extern BOX *box_copy(const BOX *box); +extern BOX *box_fill(BOX *box, double xlo, double xhi, double ylo, double yhi); #endif /* GEO_DECLS_H */ diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c index c4e8a3b..9279756 100644 --- a/src/backend/access/gist/gistget.c +++ b/src/backend/access/gist/gistget.c @@ -14,9 +14,9 @@ */ #include "postgres.h" +#include "access/genam.h" #include "access/gist_private.h" #include "access/relscan.h" -#include "catalog/pg_type.h" #include "miscadmin.h" #include "storage/lmgr.h" #include "storage/predicate.h" @@ -543,7 +543,6 @@ getNextNearest(IndexScanDesc scan) { GISTScanOpaque so = (GISTScanOpaque) scan->opaque; bool res = false; - int i; if (scan->xs_hitup) { @@ -564,45 +563,10 @@ getNextNearest(IndexScanDesc scan) /* found a heap item at currently minimal distance */ scan->xs_ctup.t_self = item->data.heap.heapPtr; scan->xs_recheck = item->data.heap.recheck; - scan->xs_recheckorderby = item->data.
Re[2]: Alter index rename concurrently to
>Понедельник, 16 июля 2018, 22:40 +03:00 от Victor Yegorov : > >пн, 16 июл. 2018 г. в 21:58, Andrey Klychkov < aaklych...@mail.ru >: >>I made a patch to solve this issue (see the attachment). >>It allows to avoid locks by a query like this: >>“alter index rename CONCURRENTLY to ”. > >Please, have a look at previous discussions on the subject: >- 2012 >https://www.postgresql.org/message-id/cab7npqtys6juqdxuczbjb0bnw0kprw8wdzuk11kaxqq6o98...@mail.gmail.com >- 2013 >https://www.postgresql.org/message-id/cab7npqstfkuc7dzxcdx4hotu63yxhrronv2aouzyd-zz_8z...@mail.gmail.com >- https://commitfest.postgresql.org/16/1276/ > Hello, Thank you for this information! In the first discussion the concurrent alter was mentioned. In the next link and commitfest info I only saw "Reindex concurrently 2.0". It sounds great! If this component will be added to core it certainly facilitates index rebuilding. What about "alter index ... rename to" in the concurrent mode? Does "Reindex concurrently 2.0" add it? From time to time we need just to rename some indexes. Without concurrent mode this "alter index" makes queues. It may be a problem on high load databases. -- Kind regards, Andrey Klychkov
Re: Make foo=null a warning by default.
Hello, Yeah, I was wondering about that too. But Fabien brought up a completely new use-case for this: people learning SQL. Indeed. This year, about 30% of my students wrote "= NULL" in a query at least once. Probably I or a colleague were called to the rescue for help. So this warning would save me time, which explains why I reviewed the patch with such enthousiasm:-) I'm fine with keeping the current behavior as a default. -- Fabien.
Re: Libpq support to connect to standby server as priority
On Tue, Jul 17, 2018 at 12:42 AM Laurenz Albe wrote: > Haribabu Kommi wrote: > > > On Wed, Jul 4, 2018 at 11:14 PM Laurenz Albe > wrote: > > > > Haribabu Kommi wrote: > > > > > > > > - I think the construction with "read_write_host_index" makes the > code even more > > > > complicated than it already is. > > > > > > > > What about keeping the first successful connection open and > storing it in a > > > > variable if we are in "prefer-read" mode. > > > > If we get the read-only connection we desire, close that cached > connection, > > > > otherwise use it. > > > > > > Even if we add a variable to cache the connection, I don't think the > logic of checking > > > the next host for the read-only host logic may not change, but the > extra connection > > > request to the read-write host again will be removed. > > > > I evaluated your suggestion of caching the connection and reuse it when > there is no > > read only server doesn't find, but I am thinking that it will add more > complexity and also > > the connection to the other servers delays, the cached connection may be > closed by > > the server also because of timeout. > > > > I feel the extra time during connection may be fine, if user is > preferring the prefer-read > > mode, instead of adding more complexity in handling the cached > connection? > > > > comments? > > I tested the new patch, and it works as expected. > Thanks for the confirmation. > I don't think that time-out of the cached session is a valid concern, > because that > would have to be a really short timeout. > On the other hand, establishing the connection twice (first to check if it > is read-only, > then again because no read-only connection is found) can be quite costly. > > But that is a matter of debate, as is the readability of the code. > Thanks for your opinion, let's wait for opinion from others also. I can go for the modification, if others also find it useful. Regards, Haribabu Kommi Fujitsu Australia
Re: [HACKERS] WAL logging problem in 9.4.3?
On 07/16/2018 08:01 PM, Michael Paquier wrote: I doubt as well that we'd be able to catch all the holes as well as the conditions where the optimization could be run safely are rather basically impossible to catch beforehand. I'd like to vote for getting rid of this optimization for COPY, this can hurt more than it is helpful. Per the lack of complaints, this could happen only in HEAD? Well, we'd be getting rid of it because of a danger of data loss which we can't otherwise mitigate. Maybe it does need to be backpatched, even if we haven't had complaints. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PG 10: could not generate random cancel key
Last week I upgraded 15 servers from various pre-10 versions to 10.4. At first everything looked OK, but then (around 4 days later) one of them failed with this in the logs: 2018-07-14 01:53:35.840 BST LOG: could not generate random cancel key 2018-07-14 01:53:37.233 BST LOG: could not generate random cancel key 2018-07-14 01:53:37.245 BST LOG: could not generate random cancel key 2018-07-14 01:53:38.553 BST LOG: could not generate random cancel key 2018-07-14 01:53:38.581 BST LOG: could not generate random cancel key 2018-07-14 01:54:43.851 BST WARNING: worker took too long to start; canceled 2018-07-14 01:54:43.862 BST LOG: could not generate random cancel key 2018-07-14 01:55:09.861 BST LOG: could not generate random cancel key 2018-07-14 01:55:09.874 BST LOG: could not generate random cancel key ... After that it would not accept any new connections until I restarted postmaster a few hours later. Since then, it has been OK. It was built using --with-openssl and strong random support enabled, so it was OpenSSL's RAND_bytes() that failed for some reason. I attempted to reproduce it with a small C program directly calling RAND_bytes(), but it refused to fail, even if I disabled haveged and ran my tests in an @reboot cron job. So this failure is evidently quite rare, but the documentation for RAND_bytes() says it *can* fail (returning 0) if it isn't seeded with enough entropy, in which case more must be added, which we're not doing. In addition, once it does fail, repeated calls to RAND_bytes() will continue to fail if it isn't seeded with more data -- hence the inability to start any new backends until after a postmaster restart, which is not a very friendly failure mode. The OpenSSL documentation suggests that we should use RAND_status() [1] to check that the generator has been seeded with enough data: RAND_status() indicates whether or not the CSPRNG has been sufficiently seeded. If not, functions such as RAND_bytes(3) will fail. and if not, RAND_poll() can be used to fix that: RAND_poll() uses the system's capabilities to seed the CSPRNG using random input obtained from polling various trusted entropy sources. The default choice of the entropy source can be modified at build time using the --with-rand-seed configure option, see also the NOTES section. A summary of the configure options can be displayed with the OpenSSL version(1) command. Looking for precedents elsewhere, I found [2] which does exactly that, although I'm slightly dubious about the need for the for-loop there. I also found a thread [3], which recommends simply doing if (RAND_status() == 0) RAND_poll(); which seems preferable. Attached is a patch to do this in pg_strong_random(). Thoughts? Regards, Dean [1] https://www.openssl.org/docs/man1.1.1/man3/RAND_status.html [2] https://github.com/nodejs/node/blob/master/src/node_crypto.cc [3] https://github.com/openssl/openssl/issues/4148 diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c new file mode 100644 index bc7a8aa..fd5ad92 --- a/src/port/pg_strong_random.c +++ b/src/port/pg_strong_random.c @@ -103,6 +103,9 @@ pg_strong_random(void *buf, size_t len) * When built with OpenSSL, use OpenSSL's RAND_bytes function. */ #if defined(USE_OPENSSL_RANDOM) + /* Make sure that OpenSSL's CSPRNG has been sufficiently seeded */ + if (RAND_status() == 0) + (void) RAND_poll(); if (RAND_bytes(buf, len) == 1) return true; return false;
Re: Make foo=null a warning by default.
Hello David, A few comments about this v2. ISTM that there is quite strong opposition to having "warn" as a default, so probably you should set if "off"? transform_null_equals_options[] = { [...] I do not really understand the sort order of this array. Maybe it could be alphabetical, or per value? Or maybe it is standard values and then synonyms, in which is case maybe a comment on the end of the list. I've changed it to per value. Is this better? At least, I can see the logic. Or anything in better English. I've changed this, and hope it's clearer. Yep, and more elegant than my proposal. * Test +select 1=null; + ?column? Maybe use as to describe the expected behavior, so that it is easier to check, and I think that "?column?" looks silly eg: select 1=null as expect_{true,false}[_and_a_warning/error]; Changed to more descriptive headers. Cannot see it in the patch update? create table cnullchild (check (f1 = 1 or f1 = null)) inherits(cnullparent); +WARNING: = NULL can only produce a NULL I'm not sure of this test case. Should it be turned into "is null"? This was just adjusting the existing test output to account for the new warning. I presume it was phrased that way for a reason. Hmmm. Probably you are right. I think that the test case deserves better comments, as it is most peculiar. It was added by Tom for testing CHECK constant NULL expressions simplications. TRUE OR NULL is TRUE, but FALSE OR NULL is NULL. Why not. Combined with the fact that NULL is considered ok for a CHECK, this lead to strange but intentional behavior, such as: -- this CHECK is always nignored in practice CREATE TABLE foo (i INT CHECK(i = 1 OR NULL)); INSERT INTO foo(i) VALUES(2); # ACCEPTED -- but this one is not CREATE TABLE foo (i INT CHECK(i = 1 OR FALSE)); INSERT INTO foo(i) VALUES(2); # REFUSED Can't say I'm thrilled about that, and the added warning looks appropriate. -- Fabien.
Re[2]: Alter index rename concurrently to
> Понедельник, 16 июля 2018, 22:19 +03:00 от Andrey Borodin > : > >Hi! > >> 16 июля 2018 г., в 22:58, Andrey Klychkov < aaklych...@mail.ru > написал(а): >> Dear hackers! >> >> I have an idea to facilitate work with index rebuilding. >> >> "ALTER INDEX ... RENAME CONCURRENTLY TO ..." > >The idea seems useful. I'm not an expert in CIC, but your post do not cover >one very important topic. When we use CREATE INDEX CONCURRENTLY we pay for >less intrusive lock by scanning data twice. What is the price of RENAME >CONCURRENTLY? Should this mode be default? Hello and thank you for the answer! I don't think "alter index ... rename concurrently to ..." has large overheads cause it just locks table and copies a new name instead of an old name to the field relform->relname of the FormData_pg_class struct. ./src/include/catalog/pg_class.h: typedef of FormData_pg_class ./src/backend/commands/tablecmds.c: RenameRelation() and RenameRelationInternal() As I wrote before, in my patch I've added the opportunity to change a locking AccessShareLock -> ShareUpdateExclusiveLock by passing "concurrently" in "alter", it's similar to the way of index_drop()/index_create() functions. It changes only one name field and nothing more. I believe it is safe. Also I ran tests again: select/insert queries in loops in several sessions and changed an index name concurrently in parallel many times. After that, index was valid and its index_scan was increased. However, I don't have much experience and maybe I am wrong. -- Kind regards, Andrey Klychkov
Re: Pluggable Storage - Andres's take
On Mon, Jul 16, 2018 at 11:35 PM Andres Freund wrote: > On 2018-07-04 20:11:21 +1000, Haribabu Kommi wrote: > > On Tue, Jul 3, 2018 at 5:06 PM Andres Freund wrote: > > > > The most fundamental issues I had with Haribabu's last version from [2] > > > are the following: > > > > > > - The use of TableTuple, a typedef from void *, is bad from multiple > > > fronts. For one it reduces just about all type safety. There were > > > numerous bugs in the patch where things were just cast from HeapTuple > > > to TableTuple to HeapTuple (and even to TupleTableSlot). I think > it's > > > a really, really bad idea to introduce a vague type like this for > > > development purposes alone, it makes it way too hard to refactor - > > > essentially throwing the biggest benefit of type safe languages out > of > > > the window. > > > > > > > My earlier intention was to remove the HeapTuple usage entirely and > replace > > it with slot everywhere outside the tableam. But it ended up with > TableTuple > > before it reach to the stage because of heavy use of HeapTuple. > > I don't think that's necessary - a lot of the system catalog accesses > are going to continue to be heap tuple accesses. And the conversions you > did significantly continue to access TableTuples as heap tuples - it was > just that the compiler didn't warn about it anymore. > > A prime example of that is the way the rewriteheap / cluster > integreation was done. Cluster continued to sort tuples as heap tuples - > even though that's likely incompatible with other tuple formats which > need different state. > OK. Understood. > > > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to > > > a higher memory usage, because the resulting tuples weren't freed or > > > anything. There might be a reason for doing such a change - we've > > > certainly discussed that before - but I'm *vehemently* against doing > > > that at the same time we introduce pluggable storage. Analyzing the > > > performance effects will be hard enough without changes like this. > > > > > > > How about using of slot instead of tuple and reusing of it? I don't know > > yet whether it is possible everywhere. > > Not quite sure what you mean? > I thought of using slot everywhere can reduce the use of heap_copytuple, I understand that you already did those changes from you reply from the other mail. > > Tasks / Questions: > > > > > > - split up patch > > > > > > > How about generating refactoring changes as patches first based on > > the code in your repo as discussed here[1]? > > Sure - it was just at the moment too much work ;) > Yes, it is too much work. How about doing this once most of the open items are finished? > > > > - Change heap table AM to not allocate handler function for each table, > > > instead allocate it statically. Avoids a significant amount of data > > > duplication, and allows for a few more compiler optimizations. > > > > > > > Some kind of static variable handlers for each tableam, but need to check > > how can we access that static handler from the relation. > > I'm not sure what you mean by "how can we access"? We can just return a > pointer from the constant data from the current handler? Except that > adding a bunch of consts would be good, the external interface wouldn't > need to change? > I mean we may need to store some tableam ID in each table, so that based on that ID we get the static tableam handler, because at a time there may be tables from different tableam methods. > > > > - change scan level slot creation to use tableam function for doing so > > > - get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid > > > > > > > so with this there shouldn't be a way from slot to tid mapping or there > > should be some other way. > > I'm not sure I follow? > Replacing of heaptuple with slot, currently with slot only the tid is passed to the tableam methods, To get rid of the tid from slot, we may need any other method of passing? > > - bitmap index scans probably need a new tableam.h callback, abstracting > > > bitgetpage() > > > > > > > OK. > > Any chance you could try to tackle this? I'm going to be mostly out > this week, so we'd probably not run across each others feet... > OK, I will take care of the above point. Regards, Haribabu Kommi Fujitsu Australia
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
On 07/17/2018 11:09 AM, Dean Rasheed wrote: On 16 July 2018 at 21:55, Tomas Vondra wrote: ... >> So, how would the proposed algorithm work? Let's start with "a=1": sel(a=1) = 0.1508 I don't see much point in applying the two "b" clauses independently (or how would it be done, as it's effectively a single clause): sel(b=1 or b=2) = 0.0673 And we get total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101 From the multivariate MCV we get mcv_sel = 0.0399 And finally total_sel = Max(total_sel, mcv_sel) = 0.0399 Which is great, but I don't see how that actually helped anything? We still only have the estimate for the ~7% covered by the MCV list, and not the remaining non-MCV part. Right. If these are the only stats available, and there are just 2 top-level clauses like this, it either returns the MCV estimate, or the old univariate estimate (whichever is larger). It avoids over-estimating, but will almost certainly under-estimate when the MCV matches are incomplete. Yeah :-( In my experience under-estimates tend to have much worse consequences (say a nested loop chosen by under-estimate vs. hash join chosen by over-estimate). This certainly influenced some of the choices I've made in this patch (extrapolation to non-MCV part is an example of that), but I agree it's not particularly scientific approach and I'd very much want something better. I could do the same thing for the second query, but the problem there is actually exactly the same - extrapolation from MCV to non-MCV part roughly doubles the estimate. So unless I'm applying your algorithm incorrectly, this does not seem like a very promising direction :-( You could be right. Actually it's the order dependence with more than 2 top-level clauses that bothers me most about this algorithm. It's also not entirely obvious how to include histogram stats in this scheme. I think for inequalities that's fairly simple - histograms work pretty well for that, and I have a hunch that replacing the 0.5 estimate for partially-matching buckets with conver_to_scalar-like logic and the geometric mean (as you proposed) will work well enough. For equalities it's going to be hard. The only thing I can think of at the moment is checking if there are any matching buckets at all, and using that to decide whether to extrapolate the MCV selectivity to the non-MCV part or not (or perhaps to what part of the non-MCV part). A different approach that I have been thinking about is, in mcv_update_match_bitmap(), attempt to work out the probability of all the clauses matching and it not being an MCV value. For example, given a clause like a=1 whose univariate estimate we know (0.1508 in the above example), knowing that the MCV values account for 0.0222+0.0177 of that, the remainder must be from non-MCV values. So we could say that the probability that a=1 and it not being and MCV is 0.1508-0.0222-0.0177 = 0.1109. So then the question is could we combine these non-MCV probabilities to give an estimate of how many non-MCV values we need to worry about? I've not fully thought that through, but it might be useful. Could we use it to derive some upper boundaries on the non-MCV part? The problem is, it's still likely to over-estimate when the MCV matches fully cover all possibilities, and I still don't see a way to reliably detect that case. I guess we can use a histogram to limit the over-estimates, as explained above. It may not be 100% reliable as it depends on the sample and how exactly the buckets are formed, but it might help. But perhaps these over-estimates are a particularly serious issue? When you already get matches in a MCV, the number of matching rows is going to be pretty significant. If you over-estimate 2x, so what? IMHO that's still pretty accurate estimate. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PG 10: could not generate random cancel key
On Tue, Jul 17, 2018 at 01:33:11PM +0100, Dean Rasheed wrote: > Looking for precedents elsewhere, I found [2] which does exactly that, > although I'm slightly dubious about the need for the for-loop there. I > also found a thread [3], which recommends simply doing > > if (RAND_status() == 0) > RAND_poll(); > > which seems preferable. Attached is a patch to do this in pg_strong_random(). Checking for the return result of RAND_poll() would also be good thing to do. From what I read in OpenSSL code it could fail as well, and we could combine that with a loop attempted to feed the machinery a decided amount of times, just failing after successive failures. -- Michael signature.asc Description: PGP signature
Re: [bug fix] Produce a crash dump before main() on Windows
On Thu, Mar 1, 2018 at 4:14 PM Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com> wrote: > From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] > > Another idea to add to the current patch is to move the call to > SetErrorMode() > > to the below function, which is called first in main(). How about this? > > > > void > > pgwin32_install_crashdump_handler(void) > > { > > SetUnhandledExceptionFilter(crashDumpHandler); > > } > > I moved SetErrorMode() to the beginning of main(). It should be placed > before any code which could crash. The current location is a bit late: in > fact, write_stderr() crashed when WSAStartup() failed. > I reviewed patch and it works as per the subject, but I am not able to verify the actual bug that is reported in the upthread. The moving of setErrorMode() call to the start of the main function can handle all the cases that can lead to a crash with no popup. The reset of setErrorMode(0) before start of any process can generate the coredump because of any issue before it reaches the main() function, but this change can lead to stop the postmaster restart process, if no one present to observe the scenario? Doesn't this change can generate backward compatibility problems to particular users? I don't have any other comments with the current patch. Regards, Haribabu Kommi Fujitsu Australia
Re: PG 10: could not generate random cancel key
On 17 July 2018 at 14:04, Michael Paquier wrote: > On Tue, Jul 17, 2018 at 01:33:11PM +0100, Dean Rasheed wrote: >> Looking for precedents elsewhere, I found [2] which does exactly that, >> although I'm slightly dubious about the need for the for-loop there. I >> also found a thread [3], which recommends simply doing >> >> if (RAND_status() == 0) >> RAND_poll(); >> >> which seems preferable. Attached is a patch to do this in pg_strong_random(). > > Checking for the return result of RAND_poll() would also be good thing > to do. From what I read in OpenSSL code it could fail as well, and > we could combine that with a loop attempted to feed the machinery a > decided amount of times, just failing after successive failures. >From what I understand from here [1], some parts of OpenSSL call RAND_poll() once on initialisation, and that's enough to get the PRNG going. It's not obvious that calling it multiple times would have any benefit. They also don't appear to bother checking the return code from RAND_poll() [2]. If it did fail, there'd not be much you could do anyway, so you might as well just let it continue and let RAND_bytes() fail. In fact it may even be possible for RAND_poll() to fail, but just do enough to cause RAND_bytes() to succeed. Regards, Dean [1] https://wiki.openssl.org/index.php/Random_Numbers [2] https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c
Re: [HACKERS] Support for Secure Transport SSL library on macOS as OpenSSL alternative
On 27/06/18 21:57, Daniel Gustafsson wrote: On 27 Jun 2018, at 14:32, Daniel Gustafsson wrote: Attached is an updated patch for supporting the native macOS Secure Transport library, rebased on top of current master. Courtesy of the ever-present Murphy I managed to forget some testcode in src/backend/Makefile which broke compilation for builds without secure transport, attached v8 patch fixes that. I've read through this patch now in more detail. Looks pretty good, but I have a laundry list of little things below. The big missing item is documentation. --- laundry list begins --- What exactly does 'ssl_is_server_start' mean? I don't understand that mechanism. ISTM it's only checked in load_key(), via be_tls_open_server(), which should only be called after be_tls_init(), in which case it's always 'true' when it's checked. Should there be an Assertion on it or something? The "-framework" option, being added to CFLAGS, is clang specific. I think we need some more autoconf magic, to make this work with gcc. In be_tls_open_server(), I believe there are several cases of using variables uninitialized. For example, load_identity_keychain() doesn't always set the 'identity' output parameter, but there's an "if (identity == NULL)" check after the call to it. And 'certificates' is left uninitialized, if load_identity_keychain() is used. Also, 'dh_len' is used uninitialized here in the "if (!dh_buf || dh_len == 0)" line, if the 'ssl_dh_params_file' option is not set. Did clang not give warnings about these? + /* +* Certificate Revocation List are not supported in the Secure Transport +* API +*/ The corresponding client code has a longer comment on that: + /* +* There is no API to load CRL lists in Secure Transport, they can however +* be imported into a Keychain with the commandline application "certtool". +* For libpq to use them, the certificate/key and root certificate needs to +* be using an identity in a Keychain into which the CRL have been +* imported. That needs to be documented. +*/ Is it possible to programmatically create a temporary keychain, in memory, and load the CRL to it? (I googled a bit, and couldn't find any suitable API for it, so I guess not..) + if (ssl_crl_file[0]) + ereport(FATAL, + (errmsg("CRL files not supported with Secure Transport"))); I think this should be COMMERROR, like other errors around this. We don't want to pass that error message to the client. Although I guess it won't reach the client as we haven't negotiated TLS yet. + /* +* We use kTryAuthenticate here since we don't know which sslmode the +* client is using. If we were to use kAlwaysAuthenticate then sslmode +* require won't work as intended. +*/ + if (ssl_loaded_verify_locations) + SSLSetClientSideAuthenticate((SSLContextRef) port->ssl, kTryAuthenticate); That comment sounds wrong. This is in backend code, and SSLSetClientSideAuthenticate() is all about checking the client's certificate in the server, while libpq 'sslmode' is about checking the server's certificate in the client. + for (;;) + { + status = SSLHandshake((SSLContextRef) port->ssl); + if (status == noErr) + break; + + if (status == errSSLWouldBlock) + continue; + ... + } Does this busy-wait, if there's not enough data from the client? That seems bad. In the corresponding client-side code, there's a comment on that too, but it's still bad. In be_tls_get_version and PQsslAttribute, can we add support for kTLSProtocol13? Perhaps with an autoconf test, if the constant is not available on all macOS versions. In be_tls_get_peerdn_name, wouldn't SecCertificateCopyCommonName() be more appropriate than SecCertificateCopyLongDescription()? In be_tls_get_cipher, returning "unknown" would seem better than erroring out, if the cipher isn't recognized. Check error message style. eg.: + ereport(COMMERROR, + (errmsg("could not load server certificate \"%s\": \"%s\"", + ssl_cert_file, pg_SSLerrmessage(status; Why is the result of pg_SSLerrmessage() in quotes? Maybe errdetail? + * Any consumers of the Keychain array should always call this to ensure that + * it is set up in a manner that reflect the configuration. If it not, then s/reflect/reflects/ + else if (keychain_count == 2) + { + if (ssl_keychain_file[0] == '\0' || !ssl_keychain_use_default) + goto error; + + return; + } + else + /* Fallthrough to erroring out */ + +error: + ereport(FATAL, + (
Another fun fact about temp tables and wraparound
Hello, hackers! Recently I was investigating the case of 'stuck in wraparaound' problem. PostgreSQL instance(9.6.9) in question reached 'million-before-wraparound' threshold and switched to read-only mode. Running vacuum in single-mode gives not results, datfrozenxid was not advancing: backend> vacuum freeze; 2018-07-13 16:43:58 MSK [3666-3] WARNING: database "database_name" must be vacuumed within 991565 transactions 2018-07-13 16:43:58 MSK [3666-4] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database. You might also need to commit or roll back old prepared transactions. backend> pg_prepared_xacts was empty. After some poking around it became clear that some old temp table was holding the oldest relfrozenxid! vacuum during get_rel_oids() ignored temp table but didn`t when it comes to calculating oldest relfrozenxid. Dropping all temp schemas helped Crude way to reproduce: postgres=# create temp table t1(); gdb: set ShmemVariableCache->nextXid = ShmemVariableCache->xidStopLimit + 100 pg_ctl stop -D PGDATA with open('path_to_clog_file', 'w') as f: x = 0 while x < 20: f.write(chr(1)) x = x + 1 postgres --single -D $PGDATA PostgreSQL stand-alone backend 9.6.9 backend> vacuum freeze; WARNING: database "postgres" must be vacuumed within 47 transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database. You might also need to commit or roll back old prepared transactions. backend> backend> backend> vacuum freeze; backend> I think the root case of all temp table problems is that they are present in catalog. I think they should not be present in catalog. And vacuum probably should ignore them during datfrozenxid calculation. In single mode at least. Or just drop them in single mode. And it would be good to have advice 'drop temp schemas' in HINT message. -- Grigory Smolkin Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Another fun fact about temp tables and wraparound
On 2018-Jul-17, Grigory Smolkin wrote: > Hello, hackers! > > Recently I was investigating the case of 'stuck in wraparaound' problem. > PostgreSQL instance(9.6.9) in question reached 'million-before-wraparound' > threshold and switched to read-only mode. > Running vacuum in single-mode gives not results, datfrozenxid was not > advancing: > > backend> vacuum freeze; > 2018-07-13 16:43:58 MSK [3666-3] WARNING: database "database_name" must be > vacuumed within 991565 transactions > 2018-07-13 16:43:58 MSK [3666-4] HINT: To avoid a database shutdown, > execute a database-wide VACUUM in that database. > You might also need to commit or roll back old prepared > transactions. > backend> > > pg_prepared_xacts was empty. > After some poking around it became clear that some old temp table was > holding the oldest relfrozenxid! Hmm, autovacuum is supposed to drop temp tables that are above the wraparound xid age to avoid this problem -- see autovacuum lines 2046ff. (Except it doesn't do anything if the owning backend is active. I guess this could be a problem if the owning backend fails to do anything about those tables. Maybe this part is a mistake.) Obviously, during single-user mode autovacuum doesn't run anyway. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow auto_explain to log to NOTICE
> On 4 Jul 2018, at 15:34, Andrew Dunstan > wrote: > > On Wed, Jun 20, 2018 at 2:06 PM, Daniel Gustafsson wrote: >>> On 27 Apr 2018, at 04:24, Andres Freund wrote: >>> >>> On 2018-04-27 11:52:18 +0930, Tom Dunstan wrote: > I'd argue this should contain the non-error cases. It's just as > reasonable to want to add this as a debug level or such. So all of warning, info, debug and debug1-5? >>> >>> Yea. Likely nobody will ever use debug5, but I don't see a point making >>> that judgement call here. >> >> Did you have a chance to hack up a new version of the patch with Andres’ >> review >> comments? I’m marking this patch as Waiting on Author for now based on the >> feedback so far in this thread. >> > > Here is an updated version of the patch. Setting this to "needs review”. Thanks! Looking at the code, and documentation this seems a worthwhile addition. Manual testing proves that it works as expected, and the patch follows the expected style. A few small comments: Since DEBUG is not a defined loglevel, it seems superfluous to include it here. It’s also omitted from the documentation so it should probably be omitted from here. + {"debug", DEBUG2, true}, The tag should be closed with a matching . + auto_explain.log_level configuration parameter With those fixed it’s ready for committer. > I haven't added tests for auto_explain - I think that would be useful > but it's a separate project. Agreeing that this would be beneficial, the attached patch (to apply on top of the patch in question) takes a stab at adding tests for this new functionality. In order to test plan output we need to support COSTS in the explain output, so a new GUC auto_explain.log_costs is added. We also need to not print the duration, so as a hack this patch omits the duration if auto_explain.log_timing is set to off and auto_explain.log_analyze is set off. This is a hack and not a nice overloading, but it seems overkill to add a separate GUC just to turn off the duration, any better ideas on how support omitting the duration? cheers ./daniel auto_explain_tests.diff Description: Binary data
Re: Allow auto_explain to log to NOTICE
On 07/17/2018 12:04 PM, Daniel Gustafsson wrote: On 4 Jul 2018, at 15:34, Andrew Dunstan wrote: On Wed, Jun 20, 2018 at 2:06 PM, Daniel Gustafsson wrote: On 27 Apr 2018, at 04:24, Andres Freund wrote: On 2018-04-27 11:52:18 +0930, Tom Dunstan wrote: I'd argue this should contain the non-error cases. It's just as reasonable to want to add this as a debug level or such. So all of warning, info, debug and debug1-5? Yea. Likely nobody will ever use debug5, but I don't see a point making that judgement call here. Did you have a chance to hack up a new version of the patch with Andres’ review comments? I’m marking this patch as Waiting on Author for now based on the feedback so far in this thread. Here is an updated version of the patch. Setting this to "needs review”. Thanks for the review Thanks! Looking at the code, and documentation this seems a worthwhile addition. Manual testing proves that it works as expected, and the patch follows the expected style. A few small comments: Since DEBUG is not a defined loglevel, it seems superfluous to include it here. It’s also omitted from the documentation so it should probably be omitted from here. + {"debug", DEBUG2, true}, I treated this like we do for client_min_messages and log_min_messages - the alias is there but I don;t think we document it either. I don't mind removing it, was just trying to be consistent. It seems odd that we would accept it in one place but not another. The tag should be closed with a matching . + auto_explain.log_level configuration parameter With those fixed it’s ready for committer. Thanks, will fix. I haven't added tests for auto_explain - I think that would be useful but it's a separate project. Agreeing that this would be beneficial, the attached patch (to apply on top of the patch in question) takes a stab at adding tests for this new functionality. In order to test plan output we need to support COSTS in the explain output, so a new GUC auto_explain.log_costs is added. We also need to not print the duration, so as a hack this patch omits the duration if auto_explain.log_timing is set to off and auto_explain.log_analyze is set off. This is a hack and not a nice overloading, but it seems overkill to add a separate GUC just to turn off the duration, any better ideas on how support omitting the duration? Great, I'll check it out. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PG 10: could not generate random cancel key
On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed wrote: > if (RAND_status() == 0) > RAND_poll(); Looks like a recipe for an infinite loop. At least, I think we ought to have a CHECK_FOR_INTERRUPTS() in that loop. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: chained transactions
On 01/03/18 05:35, Peter Eisentraut wrote: The SQL standard offers the "chained transactions" feature to address this. The new command variants COMMIT AND CHAIN and ROLLBACK AND CHAIN immediately start a new transaction with the characteristics (isolation level, read/write, deferrable) of the previous one. So code that has particular requirements regard transaction isolation and such can use this to simplify code management. Oh, is that all it does? That's disappointing, because that's a lot less powerful than how I understand chained transactions. And at the same time relieving, because that's a lot simpler to implement :-). In Gray & Reuter's classic book, Transaction Processing, they describe chained transactions so that you also keep locks and cursors. Unfortunately I don't have a copy at hand, but that's my recollection, at least. I guess the SQL standard committee had a different idea. - Heikki
Re: PG 10: could not generate random cancel key
On 2018-Jul-17, Robert Haas wrote: > On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed > wrote: > > if (RAND_status() == 0) > > RAND_poll(); > > Looks like a recipe for an infinite loop. At least, I think we ought > to have a CHECK_FOR_INTERRUPTS() in that loop. What loop? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On Fri, Apr 6, 2018 at 4:25 PM, Claudio Freire wrote: >> True all that. My point is that the multi-segmented array isn't all that >> simple and proven, compared to an also straightforward B-tree. It's pretty >> similar to a B-tree, actually, except that it has exactly two levels, and >> the node (= segment) sizes grow exponentially. I'd rather go with a true >> B-tree, than something homegrown that resembles a B-tree, but not quite. > > I disagree. Yeah, me too. I think a segmented array is a lot simpler than a home-grown btree. I wrote a home-grown btree that ended up becoming src/backend/utils/mmgr/freepage.c and it took me a long time to get rid of all the bugs. Heikki is almost certainly better at coding up a bug-free btree than I am, but a segmented array is a dead simple data structure, or should be if done properly, and a btree is not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PG 10: could not generate random cancel key
On Tue, Jul 17, 2018 at 1:27 PM, Alvaro Herrera wrote: > On 2018-Jul-17, Robert Haas wrote: > >> On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed >> wrote: >> > if (RAND_status() == 0) >> > RAND_poll(); >> >> Looks like a recipe for an infinite loop. At least, I think we ought >> to have a CHECK_FOR_INTERRUPTS() in that loop. > > What loop? Ugh, I'm not doing very well today, am I? I read that as while() but it says if(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian
On 2018-Jul-16, David Rowley wrote: > On 16 July 2018 at 12:55, David Rowley wrote: > > Thinking about this some more, I don't quite see any reason that the > > partitioned_rels for a single hierarchy couldn't just be a Bitmapset > > instead of an IntList. > > Of course, this is not possible since we can't pass a List of > Bitmapsets to the executor due to Bitmapset not being a node type. Maybe we can just add a new node type that wraps a lone bitmapset. The naive implementation (just print out individual members) should be pretty straightforward; a more sophisticated one (print out the "words") may end up more compact or not depending on density, but much harder for debugging, and probably means a catversion bump when BITS_PER_BITMAPWORD is changed, so probably not a great idea anyway. I suppose the only reason we haven't done this yet is nobody has needed it. Sounds like its time has come. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow auto_explain to log to NOTICE
> On 17 Jul 2018, at 19:11, Andrew Dunstan > wrote: > > On 07/17/2018 12:04 PM, Daniel Gustafsson wrote: >> Since DEBUG is not a defined loglevel, it seems superfluous to include it >> here. >> It’s also omitted from the documentation so it should probably be omitted >> from >> here. >> >> + {"debug", DEBUG2, true}, > > I treated this like we do for client_min_messages and log_min_messages - the > alias is there but I don;t think we document it either. > > I don't mind removing it, was just trying to be consistent. It seems odd that > we would accept it in one place but not another. Ooh.. I didn’t know that alias existed and didn’t find it when poking at the code. In that case I agree with you, the alias should stay so I withdraw that comment. I learned something new today =) >>> I haven't added tests for auto_explain - I think that would be useful >>> but it's a separate project. >> Agreeing that this would be beneficial, the attached patch (to apply on top >> of >> the patch in question) takes a stab at adding tests for this new >> functionality. >> >> In order to test plan output we need to support COSTS in the explain output, >> so >> a new GUC auto_explain.log_costs is added. We also need to not print the >> duration, so as a hack this patch omits the duration if >> auto_explain.log_timing >> is set to off and auto_explain.log_analyze is set off. This is a hack and >> not >> a nice overloading, but it seems overkill to add a separate GUC just to turn >> off the duration, any better ideas on how support omitting the duration? > > Great, I'll check it out. I’m not sure it’s worth adding this much to the code just to be able to test it, but it seemed like a good excercise to write to have something to reason about. cheers ./daniel
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Jul 16, 2018 at 3:25 PM, Tomas Vondra wrote: > Oh, right, I forgot the patch also adds the leader into the group, for > some reason (I agree it's unclear why that would be necessary, as you > pointed out later). > > But all this is happening while holding the partition lock (in exclusive > mode). And the decoding backends do synchronize on the lock correctly > (although, man, the rechecks are confusing ...). > > But now I see ProcKill accesses decodeGroupLeader in multiple places, > and only the first one is protected by the lock, for some reason > (interestingly enough the one in lockGroupLeader block). Is that what > you mean? I haven't traced out the control flow completely, but it sure looks to me like there are places where decodeGroupLeader is checked without holding any LWLock at all. Also, it looks to me like some places (like where we're trying to find a PGPROC by XID) we use ProcArrayLock and in others -- I guess where we're checking the decodeGroupBlah stuff -- we are using the lock manager locks. I don't know how safe that is, and there are not a lot of comments justifying it. I also wonder why we're using the lock manager locks to protect the decodeGroup stuff rather than backendLock. > FWIW I suspect the ProcKill part is borked due to incorrectly resolved > merge conflict or something, per my initial response from today. Yeah I wasn't seeing the code the way I thought you were describing it in that response, but I'm dumb this week so maybe I just misunderstood. >> I think that's probably not going to work out, but of course it's up >> to you how you want to spend your time! > > Well, yeah. I'm sure I could think of more fun things to do, but OTOH I > also have patches that require the capability to decode in-progress > transactions. It's not a matter of fun; it's a matter of whether it can be made to work. Don't get me wrong -- I want the ability to decode in-progress transactions. I complained about that aspect of the design to Andres when I was reviewing and committing logical slots & logical decoding, and I complained about it probably more than I complained about any other aspect of it, largely because it instantaneously generates a large lag when a bulk load commits. But not liking something about the way things are is not the same as knowing how to make them better. I believe there is a way to make it work because I believe there's a way to make anything work. But I suspect that it's at least one order of magnitude more complex than this patch currently is, and likely an altogether different design. > But the way I understand it, it pretty much *is* a list of waiters, > along with a couple of flags to allow the processes to notify the other > side about lock/unlock/abort. It does resemble the lock groups, but I > don't think it has the same goals. So the parts that aren't relevant shouldn't be copied over. >> That having been said, I still don't see how that's really going to >> work. Just to take one example, suppose that the leader is trying to >> ERROR out, and the decoding workers are blocked waiting for a lock >> held by the leader. The system has no way of detecting this deadlock >> and resolving it automatically, which certainly seems unacceptable. >> The only way that's going to work is if the leader waits for the >> worker by trying to acquire a lock held by the worker. Then the >> deadlock detector would know to abort some transaction. But that >> doesn't really work either - the deadlock was created by the >> foreground process trying to abort, and if the deadlock detector >> chooses that process as its victim, what then? We're already trying >> to abort, and the abort code isn't supposed to throw further errors, >> or fail in any way, lest we break all kinds of other things. Not to >> mention the fact that running the deadlock detector in the abort path >> isn't really safe to begin with, again because we can't throw errors >> when we're already in an abort path. > > Fair point, not sure. I'll leave this up to Nikhil. That's fine, but please understand that I think there's a basic design flaw here that just can't be overcome with any amount of hacking on the details here. I think we need a much higher-level consideration of the problem here and probably a lot of new infrastructure to support it. One idea might be to initially support decoding of in-progress transactions only if they don't modify any catalog state. That would leave out a bunch of cases we'd probably like to support, such as CREATE TABLE + COPY in the same transaction, but it would likely dodge a lot of really hard problems, too, and we could improve things later. One approach to the problem of catalog changes would be to prevent catalog tuples from being removed even after transaction abort until such time as there's no decoding in progress that might care about them. That is not by itself sufficient because a transaction can abort after inserting a heap tuple but before ins
Re: GiST VACUUM
Hi! > 16 июля 2018 г., в 21:24, Andrey Borodin написал(а): > I was checking WAL replay of new scheme to log page deletes and found a bug there (incorrect value of deleted downlink in WAL record). Here's fixed patch v10. Also I've added support to WAL identification for new record, done some improvements to comments and naming in data structures. Thanks! Best regards, Andrey Borodin. 0002-Physical-GiST-scan-during-VACUUM-v10.patch Description: Binary data 0001-Delete-pages-during-GiST-VACUUM-v10.patch Description: Binary data
Re: Another usability issue with our TAP tests
On 16.07.18 19:13, Tom Lane wrote: > But a TAP test failure leaves nothing behind that git will consider > unusual. I've repeatedly had to run check-world with no parallelism > (wasting many minutes) in order to locate which test actually failed. How about something like this: diff --git a/src/Makefile.global.in b/src/Makefile.global.in index 95d090e72d..12114d8427 100644 --- a/src/Makefile.global.in +++ b/src/Makefile.global.in @@ -396,7 +396,7 @@ endif PROVE = @PROVE@ # There are common routines in src/test/perl, and some test suites have # extra perl modules in their own directory. -PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I $(srcdir) +PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I $(srcdir) --state=save # User-supplied prove flags such as --verbose can be provided in PROVE_FLAGS. PROVE_FLAGS = @@ -420,12 +420,14 @@ define prove_installcheck rm -rf '$(CURDIR)'/tmp_check $(MKDIR_P) '$(CURDIR)'/tmp_check cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) +@rm -f .prove endef define prove_check rm -rf '$(CURDIR)'/tmp_check $(MKDIR_P) '$(CURDIR)'/tmp_check cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) +@rm -f .prove endef else -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allowing multiple DDL commands to run simultaneously
On Mon, Jul 9, 2018 at 6:00 AM, Simon Riggs wrote: > Proposal would be to add a new lock mode "ShareUpdate", which does not > conflict with itself and yet conflicts with "ShareUpdateExclusive" or > higher. (Hence, it is a strong lock type). DDL would take a > ShareUpdateLock on the table, then during critical portions of > commands it would take a ShareUpdateExclusiveLock and then release it > again before commit. I think this would be quite prone to deadlocks. Suppose someone tries to grab an AccessExclusiveLock on the table during a window in which we hold only ShareUpdateLock. The next attempt to upgrade to ShareUpdateExclusiveLock will cause a simple deadlock. In general, any approach that involves upgrading our lock strength is likely to have this problem. You might be able to work around this by inventing a whole new lock type, say "Relation Maintenance". Make a rule that you can only take the "Relation Maintenance" lock while holding a Relation lock with strength >= ShareUpdateLock and that you do not need to bother acquiring it if you hold a self-exclusive lock that conflicts with ShareUpdateLock. I think that works out to about the same thing as what you're proposing, except without the deadlock hazard. In general, though, +1 for trying to do something about this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: untrusted PLs should be GRANTable
On 17.07.18 07:20, Craig Ringer wrote: > A user has raised the point that our refusal to GRANT rights to > untrusted PLs is counterproductive and inconsistent with how we behave > elsewhere. Previous discussion: https://www.postgresql.org/message-id/flat/1357905627.24219.6.camel%40vanquo.pezone.net What caused that to die was "What might actually be a problem in this area is that, AFAICT, pg_dump does not save privileges granted to objects in extensions." But I think that is fixed/fixable now with pg_init_privs. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Another usability issue with our TAP tests
Peter Eisentraut writes: > How about something like this: > -PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I $(srcdir) > +PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I $(srcdir) --state=save Cute idea, but it seems not to work with older versions of prove: $ which prove /usr/local/perl5.8.3/bin/prove $ prove --state=save Unknown option: s We could just do a "touch .prove_running" beforehand and an "rm" after, perhaps (I think Michael suggested something like that already). regards, tom lane
Re: patch to allow disable of WAL recycling
On 17.07.18 00:04, Jerry Jelinek wrote: > There have been quite a few comments since last week, so at this point I > am uncertain how to proceed with this change. I don't think I saw > anything concrete in the recent emails that I can act upon. The outcome of this could be multiple orthogonal patches that affect the WAL file allocation behavior somehow. I think your original idea of skipping recycling on a COW file system is sound. But I would rather frame the option as "preallocating files is obviously useless on a COW file system" rather than "this will make things mysteriously faster with uncertain trade-offs". The actual implementation could use another round of consideration. I wonder how this should interact with min_wal_size. Wouldn't min_wal_size = 0 already do what we need (if you could set it to 0, which is currently not possible)? Should the new setting be something like min_wal_size = -1? Or even if it's a new setting, it might be better to act on it in XLOGfileslop(), so these things are kept closer together. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
On Sun, Jul 8, 2018 at 7:59 PM, Peter Geoghegan wrote: > The whole "getting tired" thing is the root of the problem here, which > is why the pending v3 of my patch will remove that code completely > (_bt_findinsertloc() is streamlined). Peter, This seems like really interesting and important work. I wouldn't have foreseen that the "getting tired" code would have led to this kind of bloat (even if I had known about it at all). I wonder, though, whether it's possible that the reverse could happen in some other scenario. It seems to me that with the existing code, if you reinsert a value many copies of which have been deleted, you'll probably find partially-empty pages whose free space can be reused, but if there's one specific place where each tuple needs to go, you might end up having to split pages if the new TIDs are all larger or smaller than the old TIDs. I'm really glad that you are working on this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
Robert Haas writes: > On Sun, Jul 8, 2018 at 7:59 PM, Peter Geoghegan wrote: >> The whole "getting tired" thing is the root of the problem here, which >> is why the pending v3 of my patch will remove that code completely >> (_bt_findinsertloc() is streamlined). > This seems like really interesting and important work. I wouldn't > have foreseen that the "getting tired" code would have led to this > kind of bloat (even if I had known about it at all). I wonder, > though, whether it's possible that the reverse could happen in some > other scenario. It seems to me that with the existing code, if you > reinsert a value many copies of which have been deleted, you'll > probably find partially-empty pages whose free space can be reused, > but if there's one specific place where each tuple needs to go, you > might end up having to split pages if the new TIDs are all larger or > smaller than the old TIDs. Yeah ... if memory serves, there were specific usage patterns where that hack made things way better than they'd been before. (I do not recall if the hack itself was mine, but I think I can be blamed for the "getting tired" comment ...) I'd suggest git blaming your way to the commit that put that in, and then checking the hackers archives around that date for more info. regards, tom lane
Re: patch to allow disable of WAL recycling
On 07/17/2018 09:12 PM, Peter Eisentraut wrote: > On 17.07.18 00:04, Jerry Jelinek wrote: >> There have been quite a few comments since last week, so at this point I >> am uncertain how to proceed with this change. I don't think I saw >> anything concrete in the recent emails that I can act upon. > > The outcome of this could be multiple orthogonal patches that affect the > WAL file allocation behavior somehow. I think your original idea of > skipping recycling on a COW file system is sound. But I would rather > frame the option as "preallocating files is obviously useless on a COW > file system" rather than "this will make things mysteriously faster with > uncertain trade-offs". > Makes sense, I guess. But I think many claims made in this thread are mostly just assumptions at this point, based on our beliefs how CoW or non-CoW filesystems work. The results from ZFS (showing positive impact) are an exception, but that's about it. I'm sure those claims are based on real-world experience and are likely true, but it'd be good to have data from a wider range of filesystems / configurations etc. so that we can give better recommendations to users, for example. That's something I can help with, assuming we agree on what tests we want to do. I'd say the usual batter of write-only pgbench tests with different scales (fits into s_b, fits into RAM, larger then RAM) on common Linux filesystems (ext4, xfs, btrfs) and zfsonlinux, and different types of storage would be enough. I don't have any freebsd box available, unfortunately. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
On Tue, Jul 17, 2018 at 1:12 PM, Robert Haas wrote: > This seems like really interesting and important work. I wouldn't > have foreseen that the "getting tired" code would have led to this > kind of bloat (even if I had known about it at all). Thanks! I'm glad that I can come up with concrete, motivating examples like this, because it's really hard to talk about the big picture here. With something like a pgbench workload, there are undoubtedly many different factors in play, since temporal locality influences many different things all at once. I don't think that I understand it all just yet. Taking a holistic view of the problem seems very helpful, but it's also very frustrating at times. > I wonder, > though, whether it's possible that the reverse could happen in some > other scenario. It seems to me that with the existing code, if you > reinsert a value many copies of which have been deleted, you'll > probably find partially-empty pages whose free space can be reused, > but if there's one specific place where each tuple needs to go, you > might end up having to split pages if the new TIDs are all larger or > smaller than the old TIDs. That's a legitimate concern. After all, what I've done boils down to adding a restriction on space utilization that wasn't there before. This clearly helps because it makes it practical to rip out the "getting tired" thing, but that's not everything. There are good reasons for that hack, but if latency magically didn't matter then we could just tear the hack out without doing anything else. That would make groveling through pages full of duplicates at least as discerning about space utilization as my patch manages to be, without any of the complexity. There is actually a flipside to that downside, though (i.e. the downside is also an upside): While not filling up leaf pages that have free space on them is bad, it's only bad when it doesn't leave the pages completely empty. Leaving the pages completely empty is actually good, because then VACUUM is in a position to delete entire pages, removing their downlinks from parent pages. That's a variety of bloat that we can reverse completely. I suspect that you'll see far more of that favorable case in the real world with my patch. It's pretty much impossible to do page deletions with pages full of duplicates today, because the roughly-uniform distribution of still-live tuples among leaf pages fails to exhibit any temporal locality. So, maybe my patch would still come out ahead of simply ripping out "getting tired" in this parallel universe where latency doesn't matter, and space utilization is everything. I made one small mistake with my test case: It actually *is* perfectly efficient at recycling space even at the end, since I don't delete all the duplicates (just 90% of them). Getting tired might have been a contributing factor there, too. -- Peter Geoghegan
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
On Tue, Jul 17, 2018 at 1:29 PM, Tom Lane wrote: > Yeah ... if memory serves, there were specific usage patterns where > that hack made things way better than they'd been before. (I do not > recall if the hack itself was mine, but I think I can be blamed for > the "getting tired" comment ...) I'd suggest git blaming your way > to the commit that put that in, and then checking the hackers archives > around that date for more info. I've done plenty of research into the history of this hack. It was your work, but it does actually make sense in the context of today's nbtree code. It is essential with scankey-wise duplicates, since groveling through hundreds or even thousands of pages full of duplicates to find free space (and avoid a page split) is going to have a very serious downside for latency. Vadim wanted to fix the problem by more or less doing what I propose [1], though he never got into figuring out how to make that practical (i.e. how to make it not bloat up internal pages, which must represent heap TID as just another part of the key space). Having unique keys isn't just assumed by Lehman & Yao; I think that it's assumed by most or all real-world B-Tree implementations. [1] https://www.postgresql.org/message-id/18788.963953...@sss.pgh.pa.us -- Peter Geoghegan
Re: Allowing multiple DDL commands to run simultaneously
On 17 July 2018 at 19:47, Robert Haas wrote: > On Mon, Jul 9, 2018 at 6:00 AM, Simon Riggs wrote: >> Proposal would be to add a new lock mode "ShareUpdate", which does not >> conflict with itself and yet conflicts with "ShareUpdateExclusive" or >> higher. (Hence, it is a strong lock type). DDL would take a >> ShareUpdateLock on the table, then during critical portions of >> commands it would take a ShareUpdateExclusiveLock and then release it >> again before commit. > > I think this would be quite prone to deadlocks. Suppose someone tries > to grab an AccessExclusiveLock on the table during a window in which > we hold only ShareUpdateLock. The next attempt to upgrade to > ShareUpdateExclusiveLock will cause a simple deadlock. In general, > any approach that involves upgrading our lock strength is likely to > have this problem. > > You might be able to work around this by inventing a whole new lock > type, say "Relation Maintenance". Make a rule that you can only take > the "Relation Maintenance" lock while holding a Relation lock with > strength >= ShareUpdateLock and that you do not need to bother > acquiring it if you hold a self-exclusive lock that conflicts with > ShareUpdateLock. I think that works out to about the same thing as > what you're proposing, except without the deadlock hazard. Yes, it seems better to invent a new orthogonal lock type than have a new lock level. Thanks. Seems more like a critical section than a lock. I'd make code take that lock, even if they have a self-exclusive lock, just to avoid later problems when the lock level changes. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
Peter Geoghegan writes: > I've done plenty of research into the history of this hack. It was > your work, but it does actually make sense in the context of today's > nbtree code. It is essential with scankey-wise duplicates, since > groveling through hundreds or even thousands of pages full of > duplicates to find free space (and avoid a page split) is going to > have a very serious downside for latency. Well, the actual problem was O(N^2) behavior: https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=40549e9cb5abd2986603883e4ab567dab34723c6 I certainly have no objection to improving matters, but let's be sure we don't re-introduce any two-decade-old problems. regards, tom lane
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
On 16.07.2018 23:55, Tomas Vondra wrote: On 07/16/2018 02:54 PM, Dean Rasheed wrote: On 16 July 2018 at 13:23, Tomas Vondra wrote: The top-level clauses allow us to make such deductions, with deeper clauses it's much more difficult (perhaps impossible). Because for example with (a=1 AND b=1) there can be just a single match, so if we find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR b=2)) it's not that simple, because there may be multiple combinations and so a match in MCV does not guarantee anything. Actually, it guarantees a lower bound on the overall selectivity, and maybe that's the best that we can do in the absence of any other stats. Hmmm, is that actually true? Let's consider a simple example, with two columns, each with just 2 values, and a "perfect" MCV list: a | b | frequency --- 1 | 1 | 0.5 2 | 2 | 0.5 And let's estimate sel(a=1 & b=2). OK.In this case, there are no MCV matches, so there is no lower bound (it's 0). What we could do though is also impose an upper bound, based on the sum of non-matching MCV frequencies. In this case, the upper bound is also 0, so we could actually say the resulting selectivity is 0. Hmmm, it's not very clear to me how would we decide which of these cases applies, because in most cases we don't have MCV covering 100% rows. Anyways, I've been thinking about how to modify the code to wort the way you proposed (in a way sufficient for a PoC). But after struggling with it for a while it occurred to me it might be useful to do it on paper first, to verify how would it work on your examples. So let's use this data create table foo(a int, b int); insert into foo select 1,1 from generate_series(1,5); insert into foo select 1,2 from generate_series(1,4); insert into foo select 1,x/10 from generate_series(30,25) g(x); insert into foo select 2,1 from generate_series(1,3); insert into foo select 2,2 from generate_series(1,2); insert into foo select 2,x/10 from generate_series(30,50) g(x); insert into foo select 3,1 from generate_series(1,1); insert into foo select 3,2 from generate_series(1,5000); insert into foo select 3,x from generate_series(3,60) g(x); insert into foo select x,x/10 from generate_series(4,75) g(x); Assuming we have perfectly exact statistics, we have these MCV lists (both univariate and multivariate): select a, count(*), round(count(*) /2254937.0, 4) AS frequency from foo group by a order by 2 desc; a| count | frequency ++--- 3 | 614998 |0.2727 2 | 549971 |0.2439 1 | 339971 |0.1508 1014 | 1 |0. 57220 | 1 |0. ... select b, count(*), round(count(*) /2254937.0, 4) AS frequency from foo group by b order by 2 desc; b| count | frequency +---+--- 1 | 90010 |0.0399 2 | 65010 |0.0288 3 |31 |0. 7 |31 |0. ... select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency from foo group by a, b order by 3 desc; a| b| count | frequency ++---+--- 1 | 1 | 5 |0.0222 1 | 2 | 4 |0.0177 2 | 1 | 3 |0.0133 2 | 2 | 2 |0.0089 3 | 1 | 1 |0.0044 3 | 2 | 5000 |0.0022 2 | 12445 |10 |0. ... And let's estimate the two queries with complex clauses, where the multivariate stats gave 2x overestimates. SELECT * FROM foo WHERE a=1 and (b=1 or b=2); -- actual 9, univariate: 24253, multivariate: 181091 univariate: sel(a=1) = 0.1508 sel(b=1) = 0.0399 sel(b=2) = 0.0288 sel(b=1 or b=2) = 0.0673 multivariate: sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770) The second multivariate estimate comes from assuming only the first 5 items make it to the multivariate MCV list (covering 6.87% of the data) and extrapolating the selectivity to the non-MCV data too. (Notice it's about 2x the actual selectivity, so this extrapolation due to not realizing the MCV already contains all the matches is pretty much responsible for the whole over-estimate). So, how would the proposed algorithm work? Let's start with "a=1": sel(a=1) = 0.1508 I don't see much point in applying the two "b" clauses independently (or how would it be done, as it's effectively a single clause): sel(b=1 or b=2) = 0.0673 And we get total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101 From the multivariate MCV we get mcv_sel = 0.0399 And finally total_sel = Max(total_sel, mcv_sel) = 0.0399 Which is great, but I don't see how that actually helped anything? We still only have the estimate for the ~7% covered by the MCV list, and not the remaining non-MCV part. I could do the same thing for the second
Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)
Hello Heikki, [...] Let's keep it that way. I think the only change we need to make in the logic is to check at the end, if *any* progress reports at all have been printed, and print one if not. Ok, this simplifies the condition. And do that only when the -P option is smaller than the -T option, I suppose. Yep, why not. Oh. I'm a bit surprised we don't support decimals, i.e. -P 0.5. Actually, it seems to be acceptd, but it's truncated down to the nearest integer. Indeed, "atoi" does not detect errors, and it is true of the many uses in pgbench: clients, threads, scale, duration, fillfactor... That's not very nice :-(. But it's a separate issue. Yep. For human consumption, seconds seem okay. The more reasonable alternative could be to always last 2 seconds under -T 2, even if the execution can be shorten because there is nothing to do at all, i.e. remove the environment-based condition but keep the sleep. That sounds reasonable. It's a bit silly to wait when there's nothing to do, but it's also weird if the test exits before the specified time is up. Seems less surprising to always sleep. I did that in the attached version: no more environment variable hack, and no execution shortcut even if there is nothing to do. I also had to reproduce the progress logic to keep on printing report of (no) progress in this tailing phase. I'm not very happy because it is a change of behavior. I suggest that I could add a special "--strict-time-compliance" option to do this only when required... and it would only be required by tap tests. -- Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 41b756c089..02aff58a48 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -5597,6 +5597,96 @@ main(int argc, char **argv) return 0; } +/* display the progress report */ +static void +doProgressReport(TState *thread, StatsData *plast, int64 *plast_report, +int64 now, int64 thread_start) +{ + StatsData cur; + int64 run = now - *plast_report, + ntx; + double tps, + total_run, + latency, + sqlat, + lag, + stdev; + chartbuf[64]; + int i; + + /* +* Add up the statistics of all threads. +* +* XXX: No locking. There is no guarantee that we get an +* atomic snapshot of the transaction count and latencies, so +* these figures can well be off by a small amount. The +* progress report's purpose is to give a quick overview of +* how the test is going, so that shouldn't matter too much. +* (If a read from a 64-bit integer is not atomic, you might +* get a "torn" read and completely bogus latencies though!) +*/ + initStats(&cur, 0); + for (i = 0; i < nthreads; i++) + { + mergeSimpleStats(&cur.latency, &thread[i].stats.latency); + mergeSimpleStats(&cur.lag, &thread[i].stats.lag); + cur.cnt += thread[i].stats.cnt; + cur.skipped += thread[i].stats.skipped; + } + + /* we count only actually executed transactions */ + ntx = (cur.cnt - cur.skipped) - (plast->cnt - plast->skipped); + total_run = (now - thread_start) / 100.0; + tps = 100.0 * ntx / run; + if (ntx > 0) + { + latency = 0.001 * (cur.latency.sum - plast->latency.sum) / ntx; + sqlat = 1.0 * (cur.latency.sum2 - plast->latency.sum2) / ntx; + stdev = 0.001 * sqrt(sqlat - 100.0 * latency * latency); + lag = 0.001 * (cur.lag.sum - plast->lag.sum) / ntx; + } + else + { + latency = sqlat = stdev = lag = 0; + } + + if (progress_timestamp) + { + /* +* On some platforms the current system timestamp is +* available in now_time, but rather than get entangled +* with that, we just eat the cost of an extra syscall in +* all cases. +*/ + struct timeval tv; + + gettimeofday(&tv, NULL); + snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s", +(long) tv.tv_sec, (long) (tv.tv_usec / 1000)); + } + else + { + /* round seconds are expected, but the thread may be late */ + snprintf(tbuf, sizeof(tbuf), "%.1f s", total_run); + } + + fprintf(stderr, + "progress: %s, %.1f tps, lat %.3f ms stddev %.3f", + tbuf, tps, latency, stdev); + + if (throttle_delay) + { + fprintf(stderr, ", lag %.3f ms", lag); + if (latency_limit) +
Re: Fix some error handling for read() and errno
On Mon, Jul 16, 2018 at 04:04:12PM -0400, Alvaro Herrera wrote: > No objection here -- incremental progress is better than none. Thanks. I have pushed 0001 now. I have found some more work which could be done, for which I'll spawn a new thread. -- Michael signature.asc Description: PGP signature
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
On Tue, Jul 17, 2018 at 3:10 PM, Tom Lane wrote: > Well, the actual problem was O(N^2) behavior: > > https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us > > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=40549e9cb5abd2986603883e4ab567dab34723c6 Oh, yeah. We still put the new tuple to the left of all existing duplicates on the leaf that we decide to insert on, because "we'll insert it before any existing equal keys because of the way _bt_binsrch() works". I actually plan on mostly fixing that aspect of _bt_binsrch() while I'm at it, which is why I didn't think to frame it that way. It is claimed that we need to do this _bt_binsrch() go-left-on-equal thing for internal pages because we allow duplicates, contrary to L&Y. I find it much easier to think of it as being necessary due to index scans where only some columns are specified in the initial insertion scan key that finds the initial leaf page (i.e. the _bt_first() stuff). I'm not going to touch the _bt_binsrch() go-left-on-equal thing, because it's actually necessary for any real world implementation that has multiple columns in tuples. Fortunately, we can usually avoid the go-left-on-equal thing for internal pages by avoiding equality -- a sentinel heap scan TID value may be used for this in many cases. The Lehman and Yao paper is badly written. They should have talked about the _bt_binsrch() go-left-on-equal thing, rather than leaving it to the reader to figure it out. It's not why L&Y require unique keys, though. Sorry if all this detail isn't useful right now. The important point is that I actually think that this code gets most things right already. > I certainly have no objection to improving matters, but let's be sure > we don't re-introduce any two-decade-old problems. A mountain of work remains to validate the performance of the patch. -- Peter Geoghegan
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
On 07/18/2018 12:41 AM, Konstantin Knizhnik wrote: > ... > > Teodor Sigaev has proposed an alternative approach for calculating > selectivity of multicolumn join or compound index search. > Usually DBA creates compound indexes which can be used by optimizer to > build efficient query execution plan based on index search. > We can stores statistic for compound keys of such indexes and (as it is > done now for functional indexes) and use it to estimate selectivity > of clauses. I have implemented this idea and will publish patch in > separate thread soon. > Now I just want to share some results for the Tomas examples. > > So for Vanilla Postges without extra statistic estimated number of rows > is about 4 times smaller than real. > Can you please post plans with parallelism disabled, and perhaps without the aggregate? Both makes reading the plans unnecessarily difficult ... > postgres=# explain analyze SELECT count(*) FROM foo WHERE a=1 and (b=1 > or b=2); > QUERY PLAN > > > > Aggregate (cost=10964.76..10964.77 rows=1 width=8) (actual > time=49.251..49.251 rows=1 loops=1) > -> Bitmap Heap Scan on foo (cost=513.60..10906.48 rows=23310 > width=0) (actual time=17.368..39.928 rows=9 loops=1) ok, 23k vs. 90k > > If we add statistic for a and b columns: > > create statistics ab on a,b from foo; > analyze foo; > > then expected results is about 30% larger then real: 120k vs 90k: > Eh? The plan however says 9k vs. 30k ... > -> Parallel Bitmap Heap Scan on foo > (cost=2561.33..13424.24 rows=9063 width=0) (actual time=15.551..26.057 > rows=3 loops=3) ... > With compound index statistic estimation is almost equal to real value: > > -> Bitmap Heap Scan on foo (cost=1880.20..13411.66 rows=23310 > width=0) (actual time=38.776..61.050 rows=9 loops=1) Well, this says 23k vs. 90k. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
I wrote: >> So I said I didn't want to do extra work on this, but I am looking into >> fixing it by having these aux process types run a ResourceOwner that can >> be told to clean up any open buffer pins at exit. > That turned out to be a larger can of worms than I'd anticipated, as it > soon emerged that we'd acquired a whole lot of cargo-cult programming > around ResourceOwners. ... > I'm mostly pretty happy with this, but I think there are a couple of > loose ends in logicalfuncs.c and slotfuncs.c: those are creating > non-standalone ResourceOwners (children of whatever the active > ResourceOwner is) and doing nothing much to clean them up. That seems > pretty bogus. Further investigation showed that the part of that code that was actually needed was not the creation of a child ResourceOwner, but rather restoration of the old CurrentResourceOwner setting after exiting the logical decoding loop. Apparently we can come out of that with the TopTransaction resowner being active. This still seems a bit bogus; maybe there should be a save-and-restore happening somewhere else? But I'm not really excited about doing more than commenting it. Also, most of the other random creations of ResourceOwners seem to just not be necessary at all, even with the new rule that you must have a resowner to acquire buffer pins. So the attached revised patch just gets rid of them, and improves some misleading/wrong comments on the topic. It'd still be easy to use CreateAuxProcessResourceOwner in any process where we discover we need one, but I don't see the value in adding useless overhead. At this point I'm leaning to just applying this in HEAD and calling it good. The potential for assertion failures isn't relevant to production builds, and my best guess at this point is that there isn't really any other user-visible bug. The resowners that were being created and not adequately cleaned up all seem to have been dead code anyway (ie they'd never acquire any resources). We could have a problem if, say, the bgwriter exited via the FATAL path while holding a pin, but I don't think there's a reason for it to do that except in a database-wide shutdown, where the consequences of a leaked pin seem pretty minimal. Any objections? Anyone want to do further review? regards, tom lane diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c index 4e32cff..30ddf94 100644 --- a/src/backend/access/transam/parallel.c +++ b/src/backend/access/transam/parallel.c @@ -37,7 +37,6 @@ #include "utils/guc.h" #include "utils/inval.h" #include "utils/memutils.h" -#include "utils/resowner.h" #include "utils/snapmgr.h" #include "utils/typcache.h" @@ -1220,16 +1219,19 @@ ParallelWorkerMain(Datum main_arg) Assert(ParallelWorkerNumber == -1); memcpy(&ParallelWorkerNumber, MyBgworkerEntry->bgw_extra, sizeof(int)); - /* Set up a memory context and resource owner. */ - Assert(CurrentResourceOwner == NULL); - CurrentResourceOwner = ResourceOwnerCreate(NULL, "parallel toplevel"); + /* Set up a memory context to work in, just for cleanliness. */ CurrentMemoryContext = AllocSetContextCreate(TopMemoryContext, "Parallel worker", ALLOCSET_DEFAULT_SIZES); /* - * Now that we have a resource owner, we can attach to the dynamic shared - * memory segment and read the table of contents. + * Attach to the dynamic shared memory segment for the parallel query, and + * find its table of contents. + * + * Note: at this point, we have not created any ResourceOwner in this + * process. This will result in our DSM mapping surviving until process + * exit, which is fine. If there were a ResourceOwner, it would acquire + * ownership of the mapping, but we have no need for that. */ seg = dsm_attach(DatumGetUInt32(main_arg)); if (seg == NULL) @@ -1393,7 +1395,7 @@ ParallelWorkerMain(Datum main_arg) /* Must exit parallel mode to pop active snapshot. */ ExitParallelMode(); - /* Must pop active snapshot so resowner.c doesn't complain. */ + /* Must pop active snapshot so snapmgr.c doesn't complain. */ PopActiveSnapshot(); /* Shut down the parallel-worker transaction. */ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 4049deb..5d01edd 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6356,6 +6356,15 @@ StartupXLOG(void) struct stat st; /* + * We should have an aux process resource owner to use, and we should not + * be in a transaction that's installed some other resowner. + */ + Assert(AuxProcessResourceOwner != NULL); + Assert(CurrentResourceOwner == NULL || + CurrentResourceOwner == AuxProcessResourceOwner); + CurrentResourceOwner = AuxProcessResourceOwner; + + /* * Verify XLOG status looks valid. */ if (ControlFile->state < DB_SHUTDOWNED || @@ -8467,6 +8476,15 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch) void ShutdownXLOG(in
Re: Another usability issue with our TAP tests
On Tue, Jul 17, 2018 at 03:02:32PM -0400, Tom Lane wrote: > Cute idea, but it seems not to work with older versions of prove: > > $ which prove > /usr/local/perl5.8.3/bin/prove > $ prove --state=save > Unknown option: s I didn't know this one, and that's actually nice, but I cannot get easily a way to track down which tests failed during a parallel run... And I am used as well to hunt those with "git status". > We could just do a "touch .prove_running" beforehand and an "rm" after, > perhaps (I think Michael suggested something like that already). Yes, except that in the idea of upthread TestLib.pm would be in control of it, so as there is less code churn in prove_check and prove_installcheck. But I am having second-thoughts about putting the test state directly in the module as that's a four-liner in Makefile.global.in, and that seems to work even if you put a die() to fail hard or even if only a subset of tests is failing. Thoughts? -- Michael diff --git a/src/Makefile.global.in b/src/Makefile.global.in index 95d090e72d..fc81380ad2 100644 --- a/src/Makefile.global.in +++ b/src/Makefile.global.in @@ -417,15 +417,19 @@ endef ifeq ($(enable_tap_tests),yes) define prove_installcheck +touch '$(CURDIR)'/tap_running rm -rf '$(CURDIR)'/tmp_check $(MKDIR_P) '$(CURDIR)'/tmp_check cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) +rm '$(CURDIR)'/tap_running endef define prove_check +touch '$(CURDIR)'/tap_running rm -rf '$(CURDIR)'/tmp_check $(MKDIR_P) '$(CURDIR)'/tmp_check cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) +rm '$(CURDIR)'/tap_running endef else signature.asc Description: PGP signature
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
On Tue, Jul 17, 2018 at 5:16 PM, Peter Geoghegan wrote: > There is actually a flipside to that downside, though (i.e. the > downside is also an upside): While not filling up leaf pages that have > free space on them is bad, it's only bad when it doesn't leave the > pages completely empty. Leaving the pages completely empty is actually > good, because then VACUUM is in a position to delete entire pages, > removing their downlinks from parent pages. That's a variety of bloat > that we can reverse completely. I suspect that you'll see far more of > that favorable case in the real world with my patch. It's pretty much > impossible to do page deletions with pages full of duplicates today, > because the roughly-uniform distribution of still-live tuples among > leaf pages fails to exhibit any temporal locality. So, maybe my patch > would still come out ahead of simply ripping out "getting tired" in > this parallel universe where latency doesn't matter, and space > utilization is everything. Yes, that's a good point. Also, and I think pretty importantly, this seems essential if we want to allow retail index tuple deletion, which has its own set of advantages. I don't think you're going to be able to rip out the getting-tired code, though, because presumably we'll have to continue support existing btree indexes that don't include TIDs for some probably-not-small number of future releases, even if we decide that all future btree indexes (and any that are reindexed) should have TIDs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
On Tue, Jul 17, 2018 at 8:01 PM, Tom Lane wrote: > Any objections? Anyone want to do further review? LGTM. I think this is an improvement. However, it seems like it might be a good idea for ResourceOwnerRememberBuffer and ResourceOwnerForgetBuffer to Assert(buffer != NULL), so that if somebody messes up it will trip an assertion rather than just seg faulting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
Robert Haas writes: > On Tue, Jul 17, 2018 at 8:01 PM, Tom Lane wrote: >> Any objections? Anyone want to do further review? > LGTM. I think this is an improvement. However, it seems like it > might be a good idea for ResourceOwnerRememberBuffer and > ResourceOwnerForgetBuffer to Assert(buffer != NULL), so that if > somebody messes up it will trip an assertion rather than just seg > faulting. Uh, what? There are only a few callers of those, and they'd all have crashed already if they were somehow dealing with an invalid buffer. regards, tom lane
Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
On Tue, Jul 17, 2018 at 8:55 PM, Tom Lane wrote: > Robert Haas writes: >> On Tue, Jul 17, 2018 at 8:01 PM, Tom Lane wrote: >>> Any objections? Anyone want to do further review? > >> LGTM. I think this is an improvement. However, it seems like it >> might be a good idea for ResourceOwnerRememberBuffer and >> ResourceOwnerForgetBuffer to Assert(buffer != NULL), so that if >> somebody messes up it will trip an assertion rather than just seg >> faulting. > > Uh, what? There are only a few callers of those, and they'd all have > crashed already if they were somehow dealing with an invalid buffer. Sorry, I meant Assert(owner != NULL). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: patch to allow disable of WAL recycling
On Tue, Jul 17, 2018 at 3:12 PM, Peter Eisentraut wrote: > The actual implementation could use another round of consideration. I > wonder how this should interact with min_wal_size. Wouldn't > min_wal_size = 0 already do what we need (if you could set it to 0, > which is currently not possible)? Hmm, would that actually disable recycling, or just make it happen only rarely? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
Robert Haas writes: > On Tue, Jul 17, 2018 at 8:55 PM, Tom Lane wrote: >> Uh, what? There are only a few callers of those, and they'd all have >> crashed already if they were somehow dealing with an invalid buffer. > Sorry, I meant Assert(owner != NULL). Oh, gotcha: so that if an external developer hits it, he can more easily see that this is from an (effective) API change and not some mysterious bug. Makes sense, especially if we include a comment: /* We used to allow pinning buffers without a resowner, but no more */ Assert(CurrentResourceOwner != NULL); I think it's sufficient to do this in ResourceOwnerEnlargeBuffers, though. The other two should be unreachable without having gone through that. regards, tom lane
Re: PG 10: could not generate random cancel key
On Tue, Jul 17, 2018 at 01:31:01PM -0400, Robert Haas wrote: > On Tue, Jul 17, 2018 at 1:27 PM, Alvaro Herrera > wrote: >> On 2018-Jul-17, Robert Haas wrote: >>> On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed >>> wrote: if (RAND_status() == 0) RAND_poll(); >>> >>> Looks like a recipe for an infinite loop. At least, I think we ought >>> to have a CHECK_FOR_INTERRUPTS() in that loop. >> >> What loop? The CHECK_FOR_INTERRUPTS() addition could be an addition if the number of retries gets high enough, but that just does not apply here. > Ugh, I'm not doing very well today, am I? I read that as while() but > it says if(). Time for vacations :) -- Michael signature.asc Description: PGP signature
Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process
On Sat, Jul 14, 2018 at 5:34 AM, Heikki Linnakangas wrote: > On 18/04/18 09:55, Thomas Munro wrote: >> Here's a draft patch that does that. One contentious question is: >> should you have to opt *in* to auto-exit-on-postmaster death? Andres >> opined that you should. I actually think it's not so bad if you don't >> have to do that, and instead have to opt out. I think of it as a kind >> of 'process cancellation point' or a quiet PANIC that you can opt out >> of. It's nice to remove the old boilerplate code without having to >> add a new boilerplate event that you have to remember every time. Any >> other opinions? > > Hmm. Exiting on postmaster death by default does feel a bit too magical to > me. But as your patch points out, there are currently no places where you > *don't* want to exit on postmaster death, some callers just prefer to handle > it themselves. And I like having a default or something to make sure that > all call sites in the future will also exit quickly. > > I'd suggest that we add a new WL_EXIT_ON_POSTMASTER_DEATH flag, making it > opt-on. Ok, it's now 2 against 1. So here's a version that uses an explicit WL_EXIT_ON_PM_DEATH value. I like that name because it's shorter and more visually distinctive (harder to confuse with WL_POSTMASTER_DEATH). > But add an assertion in WaitLatchOrSocket: > > Assert ((wakeEvents & (WL_EXIT_POSTMASTER_DEATH | WL_POSTMASTER_DEATH)) != > 0); Ok. Done for the WaitLatchXXX() interface. The WaitEventSet interface (rarely used directly) is less amenable to that change. Here are some of the places I had to add WL_EXIT_ON_PM_DEATH: gather_readnext(), shm_mq_send_bytes(), shm_mq_receive_bytes(), shm_mq_wait_internal(), ProcSleep(), ProcWaitForSignal(), pg_sleep(), pgfdw_get_result(). Was it intentional that any of those places don't currently exit on postmaster vaporisation? -- Thomas Munro http://www.enterprisedb.com 0001-Add-WL_EXIT_ON_PM_DEATH-pseudo-event-v3.patch Description: Binary data
Re: PG 10: could not generate random cancel key
On Tue, Jul 17, 2018 at 02:28:14PM +0100, Dean Rasheed wrote: > From what I understand from here [1], some parts of OpenSSL call > RAND_poll() once on initialisation, and that's enough to get the PRNG > going. It's not obvious that calling it multiple times would have any > benefit. > > They also don't appear to bother checking the return code from > RAND_poll() [2]. If it did fail, there'd not be much you could do > anyway, so you might as well just let it continue and let RAND_bytes() > fail. In fact it may even be possible for RAND_poll() to fail, but > just do enough to cause RAND_bytes() to succeed. > > [1] https://wiki.openssl.org/index.php/Random_Numbers This quote from the wiki is scary so that's not quite clean either for Windows: "Be careful when deferring to RAND_poll on some Unix systems because it does not seed the generator. See the code guarded with OPENSSL_SYS_VXWORKS in rand_unix.c. Additionally, RAND_poll can have negative interactions on newer Windows platforms, so your program could hang or crash depending on the potential issue. See Windows Issues below." > [2] > https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c This repository is outdated, on OpenSSL HEAD I am seeing this used only in rand_win.c. And this commit is sort of interesting because there was a retry loop done with RAND_poll(). Please see this one: commit: c16de9d8329d41a2433d0f273c080d9d06ad7a87 author: Dr. Matthias St. Pierre date: Thu, 31 Aug 2017 23:16:22 +0200 committer: Ben Kaduk date: Wed, 18 Oct 2017 08:39:20 -0500 Fix reseeding issues of the public RAND_DRBG apps/ocsp.c also has the wisdom to check for a failure on RAND_poll(). -- Michael signature.asc Description: PGP signature
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
On Tue, Jul 17, 2018 at 5:45 PM, Robert Haas wrote: > Yes, that's a good point. Also, and I think pretty importantly, this > seems essential if we want to allow retail index tuple deletion, which > has its own set of advantages. Retail index tuple deletion is clearly something we should have in some form, and is the top reason to pursue the project. I could probably come up with dozens of other reasons if I had to, though. Here are a few of the less obvious ones: * This improves usage_count of heap page buffers automatically, by recognizing correlated references when there are many duplicates. [1] * How else is VACUUM/deletion supposed to work with indirect indexes? I think that it's impossible without either retail deletion/retail updates, unless an AEL is okay. * Global indexes (indexes that simultaneously cover multiple partitions) are just a generalization of this idea, with a partition number attribute before the heap TID attribute. * The sort order of duplicates actually matters quite a lot to the executor, and to the planner, which this lets us reason about in many cases. [2][3] * There is nothing to stop us from exploiting this within tidbitmap.c/bitmap index scans. We can build disjunct lists of sorted TIDs, one per value. Merging lists like this together is typically very cheap, even when there are relatively many distinct lists. We could probably do that on-the-fly. * When building a bitmap for a low cardinality value from a secondary index, why bother even visiting the leaf pages? We can make a lossy bitmap from a subset of the internal pages. * Techniques like bitmap semi-join, which are important for star-schema queries, are dependent on the merging and intersection of fact table bitmaps being cheap (basically, you build several bitmaps through nestloop joins with some dimension table on the outer side, and some fact table index on the inner side -- that's relatively easy to parallelize well). Tom's "Improve OR conditions on joined columns" patch sorts a list of TIDs [4], and I suspect that there is some high-level connection there. * Heap prefetching for nbtree index scans is a lot more effective than it might otherwise be. I really could go on, but you get the idea. The merits of each of these ideas are debatable, but the fact that there are so many ideas that are at least plausible-sounding seems significant to me. > I don't think you're going to be able to rip out the getting-tired > code, though, because presumably we'll have to continue support > existing btree indexes that don't include TIDs for some > probably-not-small number of future releases, even if we decide that > all future btree indexes (and any that are reindexed) should have > TIDs. That's definitely true. There is no way that it's okay to ignore this issue with pg_upgrade. [1] https://postgr.es/m/cah2-wzkrkkw88yq4-bln5n3drfv93p8r+ti-kfbter08p8c...@mail.gmail.com [2] https://postgr.es/m/20170707234119.gn17...@telsasoft.com [3] https://postgr.es/m/520D6610.8040907%40emulex.com#520d6610.8040...@emulex.com [4] https://postgr.es/m/22002.1487099934%40sss.pgh.pa.us -- Peter Geoghegan
Re: [bug fix] Produce a crash dump before main() on Windows
On 26 February 2018 at 12:06, Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com> wrote: > From: Craig Ringer [mailto:cr...@2ndquadrant.com] > > The patch proposed here means that early crashes will invoke WER. If > we're > > going to allow WER we should probably just do so unconditionally. > > > > I'd be in favour of leaving WER on when we find out we're in a > noninteractive > > service too, but that'd be a separate patch for pg11+ only. > > As for PG11+, I agree that we want to always leave WER on. That is, call > SetErrorMode(SEM_FAILCRITICALERRORS) but not specify > SEM_NOGPFAULTERRORBOX. The problem with the current specification of > PostgreSQL is that the user can only get crash dumps in a fixed folder > $PGDATA\crashdumps. That location is bad because the crash dumps will be > backed up together with the database cluster without the user noticing it. > What's worse, the crash dumps are large. With WER, the user can control > the location and size of crash dumps. > Yeah, that's quite old and dates back to when Windows didn't offer much if any control over WER in services. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
Hello. I confirmed that this patch fixes the crash. At Tue, 17 Jul 2018 20:01:05 -0400, Tom Lane wrote in <14892.1531872...@sss.pgh.pa.us> > I wrote: > >> So I said I didn't want to do extra work on this, but I am looking into > >> fixing it by having these aux process types run a ResourceOwner that can > >> be told to clean up any open buffer pins at exit. > > > That turned out to be a larger can of worms than I'd anticipated, as it > > soon emerged that we'd acquired a whole lot of cargo-cult programming > > around ResourceOwners. ... > > I'm mostly pretty happy with this, but I think there are a couple of > > loose ends in logicalfuncs.c and slotfuncs.c: those are creating > > non-standalone ResourceOwners (children of whatever the active > > ResourceOwner is) and doing nothing much to clean them up. That seems > > pretty bogus. > > Further investigation showed that the part of that code that was > actually needed was not the creation of a child ResourceOwner, but rather > restoration of the old CurrentResourceOwner setting after exiting the > logical decoding loop. Apparently we can come out of that with the > TopTransaction resowner being active. This still seems a bit bogus; > maybe there should be a save-and-restore happening somewhere else? > But I'm not really excited about doing more than commenting it. CurrentResourceOwner doesn't seem to be changed. It is just saved during snapshot export and used just as a flag only for checking for duplicate snapshot exporting. > Also, most of the other random creations of ResourceOwners seem to just > not be necessary at all, even with the new rule that you must have a > resowner to acquire buffer pins. So the attached revised patch just > gets rid of them, and improves some misleading/wrong comments on the > topic. It'd still be easy to use CreateAuxProcessResourceOwner in any > process where we discover we need one, but I don't see the value in adding > useless overhead. +1 for unifying to resowner for auxiliary processes. I found the comment below just before ending cleanup of auxiliary process main funcs. | * These operations are really just a minimal subset of | * AbortTransaction(). We don't have very many resources to worry | * about in checkpointer, but we do have LWLocks, buffers, and temp | * files. So couldn't we use TopTransactionResourceOwner instead of AuxProcessResrouceOwner? I feel a bit uneasy that bootstrap and standalone-backend have *AuxProcess*ResourceOwner. It's not about this ptch, but while looking this closer, I found the following comment on ShutdownXLOG(). | /* | * This must be called ONCE during postmaster or standalone-backend shutdown | */ Is the "postmaster" typo of "bootstrap process"? It is also called from checkpointer when non-standlone mode. > At this point I'm leaning to just applying this in HEAD and calling it > good. The potential for assertion failures isn't relevant to production > builds, and my best guess at this point is that there isn't really any > other user-visible bug. The resowners that were being created and not > adequately cleaned up all seem to have been dead code anyway (ie they'd > never acquire any resources). Agreed. >We could have a problem if, say, the > bgwriter exited via the FATAL path while holding a pin, but I don't think > there's a reason for it to do that except in a database-wide shutdown, > where the consequences of a leaked pin seem pretty minimal. > > Any objections? Anyone want to do further review? FWIW I think we won't be concerned about leaked pins after FATAL. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: missing toast table for pg_policy
On Tue, Jul 17, 2018 at 06:03:26PM +0900, Michael Paquier wrote: > On Tue, Jul 17, 2018 at 02:55:07PM +0700, John Naylor wrote: >> I didn't dig deeper, since TOAST and the large object facility are >> mutually exclusive so there shouldn't be a toast table here anyway. >> Hope this helps. > > [... digging ...] > This comes from get_rel_infos where large objects are treated as user > data. Rather than the comment you added, I would rather do the > following: > "Large object catalogs and toast tables are mutually exclusive and large > object data is handled as user data by pg_upgrade, which would cause > failures." So, I have been playing with the previous patch and tortured it through a couple of upgrade scenarios, where it proves to work. Attached is an updated patch which adds toast tables except for pg_class, pg_attribute, pg_index and pg_largeobject* with a proposal of commit message. Any objections for the commit of this stuff? -- Michael From 816dc770bc6bd2914b00ded5f1523265cf6c6d48 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 18 Jul 2018 12:56:21 +0900 Subject: [PATCH] Add toast tables to most system catalogs It has been project policy to create toast tables only for those catalogs that might reasonably need one. Since this judgment call can change over time, just create one for every catalog, as this can be useful when creating rather-long entries in catalogs, with recent examples being in the shape of policy expressions or customly-formatted SCRAM verifiers. To prevent circular dependencies and to avoid adding complexity to VACUUM FULL logic, exclude pg_class, pg_attribute, and pg_index. Also, to prevent pg_upgrade from seeing a non-empty new cluster, exclude pg_largeobject and pg_largeobject_metadata from the set as large object data is handled as user data. Those relations have no reason to use a toast table anyway. Author: Joe Conway, John Naylor Reviewed-by: Michael Paquier, Tom Lane Discussion: https://postgr.es/m/84ddff04-f122-784b-b6c5-353680449...@joeconway.com --- src/backend/catalog/catalog.c | 18 - src/include/catalog/toasting.h| 40 +++- src/test/regress/expected/misc_sanity.out | 80 +++ src/test/regress/sql/misc_sanity.sql | 12 ++-- 4 files changed, 82 insertions(+), 68 deletions(-) diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c index a42155eeea..6061428bcc 100644 --- a/src/backend/catalog/catalog.c +++ b/src/backend/catalog/catalog.c @@ -253,12 +253,24 @@ IsSharedRelation(Oid relationId) relationId == SubscriptionNameIndexId) return true; /* These are their toast tables and toast indexes (see toasting.h) */ - if (relationId == PgShdescriptionToastTable || - relationId == PgShdescriptionToastIndex || + if (relationId == PgAuthidToastTable || + relationId == PgAuthidToastIndex || + relationId == PgDatabaseToastTable || + relationId == PgDatabaseToastIndex || relationId == PgDbRoleSettingToastTable || relationId == PgDbRoleSettingToastIndex || + relationId == PgPlTemplateToastTable || + relationId == PgPlTemplateToastIndex || + relationId == PgReplicationOriginToastTable || + relationId == PgReplicationOriginToastIndex || + relationId == PgShdescriptionToastTable || + relationId == PgShdescriptionToastIndex || relationId == PgShseclabelToastTable || - relationId == PgShseclabelToastIndex) + relationId == PgShseclabelToastIndex || + relationId == PgSubscriptionToastTable || + relationId == PgSubscriptionToastIndex || + relationId == PgTablespaceToastTable || + relationId == PgTablespaceToastIndex) return true; return false; } diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h index 3db39b8f86..f259890e43 100644 --- a/src/include/catalog/toasting.h +++ b/src/include/catalog/toasting.h @@ -46,25 +46,59 @@ extern void BootstrapToastTable(char *relName, */ /* normal catalogs */ +DECLARE_TOAST(pg_aggregate, 4159, 4160); DECLARE_TOAST(pg_attrdef, 2830, 2831); +DECLARE_TOAST(pg_collation, 4161, 4162); DECLARE_TOAST(pg_constraint, 2832, 2833); +DECLARE_TOAST(pg_default_acl, 4143, 4144); DECLARE_TOAST(pg_description, 2834, 2835); +DECLARE_TOAST(pg_event_trigger, 4145, 4146); +DECLARE_TOAST(pg_extension, 4147, 4148); +DECLARE_TOAST(pg_foreign_data_wrapper, 4149, 4150); +DECLARE_TOAST(pg_foreign_server, 4151, 4152); +DECLARE_TOAST(pg_foreign_table, 4153, 4154); +DECLARE_TOAST(pg_init_privs, 4155, 4156); +DECLARE_TOAST(pg_language, 4157, 4158); +DECLARE_TOAST(pg_namespace, 4163, 4164); +DECLARE_TOAST(pg_partitioned_table, 4165, 4166); +DECLARE_TOAST(pg_policy, 4167, 4168); DECLARE_TOAST(pg_proc, 2836, 2837); DECLARE_TOAST(pg_rewrite, 2838, 2839); DECLARE_TOAST(pg_seclabel, 3598, 3599); DECLARE_TOAST(pg_statistic, 2840, 2841); DECLARE_TOAST(pg_statistic_ext, 3439, 3440); DECLARE_TOAST(pg_trigger, 2336, 2337); +DECLARE_TOAST(pg_ts_dict, 4169, 4170); +DECLARE_TOAST(pg_type, 4171, 4172); +DECLARE_TOAST(pg_user_
Re: AtEOXact_ApplyLauncher() and subtransactions
On 17 July 2018 at 03:29, Robert Haas wrote: > On Mon, Jul 16, 2018 at 2:36 AM, Amit Khandekar > wrote: >> 0001 patch contains the main fix. In this patch I have used some >> naming conventions and some comments that you used in your patch, >> plus, I used your method of lazily allocating new stack level. The >> stack is initially Null. > > Committed and back-patched to v11 and v10. Thanks Robert.
Re: [bug fix] Produce a crash dump before main() on Windows
At Wed, 18 Jul 2018 11:12:06 +0800, Craig Ringer wrote in > On 26 February 2018 at 12:06, Tsunakawa, Takayuki < > tsunakawa.ta...@jp.fujitsu.com> wrote: > > > From: Craig Ringer [mailto:cr...@2ndquadrant.com] > > > The patch proposed here means that early crashes will invoke WER. If > > we're > > > going to allow WER we should probably just do so unconditionally. > > > > > > I'd be in favour of leaving WER on when we find out we're in a > > noninteractive > > > service too, but that'd be a separate patch for pg11+ only. > > > > As for PG11+, I agree that we want to always leave WER on. That is, call > > SetErrorMode(SEM_FAILCRITICALERRORS) but not specify > > SEM_NOGPFAULTERRORBOX. The problem with the current specification of > > PostgreSQL is that the user can only get crash dumps in a fixed folder > > $PGDATA\crashdumps. That location is bad because the crash dumps will be > > backed up together with the database cluster without the user noticing it. > > What's worse, the crash dumps are large. With WER, the user can control > > the location and size of crash dumps. > > > > Yeah, that's quite old and dates back to when Windows didn't offer much if > any control over WER in services. Yeah. If we want to take a crash dump, we cannot have auto-restart. Since it is inevitable what we can do for this would be adding a new knob for that, which cannot be turned on together with restart_after_crash...? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Make foo=null a warning by default.
On Tue, Jul 17, 2018 at 08:34:17AM -0400, Fabien COELHO wrote: > > Hello David, > > A few comments about this v2. > > ISTM that there is quite strong opposition to having "warn" as a default, so > probably you should set if "off"? Done. > >>I do not really understand the sort order of this array. Maybe it could be > >>alphabetical, or per value? Or maybe it is standard values and then > >>synonyms, in which is case maybe a comment on the end of the list. > > > >I've changed it to per value. Is this better? > > At least, I can see the logic. I've now ordered it in what seems like a reasonable way, namely all the "off" options, then the "on" option. > >>Or anything in better English. > > > >I've changed this, and hope it's clearer. > > Yep, and more elegant than my proposal. I assure you that you expression yourself in English a good deal better than I do in Portuguese. > >>* Test > >> > >> +select 1=null; > >> + ?column? > >> > >>Maybe use as to describe the expected behavior, so that it is easier to > >>check, and I think that "?column?" looks silly eg: > >> > >> select 1=null as expect_{true,false}[_and_a_warning/error]; > > > >Changed to more descriptive headers. > > Cannot see it in the patch update? Oops. They should be there now. > >> create table cnullchild (check (f1 = 1 or f1 = null)) > >> inherits(cnullparent); > >> +WARNING: = NULL can only produce a NULL > >> > >>I'm not sure of this test case. Should it be turned into "is null"? > > > >This was just adjusting the existing test output to account for the > >new warning. I presume it was phrased that way for a reason. > > Hmmm. Probably you are right. I think that the test case deserves better > comments, as it is most peculiar. It was added by Tom for testing CHECK > constant NULL expressions simplications. > > TRUE OR NULL is TRUE, but FALSE OR NULL is NULL. Why not. Combined with the > fact that NULL is considered ok for a CHECK, this lead to strange but > intentional behavior, such as: > > -- this CHECK is always nignored in practice > CREATE TABLE foo (i INT CHECK(i = 1 OR NULL)); > INSERT INTO foo(i) VALUES(2); # ACCEPTED > > -- but this one is not > CREATE TABLE foo (i INT CHECK(i = 1 OR FALSE)); > INSERT INTO foo(i) VALUES(2); # REFUSED > > Can't say I'm thrilled about that, and the added warning looks appropriate. Per request, the warning is off by default, so the warning went away. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From 87d004f2a054a0c0999145aa9b44d97d7951b19c Mon Sep 17 00:00:00 2001 From: David Fetter Date: Sun, 15 Jul 2018 16:11:08 -0700 Subject: [PATCH] Make transform_null_equals into an enum To: pgsql-hack...@postgresql.org The transform_null_equals GUC can now be off, warn, error, or on. --- doc/src/sgml/config.sgml | 20 ++--- src/backend/parser/parse_expr.c | 30 ++--- src/backend/utils/misc/guc.c | 44 +-- src/backend/utils/misc/postgresql.conf.sample | 2 +- src/include/parser/parse_expr.h | 11 - src/test/regress/expected/guc.out | 30 + src/test/regress/sql/guc.sql | 18 7 files changed, 128 insertions(+), 27 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 4d48d93305..b0d951190f 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -8066,7 +8066,7 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' - transform_null_equals (boolean) + transform_null_equals (enum) IS NULL transform_null_equals configuration parameter @@ -8074,15 +8074,23 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' -When on, expressions of the form expr = +When enabled, expressions of the form expr = NULL (or NULL = expr) are treated as expr IS NULL, that is, they return true if expr evaluates to the null value, -and false otherwise. The correct SQL-spec-compliant behavior of -expr = NULL is to always -return null (unknown). Therefore this parameter defaults to -off. +and false otherwise. Valid values are off, warn, error, and on. + + + +The correct SQL-spec-compliant behavior of +expr = null_expression +is to always return null (unknown); PostgreSQL allows NULL +as a null-valued expression of unknown type, which is an extension to the spec. +The transformation of expr = NULL +to expr IS NULL is inconsistent +with the normal semantics of the SQL specification, and is therefore set to "off" +by default. diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c ind
More consistency for some file-related error message
Hi all, While looking at the source code for more consistency work with error messages, I have bumped into a couple of messages which could be simplified, as those include in the name of the file manipulated basically the same information as the context added. I have finished with the attached. Note that for most of the messages, I think that the context can be useful, like for example the stats temporary file which could be user-defined, so those are left out. There are also other cases, say the reorder buffer spill file, where we may not know the path worked on, so the messages are kept consistent as per HEAD. That's a bit less work to do for translators, particularly with pg_stat_statements. Thoughts? -- Michael diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index cc9efab243..33f9a79f54 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -642,19 +642,19 @@ pgss_shmem_startup(void) read_error: ereport(LOG, (errcode_for_file_access(), - errmsg("could not read pg_stat_statement file \"%s\": %m", + errmsg("could not read file \"%s\": %m", PGSS_DUMP_FILE))); goto fail; data_error: ereport(LOG, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("ignoring invalid data in pg_stat_statement file \"%s\"", + errmsg("ignoring invalid data in file \"%s\"", PGSS_DUMP_FILE))); goto fail; write_error: ereport(LOG, (errcode_for_file_access(), - errmsg("could not write pg_stat_statement file \"%s\": %m", + errmsg("could not write file \"%s\": %m", PGSS_TEXT_FILE))); fail: if (buffer) @@ -761,7 +761,7 @@ pgss_shmem_shutdown(int code, Datum arg) error: ereport(LOG, (errcode_for_file_access(), - errmsg("could not write pg_stat_statement file \"%s\": %m", + errmsg("could not write file \"%s\": %m", PGSS_DUMP_FILE ".tmp"))); if (qbuffer) free(qbuffer); @@ -1871,7 +1871,7 @@ qtext_store(const char *query, int query_len, error: ereport(LOG, (errcode_for_file_access(), - errmsg("could not write pg_stat_statement file \"%s\": %m", + errmsg("could not write file \"%s\": %m", PGSS_TEXT_FILE))); if (fd >= 0) @@ -1913,7 +1913,7 @@ qtext_load_file(Size *buffer_size) if (errno != ENOENT) ereport(LOG, (errcode_for_file_access(), - errmsg("could not read pg_stat_statement file \"%s\": %m", + errmsg("could not read file \"%s\": %m", PGSS_TEXT_FILE))); return NULL; } @@ -1923,7 +1923,7 @@ qtext_load_file(Size *buffer_size) { ereport(LOG, (errcode_for_file_access(), - errmsg("could not stat pg_stat_statement file \"%s\": %m", + errmsg("could not stat file \"%s\": %m", PGSS_TEXT_FILE))); CloseTransientFile(fd); return NULL; @@ -1939,7 +1939,7 @@ qtext_load_file(Size *buffer_size) ereport(LOG, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of memory"), - errdetail("Could not allocate enough memory to read pg_stat_statement file \"%s\".", + errdetail("Could not allocate enough memory to read file \"%s\".", PGSS_TEXT_FILE))); CloseTransientFile(fd); return NULL; @@ -1958,7 +1958,7 @@ qtext_load_file(Size *buffer_size) if (errno) ereport(LOG, (errcode_for_file_access(), - errmsg("could not read pg_stat_statement file \"%s\": %m", + errmsg("could not read file \"%s\": %m", PGSS_TEXT_FILE))); free(buf); CloseTransientFile(fd); @@ -2088,7 +2088,7 @@ gc_qtexts(void) { ereport(LOG, (errcode_for_file_access(), - errmsg("could not write pg_stat_statement file \"%s\": %m", + errmsg("could not write file \"%s\": %m", PGSS_TEXT_FILE))); goto gc_fail; } @@ -2118,7 +2118,7 @@ gc_qtexts(void) { ereport(LOG, (errcode_for_file_access(), - errmsg("could not write pg_stat_statement file \"%s\": %m", + errmsg("could not write file \"%s\": %m", PGSS_TEXT_FILE))); hash_seq_term(&hash_seq); goto gc_fail; @@ -2136,14 +2136,14 @@ gc_qtexts(void) if (ftruncate(fileno(qfile), extent) != 0) ereport(LOG, (errcode_for_file_access(), - errmsg("could not truncate pg_stat_statement file \"%s\": %m", + errmsg("could not truncate file \"%s\": %m", PGSS_TEXT_FILE))); if (FreeFile(qfile)) { ereport(LOG, (errcode_for_file_access(), - errmsg("could not write pg_stat_statement file \"%s\": %m", + errmsg("could not write file \"%s\": %m", PGSS_TEXT_FILE))); qfile = NULL; goto gc_fail; @@ -2203,7 +2203,7 @@ gc_fail: if (qfile == NULL) ereport(LOG, (errcode_for_file_access(), - errmsg("could not write new pg_stat_statement file \"%s\": %m", + errmsg("could not recreate file \"%s\": %m", PGSS_TEXT_FILE))); else FreeFile(qfile); @@ -2255,7 +2255,7 @@ entry_reset(void) { ereport(LOG, (errcode_for_file_access(), - errmsg("could not create p
Re: GSOC 2018 Project - A New Sorting Routine
Hi, Tomas! > 15 июля 2018 г., в 1:20, Tomas Vondra > написал(а): > > So I doubt it's this, but I've tweaked the scripts to also set this GUC > and restarted the tests on both machines. Let's see what that does. Do you observe any different results? Thanks! Best regards, Andrey Borodin.
print_path is missing GatherMerge and CustomScan support
Hi, While debugging planner I realized that print_path() function is not aware of both GatherMerge path and CustomScan path. Attached small patch fixes it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center fix_print_path.patch Description: Binary data
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
On 17 July 2018 at 23:10, Tom Lane wrote: > Peter Geoghegan writes: >> I've done plenty of research into the history of this hack. It was >> your work, but it does actually make sense in the context of today's >> nbtree code. It is essential with scankey-wise duplicates, since >> groveling through hundreds or even thousands of pages full of >> duplicates to find free space (and avoid a page split) is going to >> have a very serious downside for latency. > > Well, the actual problem was O(N^2) behavior: > > https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us > > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=40549e9cb5abd2986603883e4ab567dab34723c6 > > I certainly have no objection to improving matters, but let's be sure > we don't re-introduce any two-decade-old problems. Looking at this in terms of CPU cost for a single insert, insertion at the end of the list of duplicates was O(N), which was much improved by the O(k) behavior. Keeping the index entries in order is O(logN) So there is a higher cost to keeping the entries in order, though that becomes a win because of the reduced bloat in cases where we have a high numbers of deletes. That is a win in insertion time as well, since by packing the index more tightly we cause fewer costly splits and less I/O. If we knew that we were never going to do deletes/non-HOT updates from the table we could continue to use the existing mechanism, otherwise we will be better off to use sorted index entries. However, it does appear as if keeping entries in sorted order would be a win on concurrency from reduced block contention on the first few blocks of the index key, so it may also be a win in cases where there are heavy concurrent inserts but no deletes. Bulk loading will improve because adding a new data block via COPY will cause all of the resulting index entries to be added with more adjacency, so a block full of duplicate entries could be sorted and then merged efficiently into the index. I hope we can see a patch that just adds the sorting-by-TID property so we can evaluate that aspect before we try to add other more advanced index ideas. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Patch] Checksums for SLRU files
Hi, Tomas! > > I think we'd want pg_upgrade tests showing an example of each SLRU > growing past one segment, and then being upgraded, and then being > accessed in various different pages and segment files, so that we can > see that we're able to shift the data to the right place successfully. > For example I think I'd want to see that a single aborted transaction > surrounded by many committed transactions shows up in the right place > after an upgrade. Can you elaborate a bit on how to implement this test. I've searched for some automated pg_upgrade tests but didn't found one. Should it be one-time test script or something "make check-world"-able? Best regards, Andrey Borodin.
Re: [Patch] Checksums for SLRU files
On Wed, Jul 18, 2018 at 5:41 PM, Andrey Borodin wrote: >> I think we'd want pg_upgrade tests showing an example of each SLRU >> growing past one segment, and then being upgraded, and then being >> accessed in various different pages and segment files, so that we can >> see that we're able to shift the data to the right place successfully. >> For example I think I'd want to see that a single aborted transaction >> surrounded by many committed transactions shows up in the right place >> after an upgrade. > > Can you elaborate a bit on how to implement this test. I've searched for some > automated pg_upgrade tests but didn't found one. > Should it be one-time test script or something "make check-world"-able? Hi Andrey, Like this (also reached by check-world): $ cd src/bin/pg_upgrade $ make check It looks like the interesting bits are in test.sh. -- Thomas Munro http://www.enterprisedb.com
RE: [bug fix] Produce a crash dump before main() on Windows
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com] > I reviewed patch and it works as per the subject, but I am not able to verify > the actual > bug that is reported in the upthread. The moving of setErrorMode() call > to the start > of the main function can handle all the cases that can lead to a crash with > no popup. Thank you for reviewing the patch. Yeah, I don't know reproduce the problem too, because some bug of Symantec AntiVirus randomly caused the crash before main()... > The reset of setErrorMode(0) before start of any process can generate the > coredump > because of any issue before it reaches the main() function, but this change > can lead > to stop the postmaster restart process, if no one present to observe the > scenario? > Doesn't this change can generate backward compatibility problems to > particular users? Sorry, I couldn't get it. Could you elaborate on it? I couldn't understand how this change has an adverse effect on the postmaster restart process (restart_after_crash = on) and backward compatibility. Regards Takayuki Tsunakawa
Re: print_path is missing GatherMerge and CustomScan support
On Wed, Jul 18, 2018 at 02:35:23PM +0900, Masahiko Sawada wrote: > Hi, > > While debugging planner I realized that print_path() function is not > aware of both GatherMerge path and CustomScan path. Attached small > patch fixes it. Good catch. Those should be backpatched. While I am looking at this stuff, I have noticed that pathnode.c/reparameterize_path_by_child uses T_MergeAppend and not T_MergeAppendPath. --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -3817,7 +3817,7 @@ do { \ } break; - case T_MergeAppend: + case T_MergeAppendPath: { MergeAppendPath *mapath This is new as of f49842d1 in v11. Robert, Ashutosh, am I missing something? -- Michael signature.asc Description: PGP signature
Re: [bug fix] Produce a crash dump before main() on Windows
On Wed, Jul 18, 2018 at 06:09:57AM +, Tsunakawa, Takayuki wrote: > From: Haribabu Kommi [mailto:kommi.harib...@gmail.com] >> I reviewed patch and it works as per the subject, but I am not able to verify >> the actual >> bug that is reported in the upthread. The moving of setErrorMode() call >> to the start >> of the main function can handle all the cases that can lead to a crash with >> no popup. > > Thank you for reviewing the patch. Yeah, I don't know reproduce the > problem too, because some bug of Symantec AntiVirus randomly caused > the crash before main()... A suggestion that I have here would be to patch manually the postmaster with any kind of code which would make it to crash before main() is called. Anything is fine as long as a core dump is produced, right? -- Michael signature.asc Description: PGP signature
Re: print_path is missing GatherMerge and CustomScan support
On Wed, Jul 18, 2018 at 11:52 AM, Michael Paquier wrote: > On Wed, Jul 18, 2018 at 02:35:23PM +0900, Masahiko Sawada wrote: >> Hi, >> >> While debugging planner I realized that print_path() function is not >> aware of both GatherMerge path and CustomScan path. Attached small >> patch fixes it. > > Good catch. Those should be backpatched. While I am looking at this > stuff, I have noticed that pathnode.c/reparameterize_path_by_child uses > T_MergeAppend and not T_MergeAppendPath. > > --- a/src/backend/optimizer/util/pathnode.c > +++ b/src/backend/optimizer/util/pathnode.c > @@ -3817,7 +3817,7 @@ do { \ > } > break; > > - case T_MergeAppend: > + case T_MergeAppendPath: > { > MergeAppendPath *mapath > > This is new as of f49842d1 in v11. Yes that's right. Thanks for taking care of it. > Robert, Ashutosh, am I missing > something? You used my personal email id by mistake, I think. I have removed it and added by EDB email address. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: print_path is missing GatherMerge and CustomScan support
On Wed, Jul 18, 2018 at 12:15:25PM +0530, Ashutosh Bapat wrote: > On Wed, Jul 18, 2018 at 11:52 AM, Michael Paquier wrote: >> On Wed, Jul 18, 2018 at 02:35:23PM +0900, Masahiko Sawada wrote: >>> Hi, >>> >>> While debugging planner I realized that print_path() function is not >>> aware of both GatherMerge path and CustomScan path. Attached small >>> patch fixes it. >> >> Good catch. Those should be backpatched. While I am looking at this >> stuff, I have noticed that pathnode.c/reparameterize_path_by_child uses >> T_MergeAppend and not T_MergeAppendPath. >> >> This is new as of f49842d1 in v11. > > Yes that's right. Thanks for taking care of it. Thanks for the confirmation. Robert, do you want to take care of this issue or should I? >> Robert, Ashutosh, am I missing >> something? > > You used my personal email id by mistake, I think. I have removed it > and added by EDB email address. My sincere apologies. I have not been careful. -- Michael signature.asc Description: PGP signature