Re: Tab complete for CREATE OR REPLACE TRIGGER statement
On 2020-11-18 06:06, Michael Paquier wrote: On Mon, Nov 16, 2020 at 10:14:10PM -0500, Tom Lane wrote: Michael Paquier writes: I don't think that this is a requirement for this thread, though. Agreed, I'm not trying to block this patch. Just wishing there were a better way. Okay. I have tweaked a couple of things in the patch and applied it. I am wondering if libreadline gives the possibility to implement an optional grouping of words to complete, but diving into its documentation I have not found something that we could use. To me the code looks like a prime candidate for "data-driven" refactoring. It should be redone as generic code that reads a table of rules with params and then checks and applies each. Thus the repetitive code would be replaced by a bit more generic code and a lot of code-free data. -- Best regards, Tels
Re: Tab complete for CREATE OR REPLACE TRIGGER statement
Hello Tom, On 2020-11-18 16:49, Tom Lane wrote: Tels writes: On 2020-11-18 06:06, Michael Paquier wrote: On Mon, Nov 16, 2020 at 10:14:10PM -0500, Tom Lane wrote: Agreed, I'm not trying to block this patch. Just wishing there were a better way. To me the code looks like a prime candidate for "data-driven" refactoring. It should be redone as generic code that reads a table of rules with params and then checks and applies each. Thus the repetitive code would be replaced by a bit more generic code and a lot of code-free data. In the past I've looked into whether the rules could be autogenerated from the backend's grammar. It did not look very promising though. The grammar isn't really factorized appropriately -- for instance, tab-complete has a lot of knowledge about which kinds of objects can be named in particular places, while the Bison productions only know that's a "name". Still, the precedent of ECPG suggests it might be possible to process the grammar rules somehow to get to a useful result. Hm, that would be even better, for now I was just thinking that code like this: IF RULE_A_MATCHES THEN DO STUFF A ELSE IF RULE_B_MATCHES THEN DO_STUFF_B ELSE IF RULE_C_MATCHES THEN DO_STUFF_C ... should be replaced by RULE_A MATCH STUFF RULE_B MATCH STUFF RULE_C MATCH STUFF ... FOREACH RULE DO IF RULE.MATCH THEN DO RULE.STUFF END FOREACH Even if the rules would be manually created (converted from the current code), that would be more compact and probably less error-prone. Creating the rule automatically turns this into a completely different story. -- Best regards, Tels
Re: Parallel Aggregates for string_agg and array_agg
Moin, On Wed, April 4, 2018 11:41 pm, David Rowley wrote: > Hi Tomas, > > Thanks for taking another look. > > On 5 April 2018 at 07:12, Tomas Vondra > wrote: >> Other than that, the patch seems fine to me, and it's already marked as >> RFC so I'll leave it at that. > > Thanks. I have one more comment - sorry for not writing sooner, the flu got to me ... Somewhere in the code there is a new allocation of memory when the string grows beyond the current size - and that doubles the size. This can lead to a lot of wasted space (think: constructing a string that is a bit over 1 Gbyte, which would presumable allocate 2 GByte). The same issue happens when each worker allocated 512 MByte for a 256 Mbyte + 1 result. IMHO a factor of like 1.4 or 1.2 would work better here - not sure what the current standard in situations like this in PG is. >> The last obstacle seems to be the argument about the risks of the patch >> breaking queries of people relying on the ordering. Not sure what's are >> the right next steps in this regard ... > > yeah, seems like a bit of a stalemate. > > Personally, think if the group of people Tom mentions do exist, then > they've already been through some troubled times since Parallel Query > was born. I'd hope they've already put up their defenses due to the > advent of that feature. I stand by my thoughts that it's pretty > bizarre to draw the line here when we've probably been causing these > people issues for many years already. I said my piece on this already > so likely not much point in going on about it. These people are also > perfectly capable of sidestepping the whole issue with setting > max_parallel_workers_per_gather TO 0. Coming from the Perl side, this issue has popped up there a lot with the ordering of hash keys (e.g. "for my $key (keys %$hash) { ... }") and even tho there was never a guarantee, it often caught people by surprise. Mostly in testsuites, tho, because real output often needs some ordering, anyway. However, changes where nevertheless done, and code had to be adapted. But the world didn't end, so to speak. For PG, I think the benefits greatly outweight the problems with the sorting order - after all, if you have code that relies on an implicit order, you've already got a problem, no? So my favourite would be something along these lines: * add the string_agg * document it in the release notes, and document workaround/solutions (add ORDER BY, disabled workers etc.) * if not already done, stress in the documentation that if you don't ORDER things, things might not come out in the order you expect - especially with paralell queries, new processing nodes etc. Best wishes, Tels PS: We use string_agg() in a case where we first agg each row, then string_agg() all rows, and the resulting string is really huge. We did run into the "out of memory"-problem, so we now use a LIMIT and assembly the resulting parts on the client side. My wish here would be to better know how large the LIMIT can be, I found it quite difficult to predict with how many rows PG runs out of memory for the string buffer, even tho all rows have almost the same length as text. But that aside, getting the parts faster with parallel agg would be very cool, too.
Re: perltidy version
Moin, On Wed, April 25, 2018 12:35 pm, Tom Lane wrote: > Michael Paquier writes: >> On Mon, Apr 23, 2018 at 12:40:00PM -0400, Tom Lane wrote: >>> Alvaro Herrera writes: >>>> I still vote we use 20170521 which is recent enough that we won't have >>>> to change it in a few years. > >> That's the version available on Debian sid, so from the prospective of >> any Debian user this is a handy version to use when sending patches: >> perltidy/unstable,now 20170521-1 all [installed] > > I'm not hearing any objections or counterproposals, so let's go with > that version. > >>>> I further vote that we change the URL in pgindent/README from >>>> sourceforge to metacpan.org, >>>> https://metacpan.org/pod/distribution/Perl-Tidy/lib/Perl/Tidy.pod > > Agreed on pointing to cpan, but that page is pretty confusing if you're > looking for a non-bleeding-edge version. I suggest linking to > > https://cpan.metacpan.org/authors/id/S/SH/SHANCOCK/ > > which presents a handy directory listing. Linking to the author directory can be pretty confusing, because if a new (co-)-maintainer releases something, it will end up not in this directory. But it is possible to point to the specific version PG needs like so: http://search.cpan.org/~shancock/Perl-Tidy-20170521/ That way visitor see the right version, but also have all the other data and see all other (still existing) versions, if they want. If that page is "more confusing" than a directory listing where you have to pick the right file, or not, is of course debatable. Regards, Tels
Re: Efficient output for integer types
Moin, On 2019-09-22 23:58, David Fetter wrote: On Sat, Sep 21, 2019 at 07:29:25AM +0100, Andrew Gierth wrote: >>>>> "David" == David Fetter writes: Fixed. Good work, more performance is sure nice :) Noticed one more thing in the patch: - *start++ = *a; - *a-- = swap; + memcpy(pos - 2, DIGIT_TABLE + c, 2); + i += 2; } + else + *a = (char) ('0' + value2); + + return olength; } The line "i += 2;" modifies i, but i is never used again nor returned. Best regards, Tels
Re: Transparent Data Encryption (TDE) and encrypted files
Moin, On 2019-09-30 23:26, Bruce Momjian wrote: For full-cluster Transparent Data Encryption (TDE), the current plan is to encrypt all heap and index files, WAL, and all pgsql_tmp (work_mem overflow). The plan is: https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#TODO_for_Full-Cluster_Encryption We don't see much value to encrypting vm, fsm, pg_xact, pg_multixact, or other files. Is that correct? Do any other PGDATA files contain user data? IMHO the general rule in crypto is: encrypt everything, or don't bother. If you don't encrypt some things, somebody is going to find loopholes and sidechannels and partial-plaintext attacks. Just a silly example: If you trick the DB into putting only one row per page, any "bit-per-page" map suddenly reveals information about a single encrypted row that it shouldn't reveal. Many people with a lot of free time on their hands will sit around, drink a nice cup of tea and come up with all sorts of attacks on these things that you didn't (and couldn't) anticipate now. So IMHO it would be much better to err on the side of caution and encrypt everything possible. Best regards, Tels
Re: Declaring a strict function returns not null / eval speed
Moin, On 2019-10-20 13:30, Andreas Karlsson wrote: On 10/1/19 9:38 AM, Andres Freund wrote: We spend a surprising amount of time during expression evaluation to reevaluate whether input to a strict function (or similar) is not null, even though the value either comes from a strict function, or a column declared not null. Now you can rightfully say that a strict function still can return NULL, even when called with non-NULL input. But practically that's quite rare. Most of the common byvalue type operators are strict, and approximately none of those return NULL when actually called. That makes me wonder if it's worthwhile to invent a function property declaring strict strictness or such. It'd allow for some quite noticable improvements for e.g. queries aggregating a lot of rows, we spend a fair time checking whether the transition value has "turned" not null. I'm about to submit a patch making that less expensive, but it's still expensive. I can also imagine that being able to propagate NOT NULL further up the parse-analysis tree could be beneficial for planning, but I've not looked at it in any detail. Agreed, this sounds like something useful to do since virtually all strict functions cannot return NULL, especially the ones which are used in tight loops. The main design issue seems to be to think up a name for this new level of strictness which is not too confusing for end users. STRICT NONULL? That way you could do CREATE FUNCTION f1 ... STRICT; CREATE FUNCTION f2 ... STRICT NONULL; CREATE FUNCTION f3 ... NONULL; and the last wold throw "not implementet yet"? "NEVER RETURNS NULL" would also ryme with the existing "RETURNS NULL ON NULL INPUT", but I find the verbosity too high. Best regards, Tels -- Best regards, Tels
Re: Declaring a strict function returns not null / eval speed
Moin, On 2019-10-20 16:27, Tom Lane wrote: Tels writes: On 2019-10-20 13:30, Andreas Karlsson wrote: Agreed, this sounds like something useful to do since virtually all strict functions cannot return NULL, especially the ones which are used in tight loops. The main design issue seems to be to think up a name for this new level of strictness which is not too confusing for end users. STRICT NONULL? That way you could do CREATE FUNCTION f1 ... STRICT; CREATE FUNCTION f2 ... STRICT NONULL; CREATE FUNCTION f3 ... NONULL; and the last wold throw "not implementet yet"? "NEVER RETURNS NULL" would also ryme with the existing "RETURNS NULL ON NULL INPUT", but I find the verbosity too high. "RETURNS NOT NULL", perhaps? That'd have the advantage of not requiring any new keyword. Hm, yes, that would be a good compromise on verbosity and align even better the other "RETURNS ..." variants. I'm a little bit skeptical of the actual value of adding this additional level of complexity, but I suppose we can't determine that reliably without doing most of the work :-( Maybe it would be possible to simulate the effect somehow? Or at least we could try to find practical queries where the additional information results in a much better plan if RETRUNS NOT NULL was set. Best regards, Tels
Re: What does Time.MAX_VALUE actually represent?
Moin, On Sat, December 30, 2017 4:25 pm, Gavin Flower wrote: > On 12/31/2017 03:07 AM, Dave Cramer wrote: >> We are having a discussion on the jdbc project about dealing with >> 24:00:00. >> >> https://github.com/pgjdbc/pgjdbc/pull/992#issuecomment-354507612 >> >> Dave Cramer > > In Dublin (I was there 2001 to 2004), Time tables show buses just after > midnight, such as 1:20am as running at the time 2520 - so there are > visible close to the end of the day. If you are looking for buses > around midnight this is very user friendly - better than looking at the > other end of the time table for 0120. > > I think logically that 24:00:00 is exactly one day later than 00:00:00 - > but I see from following the URL, that there are other complications... Careful here, if "24:00:00" always means literally "00:00:00 one day later", that could work, but you can't just have it meaning "add 24 hours to the clock". For instance, during daylight saving time changes, days can be 23 hours or 25 hours long... Best wishes, Tels
Re: What does Time.MAX_VALUE actually represent?
Moin, On Sun, December 31, 2017 12:50 pm, Tom Lane wrote: > Peter Eisentraut writes: >> select timestamp '2017-12-30 24:00:00'; >> returns >> 2017-12-31 00:00:00 >> which makes some sense. > >> I don't know why we accept that and not '24:00:01' and beyond, but it's >> probably historical. > > We also accept > > regression=# select timestamp '2017-12-30 23:59:60'; > timestamp > - > 2017-12-31 00:00:00 > (1 row) > > which is maybe a little bit more defensible as a common behavior > to allow for leap seconds. (It's far from a correct implementation of > leap seconds, of course, but I think most versions of mktime() and > friends do likewise.) > > Digging into the git history, it looks like this choice dates to > > commit a93bf4503ffc6d7cd6243a6324fb2ef206b10adf > Author: Bruce Momjian > Date: Fri Oct 14 11:47:57 2005 + > > Allow times of 24:00:00 to match rounding behavior: > > regression=# select '23:59:59.9'::time(0); >time > -- > 24:00:00 > (1 row) > > This is bad because: > > regression=# select '24:00:00'::time(0); > ERROR: date/time field value out of range: "24:00:00" > > The last example now works. > > > There may at the time have been some argument about imprecision of > float timestamps involved, but it works the same way today once > you exceed the precision of our integer timestamps: > > regression=# select time '23:59:59.99'; > time > - > 23:59:59.99 > (1 row) > > regression=# select time '23:59:59.999'; >time > -- > 24:00:00 > (1 row) > > If we didn't allow '24:00:00' as a valid value then we'd need to > throw an error for '23:59:59.999', which doesn't seem nice. Hm, but shouldn't the result then be "00:00:00" instead of "24:00:00"? With addition it seems to work different: postgres=# select time '23:59:59.99' + interval '0.01 seconds'; ?column? -- 00:00:00 (1 row) Best regards, Tels
Re: Faster inserts with mostly-monotonically increasing values
Moin, On Tue, January 2, 2018 7:51 am, Pavan Deolasee wrote: > On Sun, Dec 31, 2017 at 4:36 PM, Peter Geoghegan wrote: > >> On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee >> wrote: >> > Here is a patch that implements the idea. If the last insert happens >> to >> be >> > in the rightmost block of an index, then we cache the block and check >> that >> > first for the next insert. For the cached block to be usable for the >> insert, >> > it should still be the rightmost, the to-be-inserted key should go >> into >> the >> > cached block and there is enough free space in the block. If any of >> these >> > conditions do not meet, we fall back to the regular code path, i.e. >> descent >> > down the index from the top. [snip] >> > So as the size of the index increases, the benefits of the patch also >> tend >> > to increase. This is most likely because as the index becomes taller >> and >> > taller, the costs associated with index descent becomes higher. >> >> FWIW, I think that int4 indexes have to be extremely large before they >> exceed 4 levels (where the leaf level counts as a level). My >> conservative back-of-an-envelope estimate is 9 billion index tuples. >> >> > Hmm Ok. I am not entirely sure whether it's the depth or just purely > binary > searching through 3-4 index pages and/or pinning/unpinning buffers result > in much overhead. I will run some more tests and collect evidences. Just a question trying to understand how btree indexes work: If one inserts ever-increasing value, is the tree a balanced tree with a minimum (or at least not-as-high) number of levels, or does it increase in height every insert and creates a "tall stack"? @Peter: Could you please share your back-of-the-envelope calculation, I'd love to get some insights into the innards. All the best, Tels
Re: [HACKERS] [PATCH] Incremental sort
Hello Alexander, On Thu, January 4, 2018 4:36 pm, Alexander Korotkov wrote: > On Fri, Dec 8, 2017 at 4:06 PM, Alexander Korotkov < > a.korot...@postgrespro.ru> wrote: > >> Thank you for pointing that. Sure, both cases are better. I've added >> second case as well as comments. Patch is attached. I had a quick look, this isn't a full review, but a few things struck me on a read through the diff: There are quite a few places where lines are broken like so: + ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate, + pwcxt); Or like this: + result = (PlanState *) ExecInitIncrementalSort( + (IncrementalSort *) node, estate, eflags); e.g. a param is on the next line, but aligned to the very same place where it would be w/o the linebreak. Or is this just some sort of artefact because I viewed the diff with tabspacing = 8? I'd fix the grammar here: + * Incremental sort is specially optimized kind of multikey sort when + * input is already presorted by prefix of required keys list. Like so: "Incremental sort is a specially optimized kind of multikey sort used when the input is already presorted by a prefix of the required keys list." + * Consider following example. We have input tuples consisting from "Consider the following example: We have ..." +* In incremental sort case we also have to cost to detect sort groups. "we also have to cost the detection of sort groups." "+ * It turns out into extra copy and comparison for each tuple." "This turns out to be one extra copy and comparison per tuple." + "Portions Copyright (c) 1996-2017" Should probably be 2018 now - time flies fast :) return_value = _readMaterial(); else if (MATCH("SORT", 4)) return_value = _readSort(); + else if (MATCH("INCREMENTALSORT", 7)) + return_value = _readIncrementalSort(); else if (MATCH("GROUP", 5)) return_value = _readGroup(); I think the ", 7" here is left-over from when it was named "INCSORT", and it should be MATCH("INCREMENTALSORT", 15)), shouldn't it? + space, fase when it's value for in-memory typo: "space, false when ..." + boolcmp; + cmp = cmpSortSkipCols(node, node->sampleSlot, slot); + + if (cmp) In the above, the variable cmp could be optimized away with: + if (cmpSortSkipCols(node, node->sampleSlot, slot)) (not sure if modern compilers won't do this, anway, though) +typedef struct IncrementalSortState +{ + ScanState ss; /* its first field is NodeTag */ + boolbounded;/* is the result set bounded? */ + int64 bound; /* if bounded, how many tuples are needed */ If I'm not wrong, the layout of the struct will include quite a bit of padding on 64 bit due to the mixing of bool and int64, maybe it would be better to sort the fields differently, e.g. pack 4 or 8 bools together? Not sure if that makes much of a difference, though. That's all for now :) Thank you for your work, Tels
Re: Faster inserts with mostly-monotonically increasing values
Hello Alvaro, On Thu, January 4, 2018 7:35 am, Alvaro Herrera wrote: > Pavan Deolasee wrote: >> On Tue, Jan 2, 2018 at 7:15 PM, Tels >> wrote: >> >> > Just a question trying to understand how btree indexes work: >> > >> > If one inserts ever-increasing value, is the tree a balanced tree with >> a >> > minimum (or at least not-as-high) number of levels, or does it >> increase in >> > height every insert and creates a "tall stack"? >> >> AFAIK all pages will be half-filled in this case. Imagine if one index >> page >> can accommodate N entries, the page will be split when (N+1)th entry is >> added. The left page will have half the entries and the right page will >> get >> the remaining. The next split will happen when the right page is full >> i.e. >> when another N/2 entries are added. Thus there will be a split at every >> N/2 >> inserts, creating an index with half filled pages. > > Not really -- quoth _bt_findsplitloc(): > > * If the page is the rightmost page on its level, we instead try to > arrange > * to leave the left split page fillfactor% full. In this way, when we > are > * inserting successively increasing keys (consider sequences, timestamps, > * etc) we will end up with a tree whose pages are about fillfactor% full, > * instead of the 50% full result that we'd get without this special case. > > > To answer Tels question: the tree is balanced by splitting pages and > propagating the splits upwards to the parent pages, all the way up to > the root. When the root page gets full, it is also split and a new root > is created on top of the old root plus its new sibling page, which is > the only point at which the tree grows in depth. Thank you for the explanation! You learn something new every time :) All the best, Tels
Re: [Patch] Make block and file size for WAL and relations defined at cluster creation
On Wed, January 3, 2018 4:04 pm, Robert Haas wrote: > On Wed, Jan 3, 2018 at 3:43 PM, Remi Colinet > wrote: >> Justifications are: > > I think this is all missing the point. If varying the block size (or > whatever) is beneficial, then having it configurable at initdb is > clearly useful. But, as Andres says, you haven't submitted any > evidence that this is the case. You need to describe scenarios in > which (1) a non-default blocksize performs better and (2) there's no > reasonable way to obtain the same performance improvement without > changing the block size. (Note: I gather that "block size" here is the page size, but I'm not entirely sure. So this detail might make my arguments moot :) First, I do think there is a chicken-and-egg problem here. If you can't vary the values except by re-compiling, almost no-one does. So you don't get any benchmarks, or data. And even if the author of the patch does this, he can only provide very spotty data, which might not be enough to draw the right conclusions. Plus, if the goal is to "select the optimal size" (as opposed to Remi's goal, which seems to me os "make it easier to select the optimial size for one system at hand"), you might not be able to do this, as the "optimial" size is different for different conditions, but you can't try different values if the value is compiled in... Plus, isn't almost all advancement in computing that you replace "1 + 1" with "x + y"? :) So, I do think the patch has some merits, because at least it allows much easer benchmarking. And if the value was configurable at initdb time, I think more people would experiment and settle for their optimium. So for me the question isn't here "does it benefit everyone", but "does it benefit some, and what is the cost for others". Plus, I stumbled just by accident over a blog post from Tomas Vondra[0] from 2015, who seems to agree that different page sizes can be useful: https://blog.pgaddict.com/posts/postgresql-on-ssd-4kb-or-8kB-pages Me thinks the next steps would be that more benchmarks are done on more distinct hardware to see what benefits we can see, and weight this against the complexity the patch introduces into the code base. Hope that does make sense, Tels [0]: Tomas, great blog, btw!
Re: proposal: alternative psql commands quit and exit
On Mon, January 15, 2018 11:10 am, Robert Haas wrote: > On Mon, Jan 15, 2018 at 10:57 AM, Tom Lane wrote: >> Robert Haas writes: >>> I've discovered one thing about this design that is not so good, which >>> is that if you open a single, double, or dollar quote, then the >>> instructions that are provided under that design do not work: >> >> I was kind of imagining that we could make the hint text vary depending >> on the parsing state. Maybe that's too hard to get to --- but if the >> prompt-creating code knows what it is, perhaps this can too. > > prompt_status does seem to be available in psql's MainLoop(), so I > think that could be done, but the problem is that I don't know exactly > what message would be useful to print when we're in a "in quotes" > state. If I had a good message text for that state, I might just > choose to use it always rather than multiplying the number of > messages. > > More broadly, I think what is needed here is less C-fu than > English-fu. If we come up with something good, we can make it print > that thing. Maybe that's a Linux-specific thing, but I always use Ctrl-D to exit a console, and this works with psql, too, even when in the middle of a query typed. So maybe this could be suggested? Best wishes, Tels
Re: proposal: alternative psql commands quit and exit
Moin, On Mon, January 15, 2018 2:34 pm, Tels wrote: > Maybe that's a Linux-specific thing, but I always use Ctrl-D to exit a > console, and this works with psql, too, even when in the middle of a query > typed. > > So maybe this could be suggested? Sorry, should have really read the thread until the very end, please ignore my noise. Best wishes, Tels
Re: STATISTICS retained in CREATE TABLE ... LIKE (INCLUDING ALL)?
Moin, On Fri, January 26, 2018 2:30 am, David Rowley wrote: > On 21 January 2018 at 19:21, David Rowley > wrote: >> On 20 January 2018 at 18:50, Tom Lane wrote: >>> Stephen Froehlich writes: >>>> Are custom statistics in PG10 retained in LIKE (INCLUDING ALL) or do I >>>> need to recreate the statistics by hand each time I create a new >>>> table? >>> >>> LIKE doesn't know about those, no. Perhaps it should. >> >> Agreed. ALL does not mean MOST. > > (Moving to -hackers) > > The attached implements this. > > Looking at "LIKE ALL" in more detail in the docs it claims to be > equivalent to "INCLUDING DEFAULTS INCLUDING IDENTITY INCLUDING > CONSTRAINTS INCLUDING INDEXES INCLUDING STORAGE INCLUDING COMMENTS", > so it didn't seem right to magically have LIKE ALL include something > that there's no option to include individually, so I added INCLUDING > STATISTICS. Note sure if someone want's to exclude statistics, but it is good that one can. So +1 from me. Looking at the patch, at first I thought the order was sorted and you swapped STORAGE and STATISTICS by accident. But then, it seems the order is semi-random. Should that list be sorted or is it already sorted by some criteria that I don't see? - INCLUDING DEFAULTS INCLUDING IDENTITY INCLUDING CONSTRAINTS INCLUDING INDEXES INCLUDING STORAGE INCLUDING COMMENTS. + INCLUDING DEFAULTS INCLUDING IDENTITY INCLUDING CONSTRAINTS INCLUDING INDEXES INCLUDING STORAGE INCLUDING STATISTICS INCLUDING COMMENTS. Best wishes, Tels
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, February 5, 2018 4:27 pm, Robert Haas wrote: > On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan wrote: >> It certainly is common. In the case of logtape.c, we almost always >> write out some garbage bytes, even with serial sorts. The only >> difference here is the *sense* in which they're garbage: they're >> uninitialized bytes, which Valgrind cares about, rather than byte from >> previous writes that are left behind in the buffer, which Valgrind >> does not care about. > > /me face-palms. > > So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED > on the buffer. "We know what we're doing, trust us!" > > In some ways, that seems better than inserting a suppression, because > it only affects the memory in the buffer. > > Anybody else want to express an opinion here? Are the uninitialized bytes that are written out "whatever was in the memory previously" or just some "0x00 bytes from the allocation but not yet overwritten from the PG code"? Because the first sounds like it could be a security problem - if random junk bytes go out to the disk, and stay there, information could inadvertedly leak to permanent storage. Best regards, Tels
Re: Using scalar function as set-returning: bug or feature?
Moin Tom, On Mon, February 12, 2018 5:03 pm, Tom Lane wrote: > Pavel Stehule writes: >> 2018-02-09 12:02 GMT+01:00 Marko Tiikkaja : >>> This is quite short-sighted. The better way to do this is to complain >>> if >>> the number of expressions is different from the number of target >>> variables >>> (and the target variable is not a record-ish type). There's been at >>> least >>> two patches for this earlier (one my me, and one by, I think Pavel >>> Stehule). I urge you to dig around in the archives to avoid wasting >>> your >>> time. [snip a bit] > As things stand today, we would have a hard time tightening that up > without producing unwanted complaints about the cases mentioned in > this comment, because the DTYPE_ROW logic is used for both "INTO a,b,c" > and composite-type variables. However, my pending patch at > https://commitfest.postgresql.org/17/1439/ > gets rid of the use of DTYPE_ROW for composite types, and once that > is in it might well be reasonable to just throw a flat-out error for > wrong number of source values for a DTYPE_ROW target. I can't > immediately think of any good reason why you'd want to allow for > the number of INTO items not matching what the query produces. Perl lets you set a fixed number of multiple variables from an array and discard the rest like so: my ($a, $b) = (1,2,3); and the right hand side can also be a function: my ($c, $d) = somefunc( 123 ); Where somefunc() returns more than 2 params. This is quite handy if you sometimes need more ore less return values, but don't want to create almost-identical functions for each case. I'm not sure if you mean exactly the scenario as in the attached test case, but this works in plpgsql, too, and would be a shame to lose. OTOH, one could also write: SELECT INTO ba, bb a,b FROM foo(1); and it would still work, or wouldn't it? Best regards, Tels test.psql Description: Binary data test.pl Description: Perl program
Re: pglz performance
Hello Andrey, On 2019-11-02 12:30, Andrey Borodin wrote: 1 нояб. 2019 г., в 18:48, Alvaro Herrera написал(а): PFA two patches: v4-0001-Use-memcpy-in-pglz-decompression.patch (known as 'hacked' in test_pglz extension) v4-0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch (known as 'hacked8') Looking at the patches, it seems only the case of a match is changed. But when we observe a literal byte, this is copied byte-by-byte with: else { * An unset control bit means LITERAL BYTE. So we just * copy one from INPUT to OUTPUT. */ *dp++ = *sp++; } Maybe we can optimize this, too. For instance, you could just increase a counter: else { /* * An unset control bit means LITERAL BYTE. We count * these and copy them later. */ literal_bytes ++; } and in the case of: if (ctrl & 1) { /* First copy all the literal bytes */ if (literal_bytes > 0) { memcpy( sp, dp, literal_bytes); sp += literal_bytes; dp += literal_bytes; literal_bytes = 0; } (Code untested!) The same would need to be done at the very end, if the input ends without any new CTRL-byte. Wether that gains us anything depends on how common literal bytes are. It might be that highly compressible input has almost none, while input that is a mix of incompressible strings and compressible ones might have longer stretches. One example would be something like an SHA-256, that is repeated twice. The first instance would be incompressible, the second one would be just a copy. This might not happens that often in practical inputs, though. I wonder if you agree and what would happen if you try this variant on your corpus tests. Best regards, Tels
Re: backup manifests
y the backup. A few percentage points is probably not a big deal, but a user who has an 8-hour window to get the backup done overnight will not be happy if it's taking 6 hours now and we tack 40%-50% on to that. So I think that we either have to disable backup checksums by default, or figure out a way to get the overhead down to something a lot smaller than what current tests are showing -- which we could possibly do without changing the algorithm if we can somehow make it a lot cheaper, but otherwise I think the choice is between disabling the functionality altogether by default and adopting a less-expensive algorithm. Maybe someday when delta restore is in core and widely used and CPUs are faster, it'll make sense to revise the default, and that's cool, but I can't see imposing a big overhead by default to enable a feature core doesn't have yet... Modern algorithms are amazingly fast on modern hardware, some even are implemented in hardware nowadays: https://software.intel.com/en-us/articles/intel-sha-extensions Quote from: https://neosmart.net/blog/2017/will-amds-ryzen-finally-bring-sha-extensions-to-intels-cpus/ "Despite the extremely limited availability of SHA extension support in modern desktop and mobile processors, crypto libraries have already upstreamed support to great effect. Botan’s SHA extension patches show a significant 3x to 5x performance boost when taking advantage of the hardware extensions, and the Linux kernel itself shipped with hardware SHA support with version 4.4, bringing a very respectable 3.6x performance upgrade over the already hardware-assisted SSE3-enabled code." If you need to load the data from disk and shove it over a network, the hashing will certainly be very little overhead, it might even be completely invisible, since it can run in paralell to all the other things. Sure, there is the thing called zero-copy-networking, but if you have to compress the data bevore sending it to the network, you have to put it through the CPU, anyway. And if you have more than one core, the second one can to the hashing it paralell to the first one doing the compression. To get a feeling one can use: openssl speed md5 sha1 sha256 sha512 On my really-not-fast desktop CPU (i5-4690T CPU @ 2.50GHz) it says: The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes 16384 bytes md5 122638.55k 277023.96k 487725.57k 630806.19k 683892.74k 688553.98k sha1 127226.45k 313891.52k 632510.55k 865753.43k 960995.33k 977215.19k sha256 77611.02k 173368.15k 325460.99k 412633.43k 447022.92k 448020.48k sha512 51164.77k 205189.87k 361345.79k 543883.26k 638372.52k 645933.74k Or in other words, it can hash nearly 931 MByte /s with SHA-1 and about 427 MByte / s with SHA-256 (if I haven't miscalculated something). You'd need a pretty fast disk (aka M.2 SSD) and network (aka > 1 Gbit) to top these speeds and then you'd use a real CPU for your server, not some poor Intel powersaving surfing thingy-majingy :) Best regards, Tels
Re: backup manifests
Moin, On 2019-11-22 23:30, David Steele wrote: On 11/22/19 5:15 PM, Tels wrote: On 2019-11-22 20:01, Robert Haas wrote: On Fri, Nov 22, 2019 at 1:10 PM David Steele wrote: > Phrased more positively, if you want a cryptographic hash > at all, you should probably use one that isn't widely viewed as too > weak. Sure. There's another advantage to picking an algorithm with lower collision rates, though. CRCs are fine for catching transmission errors (as caveated above) but not as great for comparing two files for equality. With strong hashes you can confidently compare local files against the path, size, and hash stored in the manifest and save yourself a round-trip to the remote storage to grab the file if it has not changed locally. I agree in part. I think there are two reasons why a cryptographically strong hash is desirable for delta restore. First, since the checksums are longer, the probability of a false match happening randomly is lower, which is important. Even if the above analysis is correct and the chance of a false match is just 2^-32 with a 32-bit CRC, if you back up ten million files every day, you'll likely get a false match within a few years or less, and once is too often. Second, unlike what I supposed above, the contents of a PostgreSQL data file are not chosen at random, unlike transmission errors, which probably are more or less random. It seems somewhat possible that there is an adversary who is trying to choose the data that gets stored in some particular record so as to create a false checksum match. A CRC is a lot easier to fool than a crytographic hash, so I think that using a CRC of *any* length for this kind of use case would be extremely dangerous no matter the probability of an accidental match. Agreed. See above. However, if you choose a hash, please do not go below SHA-256. Both MD5 and SHA-1 already had collision attacks, and these only got to be bound to be worse. I don't think collision attacks are a big consideration in the general case. The manifest is generally stored with the backup files so if a file is modified it is then trivial to modify the manifest as well. That is true. However, a simple way around this is to sign the manifest with a public key l(GPG or similiar). And if the manifest contains strong, hard-to-forge hashes, we got a mure more secure backup, where (almost) nobody else can alter the manifest, nor can he mount easy collision attacks against the single files. Without the strong hashes it would be pointless to sign the manifest. Of course, you could store the manifest separately or even just know the hash of the manifest and store that separately. In that case SHA-256 might be useful and it would be good to have the option, which I believe is the plan. I do wonder if you could construct a successful collision attack (even in MD5) that would also result in a valid relation file. Probably, at least eventually. With MD5, certainly. One way is to have two block of 512 bits that hash to the different MD5s. It is trivial to re-use one already existing from the known examples. Here is one, where the researchers constructed 12 PDFs that all have the same MD5 hash: https://www.win.tue.nl/hashclash/Nostradamus/ If you insert one of these blocks into a relation and dump it, you could swap it (probably?) out on disk for the other block. I'm not sure this is of practical usage as an attack, tho. It would, however, cast doubt on the integrity of the backup and prove that MD5 is useless. OTOH, finding a full collision with MD5 should also be in reach with todays hardware. It is hard find exact numbers but this: https://www.win.tue.nl/hashclash/SingleBlock/ gives the following numbers for 2008/2009: "Finding the birthday bits took 47 hours (expected was 3 days) on the cluster of 215 Playstation 3 game consoles at LACAL, EPFL. This is roughly equivalent to 400,000 hours on a single PC core. The single near-collision block construction took 18 hours and 20 minutes on a single PC core." Today one can probably compute it on a single GPU in mere hours. And you can rent massive amounts of them in the cloud for real cheap. Here are a few, now a bit dated, references: https://blog.codinghorror.com/speed-hashing/ http://codahale.com/how-to-safely-store-a-password/ Best regards, Tels
Re: backup manifests
On 2019-11-24 15:38, David Steele wrote: On 11/23/19 4:34 PM, Andrew Dunstan wrote: On 11/23/19 3:13 AM, Tels wrote: Without the strong hashes it would be pointless to sign the manifest. I guess I must have missed where we are planning to add a cryptographic signature. I don't think we were planning to, but the user could do so if they wished. That was what I meant. Best regards, Tels
Re: backup manifests
Moin, sorry for the very late reply. There was a discussion about the specific format of the backup manifests, and maybe that was already discussed and I just overlooked it: 1) Why invent your own format, and not just use a machine-readable format that already exists? It doesn't have to be full blown XML, or even JSON, something simple as YAML would already be better. That way not everyone has to write their own parser. Or maybe it is already YAML and just the different keywords where under discussion? 2) It would be very wise to add a version number to the format. That will making an extension later much easier and avoids the "we need to add X, but that breaks compatibility with all software out there" situations that often arise a few years down the line. Best regards, and a happy New Year 2020 Tels
Re: Translations contributions urgently needed
On Thu, February 22, 2018 11:04 pm, Tom Lane wrote: > Alvaro Herrera writes: >> Please join pgsql-translat...@postgresql.org. > > What surprises me about this thread is that apparently the sad state > of the v10 translations wasn't already discussed on that list? The list is more or less dead - over the past 6 months there where only between 1 and 5 messages per month. > I have no objection to calling for more translation volunteers on > this list --- in fact, probably it'd be a good idea to call for > more help on pgsql-general, too. But it doesn't seem like it's > quite on-topic for -hackers otherwise. I do think it would be a good idea to at least send a translation summary once a quarter to the translators list and maybe include hackers and general - many might not be aware of the 80% rule. Also an automated summary one ore two months before a major release as a "call for action" would be good to avoid such mishaps. Last but not least I'd be able to help with the german translation, but I have no clear idea how, or what the actual status is. Are German translators actually needed? And while I have no problem entering some texts on a website or edit a file with an editor, the process described here: https://www.postgresql.org/docs/10/static/nls-translator.html sounds complicated and leaves me unclear on quite a few things, for instance how would I submit whatever work I've done? Do I need git? An account? Where? And how does it all work? I guess a lot of potential translators who aren't programmers would be left baffled, too. Best regards, Tels
Re: [HACKERS] [POC] Faster processing at Gather node
Hello Robert, On Fri, March 2, 2018 12:22 pm, Robert Haas wrote: > On Wed, Feb 28, 2018 at 10:06 AM, Robert Haas > wrote: >> [ latest patches ] > > Committed. Thanks for the review. Cool :) There is a typo, tho: + /* +* If the counterpary is known to have attached, we can read mq_receiver +* without acquiring the spinlock and assume it isn't NULL. Otherwise, +* more caution is needed. +*/ s/counterpary/counterparty/; Sorry, only noticed while re-reading the thread. Also, either a double space is missing, or one is too many: + /* +* Separate prior reads of mq_ring from the increment of mq_bytes_read +* which follows. Pairs with the full barrier in shm_mq_send_bytes(). We +* only need a read barrier here because the increment of mq_bytes_read is +* actually a read followed by a dependent write. +*/ (" Pairs ..." vs. ". We only ...") Best regards, Tels
Re: 2018-03 Commitfest Summary (Andres #1)
Moin, On Fri, March 2, 2018 2:48 pm, Andres Freund wrote: > Hi, > > On 2018-03-02 11:06:12 +0100, Fabien COELHO wrote: >> A lot of patches do not even get a review: no immediate interest or more >> often no ressources currently available, patch stays put, I'm fine with >> that. > > Well, even if that's the case that's not free of cost to transport them > from fest to fest. Going through them continually isn't free. At some > point we just gotta decide it's an undesirable feature. Erm, I don't think that from A: "There are not enough reviewer capacities" necessarily follows B: "This is an undesirable feature." That PG doesn't have enough reviewers doesn't mean users/authors want or need features - it just means there are not enough people who are capable of doing a review, or they haven't been found yet, or worse, turned away by the current process. Sure, sometimes a lack of interest just means a lack of interest - but not every user of PG reads or follows the -hackers list or does (or can) even contribute to PG. And, the most strongest point to me is: Somebody showed soo much interest they got up and wrote a patch. Even if nobody reviewed it, that is already a high hurdle cleared, isn't it? >> Now, some happy patches actually get reviews and are switched to ready, >> which shows that somebody saw enough interest in them to spend some time >> to >> improve them. >> >> If committers ditch these reviewed patches on weak ground (eg "I do not >> need >> this feature so nobody should need it"), it is both in contradiction >> with >> the fact that someone took the time to review it, and is highly >> demotivating >> for people who do participate to the reviewing process and contribute to >> hopefully improve these patches, because the reviewing time just goes to >> the >> drain in the end even when the patch is okay. > > The consequence of this appears to be that we should integrate > everything that anybody decided worthy enough to review. And likewise, I don't think the reverse follows, either. [snipabit] >> So for me killing ready patches in the end of the process and on weak >> ground >> can only make the process worse. The project is shooting itself in the >> foot, >> and you cannot complain later that there is not enough reviewers. That would also be my opinion. > How would you want it to work otherwise? Merge everything that somebody > found review worthy? Have everything immediately reviewed by committers? > Neither of those seem realistic. True, but the process could probably be streamlined quite a bit. As an outsider, I do wonder why sometimes long time periods go by without any action - but when then a deadline comes, the action is taken suddenly without the involved parties having had the time to respond first (I mean "again", of course they could have responded earlier. But you know how it is in a busy world, and with PG being a "side-project"). That is unsatisfactory for the authors (who either rebase patches needlessly, or find their patch not getting a review because it has bitrotted again) or reviewers (because they find patches have bitrotted) and everyone else (because even the simplest features or problem-solving fixes can take easily a year or two, if they don't get rejected outright). Maybe an idea would be to send automatic notifications about patches that need rebasing, or upcoming deadlines like starting commit fests? In addition, clear rules and well-formulated project goals would help a lot. Also, the discussion about "needs of the project" vs. the "needs of the users" [0] should be separate from the "what do we do about the lack of manpower" discussion. Because if you argue what you want with the what you have, you usually end up with nothing in both departments. Best regards, Tels [0]: E.g. how much is it worth to have a clean, pristine state of source code vs. having some features and fixes for the most pressing items in a somewhat timely manner? I do think that there should be at least different levels between PG core and utilities like pgbench.
Re: [WIP PATCH] Index scan offset optimisation using visibility map
Hello Michail, On Tue, March 6, 2018 4:03 pm, Michail Nikolaev wrote: > Hello, Andrey. > > Thanks for review. > > I have updated comments according your review also renamed some fields for > consistency. > Additional some notes added to documentation. > > Updated patch in attach, github updated too. That is a cool idea, and can come in very handy if you regulary need large offsets. Cannot comment on the code, but there is at least one regular typo: + * Decrement counter for remaning skipped tuples. + * If last tuple skipped - release the buffer. + */ + if (node->iss_TuplesSkippedRemaning > 0) + node->iss_TuplesSkippedRemaning--; The English word is "remaining", with "ai" instead of "a" :) Also the variable name "iss_TuplesSkipped" has its grammar at odds with the purpose the code seems to be. It could be named "SkipTuples" (e.g. this is the number of tuples we need to skip, not the number we have skipped), and the other one then "iss_SkipTuplesRemaining" so they are consistent with each other. Others might have a better name for these two, of course. Best wishes, Tels
Re: [HACKERS] plpgsql - additional extra checks
Hello Pavel and Tomas, On Tue, March 20, 2018 12:36 am, Pavel Stehule wrote: > 2018-03-19 21:47 GMT+01:00 Tomas Vondra : > >> Hi, >> >> I'm looking at the updated patch (plpgsql-extra-check-180316.patch), and >> this time it applies and builds OK. The one thing I noticed is that the >> documentation still uses the old wording for strict_multi_assignement: >> >> WARNING: Number of evaluated fields does not match expected. >> HINT: strict_multi_assignement check of extra_warnings is active. >> WARNING: Number of evaluated fields does not match expected. >> HINT: strict_multi_assignement check of extra_warnings is active. >> >> This was reworded to "Number of source and target fields in assignment >> does not match." >> I believe the correct wording should be: "Number of source and target fields in assignment do not match." ecause comparing one number to the other means "the number A and B _do_ not match", not "the number A does not match number B". Also there is an inconsistent quoting here: + +Setting plpgsql.extra_warnings, or +plpgsql.extra_errors, as appropriate, to all no quotes for 'all'. +is encouraged in development and/or testing environments. + + + +These additional checks are enabled through the configuration variables +plpgsql.extra_warnings for warnings and +plpgsql.extra_errors for errors. Both can be set either to +a comma-separated list of checks, "none" or +"all". quotes here around '"all"'. I think it should be one or the other in both cases. Also: + Currently +the list of available checks includes only one: but then it lists more than one check? Best wishes, Tels
Re: truncating timestamps on arbitrary intervals
Hello John, this looks like a nice feature. I'm wondering how it relates to the following use-case: When drawing charts, the user can select pre-defined widths on times (like "15 min", "1 hour"). The data for these slots is fitted always to intervalls that start in 0 in the slot, e.g. if the user selects "15 min", the interval always starts at xx:00, xx:15, xx:30 or xx:45. This is to aid caching of the resulting data, and so that requesting the same chart at xx:00 and xx:01 actually draws the same chart, and not some bar with only one minute data at at the end. In PSQL, this works out to using this as GROUP BY and then summing up all values: SELECT FLOOR(EXTRACT(EPOCH from thetime) / 3600) * 3600, SUM(events) FROM mytable ... GROUP BY 1; whereas here 3600 means "hourly". It is of course easy for things like "1 hour", but for yearly I had to come up with things like: EXTRAC(YEAR FROM thetime) * 12 + EXTRACT(MONTH FROM thetime) and its gets even more involved for weeks, weekdays or odd things like fortnights. So would that mean one could replace this by: date_trunc_interval('3600 seconds'::interval, thetime) and it would be easier, and (hopefully) faster? Best regards, Tels
Re: Speed up Hash Join by teaching ExprState about hashing
Hello David, you wrote: v4 patch attached. If nobody else wants to look at this then I'm planning on pushing it soon. Had a very brief look at this bit caught my attentioon: + EEO_CASE(EEOP_HASHDATUM_NEXT32_STRICT) + { + FunctionCallInfo fcinfo = op->d.hashdatum.fcinfo_data; + uint32 existing_hash = DatumGetUInt32(*op->resvalue); + uint32 hashvalue; + + /* combine successive hash values by rotating */ + existing_hash = pg_rotate_left32(existing_hash, 1); + + if (fcinfo->args[0].isnull) + { Is it nec. to rotate existing_hash here before checking for isnull? Because in case of isnull, isn't the result of the rotate thrown away? Or in other words, mnaybe this bit here can be moved to after the isnull check: + /* combine successive hash values by rotating */ + existing_hash = pg_rotate_left32(existing_hash, 1); -- Best regards, Tels
Re: Tighten up a few overly lax regexes in pg_dump's tap tests
Moin, On Wed, February 6, 2019 8:10 pm, Michael Paquier wrote: > On Wed, Feb 06, 2019 at 03:41:20PM +0100, Daniel Gustafsson wrote: >> Correct. One could argue that the regex is still suboptimal since >> “COMMENT ON >> DATABASE postgres IS ;” will be matched as well, but there I think the >> tradeoff >> for readability wins. > > Okay, that looks like an improvement anyway, so committed after going > over the tests for similar problems, and there was one for CREATE > DATABASE and DROP ROLE. It is possible to have a regex which tells as > at least one non-whitespace character, but from what I can see these > don't really improve the readability. Sorry for being late to the party, but it just occured to me that while ".+" is definitely an improvement over ".*", it isn't foolproof either, as it also matches ";". Thus: regexp => qr/^COMMENT ON DATABASE postgres IS .+;/m, matches things like: 'COMMENT ON DATABASE postgres IS ;;' 'COMMENT ON DATABASE postgres IS ;' 'COMMENT ON DATABASE postgres IS --;' etc. I'm not sure it is really necessary to deal with these cases, but one possibility would be to pre-compute regexps: $QR_COMMENT = qr/[^ ;]+/; $QR_IDENTIFIER = qr/[^ ;]+/; # etc or whataver is the thing that should actually be matched here and use them like so: regexp => qr/^COMMENT ON DATABASE postgres IS $QR_COMMENT;/m, That way it is easily changable and quite readable. Oh, one more question. Shouldn't these regexps that start with "^" also end with "$"? Or can there be output like: 'COMMENT ON DATABASE postgres IS $QR_IDENTIFIER; SELECT 1;' ? Best regards, Tels
Re: In PG12, query with float calculations is slower than PG11
Moin, On 2020-02-07 15:42, Emre Hasegeli wrote: > The patch looks unduly invasive to me, but I think that it might be > right that we should go back to a macro-based implementation, because > otherwise we don't have a good way to be certain that the function > parameter won't get evaluated first. I'd first like to see some actual evidence of this being a problem, rather than just the order of the checks. There seem to be enough evidence of this being the problem. We are better off going back to the macro-based implementation. I polished Keisuke Kuroda's patch commenting about the performance issue, removed the check_float*_val() functions completely, and added unlikely() as Tom Lane suggested. It is attached. I confirmed with different compilers that the macro, and unlikely() makes this noticeably faster. Hm, the diff has the macro tests as: + if (unlikely(isinf(val) && !(inf_is_valid))) ... + if (unlikely((val) == 0.0 && !(zero_is_valid))) But the comment does not explain that this test has to be in that order, or the compiler will for non-constant arguments evalute the (now) right-side first. E.g. if I understand this correctly: + if (!(zero_is_valid) && unlikely((val) == 0.0) would have the same problem of evaluating "zero_is_valid" (which might be an isinf(arg1) || isinf(arg2)) first and so be the same thing we try to avoid with the macro? Maybe adding this bit of info to the comment makes it clearer? Also, a few places use the macro as: + CHECKFLOATVAL(result, true, true); which evaluates to a complete NOP in both cases. IMHO this could be replaced with a comment like: + // No CHECKFLOATVAL() needed, as both inf and 0.0 are valid (or something along the lines of "no error can occur"), as otherwise CHECKFLOATVAL() implies to the casual reader that there are some checks done, while in reality no real checks are done at all (and hopefully the compiler optimizes everything away, which might not be true for debug builds). -- Best regards, TelsFrom e869373ad093e668872f08833de2c5c614aab673 Mon Sep 17 00:00:00 2001 From: Emre Hasegeli Date: Fri, 7 Feb 2020 10:27:25 + Subject: [PATCH] Bring back CHECKFLOATVAL() macro The inline functions added by 6bf0bc842b caused the conditions of overflow/underflow checks to be evaluated when no overflow/underflow happen. This slowed down floating point operations. This commit brings back the macro that was in use before 6bf0bc842b to fix the performace regression. Reported-by: Keisuke Kuroda Author: Keisuke Kuroda Discussion: https://www.postgresql.org/message-id/CANDwggLe1Gc1OrRqvPfGE%3DkM9K0FSfia0hbeFCEmwabhLz95AA%40mail.gmail.com --- src/backend/utils/adt/float.c | 66 ++--- src/backend/utils/adt/geo_ops.c | 2 +- src/include/utils/float.h | 75 ++--- 3 files changed, 66 insertions(+), 77 deletions(-) diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c index a90d4db215..5885719850 100644 --- a/src/backend/utils/adt/float.c +++ b/src/backend/utils/adt/float.c @@ -1184,21 +1184,21 @@ ftod(PG_FUNCTION_ARGS) /* * dtof - converts a float8 number to a float4 number */ Datum dtof(PG_FUNCTION_ARGS) { float8 num = PG_GETARG_FLOAT8(0); - check_float4_val((float4) num, isinf(num), num == 0); + CHECKFLOATVAL((float4) num, isinf(num), num == 0); PG_RETURN_FLOAT4((float4) num); } /* * dtoi4 - converts a float8 number to an int4 number */ Datum dtoi4(PG_FUNCTION_ARGS) @@ -1438,36 +1438,36 @@ dsqrt(PG_FUNCTION_ARGS) float8 arg1 = PG_GETARG_FLOAT8(0); float8 result; if (arg1 < 0) ereport(ERROR, (errcode(ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION), errmsg("cannot take square root of a negative number"))); result = sqrt(arg1); - check_float8_val(result, isinf(arg1), arg1 == 0); + CHECKFLOATVAL(result, isinf(arg1), arg1 == 0); PG_RETURN_FLOAT8(result); } /* * dcbrt - returns cube root of arg1 */ Datum dcbrt(PG_FUNCTION_ARGS) { float8 arg1 = PG_GETARG_FLOAT8(0); float8 result; result = cbrt(arg1); - check_float8_val(result, isinf(arg1), arg1 == 0); + CHECKFLOATVAL(result, isinf(arg1), arg1 == 0); PG_RETURN_FLOAT8(result); } /* * dpow - returns pow(arg1,arg2) */ Datum dpow(PG_FUNCTION_ARGS) { @@ -1525,40 +1525,40 @@ dpow(PG_FUNCTION_ARGS) /* The sign of Inf is not significant in this case. */ result = get_float8_infinity(); else if (fabs(arg1) != 1) result = 0; else result = 1; } else if (errno == ERANGE && result != 0 && !isinf(result)) result = get_float8_infinity(); - check_float8_val(result, isinf(arg1) || isinf(arg2), arg1 == 0); + CHECKFLOATVAL(result, isinf(arg1) || isinf(arg2), arg1 == 0); PG_RETURN_FLOAT8(result); } /* * dexp - returns the exponential function of arg1 */ Datum dexp(PG_FUNCTION_ARGS) { float8 arg1 = PG_GETARG_FLOAT8(0); float8 result; errno = 0;
Re: Some improvements to numeric sqrt() and ln()
Dear Dean, On 2020-03-01 20:47, Dean Rasheed wrote: On Fri, 28 Feb 2020 at 08:15, Dean Rasheed wrote: It's possible that there are further gains to be had in the sqrt() algorithm on platforms that support 128-bit integers, but I haven't had a chance to investigate that yet. Rebased patch attached, now using 128-bit integers for part of sqrt_var() on platforms that support them. This turned out to be well worth it (1.5 to 2 times faster than the previous version if the result has less than 30 or 40 digits). Thank you for these patches, these sound like really nice improvements. One thing can to my mind while reading the patch: +* If r < 0 Then +* Let r = r + 2*s - 1 +* Let s = s - 1 + /* s is too large by 1; let r = r + 2*s - 1 and s = s - 1 */ + r_int64 += 2 * s_int64 - 1; + s_int64--; This can be reformulated as: +* If r < 0 Then +* Let r = r + s +* Let s = s - 1 +* Let r = r + s + /* s is too large by 1; let r = r + 2*s - 1 and s = s - 1 */ + r_int64 += s_int64; + s_int64--; + r_int64 += s_int64; which would remove one mul/shift and the temp. variable. Mind you, I have not benchmarked this, so it might make little difference, but maybe it is worth trying it. Best regards, Telsdiff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c new file mode 100644 index 10229eb..9e6bb80 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -393,16 +393,6 @@ static const NumericVar const_ten = #endif #if DEC_DIGITS == 4 -static const NumericDigit const_zero_point_five_data[1] = {5000}; -#elif DEC_DIGITS == 2 -static const NumericDigit const_zero_point_five_data[1] = {50}; -#elif DEC_DIGITS == 1 -static const NumericDigit const_zero_point_five_data[1] = {5}; -#endif -static const NumericVar const_zero_point_five = -{1, -1, NUMERIC_POS, 1, NULL, (NumericDigit *) const_zero_point_five_data}; - -#if DEC_DIGITS == 4 static const NumericDigit const_zero_point_nine_data[1] = {9000}; #elif DEC_DIGITS == 2 static const NumericDigit const_zero_point_nine_data[1] = {90}; @@ -518,6 +508,8 @@ static void div_var_fast(const NumericVa static int select_div_scale(const NumericVar *var1, const NumericVar *var2); static void mod_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result); +static void div_mod_var(const NumericVar *var1, const NumericVar *var2, + NumericVar *quot, NumericVar *rem); static void ceil_var(const NumericVar *var, NumericVar *result); static void floor_var(const NumericVar *var, NumericVar *result); @@ -7712,6 +7704,7 @@ div_var_fast(const NumericVar *var1, con NumericVar *result, int rscale, bool round) { int div_ndigits; + int load_ndigits; int res_sign; int res_weight; int *div; @@ -7766,9 +7759,6 @@ div_var_fast(const NumericVar *var1, con div_ndigits += DIV_GUARD_DIGITS; if (div_ndigits < DIV_GUARD_DIGITS) div_ndigits = DIV_GUARD_DIGITS; - /* Must be at least var1ndigits, too, to simplify data-loading loop */ - if (div_ndigits < var1ndigits) - div_ndigits = var1ndigits; /* * We do the arithmetic in an array "div[]" of signed int's. Since @@ -7781,9 +7771,16 @@ div_var_fast(const NumericVar *var1, con * (approximate) quotient digit and stores it into div[], removing one * position of dividend space. A final pass of carry propagation takes * care of any mistaken quotient digits. + * + * Note that div[] doesn't necessarily contain all of the digits from the + * dividend --- the desired precision plus guard digits might be less than + * the dividend's precision. This happens, for example, in the square + * root algorithm, where we typically divide a 2N-digit number by an + * N-digit number, and only require a result with N digits of precision. */ div = (int *) palloc0((div_ndigits + 1) * sizeof(int)); - for (i = 0; i < var1ndigits; i++) + load_ndigits = Min(div_ndigits, var1ndigits); + for (i = 0; i < load_ndigits; i++) div[i + 1] = var1digits[i]; /* @@ -7844,9 +7841,15 @@ div_var_fast(const NumericVar *var1, con maxdiv += Abs(qdigit); if (maxdiv > (INT_MAX - INT_MAX / NBASE - 1) / (NBASE - 1)) { -/* Yes, do it */ +/* + * Yes, do it. Note that if var2ndigits is much smaller than + * div_ndigits, we can save a significant amount of effort + * here by noting that we only need to normalise those div[] + * entries touched where prior iterations subtracted multiples + * of the divisor. + */ carry = 0; -for (i = div_ndigits; i > qi; i--) +for (i = Min(qi + var2ndigits - 2, div_ndigits); i > qi; i--) { newdig = div[i] + carry; if (
Re: Non-portable shell code in pg_upgrade tap tests
Moin, On Fri, July 20, 2018 10:55 am, Victor Wagner wrote: > On Fri, 20 Jul 2018 10:25:47 -0400 > Tom Lane wrote: > >> Victor Wagner writes: >> > I've discovered that in the branch REL_11_STABLE there is shell >> > script src/bin/pg_upgrade/test.sh which doesn't work under Solaris >> > 10. (it uses $(command) syntax with is not compatible with original >> > Solaris /bin/sh) > >> >> Please send a patch. Most of us do not have access to old shells > > Here it goes. Previous letter was written before fixed tests were > completed, because this old machine is slow. + *) if [ `find ${PGDATA} -type f ! -perm 640 | wc -l` -ne 0 ]; then Shouldn't ${PGDATA} in the above as argument to find be quoted, otherwise the shell would get confused if it contains spaces or other special characters? Regards, Tels
Re: Have an encrypted pgpass file
Moin, On Wed, July 18, 2018 7:25 pm, Tom Lane wrote: > Alvaro Herrera writes: >> Seems to me that passing %-specifiers to the command would make it more >> useful (%u for "user", "host" etc) -- your command could refuse to give >> you a password for the superuser account for instance but grant one for >> a read-only user. > > It would also provide a *very* fertile source of shell-script-injection > vulnerabilities. (Whaddya mean, you tried to use a user name with a > quote mark in it?) Little Bobby Tables, we call him. :) I'm also concerned that that would let anybody who could alter the environment then let arbitrary code be run as user postgres. Is this something that poses a risk in addition to the current situation? Best regards, Tels
Re: Non-portable shell code in pg_upgrade tap tests
Moin, On Sat, July 21, 2018 12:47 pm, Dagfinn Ilmari Mannsåker wrote: > Tom Lane writes: > >> "Tels" writes: >>> + *) if [ `find ${PGDATA} -type f ! -perm 640 | wc -l` -ne 0 ]; then >> >>> Shouldn't ${PGDATA} in the above as argument to find be quoted, >>> otherwise >>> the shell would get confused if it contains spaces or other special >>> characters? >> >> Hmm. Yeah, probably. I don't think this script is meant to be run with >> arbitrary values of PGDATA, but most of the other uses are quoted, so >> for consistency's sake this should be too. > > PGDATA is built from `pwd`, so it breaks if the build directory has a > space in it. Because it's testing for the absence of files, it doesn't > actually break the test, but would fail to detect the bugs it's trying > to. Thanx for the fix. But perhaps I should have been a bit more specific in my first message, spaces are not the only character this can break at. Looking at your new patch, I notice you used "" for quoting, not ''. (Not sure which variant Tom used when pushing a patch). I'm not a shell expert, but I think '' are safer, as "" still has some interpolation from the shell (at least on the systems I use regulary): te@pc:~$ mkdir 'test$test' te@pc:~$ touch 'test$test/test' te@pc:~$ find 'test$test/test' -type f test$test/test te@pc:~$ find "test$test/test" -type f find: ‘test/test’: No such file or directory So I've grown the habit to always use '' instead of "". Not sure if this is specific to bash vs. sh or PC vs Mac, tho. Best wishes, Tels
Re: perlcritic script
Moin, On Tue, May 8, 2018 5:03 pm, Peter Eisentraut wrote: > On 5/8/18 16:51, Tom Lane wrote: >> Peter Eisentraut writes: >>> On 5/8/18 13:57, Andrew Dunstan wrote: >>>> + # take executable files that file(1) thinks are perl files >>>> + find . -type f -perm -100 -exec file {} \; -print | >>>> + egrep -i ':.*perl[0-9]*\>' | >> >>> How portable is that? >> >> Well, it's the same code that's in pgperltidy ... but I agree that >> it's making a lot of assumptions about the behavior of file(1). > > OK, but then it's not a problem for this thread. If I'm not mistaken, the first line in the "find" code could be more compact like so: find . -type f -iname '*.p[lm]' (-print is default, and the -name argument is a regexp, anyway. And IMHO it could be "-iname" so we catch "test.PM", too?). Also, "-print" does not handle filenames with newlines well, so "-print0" should be used, however, this can be tricky when the next step isn't xarg, but sort. Looking at the man page, on my system this would be: find . -type f -name '*.p[lm]' -print0 | sort -u -z | xargs -0 ... Not sure if that is more, or less, portable then the original -print variant, tho. Best regards, Tels
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Hello Rushabh, On Fri, December 8, 2017 2:28 am, Rushabh Lathia wrote: > Thanks for review. > > On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan wrote: > >> On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia >> wrote: >> > 0001-Add-parallel-B-tree-index-build-sorting_v14.patch I've looked only at patch 0002, here are some comments. > + * leaderasworker indicates whether leader going to participate as worker or > + * not. The grammar is a bit off, and the "or not" seems obvious. IMHO this could be: + * leaderasworker indicates whether the leader is going to participate as worker The argument leaderasworker is only used once and for one temp. variable that is only used once, too. So the temp. variable could maybe go. And not sure what the verdict was from the const-discussion threads, I did not follow it through. If "const" is what should be done generally, then the argument could be consted, as to not create more "unconsted" code. E.g. so: +plan_create_index_workers(Oid tableOid, Oid indexOid, const bool leaderasworker) and later: - sort_mem_blocks / (parallel_workers + 1) < min_sort_mem_blocks) + sort_mem_blocks / (parallel_workers + (leaderasworker ? 1 : 0)) < min_sort_mem_blocks) Thank you for working on this patch! All the best, Tels
Re: Bitmap table scan cost per page formula
Moin, On Wed, December 20, 2017 11:51 pm, Jeff Janes wrote: > On Wed, Dec 20, 2017 at 2:18 PM, Robert Haas > wrote: > >> On Wed, Dec 20, 2017 at 4:20 PM, Jeff Janes >> wrote: >>> >>> It is not obvious to me that the parabola is wrong. I've certainly >>> seen >>> cases where reading every 2nd or 3rd block (either stochastically, or >>> modulus) actually does take longer than reading every block, because it >>> defeats read-ahead. But it depends on a lot on your kernel version and >>> your kernel settings and your file system and probably other things as >>> well. >>> >> >> Well, that's an interesting point, too. Maybe we need another graph >> that >> also shows the actual runtime of a bitmap scan and a sequential scan. >> > > I've did some low level IO benchmarking, and I actually get 13 times > slower > to read every 3rd block than every block using CentOS6.9 with ext4 and the > setting: > blockdev --setra 8192 /dev/sdb1 > On some virtualized storage which I don't know the details of, but it > behaves as if it were RAID/JBOD with around 6 independent spindles.. Repeated this here on my desktop, linux-image-4.10.0-42 with a Samsung SSD 850 EVO 500 Gbyte, on an encrypted / EXT4 partition: $ dd if=/dev/zero of=zero.dat count=130 bs=8192 130+0 records in 130+0 records out 1064960 bytes (11 GB, 9,9 GiB) copied, 22,1993 s, 480 MB/s All blocks: $ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" $ time perl -le 'open my $fh, "rand" or die; foreach (1..130) {$x="";next if $_%3>5; sysseek $fh,$_*8*1024,0 or die $!; sysread $fh, $x,8*1024; print length $x} ' | uniq -c 129 8192 real0m20,841s user0m0,960s sys 0m2,516s Every 3rd block: $ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" $ time perl -le 'open my $fh, "rand" or die; foreach (1..130) {$x=""; next if $_%3>0; sysseek $fh,$_*8*1024,0 or die $!; sysread $fh, $x,8*1024; print length $x} '|uniq -c 43 8192 real0m50,504s user0m0,532s sys 0m2,972s Every 3rd block random: $ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" $ time perl -le 'open my $fh, "rand" or die; foreach (1..130) {$x=""; next if rand()> 0.; sysseek $fh,$_*8*1024,0 or die $!; sysread $fh, $x,8*1024; print length $x} ' | uniq -c 432810 8192 real0m26,575s user0m0,540s sys 0m2,200s So it does get slower, but only about 2.5 times respectively about 30%. Hope this helps, Tels
Re: [HACKERS] Replication status in logical replication
Moin, On Mon, December 25, 2017 7:26 pm, Masahiko Sawada wrote: > On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek > wrote: >> On 21/11/17 22:06, Masahiko Sawada wrote: >>> >>> After investigation, I found out that my previous patch was wrong >>> direction. I should have changed XLogSendLogical() so that we can >>> check the read LSN and set WalSndCaughtUp = true even after read a >>> record without wait. Attached updated patch passed 'make check-world'. >>> Please review it. >>> >> >> Hi, >> >> This version looks good to me and seems to be in line with what we do in >> physical replication. >> >> Marking as ready for committer. (Sorry Masahiko, you'll get this twice, as fumbled the reply button.) I have not verifed that comment and/or code are correct, just a grammar fix: +/* + * If we've sent a record is at or beyond the flushed point, then + * we're caught up. That should read more like this: "If we've sent a record that is at or beyond the flushed point, we have caught up." All the best, Tels
Re: [HACKERS] Replication status in logical replication
Moin, On Tue, December 26, 2017 5:26 am, Masahiko Sawada wrote: > On Tue, Dec 26, 2017 at 6:19 PM, Tels > wrote: >> Moin, >> >> On Mon, December 25, 2017 7:26 pm, Masahiko Sawada wrote: >>> On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek >>> wrote: >>>> On 21/11/17 22:06, Masahiko Sawada wrote: >>>>> >>>>> After investigation, I found out that my previous patch was wrong >>>>> direction. I should have changed XLogSendLogical() so that we can >>>>> check the read LSN and set WalSndCaughtUp = true even after read a >>>>> record without wait. Attached updated patch passed 'make >>>>> check-world'. >>>>> Please review it. >>>>> >>>> >>>> Hi, >>>> >>>> This version looks good to me and seems to be in line with what we do >>>> in >>>> physical replication. >>>> >>>> Marking as ready for committer. >> >> (Sorry Masahiko, you'll get this twice, as fumbled the reply button.) >> >> I have not verifed that comment and/or code are correct, just a grammar >> fix: >> >> +/* >> + * If we've sent a record is at or beyond the flushed >> point, then >> + * we're caught up. >> >> That should read more like this: >> >> "If we've sent a record that is at or beyond the flushed point, we have >> caught up." >> > > Thank you for reviewing the patch! > Actually, that comment is inspired by the comment just below comment. > ISTM it's better to fix both if grammar of them is not appropriate. Oh yes. Your attached version reads fine to me. All the best, Tels
Re: [HACKERS] [PATCH] Tap test support for backup with tablespace mapping
Dear Vaishnavi, On Tue, December 26, 2017 8:58 pm, Vaishnavi Prabakaran wrote: > Hi All, > > I have added support in Postgres TAP test framework to backup a data > directory with tablespace mapping. Also added support to move the backup > directory contents to standby node, because current option to init the > standby from backup does not support copying softlinks, which is needed > when tablespace mapping is involved in backup. > > Added a new test to existing streaming replication tap test to demonstrate > the usage of these new APIs. > > Attached the patch, Hope this enhancement is useful. > > Thanks & Regards, > Vaishnavi, > Fujitsu Australia. > Thank you for the path, I saw these things: * backup_withtablespace() does not have any documentation? * The mkdir calls do not set a mask for the created dir, defaulting to 0777 - is this what is wanted here? * none of the mkdir, chdir etc. calls check any error code, e.g. what happens if one of them fails? * different indentation between rmdir and move (tab vs. spaces): +rmdir($data_path); + move("$backup_path", "$self->{_basedir}/pgdata") Best regards, Tels
Re: plpgsql function startup-time improvements
Hello Tom, On Wed, December 27, 2017 3:38 pm, Tom Lane wrote: > Attached are patches for two performance-improvement ideas that came > to me while working on [snip] > > Also, I changed PLpgSQL_var.isconst and PLpgSQL_var.notnull from "int" > to "bool", which is what they should have been all along, and relocated > them in the PLpgSQL_var struct. There are two motivations for this. > It saves a whole 8 bytes per PLpgSQL_var, at least on 64-bit machines, > because the fields now fit into what had been wasted padding space; > reducing the size of what we have to copy during copy_plpgsql_datums > has to be worth something. Second, those fields are now adjacent to > the common PLpgSQL_variable fields, which will simplify migrating them > into PLpgSQL_variable, as I anticipate we'll want to do at some point > when we allow composite variables to be marked CONSTANT and maybe NOT > NULL. More performance, especially in plpgsql is always a good thing :) After a few experiments I've got a question or two, though (and please excuse if this are stupid questions :) My C is a bit rusty, so I embarked on a mission to learn more. With a short test program printing out the size of PLpgSQL_var to check, I always saw 72 bytes, regardless of bool vs. int. So a bool would be 4 bytes here. Hmm. Googling around, this patch comment from Peter Eisentraut says that "bool" can be more than one byte, and only "bool8" is really one byte. https://www.postgresql.org/message-id/attachment/54267/0006-Add-bool8-typedef-for-system-catalog-structs.patch I used 64-bit Kubuntu, the gcc that came with the system, and installed Postgres 10 to see what MAXIMUM_ALIGNOF should be (it says 8 here). Since I could get the test program to compile against all PG header files (which is probably me being just being dense...), I used a stripped down test.c (attached). However, I probably did not have the real compiler settings, which might influence the size of bool or the enum definition? Maybe someone could shed some light on these questions: * Does bool vs. int really save space or is this maybe system/compiler dependend? * Or should the structs use "bool8" to ensure this? * I dimly remember that access to 1-byte fields might be slower than to 4 byte fields on X86-64. Would the size savings still worth it? And maybe folding all four bool fields into an "int flags" field with bits would save space, and not much slower (depending on often how the different flags are accessed due to the ANDing and ORing ops)? Best regards, Tels#include #include #include /* * #include "access/xact.h" #include "commands/event_trigger.h" #include "commands/trigger.h" #include "executor/spi.h" */ #define MAXIMUM_ALIGNOF 8 #define TYPEALIGN(ALIGNVAL,LEN) \ (((uintptr_t) (LEN) + ((ALIGNVAL) - 1)) & ~((uintptr_t) ((ALIGNVAL) - 1))) #define MAXALIGN(LEN) TYPEALIGN(MAXIMUM_ALIGNOF, (LEN)) /* * * Datum array node types * */ typedef enum PLpgSQL_datum_type { PLPGSQL_DTYPE_VAR, PLPGSQL_DTYPE_ROW, PLPGSQL_DTYPE_REC, PLPGSQL_DTYPE_RECFIELD, PLPGSQL_DTYPE_ARRAYELEM, PLPGSQL_DTYPE_EXPR } PLpgSQL_datum_type; /* * * Scalar variable * */ typedef struct PLpgSQL_var { PLpgSQL_datum_type dtype; int dno; char *refname; int lineno; /*PLpgSQL_type *datatype; */ char *datatype; boolisconst; boolnotnull; /*PLpgSQL_expr *default_val; PLpgSQL_expr *cursor_explicit_expr; */ char*default_val; char*cursor_explicit_expr; int cursor_explicit_argrow; int cursor_options; boolisnull; boolfreeval; } PLpgSQL_var; int main(void) { int i = MAXALIGN(sizeof(PLpgSQL_var)); printf( "MAXALIGN(sizeof(PLpgSQL_var)=%i\n", i); return 0; }
Re: plpgsql function startup-time improvements
Moin, On Thu, December 28, 2017 5:43 pm, Tom Lane wrote: > "Tels" writes: >> On Wed, December 27, 2017 3:38 pm, Tom Lane wrote: >>> Also, I changed PLpgSQL_var.isconst and PLpgSQL_var.notnull from "int" >>> to "bool", which is what they should have been all along, and relocated >>> them in the PLpgSQL_var struct. > >> With a short test program printing out the size of PLpgSQL_var to check, >> I >> always saw 72 bytes, regardless of bool vs. int. So a bool would be 4 >> bytes here. Hmm. > > Seems fairly unlikely, especially given that we typedef bool as char ;-). Hmn, yes, I can see my confusion. And a test shows, that sizeof(bool) is 1 here. and *char etc are 8. > Which field order were you checking? Are you accounting for alignment > padding? > > By my count, with the existing field order on typical 64-bit hardware, > we ought to have > > PLpgSQL_datum_type dtype; -- 4 bytes [1] > int dno;-- 4 bytes > char *refname;-- 8 bytes > int lineno; -- 4 bytes > -- 4 bytes wasted due to padding here > PLpgSQL_type *datatype; -- 8 bytes > int isconst;-- 4 bytes > int notnull;-- 4 bytes > PLpgSQL_expr *default_val; -- 8 bytes > PLpgSQL_expr *cursor_explicit_expr; -- 8 bytes > int cursor_explicit_argrow; -- 4 bytes > int cursor_options; -- 4 bytes > > Datum value; -- 8 bytes > boolisnull; -- 1 byte > boolfreeval;-- 1 byte > > so at this point we've consumed 74 bytes, but the whole struct has > to be aligned on 8-byte boundaries because of the pointers, so > sizeof(PLpgSQL_var) ought to be 80 --- and that is what I see here. > > With the proposed redesign, > > PLpgSQL_datum_type dtype; -- 4 bytes [1] > int dno;-- 4 bytes > char *refname;-- 8 bytes > int lineno; -- 4 bytes > > boolisconst;-- 1 byte > boolnotnull;-- 1 byte > -- 2 bytes wasted due to padding here > PLpgSQL_type *datatype; -- 8 bytes > PLpgSQL_expr *default_val; -- 8 bytes > PLpgSQL_expr *cursor_explicit_expr; -- 8 bytes > int cursor_explicit_argrow; -- 4 bytes > int cursor_options; -- 4 bytes > > Datum value; -- 8 bytes > boolisnull; -- 1 byte > boolfreeval;-- 1 byte > > so we've consumed 66 bytes, which rounds up to 72 with the addition > of trailing padding. Sounds logical, thanx for the detailed explanation. In my test the first 4 padding bytes are probably not there, because I probably miss something that aligns ptrs on 8-byte boundaries. At least that would explain the sizes seen here. So, if you moved "isnull" and "freeval" right behind "isconst" and "notnull", you could save another 2 byte, land at 64, and thus no extra padding would keep it at 64? >> And maybe folding all four bool fields into an "int flags" field with >> bits >> would save space, and not much slower (depending on often how the >> different flags are accessed due to the ANDing and ORing ops)? > > That'd be pretty messy/invasive in terms of the code changes needed, > and I don't think it'd save any space once you account for alignment > and the fact that my other patch proposes to add another enum at > the end of the struct. Also, I'm not exactly convinced that > replacing byte sets and tests with bitflag operations would be > cheap time-wise. (It would particularly be a mess for isnull, > since then there'd be an impedance mismatch with a whole lotta PG > APIs that expect null flags to be bools.) Already had a hunch the idea wouldn't be popular, and this are all pretty solid arguments against it. Nevermind, then :) Best wishes, Tels
Re: TAP test module - PostgresClient
On Thu, December 28, 2017 10:14 pm, Tom Lane wrote: > Craig Ringer writes: >> Another option might be to teach the TAP infrastructure and the >> buildfarm >> client how to fetch cpanminus and build DBD::Pg against our build-tree, >> so >> we could use Perl DBI. > > As a buildfarm owner, I'd take a *really* dim view of the buildfarm > trying to auto-fetch code off the internet. As a developer, the > idea that "make check-world" would try to do that is even scarier. > Buildfarm owners are likely to have taken at least some steps to > sandbox their builds, but developers not so much. > > I do not think we should even think about going there. Well, while I couldn't agree more on the "running code from the internet is dangerous" thing, there are a few points for it, tho: * if you use Perl modules on your system, you are likely doing already, anyway, as the Perl modules come, you guessed it, from the internet :) Just because a random $PackageMaintainer signed it does mean it is really safe. * And a lot of Perl modules are not in say, Debian repositories, so you need to use CPAN (or re-invent everything). Unfortunately, the trend for other languages seems to go into the same direction, with Ruby gems, the python package manager, and almost everybody else re-inventing their own packaging system, often poorly. So you might already have fallen in the trap of "use random code from the internet". (Of course, that is not really an argument for doing it, too...) * the other option seems to be "re-invent the wheel, again, locally", which isn't always the best, either. I do agree tho that having "setup" or "make check" auto-fetching stuff from the internet is not a good idea, however. Mostly because it becomes suddenly much harder to run in closed networks w/o access and such side-loading installations can bypass your systems packaging system, which doesn't sound good, either. OTOH, it is possible to setup local repositories, or maybe even pre-bundling modules into some sort of "approved TAP bundle" hosted on an official server. The last resort would be to pre-bundle the wanted modules, but this has the risk of outdating them fast. Plus, pre-bundled modules are not more security vetted than the ones from the internet, so you might as well use the CPAN version directly. The best course seems to me to have dependencies on the OS packackes for the Perl modules you want to use. Not sure, however, if the build farm client has "proper" Debian etc. packages and if it is even possible to add these dependencies in this way. Best wishes, Tels
Re: Undocumented(?) limits on regexp functions
Moin Andrew, On Tue, August 14, 2018 9:16 am, Andrew Gierth wrote: >>>>>> "Tom" == Tom Lane writes: > > >> Should these limits: > > >> a) be removed > > Tom> Doubt it --- we could use the "huge" request variants, maybe, but > Tom> I wonder whether the engine could run fast enough that you'd want > Tom> to. > > I do wonder (albeit without evidence) whether the quadratic slowdown > problem I posted a patch for earlier was ignored for so long because > people just went "meh, regexps are slow" rather than wondering why a > trivial splitting of a 40kbyte string was taking more than a second. Pretty much this. :) First of all, thank you for working in this area, this is very welcome. We do use UTF-8 and we did notice that regexp are not actually the fastest around, albeit we did not (yet) run into the memory limit. Mostly, because the regexp_match* stuff we use is only used in places where the performance is not key and the input/output is small (albeit, now that I mention it, the quadratic behaviour might explain a few slowdowns in other cases I need to investigate). Anyway, in a few places we have functions that use a lot (> a dozend) regexps that are also moderate complex (e.g. span multiple lines). In these cases the performance was not really up to par, so I experimented and in the end rewrote the functions in plperl. Which fixed the performance, so we no longer had this issue. All the best, Tels
Re: JIT compiling with LLVM v12
Moin, On Sat, August 25, 2018 9:34 pm, Robert Haas wrote: > On Wed, Aug 22, 2018 at 6:43 PM, Andres Freund wrote: >> Now you can say that'd be solved by bumping the cost up, sure. But >> obviously the row / cost model is pretty much out of whack here, I don't >> see how we can make reasonable decisions in a trivial query that has a >> misestimation by five orders of magnitude. > > Before JIT, it didn't matter whether the costing was wrong, provided > that the path with the lowest cost was the cheapest path (or at least > close enough to the cheapest path not to bother anyone). Now it does. > If the intended path is chosen but the costing is higher than it > should be, JIT will erroneously activate. If you had designed this in > such a way that we added separate paths for the JIT and non-JIT > versions and the JIT version had a bigger startup cost but a reduced > runtime cost, then you probably would not have run into this issue, or > at least not to the same degree. But as it is, JIT activates when the > plan looks expensive, regardless of whether activating JIT will do > anything to make it cheaper. As a blindingly obvious example, turning > on JIT to mitigate the effects of disable_cost is senseless, but as > you point out, that's exactly what happens right now. > > I'd guess that, as you read this, you're thinking, well, but if I'd > added JIT and non-JIT paths for every option, it would have doubled > the number of paths, and that would have slowed the planner down way > too much. That's certainly true, but my point is just that the > problem is probably not as simple as "the defaults are too low". I > think the problem is more fundamentally that the model you've chosen > is kinda broken. I'm not saying I know how you could have done any > better, but I do think we're going to have to try to figure out > something to do about it, because saying, "check-pg_upgrade is 4x > slower, but that's just because of all those bad estimates" is not > going to fly. Those bad estimates were harmlessly bad before, and now > they are harmfully bad, and similar bad estimates are going to exist > in real-world queries, and those are going to be harmful now too. > > Blaming the bad costing is a red herring. The problem is that you've > made the costing matter in a way that it previously didn't. Hm, no, I don't quite follow this argument. Isn't trying to avoid "bad costing having bad consequences" just hiding the symponts instead of curing them? It would have a high development cost, and still bad estimates could ruin your day in other places. Wouldn't it be much smarter to look at why and how the bad costing appears and try to fix this? If a query that returns 12 rows was estimated to return about 4 million, something is wrong on a ridiculous scale. If the costing didn't produce so much "to the moon" values, then it wouldn't matter so much what later decisions do depending on it. I mean, JIT is not the only thing here, even choosing the wrong plan can lead to large runtime differences (think of a sort that spills to disk etc.) So, is there a limit on how many rows can be estimated? Maybe based on things like: * how big the table is? E.g. a table with 2 pages can't have a million rows. * what the column types are? E.g. if you do: SELECT * FROM table WHERE id >= 100 AND id < 200; you cannot have more than 100 rows as a result if "id" is a unique integer column. * Index size: You can't pull out more rows from an index than it contains, maybe this helps limiting "worst estimate"? These things might also be cheaper to implement that rewriting the entire JIT model. Also, why does PG allow the stats to be that outdated - or missing, I'm not sure which case it is in this example. Shouldn't the system aim to have at least some basic stats, even if the user never runs ANALYZE? Or is this on purpose for these tests to see what happens? Best regards, Tels