Re: [HACKERS] proposal: get oldest LSN - function
On 2016-02-26 18:45:14 +0300, Kartyshov Ivan wrote: > Function pg_oldest_xlog_location gets us the oldest LSN (Log Sequence > Number) in xlog. > > It is useful additional tool for DBA (we can get replicationSlotMinLSN, so > why not in master), it can show us, if xlog replication or wal-sender is > working properly or indicate if replication on startup can get up to date > with master, or after long turnoff must be recovered from archive. How does it help with any of that? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation cache invalidation on replica
On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote: > The reason of the problem is that invalidation messages are not delivered to > replica after the end of concurrent create index. > Invalidation messages are included in xlog as part of transaction commit > record. > Concurrent index create is split into three transaction, last of which is > just performing inplace update of index tuple, marking it as valid and > invalidating cache. But as far as this transaction is not assigned XID, no > transaction record is created in WAL and send to replicas. As a result, > replica doesn't receive this invalidation messages. Ugh, that's a fairly ugly bug. > To fix the problem it is just enough to assign XID to transaction. > It can be done by adding GetCurrentTransactionId() call to the end of > DefineIdnex function: I think it'd be a better idea to always create a commit record if there's pending invalidation messages. It looks to me that that'd be a simple addition to RecordTransactionCommit. Otherwise we'd end up with something rather fragile, relying on people to recognize such problems. E.g. Vacuum and analyze also have this problem. Tom, everyone, do you see any reason not to go with such an approach? Not sending invalidations we've queued up seems like it could cause rather serious problems with HS and/or logical decoding. Basically I'm thinking about assigning an xid if (nrels != 0 || nmsgs != 0 || RelcacheInitFileInval) since we really don't ever want to miss sending out WAL in those cases. Strictly speaking we don't need an xid, but it seems rather fragile to suddenly have commit records without one. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation cache invalidation on replica
On 2016-02-27 01:16:34 +, Simon Riggs wrote: > If the above is true, then the proposed fix wouldn't work either. > > No point in sending a cache invalidation message on the standby if you > haven't also written WAL, since the catalog re-read would just see the old > row. > > heap_inplace_update() does write WAL, which blows away the starting premise. I'm not following here. heap_inplace_update() indeed writes WAL, but it does *NOT* (and may not) assign an xid. Thus we're not emitting the relcache invalidation queued in DefineIndex(), as RecordTransactionCommit() currently skips emitting a commit record if there's no xid. > So I'm not seeing this as an extant bug in an open source version of > PostgreSQL, in my current understanding. But the first message in the thread demonstrated a reproducible problem? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation cache invalidation on replica
On 2016-02-27 01:45:57 +, Simon Riggs wrote: > Surely then the fix is to make heap_inplace_update() assign an xid? That > way any catalog change will always generate a commit record containing the > invalidation that goes with the change. No need to fix up the breakage > later. Well, we could, but it'd also break things where we rely on heap_inplace_update not assigning an xid. I'm not seing why that's better than my proposal of doing this (http://archives.postgresql.org/message-id/20160227002958.peftvmcx4dxwe244%40alap3.anarazel.de), by emitting a commit record in RecordTransactionCommmit() if nrels != 0 || nmsgs != 0 || RelcacheInitFileInval . > The other heap_insert|update|delete functions (and similar) all assign xid, > so it is consistent for us to do that for inplace_update also. I don't think that follows. Inplace updates an absolute special case, where are *not allowed* to include an xid in the tuple. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit 6150a1b0
On February 26, 2016 7:55:18 PM PST, Amit Kapila wrote: >On Sat, Feb 27, 2016 at 12:41 AM, Andres Freund >wrote: >> >> Hi, >> >> On 2016-02-25 12:56:39 +0530, Amit Kapila wrote: >> > From past few weeks, we were facing some performance degradation in >the >> > read-only performance bench marks in high-end machines. My >colleague >> > Mithun, has tried by reverting commit ac1d794 which seems to >degrade the >> > performance in HEAD on high-end m/c's as reported previously[1], >but >still >> > we were getting degradation, then we have done some profiling to >see >what >> > has caused it and we found that it's mainly caused by spin lock >when >> > called via pin/unpin buffer and then we tried by reverting commit >6150a1b0 >> > which has recently changed the structures in that area and it turns >out >> > that reverting that patch, we don't see any degradation in >performance. >> > The important point to note is that the performance degradation >doesn't >> > occur every time, but if the tests are repeated twice or thrice, it >> > is easily visible. >> >> > m/c details >> > IBM POWER-8 >> > 24 cores,192 hardware threads >> > RAM - 492GB >> > >> > Non-default postgresql.conf settings- >> > shared_buffers=16GB >> > max_connections=200 >> > min_wal_size=15GB >> > max_wal_size=20GB >> > checkpoint_timeout=900 >> > maintenance_work_mem=1GB >> > checkpoint_completion_target=0.9 >> > >> > scale_factor - 300 >> > >> > Performance at commit 43cd468cf01007f39312af05c4c92ceb6de8afd8 is >469002 at >> > 64-client count and then at >6150a1b08a9fe7ead2b25240be46dddeae9d98e1, it >> > went down to 200807. This performance numbers are median of 3 >15-min >> > pgbench read-only tests. The similar data is seen even when we >revert >the >> > patch on latest commit. We have yet to perform detail analysis as >to >why >> > the commit 6150a1b08a9fe7ead2b25240be46dddeae9d98e1 lead to >degradation, >> > but any ideas are welcome. >> >> Ugh. Especially the varying performance is odd. Does it vary between >> restarts, or is it just happenstance? If it's the former, we might >be >> dealing with some alignment issues. >> > >It varies between restarts. > >> >> If not, I wonder if the issue is massive buffer header contention. As >a >> LL/SC architecture acquiring the content lock might interrupt buffer >> spinlock acquisition and vice versa. >> >> Does applying the patch from >http://archives.postgresql.org/message-id/CAPpHfdu77FUi5eiNb%2BjRPFh5S%2B1U%2B8ax4Zw%3DAUYgt%2BCPsKiyWw%40mail.gmail.com >> change the picture? >> > >Not tried, but if this is alignment issue as you are suspecting above, >then >does it make sense to try this out? It's the other theory I had. And it's additionally useful testing regardless of this regression... --- Please excuse brevity and formatting - I am writing this on my mobile phone. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sanity checking for ./configure options?
On 2016-02-27 14:15:45 -0600, Jim Nasby wrote: > Yeah, and I don't see any reasonable way to do that... we don't require sed > or the like, do we? We actually do. Check the bottom of configure.in. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Generic WAL logical messages
Hi, On 2016-02-28 22:44:12 +0100, Petr Jelinek wrote: > On 27/02/16 01:05, Andres Freund wrote: > >I'm not really convinced by RegisterStandbyMsgPrefix() et al. There's > >not much documentation about what it actually is supposed to > >acomplish. Afaics you're basically forced to use > >shared_preload_libraries with it right now? Also, iterating through a > >linked list everytime something is logged doesn't seem very satisfying? > > > > Well, my reasoning there was to stop multiple plugins from using same prefix > and for that you need some kind of registry. Making this a shared catalog > seemed like huge overkill given the potentially transient nature of output > plugins and this was the best I could come up with. And yes it requires you > to load your plugin before you can log a message for it. I think right now that's a solution that's worse than the problem. I'm inclined to define the problem away with something like "The prefix should be unique across different users of the messaging facility. Using the extension name often is a good choice.". > >>+void > >>+logicalmsg_desc(StringInfo buf, XLogReaderState *record) > >>+{ > >>+ char *rec = XLogRecGetData(record); > >>+ xl_logical_message *xlrec = (xl_logical_message *) rec; > >>+ > >>+ appendStringInfo(buf, "%s message size %zu bytes", > >>+xlrec->transactional ? "transactional" > >>: "nontransactional", > >>+xlrec->message_size); > >>+} > > > >Shouldn't we check > > uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; > > if XLogRecGetInfo(record) == XLOG_LOGICAL_MESSAGE > >here? > > > > I thought it's kinda pointless, but we seem to be doing it in other places > so will add. It leads to a segfault or something similar when adding further message types, without a compiler warning. So it seems to be a good idea to be slightly careful. > > > >>+void > >>+ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid, XLogRecPtr > >>lsn, > >>+ bool transactional, const > >>char *prefix, Size msg_sz, > >>+ const char *msg) > >>+{ > >>+ ReorderBufferTXN *txn = NULL; > >>+ > >>+ if (transactional) > >>+ { > >>+ ReorderBufferChange *change; > >>+ > >>+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > >>+ > >>+ Assert(xid != InvalidTransactionId); > >>+ Assert(txn != NULL); > >>+ > >>+ change = ReorderBufferGetChange(rb); > >>+ change->action = REORDER_BUFFER_CHANGE_MESSAGE; > >>+ change->data.msg.transactional = true; > >>+ change->data.msg.prefix = pstrdup(prefix); > >>+ change->data.msg.message_size = msg_sz; > >>+ change->data.msg.message = palloc(msg_sz); > >>+ memcpy(change->data.msg.message, msg, msg_sz); > >>+ > >>+ ReorderBufferQueueChange(rb, xid, lsn, change); > >>+ } > >>+ else > >>+ { > >>+ rb->message(rb, txn, lsn, transactional, prefix, msg_sz, msg); > >>+ } > >>+} > > > > > >This approach prohibts catalog access when processing a nontransaction > >message as there's no snapshot set up. > > > > Hmm I do see usefulness in having snapshot, although I wonder if that does > not kill the point of non-transactional messages. I don't see how it would? It'd obviously have to be the catalog/historic snapshot a transaction would have had if it started in that moment in the original WAL stream? > Question is then though which snapshot should the message see, > base_snapshot of transaction? Well, there'll not be a transaction, but something like snapbuild.c's ->snapshot ought to do the trick. > That would mean we'd have to call SnapBuildProcessChange for > non-transactional messages which we currently avoid. Alternatively we > could probably invent lighter version of that interface that would > just make sure builder->snapshot is valid and if not then build it I think the latter is probably the direction we should go in. > I am honestly sure if that's a win or not. I think it'll be confusing (bug inducing) if there's no snapshot for non-transactional messages but for transactional ones, and it'll severely limit the usefulness of the interface. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Upper planner pathification
Hi, On 2016-02-28 15:03:28 -0500, Tom Lane wrote: > Those with long memories will recall that I've been waving my arms about > $SUBJECT for more than five years. I started to work seriously on a patch > last summer, and here is a version that I feel comfortable exposing to > public scrutiny (which is not to call it "done"; more below). Yay! > So, where to go from here? I'm acutely aware that we're hard up against > the final 9.6 commitfest, and that we discourage major patches arriving > so late in a devel cycle. But I simply couldn't get this done any faster. > I don't really want to hold it over for the 9.7 devel cycle. It's been > enough trouble maintaining this patch in the face of conflicting commits > over the last year or so (it's probably still got bugs related to parallel > query...), and there definitely are conflicting patches in the upcoming > 'fest. And the lack of this infrastructure is blocking progress on FDWs > and some other things. > > So I'd really like to get this into 9.6. I'm happy to put it into the > March commitfest if someone will volunteer to review it. Hard. This is likely to cause/trigger a number of bugs, and we don't have much time to let this mature. It's a change that we're unlikely to be able to back-out if we discover that it wasn't the right thing to integrate shortly before the release. On the other hand, this is a major architectural step forward; one that unblocks a number of nice features. There's also an argument to be made that integrating this now is beneficial, because it'll cause less churn for patches being developed while 9.6 is stabilizing. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapshot too old, configured by time
On 2016-02-29 18:30:27 -0600, Kevin Grittner wrote: > Basically, a connection needs to remain open and interleave > commands with other connections, which the isolation tester does > just fine; but it needs to do that using a custom postgresql.conf > file, which TAP does just fine. I haven't been able to see the > right way to get a TAP test to set up a customized installation to > run isolation tests against. If I can get that working, I have > additional tests I can drop into that. Check contrib/test_decoding's makefile. It does just that with isolationtester. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Confusing with commit time usage in logical decoding
Hi, On 2016-03-01 18:09:28 +0100, Petr Jelinek wrote: > On 01/03/16 17:57, Alvaro Herrera wrote: > >Artur Zakirov wrote: > >>Hello, Andres > >> > >>You have introduced a large replication progress tracking infrastructure > >>last year. And there is a problem described at the link in the quote below. > >> > >>Attached patch fix this issue. Is this patch correct? I will be grateful if > >>it is and if it will be committed. > > > >AFAICS this is clearly a bug introduced in 5aa235042: > > > > /* replay actions of all transaction + subtransactions in order */ > > ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr, > >- parsed->xact_time); > >+ commit_time, origin_id, origin_lsn); > > } > > > > Well yeah but the commit_time is set few lines above as Artur pointed out, I > think the proposed fix is correct one. I'd rather just initialize commit_time to parsed->xact_time. This indeed is clearly a bug. I do wonder if anybody has a good idea about how to add regression tests for this? It's rather annoying that we have to suppress timestamps in the test_decoding tests, because they're obviously not reproducible... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On March 1, 2016 8:41:33 PM PST, Dilip Kumar wrote: >On Tue, Mar 1, 2016 at 10:19 AM, Dilip Kumar >wrote: > >> >> OK, I will test it, sometime in this week. >> > >I have tested this patch in my laptop, and there i did not see any >regression at 1 client > >Shared buffer 10GB, 5 mins run with pgbench, read-only test > >basepatch >run122187 24334 >run226288 27440 >run326306 27411 > > >May be in a day or 2 I will test it in the same machine where I >reported >the regression, and if regression is there I will check the instruction >using Call grind. Sounds like a ppc vs. x86 issue. The regression was on the former, right? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Confusing with commit time usage in logical decoding
On 2016-03-01 18:31:42 +0100, Petr Jelinek wrote: > On 01/03/16 18:18, Andres Freund wrote: > >I'd rather just initialize commit_time to parsed->xact_time. > > > >This indeed is clearly a bug. I do wonder if anybody has a good idea > >about how to add regression tests for this? It's rather annoying that > >we have to suppress timestamps in the test_decoding tests, because > >they're obviously not reproducible... > > > > The test for commit timestamps checks that the timestamps are within > reasonable time frame (for example, bigger than value of a timestamp column > in the table since that's assigned before commit obviously) , it's not > perfect but similar approach should catch issues like this one. Fixed, including such a test. Thanks for the report; and for the idea of the fix! Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup compression TODO item
On 2016-03-03 18:31:03 +0100, Magnus Hagander wrote: > I think we want it at protocol level rather than pg_basebackup level. I think we may want both eventually, but I do agree that protocol level has a lot higher "priority" than that. Something like protocol level compression has a bit of different tradeofs than compressing base backups, and it's nice not to compress, uncompress, compress again. > If SSL compression is busted on base backups, it's equally busted on > regular connection and replication streams. People do ask for > compression on that (in particular I've had a lot of requests when it > comes to replication), and our response there has traditionally been > "ssl compression"... Agreed. I think our answer there was always a bit of a cop out... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup compression TODO item
On 2016-03-03 18:44:24 +0100, Magnus Hagander wrote: > On Thu, Mar 3, 2016 at 6:34 PM, Andres Freund wrote: > > > On 2016-03-03 18:31:03 +0100, Magnus Hagander wrote: > > > I think we want it at protocol level rather than pg_basebackup level. > > > > I think we may want both eventually, but I do agree that protocol level > > has a lot higher "priority" than that. Something like protocol level > > compression has a bit of different tradeofs than compressing base > > backups, and it's nice not to compress, uncompress, compress again. > Yeah, good point, we definitely want both. Based on the field experience > I've had (which might differ from others), having it protocol level would > help more people tough, so should be higher prio. Agreed. But then our priorities are not necessary the implementers, and I don't think there's strong enough architectural reasons to only accept protocol level for now... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
Hi, > +/* > + * rename_safe -- rename of a file, making it on-disk persistent > + * > + * This routine ensures that a rename file persists in case of a crash by > using > + * fsync on the old and new files before and after performing the rename so > as > + * this categorizes as an all-or-nothing operation. > + */ > +int > +rename_safe(const char *oldfile, const char *newfile) > +{ > + struct stat filestats; > + > + /* > + * First fsync the old entry and new entry, it this one exists, to > ensure > + * that they are properly persistent on disk. Calling this routine with > + * an existing new target file is fine, rename() will first remove it > + * before performing its operation. > + */ > + fsync_fname(oldfile, false); > + if (stat(newfile, &filestats) == 0) > + fsync_fname(newfile, false); I don't think we want any stat()s here. I'd much, much rather check open for ENOENT. > +/* > + * link_safe -- make a file hard link, making it on-disk persistent > + * > + * This routine ensures that a hard link created on a file persists on system > + * in case of a crash by using fsync where on the link generated as well as > on > + * its parent directory. > + */ > +int > +link_safe(const char *oldfile, const char *newfile) > +{ If we go for a new abstraction here, I'd rather make it 'replace_file_safe' or something, and move the link/rename code #ifdef into it. > + if (link(oldfile, newfile) < 0) > + return -1; > + > + /* > + * Make the link persistent in case of an OS crash, the new entry > + * generated as well as its parent directory need to be flushed. > + */ > + fsync_fname(newfile, false); > + > + /* > + * Same for parent directory. This routine is never called to rename > + * files across directories, but if this proves to become the case, > + * flushing the parent directory if the old file would be necessary. > + */ > + fsync_parent_path(newfile); > + return 0; I think it's a seriously bad idea to encode that knowledge in such a general sounding routine. We could however argue that this is about safely replacing the *target* file; not about safely removing the old file. Currently I'm inclined to apply this to master soon. But I think we might want to wait a while with backpatching. The recent fsync upgrade disaster kinda makes me a bit careful... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)
On 2016-02-12 15:56:45 +0100, Andres Freund wrote: > Hi, > > > On 2016-02-10 23:26:20 -0500, Robert Haas wrote: > > I think the part about whacking around the FDW API is a little more > > potentially objectionable to others, so I want to hold off doing that > > unless a few more people chime in with +1. Perhaps we could start a > > new thread to talk about that specific idea. This is useful even > > without that, though. > > FWIW, I can delete a couple hundred lines of code from citusdb thanks to > this... And I'm now working on doing that. > why exactly did you expose read/writeBitmapset(), and nothing else? > Afaics a lot of the other read routines are also pretty necessary from > the outside? I'd like to also expose at least outDatum()/readDatum() - they're not entirely trivial, so it'd be sad to copy them. What I'm wondering about right now is how an extensible node should implement the equivalent of #define WRITE_NODE_FIELD(fldname) \ (appendStringInfo(str, " :" CppAsString(fldname) " "), \ _outNode(str, node->fldname)) given that _outNode isn't public, that seems to imply having to do something like #define WRITE_NODE_FIELD(fldname) \ (appendStringInfo(str, " :" CppAsString(fldname) " "), \ appendStringInfo(str, nodeToString(node->fldname))) i.e. essentially doubling memory overhead. Istm we should make outNode() externally visible? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
On 2016-03-04 14:51:50 +0900, Michael Paquier wrote: > On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund wrote: > > Hi, > > Thanks for the review. > > >> +/* > >> + * rename_safe -- rename of a file, making it on-disk persistent > >> + * > >> + * This routine ensures that a rename file persists in case of a crash by > >> using > >> + * fsync on the old and new files before and after performing the rename > >> so as > >> + * this categorizes as an all-or-nothing operation. > >> + */ > >> +int > >> +rename_safe(const char *oldfile, const char *newfile) > >> +{ > >> + struct stat filestats; > >> + > >> + /* > >> + * First fsync the old entry and new entry, it this one exists, to > >> ensure > >> + * that they are properly persistent on disk. Calling this routine > >> with > >> + * an existing new target file is fine, rename() will first remove it > >> + * before performing its operation. > >> + */ > >> + fsync_fname(oldfile, false); > >> + if (stat(newfile, &filestats) == 0) > >> + fsync_fname(newfile, false); > > > > I don't think we want any stat()s here. I'd much, much rather check open > > for ENOENT. > > OK. So you mean more or less that, right? > int fd; > fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0); > if (fd < 0) > { > if (errno != ENOENT) > return -1; > } > else > { > pg_fsync(fd); > CloseTransientFile(fd); > } Yes. Otherwise the check is racy: The file could be gone by the time you do the fsync; leading to a spurious ERROR (which often would get promoted to a PANIC). > >> +/* > >> + * link_safe -- make a file hard link, making it on-disk persistent > >> + * > >> + * This routine ensures that a hard link created on a file persists on > >> system > >> + * in case of a crash by using fsync where on the link generated as well > >> as on > >> + * its parent directory. > >> + */ > >> +int > >> +link_safe(const char *oldfile, const char *newfile) > >> +{ > > > > If we go for a new abstraction here, I'd rather make it > > 'replace_file_safe' or something, and move the link/rename code #ifdef > > into it. > > Hm. OK. I don't see any reason why switching to link() even in code > paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would > be a problem thinking about it. Should HAVE_WORKING_LINK be available > on a platform we can combine it with unlink. Is that in line with what > you think? I wasn't trying to suggest we should replace all rename codepaths with the link wrapper, just the ones that already have a HAVE_WORKING_LINK check. The name of the routine I suggested is bad though... > >> + if (link(oldfile, newfile) < 0) > >> + return -1; > >> + > >> + /* > >> + * Make the link persistent in case of an OS crash, the new entry > >> + * generated as well as its parent directory need to be flushed. > >> + */ > >> + fsync_fname(newfile, false); > >> + > >> + /* > >> + * Same for parent directory. This routine is never called to rename > >> + * files across directories, but if this proves to become the case, > >> + * flushing the parent directory if the old file would be necessary. > >> + */ > >> + fsync_parent_path(newfile); > >> + return 0; > > > > I think it's a seriously bad idea to encode that knowledge in such a > > general sounding routine. We could however argue that this is about > > safely replacing the *target* file; not about safely removing the old > > file. > > Not sure I am following here. Are you referring to the fact that if > the new file and old file are on different directories would make this > routine unreliable? Yes. > Because yes that's the case if we want to make both of them > persistent, and I think we want to do so. That's one way. > Do you suggest to correct this comment to remove the mention to the > old file's parent directory because we just care about having the new > file as being persistent? That's one approach, yes. Combined with the fact that you can't actually reliably rename across directories, the two could be on different filesystems after all, that'd be a suitable defense. It just needs to be properly documented in the function header, not at the bottom. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
On 2016-03-05 07:29:35 +0900, Michael Paquier wrote: > OK. I could produce that by tonight my time, not before unfortunately. I'm switching to this patch, after pushing the pending logical decoding fixes. Probably not today, but tomorrow PST afternoon should work. > And FWIW, per the comments of Andres, it is not clear to me what we > gain by having a common routine for link() and rename() knowing that > half the code paths performing a rename do not rely on link(). I'm not talking about replacing all renames with this. Just the ones that currently use link(). There's not much point in introducing link_safe(), when all the callers have the same duplicated code, with a fallback to rename(). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
On 2016-03-05 07:43:00 +0900, Michael Paquier wrote: > On Sat, Mar 5, 2016 at 7:35 AM, Andres Freund wrote: > > On 2016-03-04 14:51:50 +0900, Michael Paquier wrote: > >> On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund wrote: > >> Hm. OK. I don't see any reason why switching to link() even in code > >> paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would > >> be a problem thinking about it. Should HAVE_WORKING_LINK be available > >> on a platform we can combine it with unlink. Is that in line with what > >> you think? > > > > I wasn't trying to suggest we should replace all rename codepaths with > > the link wrapper, just the ones that already have a HAVE_WORKING_LINK > > check. The name of the routine I suggested is bad though... > > So we'd introduce a first routine rename_or_link_safe(), say replace_safe(). Or actually maybe just link_safe(), which falls back to access() && rename() if !HAVE_WORKING_LINK. > > That's one approach, yes. Combined with the fact that you can't actually > > reliably rename across directories, the two could be on different > > filesystems after all, that'd be a suitable defense. It just needs to be > > properly documented in the function header, not at the bottom. > > OK. Got it. Or the two could be on the same filesystem. > Still, link() and rename() do not support doing their stuff on > different filesystems (EXDEV). That's my point ... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
On 2016-03-05 22:25:36 +0900, Michael Paquier wrote: > OK, I hacked a v7: > - Move the link()/rename() group with HAVE_WORKING_LINK into a single > routine, making the previous link_safe renamed to replace_safe. This > is sharing a lot of things with rename_safe. I am not sure it is worth > complicating the code more this way by having a common single routine > for whole. Thoughts welcome. Honestly, I kind of liked the separation > with link_safe/rename_safe of previous patches because link_safe could > have been directly used by extensions and plugins btw. > - Remove the call of stat() in rename_safe() and implement a logic > depending on OpenTransientFile()/pg_fsync() to flush any existing > target file before performing the rename. > Andres, feel free to use this patch as a base, perhaps that will help. I started working on this; delayed by taking longer than planned on the logical decoding stuff (quite a bit complicated by e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8). I'm not very happy with the error handling as it is right now. For one, you have rename_safe return error codes, and act on them in the callers, but on the other hand you call fsync_fname which always errors out in case of failure. I also don't like the new messages much. Will continue working on this tomorrow. Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Badly designed error reporting code in controldata_utils.c
On 2016-03-06 17:16:42 -0800, Joe Conway wrote: > - #ifndef FRONTEND > - /* NOTE: caller must provide gettext call around the format string */ > - #define log_error(...) \ > - elog(ERROR, __VA_ARGS__) > - #else > - #define log_error(...) \ > - do { \ > - char *buf = psprintf(__VA_ARGS__); \ > - fprintf(stderr, "%s: %s\n", progname, buf); \ > - exit(2); \ > - } while (0) > - #endif > - FWIW I'm considering implementing elog/ereport properly for the frontend. We've grown several hacks around that not being present, and I think those by now have a higher aggregate complexity than a proper implementation would have. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
Hi, On 2016-03-05 19:54:05 -0800, Andres Freund wrote: > I started working on this; delayed by taking longer than planned on the > logical decoding stuff (quite a bit complicated by > e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8). I'm not very happy with the > error handling as it is right now. For one, you have rename_safe return > error codes, and act on them in the callers, but on the other hand you > call fsync_fname which always errors out in case of failure. I also > don't like the new messages much. > > Will continue working on this tomorrow. So, here's my current version of this. I've not performed any testing yet, and it's hot of the press. There's some comment smithing needed. But otherwise I'm starting to like this. Changes: * renamed rename_safe to durable_rename * renamed replace_safe to durable_link_or_rename (there was never any replacing going on) * pass through elevel to the underlying routines, otherwise we could error out with ERROR when we don't want to. That's particularly important in case of things like InstallXLogFileSegment(). * made fsync_fname use fsync_fname_ext, add 'accept permission errors' param * have walkdir call a wrapper, to add ignore_perms param What do you think? I sure wish we had the recovery testing et al. in all the back branches... - Andres >From e60caf094f68496658e969cdd4df919fd66e9d29 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Sun, 6 Mar 2016 22:20:17 -0800 Subject: [PATCH 1/2] durable-rename-andres-v8 --- src/backend/access/transam/timeline.c| 40 + src/backend/access/transam/xlog.c| 64 ++-- src/backend/access/transam/xlogarchive.c | 21 +-- src/backend/postmaster/pgarch.c | 6 +- src/backend/replication/logical/origin.c | 23 +-- src/backend/replication/slot.c | 2 +- src/backend/storage/file/fd.c| 267 +++ src/include/storage/fd.h | 4 +- 8 files changed, 232 insertions(+), 195 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index f6da673..bd91573 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -418,24 +418,10 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, TLHistoryFilePath(path, newTLI); /* - * Prefer link() to rename() here just to be really sure that we don't - * overwrite an existing file. However, there shouldn't be one, so - * rename() is an acceptable substitute except for the truly paranoid. + * Perform the rename using link if available, paranoidly trying to avoid + * overwriting an existing file (there shouldn't be one). */ -#if HAVE_WORKING_LINK - if (link(tmppath, path) < 0) - ereport(ERROR, -(errcode_for_file_access(), - errmsg("could not link file \"%s\" to \"%s\": %m", - tmppath, path))); - unlink(tmppath); -#else - if (rename(tmppath, path) < 0) - ereport(ERROR, -(errcode_for_file_access(), - errmsg("could not rename file \"%s\" to \"%s\": %m", - tmppath, path))); -#endif + durable_link_or_rename(tmppath, path, ERROR); /* The history file can be archived immediately. */ if (XLogArchivingActive()) @@ -508,24 +494,10 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size) TLHistoryFilePath(path, tli); /* - * Prefer link() to rename() here just to be really sure that we don't - * overwrite an existing logfile. However, there shouldn't be one, so - * rename() is an acceptable substitute except for the truly paranoid. + * Perform the rename using link if available, paranoidly trying to avoid + * overwriting an existing file (there shouldn't be one). */ -#if HAVE_WORKING_LINK - if (link(tmppath, path) < 0) - ereport(ERROR, -(errcode_for_file_access(), - errmsg("could not link file \"%s\" to \"%s\": %m", - tmppath, path))); - unlink(tmppath); -#else - if (rename(tmppath, path) < 0) - ereport(ERROR, -(errcode_for_file_access(), - errmsg("could not rename file \"%s\" to \"%s\": %m", - tmppath, path))); -#endif + durable_link_or_rename(tmppath, path, ERROR); } /* diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 00f139a..2d63a54 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3299,34 +3299,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, } /* - * Prefer link() to rename() here just to be really sure that we don't - * overwrite an existing logfile. However, there shouldn't be one, so - * rename() is an acceptable substitute except for the truly paranoid. + * Perform the rename using link if available, paranoidly tryin
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-03-01 16:06:47 +0100, Tomas Vondra wrote: > 1) HP DL380 G5 (old rack server) > - 2x Xeon E5450, 16GB RAM (8 cores) > - 4x 10k SAS drives in RAID-10 on H400 controller (with BBWC) > - RedHat 6 > - shared_buffers = 4GB > - min_wal_size = 2GB > - max_wal_size = 6GB > > 2) workstation with i5 CPU > - 1x i5-2500k, 8GB RAM > - 6x Intel S3700 100GB (in RAID0 for this benchmark) > - Gentoo > - shared_buffers = 2GB > - min_wal_size = 1GB > - max_wal_size = 8GB Thinking about with that hardware I'm not suprised if you're only seing small benefits. The amount of ram limits the amount of dirty data; and you have plenty have on-storage buffering in comparison to that. > Both machines were using the same kernel version 4.4.2 and default io > scheduler (cfq). The > > The test procedure was quite simple - pgbench with three different scales, > for each scale three runs, 1h per run (and 30 minutes of warmup before each > run). > > Due to the difference in amount of RAM, each machine used different scales - > the goal is to have small, ~50% RAM, >200% RAM sizes: > > 1) Xeon: 100, 400, 6000 > 2) i5: 50, 200, 3000 > > The commits actually tested are > >cfafd8be (right before the first patch) >7975c5e0 Allow the WAL writer to flush WAL at a reduced rate. >db76b1ef Allow SetHintBits() to succeed if the buffer's LSN ... Huh, now I'm a bit confused. These are the commits you tested? Those aren't the ones doing sorting and flushing? > Also, I really wonder what will happen with non-default io schedulers. I > believe all the testing so far was done with cfq, so what happens on > machines that use e.g. "deadline" (as many DB machines actually do)? deadline and noop showed slightly bigger benefits in my testing. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-03-07 09:41:51 -0800, Andres Freund wrote: > > Due to the difference in amount of RAM, each machine used different scales - > > the goal is to have small, ~50% RAM, >200% RAM sizes: > > > > 1) Xeon: 100, 400, 6000 > > 2) i5: 50, 200, 3000 > > > > The commits actually tested are > > > >cfafd8be (right before the first patch) > >7975c5e0 Allow the WAL writer to flush WAL at a reduced rate. > >db76b1ef Allow SetHintBits() to succeed if the buffer's LSN ... > > Huh, now I'm a bit confused. These are the commits you tested? Those > aren't the ones doing sorting and flushing? To clarify: The reason we'd not expect to see much difference here is that the above commits really only have any affect above noise if you use synchronous_commit=off. Without async commit it's just one additional gettimeofday() call and a few additional branches in the wal writer every wal_writer_delay. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-22 20:44:35 +0100, Fabien COELHO wrote: > > >>Random updates on 16 tables which total to 1.1GB of data, so this is in > >>buffer, no significant "read" traffic. > >> > >>(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps > >>per second avg, stddev [ min q1 median d3 max ] <=300tps > >>679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% > >> > >>(2) with 1 tablespace on 1 disk : 956.0 tps > >>per second avg, stddev [ min q1 median d3 max ] <=300tps > >>956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% > > > >Interesting. That doesn't reflect my own tests, even on rotating media, > >at all. I wonder if it's related to: > >https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 > > > >If you use your 12.04 kernel, that'd not be fixed. Which might be a > >reason to do it as you suggest. > > > >Could you share the exact details of that workload? > > See attached scripts (sh to create the 16 tables in the default or 16 table > spaces, small sql bench script, stat computation script). > > The per-second stats were computed with: > > grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 > --limit=300 > > Host is 8 cpu 16 GB, 2 HDD in RAID 1. Well, that's not a particularly meaningful workload. You increased the number of flushed to the same number of disks considerably. For a meaningful comparison you'd have to compare using one writeback context for N tablespaces on N separate disks/raids, and using N writeback contexts for the same. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-07 21:10:19 +0100, Fabien COELHO wrote: > Now I cannot see how having one context per table space would have a > significant negative performance impact. The 'dirty data' etc. limits are global, not per block device. By having several contexts with unflushed dirty data the total amount of dirty data in the kernel increases. Thus you're more likely to see stalls by the kernel moving pages into writeback. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Allowing to run a buildfarm animal under valgrind
Hi, I'm setting up a buildfarm animal that runs under valgrind. Unfortunately there's not really any good solution to force make check et al. to start postgres wrapped in valgrind. For now I've resorted to adding something like sub replace_postgres { my $srcdir=$use_vpath ? "../pgsql/" : "."; my $builddir=abs_path("$pgsql"); $srcdir=abs_path("$pgsql/$srcdir"); chdir "$pgsql/src/backend/"; rename "postgres", "postgres.orig"; sysopen my $fh, "postgres", O_CREAT|O_TRUNC|O_RDWR, 0700 or die "Could not create postgres wrapper"; print $fh <<"END"; #!/bin/bash ~/src/valgrind/vg-in-place \\ --quiet \\ --error-exitcode=128 \\ --suppressions=$srcdir/src/tools/valgrind.supp \\ --trace-children=yes --track-origins=yes --read-var-info=yes \\ --leak-check=no \\ $builddir/src/backend/postgres.orig \\ "\$@" END close $fh; chdir $branch_root; } to the buildfarm client. i.e. a script that replaces the postgres binary with a wrapper that invokes postgres via valgrind. That's obviously not a very good approach though. It's buildfarm specific and thus can't be invoked by developers and it doesn't really support being installed somewhere. Does anybody have a better idea about how to do this? Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
Hi, On 2016-03-08 12:01:18 +0900, Michael Paquier wrote: > I have spent a couple of hours looking at that in details, and the > patch is neat. Cool. Doing some more polishing right now. Will be back with an updated version soonish. Did you do some testing? > + * This routine ensures that, after returning, the effect of renaming file > + * persists in case of a crash. A crash while this routine is running will > + * leave you with either the old, or the new file. > "you" is not really Postgres-like, "the server" or "the backend" perhaps? Hm. I think your alternative proposals are more awkward. > + /* XXX: perform close() before? might be outside a > transaction. Consider errno! */ > ereport(elevel, > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\": %m", fname))); > + (void) CloseTransientFile(fd); > + return -1; > close() should be called before. slot.c for example does so and we > don't want to link an fd here in case of elevel >= ERROR. Note that the transient file machinery will normally prevent fd leakage - but it can only do so if called in a transaction context. I've added int save_errno; /* close file upon error, might not be in transaction context */ save_errno = errno; CloseTransientFile(fd); errno = save_errno; stanzas. > + * It does so by using fsync on the sourcefile before the rename, and the > + * target file and directory after. > fsync is issued as well on the target file if it exists. I think > that's worth mentioning in the header. Ok. > + /* XXX: Add racy file existence check? */ > + if (rename(oldfile, newfile) < 0) > I am not sure we should worry about that, what do you think could > cause the old file from going missing all of a sudden. Other backend > processes are not playing with it in the code paths where this routine > is called. Perhaps adding a comment in the header to let users know > that would help? What I'm thinking of is adding a check whether the *target* file already exists, and error out in that case. Just like the link() based path normally does. > Instead of "durable" I think that "persistent" makes more sense. I find durable a lot more descriptive. persistent could refer to retrying the rename or something. > We > want to make those renames persistent on disk on case of a crash. So I > would suggest the following routine names: > - rename_persistent > - rename_or_link_persistent > Having the verb first also helps identifying that this is a > system-level, rename()-like, routine. I prefer the current names. > > I sure wish we had the recovery testing et al. in all the back > > branches... > > Sure, what we have now is something that should really be backpatched, > I was just waiting to have all the existing stability issues > addressed, the last one on my agenda being the failure of hamster for > test 005 I mentioned in another thread before sending patches for > other branches. I proposed a couple of potential regarding that > actually, see here: > http://www.postgresql.org/message-id/cab7npqsaz9hnucmoua30jo2wj8mnrem18p2a7mcra-zrjxj...@mail.gmail.com Yea. Will be an interesting discussion... Anyway, I did run the patch through the existing checks, after enabling fsync in PostgresNode.pm. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Badly designed error reporting code in controldata_utils.c
On 2016-03-08 13:45:25 +0900, Michael Paquier wrote: > On Mon, Mar 7, 2016 at 10:26 AM, Andres Freund wrote: > > FWIW I'm considering implementing elog/ereport properly for the > > frontend. We've grown several hacks around that not being present, and > > I think those by now have a higher aggregate complexity than a proper > > implementation would have. > > That would be handy. And this is are going to need something like > callbacks to allow frontend applications how to apply elevel. Take for > example pg_rewind, it has an interface with DEBUG and PROGRESS level > directly embedded with FATAL controlled by user-defined options. What does "directly embedded with FATAL" mean? I don't really want to go further than what our system already is capable of; once we have that we can look for the next steps. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
On 2016-03-08 12:26:34 +0900, Michael Paquier wrote: > On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund wrote: > > On 2016-03-08 12:01:18 +0900, Michael Paquier wrote: > >> I have spent a couple of hours looking at that in details, and the > >> patch is neat. > > > > Cool. Doing some more polishing right now. Will be back with an updated > > version soonish. > > > > Did you do some testing? > > Not much in details yet, I just ran a check-world with fsync enabled > for the recovery tests, plus quick manual tests with a cluster > manually set up. I'll do more with your new version now that I know > there will be one. Here's my updated version. Note that I've split the patch into two. One for the infrastructure, and one for the callsites. > >> + /* XXX: Add racy file existence check? */ > >> + if (rename(oldfile, newfile) < 0) > > > >> I am not sure we should worry about that, what do you think could > >> cause the old file from going missing all of a sudden. Other backend > >> processes are not playing with it in the code paths where this routine > >> is called. Perhaps adding a comment in the header to let users know > >> that would help? > > > > What I'm thinking of is adding a check whether the *target* file already > > exists, and error out in that case. Just like the link() based path > > normally does. > > Ah, OK. Well, why not. I'd rather have an assertion instead of an error > though. I think it should definitely be an error if anything. But I'd rather only add it in master... Andres >From 9dc71e059cc50d57e7f4f42c68b1c4afa07279a3 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 7 Mar 2016 15:04:17 -0800 Subject: [PATCH 1/2] Introduce durable_rename() and durable_link_or_rename(). Renaming a file using rename(2) is not guaranteed to be durable in face of crashes. To be certain that a rename() atomically replaces the previous file contents in the face of crashes and different filesystems, one has to fsync the old filename, rename the file, fsync the new filename, fsync the containing directory. This sequence is not correctly adhered to currently; which exposes us to data loss risks. To avoid having to repeat this arduous sequence, introduce durable_rename(), which wraps all that. Also add durable_link_or_rename(). Several places use link() (with a fallback to rename()) to rename a file, trying to avoid replacing the target file out of paranoia. Some of those rename sequences need to be durable as well. This commit does not yet make use of the new functions; they're used in a followup commit. Author: Michael Paquier, Andres Freund Discussion: 56583bdd.9060...@2ndquadrant.com Backpatch: All supported branches --- src/backend/replication/slot.c | 2 +- src/backend/storage/file/fd.c | 287 - src/include/storage/fd.h | 4 +- 3 files changed, 228 insertions(+), 65 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index affa9b9..ead221d 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1095,7 +1095,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel) START_CRIT_SECTION(); fsync_fname(path, false); - fsync_fname((char *) dir, true); + fsync_fname(dir, true); fsync_fname("pg_replslot", true); END_CRIT_SECTION(); diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 1b30100..c9f9b7d 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -306,7 +306,10 @@ static void walkdir(const char *path, #ifdef PG_FLUSH_DATA_WORKS static void pre_sync_fname(const char *fname, bool isdir, int elevel); #endif -static void fsync_fname_ext(const char *fname, bool isdir, int elevel); +static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); + +static int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); +static int fsync_parent_path(const char *fname, int elevel); /* @@ -413,54 +416,158 @@ pg_flush_data(int fd, off_t offset, off_t amount) * indicate the OS just doesn't allow/require fsyncing directories. */ void -fsync_fname(char *fname, bool isdir) +fsync_fname(const char *fname, bool isdir) { - int fd; - int returncode; - - /* - * Some OSs require directories to be opened read-only whereas other - * systems don't allow us to fsync files opened read-only; so we need both - * cases here - */ - if (!isdir) - fd = OpenTransientFile(fname, - O_RDWR | PG_BINARY, - S_IRUSR | S_IWUSR); - else - fd = OpenTransientFile(fname, - O_RDONLY | PG_BINARY, - S_IRUSR | S_IWUSR); - - /* - * Some OSs don't allow us to open directories at all (Windows returns - * EACCE
Re: [HACKERS] Allowing to run a buildfarm animal under valgrind
Hi, On 2016-03-07 17:39:30 -0800, Andres Freund wrote: > I'm setting up a buildfarm animal that runs under valgrind. Which is now running as 'skink'. The first failed due to a missing trick in the wrapper script, but the second one looks like it had a legit issue: http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=skink&dt=2016-03-08%2004%3A22%3A00 ==1827== Invalid write of size 4 ==1827==at 0x14E35DD1: plperl_ref_from_pg_array (plperl.c:1459) ==1827==by 0x14E3608C: plperl_call_perl_func (plperl.c:2135) ==1827==by 0x14E3C24F: plperl_func_handler (plperl.c:2357) ==1827==by 0x14E3C24F: plperl_call_handler (plperl.c:1778) ==1827==by 0x5E0531: ExecMakeFunctionResultNoSets (execQual.c:2041) ==1827==by 0x5E641C: ExecTargetList (execQual.c:5367) ==1827==by 0x5E641C: ExecProject (execQual.c:5582) ==1827==by 0x5FC1C1: ExecResult (nodeResult.c:155) ==1827==by 0x5DF577: ExecProcNode (execProcnode.c:392) ==1827==by 0x5DB675: ExecutePlan (execMain.c:1566) ==1827==by 0x5DB675: standard_ExecutorRun (execMain.c:338) ==1827==by 0x6F4DD7: PortalRunSelect (pquery.c:942) ==1827==by 0x6F631D: PortalRun (pquery.c:786) ==1827==by 0x6F2FFA: exec_simple_query (postgres.c:1094) ==1827==by 0x6F2FFA: PostgresMain (postgres.c:4021) ==1827==by 0x46D33F: BackendRun (postmaster.c:4258) ==1827==by 0x46D33F: BackendStartup (postmaster.c:3932) ==1827==by 0x46D33F: ServerLoop (postmaster.c:1690) ==1827== Address 0x15803220 is 304 bytes inside a block of size 8,192 alloc'd ==1827==at 0x4C28BB5: malloc (vg_replace_malloc.c:299) ==1827==by 0x808A37: AllocSetAlloc (aset.c:864) ==1827==by 0x80A3B3: palloc (mcxt.c:907) ==1827==by 0x7E0802: get_func_signature (lsyscache.c:1483) ==1827==by 0x14E367D6: plperl_call_perl_func (plperl.c:2116) ==1827==by 0x14E3C24F: plperl_func_handler (plperl.c:2357) ==1827==by 0x14E3C24F: plperl_call_handler (plperl.c:1778) ==1827==by 0x5E0531: ExecMakeFunctionResultNoSets (execQual.c:2041) ==1827==by 0x5E641C: ExecTargetList (execQual.c:5367) ==1827==by 0x5E641C: ExecProject (execQual.c:5582) ==1827==by 0x5FC1C1: ExecResult (nodeResult.c:155) ==1827==by 0x5DF577: ExecProcNode (execProcnode.c:392) ==1827==by 0x5DB675: ExecutePlan (execMain.c:1566) ==1827==by 0x5DB675: standard_ExecutorRun (execMain.c:338) ==1827==by 0x6F4DD7: PortalRunSelect (pquery.c:942) ==1827== I've now added configuration to valgrind so it wraps VALGRINDERROR-BEGIN / VALGRINDERROR-END around errors, to make the logs easier to search. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] empty array case in plperl_ref_from_pg_array not handled correctly
Hi, Per the new valgrind animal we get: http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=skink&dt=2016-03-08%2004%3A22%3A00 2016-03-08 05:56:05.566 UTC [56de6971.723:5] LOG: statement: select plperl_sum_array('{}'); ==1827== Invalid write of size 4 ==1827==at 0x14E35DD1: plperl_ref_from_pg_array (plperl.c:1459) ==1827==by 0x14E3608C: plperl_call_perl_func (plperl.c:2135) ==1827==by 0x14E3C24F: plperl_func_handler (plperl.c:2357) ==1827==by 0x14E3C24F: plperl_call_handler (plperl.c:1778) ==1827==by 0x5E0531: ExecMakeFunctionResultNoSets (execQual.c:2041) ==1827==by 0x5E641C: ExecTargetList (execQual.c:5367) ==1827==by 0x5E641C: ExecProject (execQual.c:5582) ==1827==by 0x5FC1C1: ExecResult (nodeResult.c:155) ==1827==by 0x5DF577: ExecProcNode (execProcnode.c:392) ==1827==by 0x5DB675: ExecutePlan (execMain.c:1566) ==1827==by 0x5DB675: standard_ExecutorRun (execMain.c:338) ==1827==by 0x6F4DD7: PortalRunSelect (pquery.c:942) ==1827==by 0x6F631D: PortalRun (pquery.c:786) ==1827==by 0x6F2FFA: exec_simple_query (postgres.c:1094) ==1827==by 0x6F2FFA: PostgresMain (postgres.c:4021) ==1827==by 0x46D33F: BackendRun (postmaster.c:4258) ==1827==by 0x46D33F: BackendStartup (postmaster.c:3932) ==1827==by 0x46D33F: ServerLoop (postmaster.c:1690) ==1827== Address 0x15803220 is 304 bytes inside a block of size 8,192 alloc'd ==1827==at 0x4C28BB5: malloc (vg_replace_malloc.c:299) ==1827==by 0x808A37: AllocSetAlloc (aset.c:864) ==1827==by 0x80A3B3: palloc (mcxt.c:907) ==1827==by 0x7E0802: get_func_signature (lsyscache.c:1483) ==1827==by 0x14E367D6: plperl_call_perl_func (plperl.c:2116) ==1827==by 0x14E3C24F: plperl_func_handler (plperl.c:2357) ==1827==by 0x14E3C24F: plperl_call_handler (plperl.c:1778) ==1827==by 0x5E0531: ExecMakeFunctionResultNoSets (execQual.c:2041) ==1827==by 0x5E641C: ExecTargetList (execQual.c:5367) ==1827==by 0x5E641C: ExecProject (execQual.c:5582) ==1827==by 0x5FC1C1: ExecResult (nodeResult.c:155) ==1827==by 0x5DF577: ExecProcNode (execProcnode.c:392) ==1827==by 0x5DB675: ExecutePlan (execMain.c:1566) ==1827==by 0x5DB675: standard_ExecutorRun (execMain.c:338) ==1827==by 0x6F4DD7: PortalRunSelect (pquery.c:942) ==1827== looking at the code static SV * plperl_ref_from_pg_array(Datum arg, Oid typid) { ... /* Get total number of elements in each dimension */ info->nelems = palloc(sizeof(int) * info->ndims); info->nelems[0] = nitems; that's not suprising. If ndims is zero nelemes will be a zero-length array. Adding Assert(info->ndims > 0); makes it die reliably, without gdb. ISTM the assumption that an array always has a dimension is a bit more widespread... E.g. split_array() looks like it'd not work nicely with a zero dimensional array... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
Hi, On 2016-03-08 16:21:45 +0900, Michael Paquier wrote: > + durable_link_or_rename(tmppath, path, ERROR); > + durable_rename(path, xlogfpath, ERROR); > You may want to add a (void) cast in front of those calls for correctness. "correctness"? This is neatnikism, not correctness. I've actually added (void)'s to the sites that return on error (i.e. pass LOG or something), but not the ones where we pass ERROR. > - ereport(LOG, > - (errcode_for_file_access(), > -errmsg("could not link file \"%s\" to \"%s\" > (initialization of log file): %m", > - tmppath, path))); > We lose a portion of the error message here, but with the file name > that's easy to guess where that is happening. I am not complaining > (that's fine to me as-is), just mentioning for the archive's sake. Yea, I think that's fine too. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allowing to run a buildfarm animal under valgrind
On 2016-03-08 08:58:22 -0500, Andrew Dunstan wrote: > On 03/07/2016 08:39 PM, Andres Freund wrote: > >Does anybody have a better idea about how to do this? > > Why not just create a make target which does this? It could be run after > 'make' and before 'make check'. I would make it assume valgrind was > installed and in the path rather than using vg-in-place. I don't really see how that'd work. make check et al. start postgres via pg_ctl, so we need to invoke valgrind from there. Additionally I don't want to just be able to run make check via valgrind, but all contrib modules et al too, including eventually the replication regression tests and such. > If that doesn't work and you want to do something with the buildfarm, > probably a buildfarm module would be the way to go. We might need to add a > new module hook to support it, to run right after make(), but that would be > a one line addition to run_build.pl. Yea, I think that's what it probably has to be. What I'm decidedly unhappy with right now is that this seems to require moving make install up, or manually add a new file for installation. The reason for that is that if we replace the postgres binary with the wrapper, the original file obviously doesn't get installed anymore. So it's invoked at it's original location; which only works if share files are installed in the correct location. I do wonder if adding a PGCTLPOSTGRESWRAPPER or something to pg_ctl would be ok. That'd just supplant calling the postgres binary, making all this easier. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allowing to run a buildfarm animal under valgrind
On 2016-03-08 18:24:23 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > I do wonder if adding a PGCTLPOSTGRESWRAPPER or something to pg_ctl > > would be ok. That'd just supplant calling the postgres binary, making > > all this easier. > > This seems a reasonably principled way to go about this. Eventually we > might plug other things in it ... It's not that easy to write such a wrapper though - you *have* to exec the final binary, because -w assumes that the child pid is going to appear in postmaster.pid... To be actually useful we kinda would have to backpatch this to 9.4 (where the valgrind hardening stuff started)... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] empty array case in plperl_ref_from_pg_array not handled correctly
On 2016-03-08 02:11:03 -0700, Alex Hunsaker wrote: > On Mon, Mar 7, 2016 at 11:32 PM, Andres Freund wrote: > > > Hi, > > > > Per the new valgrind animal we get: > > > > > > http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=skink&dt=2016-03-08%2004%3A22%3A00 > > 2016-03-08 05:56:05.566 UTC [56de6971.723:5] LOG: statement: select > > plperl_sum_array('{}'); > > ==1827== Invalid write of size 4 > > ==1827==at 0x14E35DD1: plperl_ref_from_pg_array (plperl.c:1459) > > > > > [ I think you may have meant to CC me not Alexey K. I'm probably the person > responsible :D. ] I just took the first person mentioned in the commit message http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=87bb2ade2ce646083f39d5ab3e3307490211ad04 sorry for leaving you out ;) > Recursive calls to split_array() should be fine as nested 0 dim arrays get > collapsed down to single 0 dim array (I hand tested it nonetheless): > # select ARRAY[ARRAY[ARRAY[]]]::int[]; > array > ─── > {} > (1 row) > > Looking around, everything that makes an array seems to pass through > plperl_ref_from_pg_array(), so I think once that is fixed all is good. I > also looked for dimensions and ARR_NDIM() just to make sure (didn't find > anything fishy). Thanks for looking. backpatched all the way. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
On 2016-03-08 23:47:48 +0100, Tomas Vondra wrote: > I've repeated the power-loss testing today. With the patches applied I'm > not longer able to reproduce the issue (despite trying about 10x), while > without them I've hit it on the first try. This is on kernel 4.4.2. Yay, thanks for testing! Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idle In Transaction Session Timeout, revived
On 2016-03-08 16:42:37 -0500, Robert Haas wrote: > - I really wonder if the decision to ignore sessions that are idle in > transaction (aborted) should revisited. Consider this: > > rhaas=# begin; > BEGIN > rhaas=# lock table pg_class; > LOCK TABLE > rhaas=# savepoint a; > SAVEPOINT > rhaas=# select 1/0; > ERROR: division by zero Probably only if the toplevel transaction is also aborted. Might not be entirely trivial to determine. > - What's the right order of events in PostgresMain? Right now the > patch disables the timeout after checking for interrupts and clearing > DoingCommandRead, but maybe it would be better to disable the timeout > first, so that we can't have it happen that start to execute the > command and then, in medias res, bomb out because of the idle timeout. > Then again, maybe you had some compelling reason for doing it this > way, in which case we should document that in the comments. Hm, we should never bomb out of the middle of anything with this, right? I mean all the itmeout handler does is set a volatile var and set a latch, the rest is done in the interrupt handler? Which is not called in the signal handler. I'm no targuing for the current order, just that one argument ;). FWIW, I think Vik wrote this after discussing with me, and IIRC there was not a lot of reasoning going into the specific location for doing this. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fun with "Ready for Committer" patches
On 2016-03-09 08:18:09 +0900, Tatsuo Ishii wrote: > > It's hard to miss the fact that there are an absolutely breathtaking > > number of patches in this CommitFest - 80! - that are in the "needs > > review" state. We really, really, really need more review to happen - > > Many of "needs review" state patches already have reviewer(s). Do you > mean we want more reviewers in addition to them for such patches? Any review performed is worthwhile. Somebody having signed up to review a patch shouldn't stop you. Especially if that signup was more than a couple hours ago. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] the include-timestamp data returned by pg_logical_slot_peek_changes() is always 2000-01-01 in 9.5.1
On March 9, 2016 4:26:01 AM PST, Craig Ringer wrote: >On 9 March 2016 at 18:13, 李海龙 wrote: > >> >> >> HI, pgsql-hackers >> >> The include-timestamp data returned by >pg_logical_slot_peek_changes() is >> always 2000-01-01 in 9.5.1, is it not normal? It's a bug in 9.5, that has been fixed. You can either update to the version from git, wait for the next minor release, or use 9.4, where the buff is not present. >Did you enable track_commit_timestamps in the server? That's unrelated, commit ts is about an xid timestamp mapping. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fun with "Ready for Committer" patches
On 2016-03-10 06:14:25 +0900, Michael Paquier wrote: > IMO, during a review one needs to think of himself as a committer. > Once the reviewer switches the patch to "Ready for committer", it > means that the last version of the patch present would have been the > version that gained the right to be pushed. And one consideration there is whether you, as the committer, would be ok with maintaining this feature going forward. But I think for less experienced reviewers that's hard to judge; and I think asking everyone to do that raises the barriers to do reviews considerably. So I think we should somehow document that it's ok to mark the patch as such, but that you're not forced to do that if you don't want to. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
On 2016-03-07 21:55:52 -0800, Andres Freund wrote: > Here's my updated version. > > Note that I've split the patch into two. One for the infrastructure, and > one for the callsites. I've finally pushed these, after making a number of mostly cosmetic fixes. The only of real consequence is that I've removed the durable_* call from the renames to .deleted in xlog[archive].c - these don't need to be durable, and are windows only. Oh, and that there was a typo in the !HAVE_WORKING_LINK case. There's a *lot* of version skew here: not-present functionality, moved files, different APIs - we got it all. I've tried to check in each version whether we're missing fsyncs for renames and everything. Michael, *please* double check the diffs for the different branches. Note that we currently have some frontend programs with the equivalent problem. Most importantly receivelog.c (pg_basebackup/pg_recveivexlog) are missing pretty much the same directory fsyncs. And at least for pg_recvxlog it's critical, especially now that receivexlog support syncrep. I've not done anything about that; there's pretty much no chance to share backend code with the frontend in the back-branches. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_rewind just doesn't fsync *anything*?
Hi, how come that the only comment in pg_rewind about fsyncing is ' void close_target_file(void) { ... /* fsync? */ } Isn't that a bit, uh, minimal for a utility that's likely to be used in failover scenarios? I think we might actually be "saved" due to http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ce439f33 because pg_rewind appears to leave the cluster in ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY; updateControlFile(&ControlFile_new); a state that StartupXLOG will treat as needing recovery: if (ControlFile->state != DB_SHUTDOWNED && ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) SyncDataDirectory(); but that code went in after pg_rewind, so this certainly can't be an intentional save. I also don't think it's ok that you need to start the cluster to make it safe against a crash? I guess the easiest fix would be to shell out to initdb -s? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Upper planner pathification
Hi Tom, On 2016-02-28 15:03:28 -0500, Tom Lane wrote: > diff --git a/src/include/optimizer/planmain.h > b/src/include/optimizer/planmain.h > index eaa642b..cd7338a 100644 > *** a/src/include/optimizer/planmain.h > --- b/src/include/optimizer/planmain.h > *** extern RelOptInfo *query_planner(Planner > *** 43,102 >* prototypes for plan/planagg.c >*/ > extern void preprocess_minmax_aggregates(PlannerInfo *root, List *tlist); > - extern Plan *optimize_minmax_aggregates(PlannerInfo *root, List *tlist, > -const AggClauseCosts > *aggcosts, Path *best_path); > > /* >* prototypes for plan/createplan.c >*/ > extern Plan *create_plan(PlannerInfo *root, Path *best_path); > - extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual, > - Index scanrelid, Plan *subplan); > extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual, >Index scanrelid, List *fdw_exprs, List > *fdw_private, >List *fdw_scan_tlist, List *fdw_recheck_quals, >Plan *outer_plan); > - extern Append *make_append(List *appendplans, List *tlist); > - extern RecursiveUnion *make_recursive_union(List *tlist, > - Plan *lefttree, Plan *righttree, int > wtParam, > - List *distinctList, long numGroups); > - extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, > - List *pathkeys, double > limit_tuples); > - extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls, > -Plan *lefttree); > - extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls, > - AttrNumber *grpColIdx, Plan > *lefttree); > - extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual, > - AggStrategy aggstrategy, const AggClauseCosts *aggcosts, > - int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, > - List *groupingSets, long numGroups, bool combineStates, > - bool finalizeAggs, Plan *lefttree); > - extern WindowAgg *make_windowagg(PlannerInfo *root, List *tlist, > -List *windowFuncs, Index winref, > -int partNumCols, AttrNumber *partColIdx, Oid > *partOperators, > -int ordNumCols, AttrNumber *ordColIdx, Oid > *ordOperators, > -int frameOptions, Node *startOffset, Node *endOffset, > -Plan *lefttree); > - extern Group *make_group(PlannerInfo *root, List *tlist, List *qual, > -int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, > -double numGroups, > -Plan *lefttree); > extern Plan *materialize_finished_plan(Plan *subplan); > ! extern Unique *make_unique(Plan *lefttree, List *distinctList); > ! extern LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int > epqParam); > ! extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node > *limitCount, > !int64 offset_est, int64 count_est); > ! extern SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan > *lefttree, > !List *distinctList, AttrNumber flagColIdx, int firstFlag, > !long numGroups, double outputRows); > ! extern Result *make_result(PlannerInfo *root, List *tlist, > ! Node *resconstantqual, Plan *subplan); > ! extern ModifyTable *make_modifytable(PlannerInfo *root, > ! CmdType operation, bool canSetTag, > ! Index nominalRelation, > ! List *resultRelations, List *subplans, > ! List *withCheckOptionLists, List > *returningLists, > ! List *rowMarks, OnConflictExpr *onconflict, > int epqParam); > extern bool is_projection_capable_plan(Plan *plan); I see that you made a lot of formerly externally visible make_* routines static. The Citus extension (which will be open source in a few days) uses several of these (make_agg, make_sort_from_sortclauses, make_limit at least). Can we please re-consider making these static? It's fine if their parameter list changes from release to release, that's easy enough to adjust to, and it's easily visible. But having to copy all of these into extension code is more than a bit painful; especially make_sort_from_sortclauses. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind just doesn't fsync *anything*?
Hi, On 2016-03-10 08:47:16 +0100, Michael Paquier wrote: > Still, I think that we had better fsync only entries that are modified > by pg_rewind, and files that got updated, and not the whole directory Why? If any files in there are dirty, they need to be fsynced. If they're not dirty, fsync's free. > a target data folder should be stopped properly to be able to rewind, > and it is better to avoid dependencies between utilities if that's not > strictly necessary. initdb is likely to be installed side-by-side > with pg_rewind in any distribution though. It's not like we don't have any other such dependencies, in other binaries. I'm not concerned. Having to backpatch a single system() invocation + find_other_exec() call, and backporting a more general FRONTEND version of initdb's fsync_pgdata() are wildly differing in complexity. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Upper planner pathification
On 2016-03-10 13:48:31 -0500, Tom Lane wrote: > Andres Freund writes: > > I see that you made a lot of formerly externally visible make_* routines > > static. The Citus extension (which will be open source in a few days) > > uses several of these (make_agg, make_sort_from_sortclauses, make_limit > > at least). > > > Can we please re-consider making these static? > > That was intentional: in my opinion, nothing outside createplan.c ought > to be making Plan nodes anymore. The expectation is that you make a > Path describing what you want. Can you explain why, in the new planner > structure, it would be sane to have external callers of these functions? In Citus' case a full PlannedStmt is generated on the master node, to combine the data generated on worker nodes (where the bog standard postgres planner is used). It's not the only way to do things, but I don't see why the approach would be entirely invalidated by the pathification work. We do provide the planner_hook, so restricting the toolbox for that to do something useful, doesn't seem like a necessarily good idea. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind just doesn't fsync *anything*?
On 2016-03-10 20:31:55 +0100, Michael Paquier wrote: > On Thu, Mar 10, 2016 at 7:52 PM, Andres Freund wrote: > > Having to backpatch a single system() invocation + find_other_exec() > > call, and backporting a more general FRONTEND version of initdb's > > fsync_pgdata() are wildly differing in complexity. > > Talking about HEAD, wouldn't the dependency tree be cleaner if there > is a common facility in src/common? Yea, probably; but lets sort out the backbranches first. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Upper planner pathification
On 2016-03-10 14:16:03 -0500, Tom Lane wrote: > Andres Freund writes: > > In Citus' case a full PlannedStmt is generated on the master node, to > > combine the data generated on worker nodes (where the bog standard > > postgres planner is used). It's not the only way to do things, but I > > don't see why the approach would be entirely invalidated by the > > pathification work. > > I don't deny that you *could* continue to do things that way, but > I dispute that it's a good idea. Why can't you generate a Path tree > and then ask create_plan() to convert it? Primarily because create_plan(), and/or its children, have to know about what you're doing; you can hide some, but not all, things below CustomScan nodes. Secondarily, as an extension you will often have to support several major versions. ISTM, that there's good enough reasons to go either way; I don't see what we're gaining by making these private. That just encourages copy-paste coding. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-08 09:28:15 +0100, Fabien COELHO wrote: > > >>>Now I cannot see how having one context per table space would have a > >>>significant negative performance impact. > >> > >>The 'dirty data' etc. limits are global, not per block device. By having > >>several contexts with unflushed dirty data the total amount of dirty > >>data in the kernel increases. > > > >Possibly, but how much? Do you have experimental data to back up that > >this is really an issue? > > > >We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB > >of dirty buffers to manage for 16 table spaces, I do not see that as a > >major issue for the kernel. We flush in those increments, that doesn't mean there's only that much dirty data. I regularly see one order of magnitude more being dirty. I had originally kept it with one context per tablespace after refactoring this, but found that it gave worse results in rate limited loads even over only two tablespaces. That's on SSDs though. > To complete the argument, the 4MB is just a worst case scenario, in reality > flushing the different context would be randomized over time, so the > frequency of flushing a context would be exactly the same in both cases > (shared or per table space context) if the checkpoints are the same size, > just that with shared table space each flushing potentially targets all > tablespace with a few pages, while with the other version each flushing > targets one table space only. The number of pages still in writeback (i.e. for which sync_file_range has been issued, but which haven't finished running yet) at the end of the checkpoint matters for the latency hit incurred by the fsync()s from smgrsync(); at least by my measurement. My current plan is to commit this with the current behaviour (as in this week[end]), and then do some actual benchmarking on this specific part. It's imo a relatively minor detail. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-21 09:49:53 +0530, Robert Haas wrote: > I think there might be a semantic distinction between these two terms. > Doesn't writeback mean writing pages to disk, and flushing mean making > sure that they are durably on disk? So for example when the Linux > kernel thinks there is too much dirty data, it initiates writeback, > not a flush; on the other hand, at transaction commit, we initiate a > flush, not writeback. I don't think terminology is sufficiently clear to make such a distinction. Take e.g. our FlushBuffer()... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-10 17:33:33 -0500, Robert Haas wrote: > On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund wrote: > > On 2016-02-21 09:49:53 +0530, Robert Haas wrote: > >> I think there might be a semantic distinction between these two terms. > >> Doesn't writeback mean writing pages to disk, and flushing mean making > >> sure that they are durably on disk? So for example when the Linux > >> kernel thinks there is too much dirty data, it initiates writeback, > >> not a flush; on the other hand, at transaction commit, we initiate a > >> flush, not writeback. > > > > I don't think terminology is sufficiently clear to make such a > > distinction. Take e.g. our FlushBuffer()... > > Well then we should clarify it! Trying that as we speak, err, write. How about: Whenever more than bgwriter_flush_after bytes have been written by the bgwriter, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the background. Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than , but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. 0 disables controlled writeback. The default is 256Kb on Linux, 0 otherwise. This parameter can only be set in the postgresql.conf file or on the server command line. (plus adjustments for the other gucs) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-10 23:38:38 +0100, Fabien COELHO wrote: > I'm not sure I've seen these performance... If you have hard evidence, > please feel free to share it. Man, are you intentionally trying to be hard to work with? To quote the email you responded to: > My current plan is to commit this with the current behaviour (as in this > week[end]), and then do some actual benchmarking on this specific > part. It's imo a relatively minor detail. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-10 23:43:46 +0100, Fabien COELHO wrote: > > > > >Whenever more than bgwriter_flush_after bytes have > >been written by the bgwriter, attempt to force the OS to issue these > >writes to the underlying storage. Doing so will limit the amount of > >dirty data in the kernel's page cache, reducing the likelihood of > >stalls when an fsync is issued at the end of a checkpoint, or when > >the OS writes data back in larger batches in the background. Often > >that will result in greatly reduced transaction latency, but there > >also are some cases, especially with workloads that are bigger than > >, but smaller than the OS's page > >cache, where performance might degrade. This setting may have no > >effect on some platforms. 0 disables controlled > >writeback. The default is 256Kb on Linux, 0 > >otherwise. This parameter can only be set in the > >postgresql.conf file or on the server command line. > > > > > >(plus adjustments for the other gucs) > What about the maximum value? Added. bgwriter_flush_after (int) bgwriter_flush_after configuration parameter Whenever more than bgwriter_flush_after bytes have been written by the bgwriter, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the background. Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than , but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. The valid range is between 0, which disables controlled writeback, and 2MB. The default is 256Kb on Linux, 0 elsewhere. (Non-default values of BLCKSZ change the default and maximum.) This parameter can only be set in the postgresql.conf file or on the server command line. > If the default is in pages, maybe you could state it and afterwards > translate it in size. Hm, I think that's more complicated for users than it's worth. > The text could say something about sequential writes performance because > pages are sorted.., but that it is lost for large bases and/or short > checkpoints ? I think that's an implementation detail. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-11 00:23:56 +0100, Fabien COELHO wrote: > As you wish. I thought that understanding the underlying performance model > with sequential writes written in chunks is important for the admin, and as > this guc would have an impact on performance it should be hinted about, > including the limits of its effect where large bases will converge to random > io performance. But maybe that is not the right place. I do agree that that's something interesting to document somewhere. But I don't think any of the current places in the documentation are a good fit, and it's a topic much more general than the feature we're debating here. I'm not volunteering, but a good discussion of storage and the interactions with postgres surely would be a significant improvement to the postgres docs. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi, I just pushed the two major remaining patches in this thread. Let's see what the buildfarm has to say; I'd not be surprised if there's some lingering portability problem in the flushing code. There's one remaining issue we definitely want to resolve before the next release: Right now we always use one writeback context across all tablespaces in a checkpoint, but Fabien's testing shows that that's likely to hurt in a number of cases. I've some data suggesting the contrary in others. Things that'd be good: * Some benchmarking. Right now controlled flushing is enabled by default on linux, but disabled by default on other operating systems. Somebody running benchmarks on e.g. freebsd or OSX might be good. * If somebody has the energy to provide a windows implemenation for flush control, that might be worthwhile. There's several places that could benefit from that. * The default values are basically based on benchmarking by me and Fabien. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GinPageIs* don't actually return a boolean
On 2016-03-02 21:48:16 -0500, Peter Eisentraut wrote: > After reviewing this thread and relevant internet lore, I think this > might be the wrong way to address this problem. It is in general not > guaranteed in C that a Boolean-sounding function or macro returns 0 or > 1. Prime examples are ctype.h functions such as isupper(). This is > normally not a problem because built-in conditionals such as if, &&, || > handle this just fine. So code like > > - Assert(!create || !!txn); > + Assert(!create || txn != NULL); > > is arguably silly either way. There is no risk in writing just > > Assert(!create || txn); Yes, obviously. I doubt that anyone misunderstood that. I personally find the !! easier to read when contrasting to a negated value (as in the above assert). > The problem only happens if you compare two "Boolean" values directly > with each other; That's not correct. It also happen if you compare an expression with a stored version, i.e. only one side is a 'proper bool'. > A quick look through the code based on the provided patch shows that > approximately the only place affected by this is > > if (isLeaf != GinPageIsLeaf(page) || isData != GinPageIsData(page)) > elog(ERROR, "right sibling of GIN page is of different type"); > > and that's not actually a problem because isLeaf and isData are earlier > populated by the same macros. It would still be worth fixing, but a > localized fix seems better. That's maybe the only place where we compare stored booleans to expressions, but it's definitely not the only place where some expression is stored in a boolean variable. And in all those cases there's an int32 (or whatever type the expression has) to bool (usually 1byte char). That's definitely problematic. > But the lore on the internet casts some doubt on that: There is no > guarantee that bool is 1 byte, that bool can be passed around like char, > or even that bool arrays are laid out like char arrays. Maybe this all > works out okay, just like it has worked out so far that int is 4 bytes, > but we don't know enough about it. We could probably add some configure > tests around that. So? > My proposal on this particular patch is to do nothing. The stdbool > issues should be looked into, for the sake of Windows and other > future-proofness. But that will likely be an entirely different patch. That seems to entirely miss the part of this discussion dealing with high bits set and such? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GinPageIs* don't actually return a boolean
On 2016-03-11 04:50:45 +0100, Michael Paquier wrote: > On Thu, Mar 10, 2016 at 11:40 PM, Robert Haas wrote: > > We need to decide what to do about this. I disagree with Peter: I > > think that regardless of stdbool, what we've got right now is sloppy > > coding - bad style if nothing else. Furthermore, I think that while C > > lets you use any non-zero value to represent true, our bool type is > > supposed to contain only one of those two values. Therefore, I think > > we should commit the full patch, back-patch it as far as somebody has > > the energy for, and move on. But regardless, this patch can't keep > > sitting in the CommitFest - we either have to take it or reject it, > > and soon. I plan to commit something like this, unless there's very loud protest from Peter's side. > +1, I would suggest to move ahead, !! is not really Postgres-like anyway. The !! bit is a minor sideshow to this, afaics. It just came up when discussing the specifics of the fixed macros and some people expressed a clear preference for not using !!, so I fixed the occurrances I introduced. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Upper planner pathification
Hi, On 2016-03-10 15:03:41 -0500, Tom Lane wrote: > Andres Freund writes: > > Primarily because create_plan(), and/or its children, have to know about > > what you're doing; you can hide some, but not all, things below > > CustomScan nodes. > > And which of those things does e.g. setrefs.c not need to know about? For CustomScans? You can mostly get by with ->custom_exprs. > > ISTM, that there's good enough reasons to go either way; I don't see > > what we're gaining by making these private. That just encourages > > copy-paste coding. > > What it encourages is having module boundaries that actually mean > something, as well as code that can be refactored without having > to worry about which extensions will complain about it. I personally think it's entirely fine to break extensions if it's adding or removing a few parameters or somesuch. That's easy enough fixed. > I will yield on this point because it's not worth my time to argue about > it, but I continue to say that it's a bad decision you will regret. FWIW, I do agree that it'd be much nicer to use the new API; the biggest user in Citus should be able to work with that. But it's not that easy to do that and still support postgres 9.4/9.5. > Which functions did you need exactly? I'm not exporting more than > I have to. I'll try to do the port tomorrow; to make sure I have the definitive list. Afaics it's just make_seqscan, make_sort_from_sortclauses, make_limit, make_agg. I'd not however be surprised if other extensions also use some of these. Would you rather add back the exports or should I? Greetings, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Background Processes and reporting
Hi, We now have "Provide much better wait information in pg_stat_activity" and "Add a generic command progress reporting facility" making it easier to provide insight into the system. While working on the writeback control / checkpoint sorting patch I'd the following statement in BufferSync()'s main loop: fprintf(stderr, "\33[2K\rto_scan: %d, scanned: %d, %%processed: %.2f, %%writeouts: %.2f", num_to_scan, num_processed, (((double) num_processed) / num_to_scan) * 100, ((double) num_written / num_processed) * 100); which basically printed the progress of a checkpoint, and some additional detail to stderr. Quite helpful to see whether progress is "unsteady". Obviously that's not something that could be committed. So I'm wondering how we can make it possible to use the aforementioned "progress reporting facility" to monitor checkpoint progress. To which Robert replied on IM: "it wouldn't quite help with that because the checkpointer doesn't show up as a regular backend" It seems rather worthwhile to think about how we can expand the coverage of progress tracking to other types of background processes. Similarly for the wait event stuff - checkpointer, wal writer, background writer are in many cases processes that very often are blocked on locks, IO and such. Thus restricting the facility to database connected processes seems like a loss. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Background Processes and reporting
On 2016-03-11 23:53:15 +0300, Vladimir Borodin wrote: > It was many times stated in threads about waits monitoring [0, 1, 2] > and supported by different people, but ultimately waits information > was stored in PgBackendStatus. Only that it isn't. It's stored in PGPROC. This criticism is true of the progress reporting patch, but a quick scan of the thread doesn't show authors of the wait events patch participating there. > Can’t we think one more time about implementation provided by Ildus > and Alexander here [3]? I don't think so. Afaics the proposed patch tried to do too many things at once, and it's authors didn't listen well to criticism. Trying to go back to that seems like a surefire way to have nothing in 9.6. > Seems that current implementation doesn’t give reasonable ways to > implement all that features and it is really sad. Why is that? Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Background Processes and reporting
Hi, On 2016-03-11 11:16:32 -0800, Andres Freund wrote: > It seems rather worthwhile to think about how we can expand the coverage > of progress tracking to other types of background processes. WRT the progress reporting patch, I think we should split (as afaics was discussed in the thread for a while) off the new part of PgBackendStatus into it's own structure. That'd not just allow using this from non-backend processes, but would also have the advantage that the normal PgBackendStatus' changecount doesn't advance quite so rapidly. E.g. when reporting progress of a vacuum, the changecount will probably change at quite a rapid rate, but that's uninteresting for e.g. pg_stat_activity. > Similarly for the wait event stuff - checkpointer, wal writer, > background writer are in many cases processes that very often are > blocked on locks, IO and such. Thus restricting the facility to > database connected processes seems like a loss. I think one way to address this would be to not only report PgBackendStatus type processes in pg_stat_activity. While that'd obviously be a compatibility break, I think it'd be an improvement. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Background Processes and reporting
On 2016-03-12 01:05:43 +0300, Vladimir Borodin wrote: > > 12 марта 2016 г., в 0:22, Andres Freund написал(а): > > Only that it isn't. It's stored in PGPROC. > > Sorry, I missed that. So monitoring of wait events for auxiliary processes > still could be implemented? It's basically a question of where to report the information. > >> Seems that current implementation doesn’t give reasonable ways to > >> implement all that features and it is really sad. > > > > Why is that? > > Storing information about wait event in 4 bytes gives an ability to > store only wait type and event. No way to store duration or extra > information (i.e. buffer number for I/O events or buffer manager > LWLocks). Maybe I’m missing something... Sure, but that that's just incrementally building features? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Upper planner pathification
On 2016-03-10 23:38:14 -0500, Tom Lane wrote: > > Would you rather add back the exports or should I? > > I'll do it ... just send me the list. After exporting make_agg, make_limit, make_sort_from_sortclauses and making some trivial adjustments due to the pull_var_clause changes change, Citus' tests pass on 9.6, bar some noise. Pathification made some plans switch from hash-agg to sort-agg, and the other way round; but that's obviously ok. Thanks, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Background Processes and reporting
On 2016-03-12 02:24:33 +0300, Alexander Korotkov wrote: > Idea of individual time measurement of every wait event met criticism > because it might have high overhead [1]. Right. And that's actually one of the point which I meant with "didn't listen to criticism". There've been a lot of examples, on an off list, where taking timings trigger significant slowdowns. Yes, in some bare-metal environments, which a coherent tsc, the overhead can be low. But that doesn't make it ok to have a high overhead on a lot of other systems. Just claiming that that's not a problem will only lead to your position not being taken serious. > This is really so at least for Windows [2]. Measuring timing overhead for a simplistic workload on a single system doesn't mean that. Try doing such a test on a vmware esx virtualized windows machine, on a multi-socket server; in a lot of instances you'll see two-three orders of magnitude longer average times; with peaks going into 4-5 orders of magnitude. And, as sad it is, realistically most postgres instances will run in virtualized environments. > But accessing only current values wouldn't be very useful. We > anyway need to gather some statistics. Gathering it by sampling would be > both more expensive and less accurate for majority of systems. This is why > I proposed hooks to make possible platform dependent extensions. Robert > rejects hook because he is "not a big fan of hooks as a way of resolving > disagreements about the design" [3]. I think I agree with Robert here. Providing hooks into very low level places tends to lead to problems in my experience; tight control over what happens is often important - I certainly don't want any external code to run while we're waiting for an lwlock. > Besides that is actually not design issues but platform issues... I don't see how that's the case. > Another question is wait parameters. We want to expose wait event with > some parameters. Robert rejects that because it *might* add additional > overhead [3]. When I proposed to fit something useful into hard-won > 4-bytes, Roberts claims that it is "too clever" [4]. I think stopping to treat this as "Robert/EDB vs. pgpro" would be a good first step to make progress here. It seems entirely possible to extend the current API in an incremental fashion, either allowing to disable the individual pieces, or providing sufficient measurements that it's not needed. > So, situation looks like dead-end. I have no idea how to convince Robert > about any kind of advanced functionality of wait monitoring to PostgreSQL. > I'm thinking about implementing sampling extension over current > infrastructure just to make community see that it sucks. Andres, it would > be very nice if you have any idea how to move this situation forward. I've had my share of conflicts with Robert. But if I were in his shoes, targeted by this kind of rhetoric, I'd be very tempted to just ignore any further arguments from the origin. So I think the way forward is for everyone to cool off, and to see how we can incrementally make progress from here on. > Another aspect is that EnterpriseDB offers waits monitoring in proprietary > fork [5]. So? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Only try to push down foreign joins if the user mapping OIDs mat
Hi, On 2016-01-28 19:09:01 +, Robert Haas wrote: > Only try to push down foreign joins if the user mapping OIDs match. > > Previously, the foreign join pushdown infrastructure left the question > of security entirely up to individual FDWs, but it would be easy for > a foreign data wrapper to inadvertently open up subtle security holes > that way. So, make it the core code's job to determine which user > mapping OID is relevant, and don't attempt join pushdown unless it's > the same for all relevant relations. > > Per a suggestion from Tom Lane. Shigeru Hanada and Ashutosh Bapat, > reviewed by Etsuro Fujita and KaiGai Kohei, with some further > changes by me. I noticed that this breaks some citus regression tests in a minor manner. Namely previously file_fdw worked without a user mapping, now it doesn't appear to anymore. This is easy enough to fix, and it's perfectly ok for us to fix this, but I do wonder if that's not going to cause trouble for others. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Only try to push down foreign joins if the user mapping OIDs mat
Hi, On 2016-03-12 11:56:24 -0500, Robert Haas wrote: > On Fri, Mar 11, 2016 at 10:15 PM, Andres Freund wrote: > > On 2016-01-28 19:09:01 +, Robert Haas wrote: > >> Only try to push down foreign joins if the user mapping OIDs match. > >> > >> Previously, the foreign join pushdown infrastructure left the question > >> of security entirely up to individual FDWs, but it would be easy for > >> a foreign data wrapper to inadvertently open up subtle security holes > >> that way. So, make it the core code's job to determine which user > >> mapping OID is relevant, and don't attempt join pushdown unless it's > >> the same for all relevant relations. > >> > >> Per a suggestion from Tom Lane. Shigeru Hanada and Ashutosh Bapat, > >> reviewed by Etsuro Fujita and KaiGai Kohei, with some further > >> changes by me. > > > > I noticed that this breaks some citus regression tests in a minor > > manner. Namely previously file_fdw worked without a user mapping, now it > > doesn't appear to anymore. > > > > This is easy enough to fix, and it's perfectly ok for us to fix this, > > but I do wonder if that's not going to cause trouble for others. > > Hmm, I didn't intend to change that. If that commit broke something, > there's obviously a hole in our regression test coverage. CREATE EXTENSION file_fdw; CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw; CREATE FOREIGN TABLE agg_csv ( a int2, b float4 ) SERVER file_server OPTIONS (format 'csv', filename '/home/andres/src/postgresql/contrib/file_fdw/data/agg.csv', header 'true', delimiter ';', quote '@', escape '"', null ''); SELECT * FROM agg_csv; worked in 9.5, but doesn't in master. The difference apears to be the check that's now in build_simple_rel() - there was nothing hitting the user mapping code before for file_fdw. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Upper planner pathification
On 2016-03-12 12:22:01 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2016-03-10 23:38:14 -0500, Tom Lane wrote: > >> I'll do it ... just send me the list. > > > After exporting make_agg, make_limit, make_sort_from_sortclauses and > > making some trivial adjustments due to the pull_var_clause changes > > change, Citus' tests pass on 9.6, bar some noise. > > OK, done. Thanks. > > Pathification made > > some plans switch from hash-agg to sort-agg, and the other way round; > > but that's obviously ok. > > I wonder whether that's pathification per se. If you're interested enough, I've uploaded a dump of the schema relevant table to http://anarazel.de/t/lineitem_95_96_plan.dump.gz the affected query is (after ANALYZE lineitem_102009) EXPLAIN SELECT l_quantity, count(*) AS count, sum(l_extendedprice) AS avg, count(l_extendedprice) AS avg, array_agg(l_orderkey) AS array_agg FROM lineitem_102009 lineitem WHERE ((l_quantity < '5'::numeric) AND (l_orderkey > 5500) AND (l_orderkey < 9500)) GROUP BY l_quantity; =# SELECT version(); ┌──┐ │ version │ ├──┤ │ PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc-5.real (Debian 5.3.1-10) 5.3.1 20160224, 64-bit │ └──┘ (1 row) =# SELECT l_quantity, count(*) AS count, sum(l_extendedprice) AS avg, count(l_extendedprice) AS avg, array_agg(l_orderkey) AS array_agg FROM lineitem_102009 lineitem WHERE ((l_quantity < '5'::numeric) AND (l_orderkey > 5500) AND (l_orderkey < 9500)) GROUP BY l_quantity; ┌┬───┬──┬─┬┐ │ l_quantity │ count │ avg│ avg │ array_agg │ ├┼───┼──┼─┼┤ │ 1.00 │ 9 │ 13044.06 │ 9 │ {8997,9026,9158,9184,9220,9222,9348,9383,9476} │ │ 4.00 │ 7 │ 40868.84 │ 7 │ {9091,9120,9281,9347,9382,9440,9473} │ │ 2.00 │ 8 │ 26072.02 │ 8 │ {9030,9058,9123,9124,9188,9344,9441,9476} │ │ 3.00 │ 9 │ 39925.32 │ 9 │ {9124,9157,9184,9223,9254,9349,9414,9475,9477} │ └┴───┴──┴─┴┘ (4 rows) Time: 0.906 ms =# EXPLAIN SELECT l_quantity, count(*) AS count, sum(l_extendedprice) AS avg, count(l_extendedprice) AS avg, array_agg(l_orderkey) AS array_agg FROM lineitem_102009 lineitem WHERE ((l_quantity < '5'::numeric) AND (l_orderkey > 5500) AND (l_orderkey < 9500)) GROUP BY l_quantity; ┌┐ │ QUERY PLAN │ ├┤ │ HashAggregate (cost=137.91..137.93 rows=1 width=21) │ │ Group Key: l_quantity │ │ -> Bitmap Heap Scan on lineitem_102009 lineitem (cost=13.07..137.44 rows=38 width=21) │ │ Recheck Cond: ((l_orderkey > 5500) AND (l_orderkey < 9500)) │ │ Filter: (l_quantity < '5'::numeric) │ │ -> Bitmap Index Scan on lineitem_pkey_102009 (cost=0.00..13.06 rows=478 width=0) │ │ Index Cond: ((l_orderkey > 5500) AND (l_orderkey < 9500)) │ └ vs. =# SELECT version(); ┌─── │ version ├─── │ PostgreSQL 9.6devel on x86_64-pc-linux-gnu, compiled by gcc-6.real (Debian 6-20160228-1) 6.0.0 20160228 (experimental) [trunk revision └─── (1 row) =# SELECT l_quantity, count(*) AS count, sum(l_extendedprice) AS avg, count(l_extend
Re: [HACKERS] Background Processes and reporting
On 2016-03-12 16:29:11 +0530, Amit Kapila wrote: > On Sat, Mar 12, 2016 at 3:10 AM, Andres Freund wrote: > > > > > > > Similarly for the wait event stuff - checkpointer, wal writer, > > > background writer are in many cases processes that very often are > > > blocked on locks, IO and such. Thus restricting the facility to > > > database connected processes seems like a loss. > > > > I think one way to address this would be to not only report > > PgBackendStatus type processes in pg_stat_activity. While that'd > > obviously be a compatibility break, I think it'd be an improvement. > > > > I think here another point which needs more thoughts is that many of the > pg_stat_activity fields are not relevant for background processes, ofcourse > one can say that we can keep those fields as NULL, but still I think that > indicates it is not the most suitable way to expose such information. But neither are all of them relevant for autovacuum workers, and we already show them. pg_stat_activity as a name imo doesn't really imply that it's about plain queries. ISTM we should add a 'backend_type' column that is one of backend|checkpointer|autovacuum|autovacuum-worker|wal writer| bgwriter| bgworker (or similar). That makes querying easier. And then display all shmem connected processes. > Another way could be to have new view like pg_stat_background_activity with > only relevant fields or try expose via individual views like > pg_stat_bgwriter. I think the second is a pretty bad alternative; it'll force us to add new views with very similar information; and it'll be hard to get information about the whole system. I mean if you want to know which locks are causing problems, you don't primarily care whether it's a background process or a backend that has contention issues. > Do you intend to get this done for 9.6 considering an add-on patch for wait > event information displayed in pg_stat_activity? I think we should fix this for 9.6; it's a weakness in a new interface. Let's not yank people around more than we need to. I'm willing to do some work on that, if we can agree upon a course. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Background Processes and reporting
Hi, On 2016-03-14 16:16:43 -0400, Robert Haas wrote: > > I have already shown [0, 1] the overhead of measuring timings in linux on > > representative workload. AFAIK, these tests were the only one that showed > > any numbers. All other statements about terrible performance have been and > > remain unconfirmed. > > Of course, those numbers are substantial regressions which would > likely make it impractical to turn this on on a heavily-loaded > production system. A lot of people operating production systems are fine with trading a <= 10% impact for more insight into the system; especially if that configuration can be changed without a restart. I know a lot of systems that use pg_stat_statements, track_io_timing = on, etc; just to get that. In fact there's people running perf more or less continuously in production environments; just to get more insight. I think it's important to get as much information out there without performance overhead, so it can be enabled by default. But I don't think it makes sense to not allow features in that cannot be enabled by default, *if* we tried to make them cheap enough beforehand. > > Ok, doing it in short steps seems to be a good plan. Any objections against > > giving people an ability to turn some feature (i.e. notorious measuring > > timings) even if it makes some performance degradation? Of course, it should > > be turned off by default. > > I am not totally opposed to that, but I think a feature that causes a > 10% performance hit when you turn it on will be mostly useless. The > people who need it won't be able to risk turning it on. That's not my experience. > > If anything, I’m not from PostgresPro and I’m not «accusing you». But to be > > honest current committed implementation has been tested exactly on one > > machine with two workloads. And I think, it is somehow unfair to demand more > > from others. Although it doesn’t mean that testing on exactly one machine > > with only one OS is enough, of course. I suppose, you should ask the authors > > to test it on some representative hardware and workload but if authors don’t > > have them, it would be nice to help them with that. > > I'm not necessarily opposed to that, but this thread has a lot more > heat than light Indeed. >, and some of the other threads on this topic have had > the same problem. There seems to be tremendous resistance to the idea > that recording timestamps is going to be extensive even though there > are many previous threads on pgsql-hackers about many different > features showing that this is true. Somehow, I've got to justify a > position which has been taken by many people many times before on this > very same mailing list. That strikes me as 100% backwards. Agreed; I find that pretty baffling. Especially that pointing out problems like timestamp overhead generates a remarkable amount of hostility is weird. > > Also it would be really interesting to hear your opinion about the initial > > Andres’s question. Any thoughts about changing current committed > > implementation? > > I'm a little vague on specifically what Andres has in mind. That makes two of us. > I tend to think that there's not much point in allowing > pg_stat_get_progress_info('checkpointer') because we can just have a > dedicated view for that sort of thing, cf. pg_stat_bgwriter, which > seems better. But that infrastructure isn't really suitable for exposing quickly changing counters imo. And given that we now have a relatively generic framework, it seems like a pain to add a custom implementation just for the checkpointer. Also, using custom infrastructure means it's not extensible to custom bgworker, which doesn't seem like a good idea. E.g. it'd be very neat to show the progress of a logical replication catchup process that way, no? > Exposing the wait events from background processes > might be worth doing, but I don't think we want to add a bunch of > dummy lines to pg_stat_activity. Why are those dummy lines? It's activity in the cluster? We already show autovacuum workers in there. And walsenders, if you query the underlying function, instead of pg_stat_activity (due to a join to pg_database). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pglogical_output - a general purpose logical decoding output plugin
On 2016-01-31 05:09:33 +0800, Craig Ringer wrote: > On 29 January 2016 at 18:16, Andres Freund wrote: > > > Hi, > > > > so, I'm reviewing the output of: > > > > Thankyou very much for the review. Afaics you've not posted an updated version of this? Any chance you could? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timeline following for logical slots
On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote: > Great, thanks. I've studied this to the point where I'm confident that > it makes sense, so I'm about to push it. I didn't change any logic, > only updated comments here and there, both in the patch and in some > preexisting code. I also changed the "List *timelineHistory" to be > #ifndef FRONTEND, which seems cleaner than having it and insist that it > couldn't be used. Could you perhaps wait till tomorrow? I'd like to do a pass over it. Thanks Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Refactoring speculative insertion with unique indexes a little
On 2016-03-14 17:17:02 -0700, Peter Geoghegan wrote: > On Mon, Mar 14, 2016 at 5:04 PM, Robert Haas wrote: > > There hasn't been a new version of this patch in 9 months, you're > > clearly not in a hurry to produce one, and nobody else seems to feel > > strongly that this is something that needs to be done at all. I think > > we could just let this go and be just fine, but instead of doing that > > and moving onto patches that people do feel strongly about, we're > > arguing about this. Bummer. > > I'm busy working on fixing an OpenSSL bug affecting all released > versions right at the moment, but have a number of complex 9.6 patches > to review, most of which are in need of support. I'm very busy. So? You're not the only one. I don't see why we shouldn't move this to 'returned with feedback' until there's a new version. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Timeline following for logical slots
contains a > + * timeline switch, since its xlog segment will have > been copied > + * from the prior timeline. This is pretty harmless > though, as > + * nothing cares so long as the timeline doesn't go > backwards. We > + * should read the page header instead; FIXME someday. > + */ > + *pageTLI = state->currTLI; > + > + /* No need to wait on a historical timeline */ > break; > - > - CHECK_FOR_INTERRUPTS(); > - pg_usleep(1000L); > + } > } > > - /* more than one block available */ > - if (targetPagePtr + XLOG_BLCKSZ <= flushptr) > + if (targetPagePtr + XLOG_BLCKSZ <= read_upto) > + { > + /* > + * more than one block available; read only that block, have > caller > + * come back if they need more. > + */ > count = XLOG_BLCKSZ; > - /* not enough data there */ > - else if (targetPagePtr + reqLen > flushptr) > + } > + else if (targetPagePtr + reqLen > read_upto) > + { > + /* not enough data there */ > return -1; > - /* part of the page available */ > + } > else > - count = flushptr - targetPagePtr; > + { > + /* enough bytes available to satisfy the request */ > + count = read_upto - targetPagePtr; > + } > > - XLogRead(cur_page, *pageTLI, targetPagePtr, XLOG_BLCKSZ); > + XLogRead(cur_page, *pageTLI, targetPagePtr, count); When are we reading less than a page? That should afaik never be required. > + /* > + * We start reading xlog from the restart lsn, even though in > + * CreateDecodingContext we set the snapshot builder up using > the > + * slot's candidate_restart_lsn. This means we might read xlog > we > + * don't actually decode rows from, but the snapshot builder > might > + * need it to get to a consistent point. The point we start > returning > + * data to *users* at is the candidate restart lsn from the > decoding > + * context. > + */ Uh? Where are we using candidate_restart_lsn that way? I seriously doubt it is - candidate_restart_lsn is about a potential future restart_lsn, which we can set once we get reception confirmation from the client. > @@ -299,6 +312,18 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo > fcinfo, bool confirm, bool bin > CHECK_FOR_INTERRUPTS(); > } > > + /* Make sure timeline lookups use the start of the next record > */ > + startptr = ctx->reader->EndRecPtr; Huh? startptr isn't used after this, so I'm not sure what this even means? > + /* > + * The XLogReader will read a page past the valid end of WAL > because > + * it doesn't know about timelines. When we switch timelines > and ask > + * it for the first page on the new timeline it will think it > has it > + * cached, but it'll have the old partial page and say it can't > find > + * the next record. So flush the cache. > + */ > + XLogReaderInvalCache(ctx->reader); > + dito. > diff --git a/src/test/modules/decoding_failover/decoding_failover.c > b/src/test/modules/decoding_failover/decoding_failover.c > new file mode 100644 > index 000..669e6c4 > --- /dev/null > +++ b/src/test/modules/decoding_failover/decoding_failover.c > + > +/* > + * Create a new logical slot, with invalid LSN and xid, directly. This does > not > + * use the snapshot builder or logical decoding machinery. It's only intended > + * for creating a slot on a replica that mirrors the state of a slot on an > + * upstream master. > + * > + * You should immediately decoding_failover_advance_logical_slot(...) it > + * after creation. > + */ Uh. I doubt we want this, even if it's formally located in src/test/modules. These comments make it appear not to be only intended for that, and I have serious doubts about the validity of the concept as is. This seems to need some more polishing. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
On 2016-03-15 15:39:50 +0100, Michael Paquier wrote: > I have finally been able to spend some time reviewing what you pushed > on back-branches, and things are in correct shape I think. One small > issue that I have is that for EXEC_BACKEND builds, in > write_nondefault_variables we still use one instance of rename(). Correctly so afaics, because write_nondefault_variables is by definition non-durable. We write that stuff at every start / SIGHUP. Adding an fsync there would be an unnecessary slowdown. I don't think it's good policy adding fsync for stuff that definitely doesn't need it. > Yeah, true. We definitely need to do something for that, even for HEAD > it seems like an overkill to have something in for example src/common > to allow frontends to have something if the fix is localized > (pg_rewind may use something else), and it would be nice to finish > wrapping that for the next minor release, so I propose the attached > patches. At the same time, I think that adminpack had better be fixed > as well, so there are actually three patches in this series, things > that I shaped thinking about a backpatch btw, particularly for 0002. I'm doubtful about "fixing" adminpack. We don't know how it's used, and this could *seriously* increase its overhead for something potentially used at a high rate. > /* > + * Wrapper of rename() similar to the backend version with the same function > + * name aimed at making the renaming durable on disk. Note that this version > + * does not fsync the old file before the rename as all the code paths > leading > + * to this function are already doing this operation. The new file is also > + * normally not present on disk before the renaming so there is no need to > + * bother about it. > + */ > +static int > +durable_rename(const char *oldfile, const char *newfile) > +{ > + int fd; > + charparentpath[MAXPGPATH]; > + > + if (rename(oldfile, newfile) != 0) > + { > + /* durable_rename produced a log entry */ > + fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"), > + progname, current_walfile_name, > strerror(errno)); > + return -1; > + } > + > + /* > + * To guarantee renaming of the file is persistent, fsync the file with > its > + * new name, and its containing directory. > + */ > + fd = open(newfile, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR); Why is S_IRUSR | S_IWUSR specified here? Are you working on a fix for pg_rewind? Let's go with initdb -S in a first iteration, then we can, if somebody is interest enough, work on making this nicer in master. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idle In Transaction Session Timeout, revived
On 2016-03-15 14:21:34 -0400, Robert Haas wrote: > On Tue, Mar 8, 2016 at 6:08 PM, Andres Freund wrote: > > On 2016-03-08 16:42:37 -0500, Robert Haas wrote: > >> - I really wonder if the decision to ignore sessions that are idle in > >> transaction (aborted) should revisited. Consider this: > >> > >> rhaas=# begin; > >> BEGIN > >> rhaas=# lock table pg_class; > >> LOCK TABLE > >> rhaas=# savepoint a; > >> SAVEPOINT > >> rhaas=# select 1/0; > >> ERROR: division by zero > > > > Probably only if the toplevel transaction is also aborted. Might not be > > entirely trivial to determine. > > Yes, that would be one way to do it - or just ignore whether it's > aborted or not and make the timeout always apply. That seems pretty > reasonable, too, because a transaction that's idle in transaction and > aborted could easily be one that the client has forgotten about, even > if it's not hanging onto any resources other than a connection slot. > And, if it turns out that the client didn't forget about it, well, > they don't lose anything by retrying the transaction on a new > connection anyway. I'm fine with both. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-03-18 15:49:51 -0300, Alvaro Herrera wrote: > This seems a reasonable change, but I think that the use of WIN32 vs. > LATCH_USE_WIN32 is pretty confusing. In particular, LATCH_USE_WIN32 > isn't actually used for anything ... I suppose we don't care since this > is a temporary state of affairs only? Hm, I guess we could make some more of those use LATCH_USE_WIN32. There's essentially two axes here a) latch notification method (self-pipe vs windows events) b) readiness notification (epoll vs poll vs select vs WaitForMultipleObjects). > In 0005: In latch.c you typedef WaitEventSet, but the typedef already > appears in latch.h. You need only declare the struct in latch.c, > without typedef'ing. Good catch. It's even important, some compilers choke on that. > Haven't really reviewed anything here yet, just skimming ATM. Having so > many #ifdefs all over the place in this file looks really bad, but I > guess there's no way around that because this is very platform-specific. I think it's hard to further reduce the number of ifdefs, but if you have ideas... > I hope pgindent doesn't choke on it. The patch is pgindented (I'd personally never decrease indentation just to fit a line into 79 chars, as pgindent does...). Thanks for looking! Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logger process infinite loop
Hi, On 2016-03-18 21:59:01 -0700, Jeff Janes wrote: > While testing some patches on my laptop, I noticed my knee getting > uncomfortably warm. It turns out I has accumulating deranged logging > processes, needing kill -9 to get rid of them. > > The culprit is: > > commit c4901a1e03a7730e4471fd1143f1caf79695493d > Author: Andres Freund > Date: Fri Mar 18 11:43:59 2016 -0700 > > Only clear latch self-pipe/event if there is a pending notification. God, whan an awfully stupid mistake/typo. Thanks for the report! - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logger process infinite loop
On 2016-03-18 22:39:25 -0700, Andres Freund wrote: > Hi, > > On 2016-03-18 21:59:01 -0700, Jeff Janes wrote: > > While testing some patches on my laptop, I noticed my knee getting > > uncomfortably warm. It turns out I has accumulating deranged logging > > processes, needing kill -9 to get rid of them. > > > > The culprit is: > > > > commit c4901a1e03a7730e4471fd1143f1caf79695493d > > Author: Andres Freund > > Date: Fri Mar 18 11:43:59 2016 -0700 > > > > Only clear latch self-pipe/event if there is a pending notification. > > God, whan an awfully stupid mistake/typo. Thanks for the report! (fixed, if that wasn't clear) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On March 18, 2016 11:32:53 PM PDT, Amit Kapila wrote: >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund >wrote: >> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote: >> >> > I have done some >> > tests on Windows with 0003 patch which includes running the >regressions >> > (vcregress check) and it passes. Will look into it tomorrow once >again >and >> > share if I find anything wrong with it, but feel to proceed if you >want. >> >> Thanks for the testing thus far! Let's see what the buildfarm has to >> say. >> > >Won't the new code needs to ensure that ResetEvent(latchevent) should >get >called in case WaitForMultipleObjects() comes out when both >pgwin32_signal_event and latchevent are signalled at the same time? WaitForMultiple only reports the readiness of on event at a time, no? -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On March 18, 2016 11:52:08 PM PDT, Amit Kapila wrote: >On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund >wrote: >> >> >> >> On March 18, 2016 11:32:53 PM PDT, Amit Kapila > >wrote: >> >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund >> >wrote: >> >> >> >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote: >> >> >> >> > I have done some >> >> > tests on Windows with 0003 patch which includes running the >> >regressions >> >> > (vcregress check) and it passes. Will look into it tomorrow >once >> >again >> >and >> >> > share if I find anything wrong with it, but feel to proceed if >you >> >want. >> >> >> >> Thanks for the testing thus far! Let's see what the buildfarm has >to >> >> say. >> >> >> > >> >Won't the new code needs to ensure that ResetEvent(latchevent) >should >> >get >> >called in case WaitForMultipleObjects() comes out when both >> >pgwin32_signal_event and latchevent are signalled at the same time? >> >> WaitForMultiple only reports the readiness of on event at a time, no? >> > >I don't think so, please read link [1] with a focus on below paragraph >which states how it reports the readiness or signaled state when >multiple >objects become signaled. > >"When *bWaitAll* is *FALSE*, this function checks the handles in the >array >in order starting with index 0, until one of the objects is signaled. >If >multiple objects become signaled, the function returns the index of the >first handle in the array whose object was signaled." I think that's OK. We'll just get the next event the next time we call waitfor*. It's also not different to the way the routine is currently handling normal socket and postmaster events, no? It's be absurdly broken if it handled edge triggered enemy's like FD_CLOSE in a way you can't discover. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-03-18 14:00:58 -0400, Robert Haas wrote: > No, I mean it should be quite common for a particular fd to have no > events reported. If we're polling on 100 fds and 1 of them is active > and the other 99 are just sitting there, we want to skip over the > other 99 as quickly as possible. cur_events points to the event that's registered with the set, not the one that's "returned" or "modified" by poll/select, that's where the confusion is originating from. I'll add a fastpath. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-03-16 15:41:28 -0400, Robert Haas wrote: > On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund wrote: > >> > - Given current users we don't need a large amount of events, so having > >> > to iterate through the registered events doesn't seem bothersome. We > >> > could however change the api to be something like > >> > > >> > int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int > >> > nevents, long timeout); > >> > > >> > which would return the number of events that happened, and would > >> > basically "fill" one of the (usually stack allocated) OccurredEvent > >> > structures with what happened. > >> > >> I definitely think something along these lines is useful. I want to > >> be able to have an Append node with 100 ForeignScans under it and kick > >> off all the scans asynchronously and wait for all of the FDs at once. > > > > So you'd like to get only an event for the FD with data back? Or are you > > ok with iterating through hundred elements in an array, to see which are > > ready? > > I'd like to get an event back for the FD with data. Iterating sounds > like it could be really slow. Say you get lots of little packets back > from the same connection, while the others are idle. Now you've got > to keep iterating through them all over and over again. Blech. Well, that's what poll() and select() require you to do internally anyway, even if we abstract it away. But most platforms have better implementations (epoll, kqueue, ...), so it seems fair to design for those. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
) and documentation. I'd greatly appreciate a look. Amit, you offered testing on windows; could you check whether 3/4/5 work? It's quite likely that I've screwed up something. Robert you'd mentioned on IM that you've a use-case for this somewhere around multiple FDWs. If somebody has started working on that, could you ask that person to check whether the API makes sense? Greetings, Andres Freund >From 916b95e211aa017643088ba7cbb239545ac8d944 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 18 Mar 2016 00:52:07 -0700 Subject: [PATCH 1/5] Make it easier to choose the used waiting primitive in unix_latch.c. This allows for easier testing of the different primitives; in preparation for adding a new primitive. Discussion: 20160114143931.gg10...@awork2.anarazel.de Reviewed-By: Robert Haas --- src/backend/port/unix_latch.c | 50 +-- 1 file changed, 34 insertions(+), 16 deletions(-) diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c index 2ad609c..f52704b 100644 --- a/src/backend/port/unix_latch.c +++ b/src/backend/port/unix_latch.c @@ -56,6 +56,22 @@ #include "storage/pmsignal.h" #include "storage/shmem.h" +/* + * Select the fd readiness primitive to use. Normally the "most modern" + * primitive supported by the OS will be used, but for testing it can be + * useful to manually specify the used primitive. If desired, just add a + * define somewhere before this block. + */ +#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT) +/* don't overwrite manual choice */ +#elif defined(HAVE_POLL) +#define LATCH_USE_POLL +#elif HAVE_SYS_SELECT_H +#define LATCH_USE_SELECT +#else +#error "no latch implementation available" +#endif + /* Are we currently in WaitLatch? The signal handler would like to know. */ static volatile sig_atomic_t waiting = false; @@ -215,10 +231,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, cur_time; long cur_timeout; -#ifdef HAVE_POLL +#if defined(LATCH_USE_POLL) struct pollfd pfds[3]; int nfds; -#else +#elif defined(LATCH_USE_SELECT) struct timeval tv, *tvp; fd_set input_mask; @@ -247,7 +263,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, Assert(timeout >= 0 && timeout <= INT_MAX); cur_timeout = timeout; -#ifndef HAVE_POLL +#ifdef LATCH_USE_SELECT tv.tv_sec = cur_timeout / 1000L; tv.tv_usec = (cur_timeout % 1000L) * 1000L; tvp = &tv; @@ -257,7 +273,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, { cur_timeout = -1; -#ifndef HAVE_POLL +#ifdef LATCH_USE_SELECT tvp = NULL; #endif } @@ -291,16 +307,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, } /* - * Must wait ... we use poll(2) if available, otherwise select(2). - * - * On at least older linux kernels select(), in violation of POSIX, - * doesn't reliably return a socket as writable if closed - but we - * rely on that. So far all the known cases of this problem are on - * platforms that also provide a poll() implementation without that - * bug. If we find one where that's not the case, we'll need to add a - * workaround. + * Must wait ... we use the polling interface determined at the top of + * this file to do so. */ -#ifdef HAVE_POLL +#if defined(LATCH_USE_POLL) nfds = 0; if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) { @@ -396,8 +406,16 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, result |= WL_POSTMASTER_DEATH; } } -#else /* !HAVE_POLL */ +#elif defined(LATCH_USE_SELECT) + /* + * On at least older linux kernels select(), in violation of POSIX, + * doesn't reliably return a socket as writable if closed - but we + * rely on that. So far all the known cases of this problem are on + * platforms that also provide a poll() implementation without that + * bug. If we find one where that's not the case, we'll need to add a + * workaround. + */ FD_ZERO(&input_mask); FD_ZERO(&output_mask); @@ -477,7 +495,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, result |= WL_POSTMASTER_DEATH; } } -#endif /* HAVE_POLL */ +#endif /* LATCH_USE_SELECT */ /* If we're not done, update cur_timeout for next iteration */ if (result == 0 && (wakeEvents & WL_TIMEOUT)) @@ -490,7 +508,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, /* Timeout has expired, no need to continue looping */ result |= WL_TIMEOUT; } -#ifndef HAVE_POLL +#ifdef LATCH_USE_SELECT else { tv.tv_sec = cur_timeout / 1000L; -- 2.7.0.229.g701fa7f >From 2eeb7dd4f6401a4f2d45293cddd505018aa4431e Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 18 M
Re: [HACKERS] fd.c doesn't remove files on a crash-restart
On 2016-03-16 10:53:42 -0700, Joshua D. Drake wrote: > Hello, > > fd.c[1] will remove files from pgsql_tmp on a restart but not a > crash-restart per this comment: > > /* > * NOTE: we could, but don't, call this during a post-backend-crash restart > * cycle. The argument for not doing it is that someone might want to > examine > * the temp files for debugging purposes. This does however mean that > * OpenTemporaryFile had better allow for collision with an existing temp > * file name. > */ > > I understand that this is designed this way. I think it is a bad idea > because: > > 1. The majority crash-restarts in the wild are going to be diagnosed rather > easily within the OS itself. They fall into things like OOM killer and out > of disk space. I don't buy 1), like at all. I've seen numerous production instances with crashes outside of os triggered things. > 2. It can cause significant issues, we ran into this yesterday: > > -bash-4.1$ ls pgsql_tmp31227*|du -sh > 250G > > There is no active process/backend with PID 31227. The database itself is > only 55G, but we are taking up an 5x that with dead files. > > 3. The problem can get worse over time. If you have a very long running > instance, any time the backend crash-restarts you have to potential to > increase disk space used for no purpose. But I think these outweigh the debugging benefit. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] silent data loss with ext4 / all current versions
On 2016-03-17 23:05:42 +0900, Michael Paquier wrote: > > Are you working on a fix for pg_rewind? Let's go with initdb -S in a > > first iteration, then we can, if somebody is interest enough, work on > > making this nicer in master. > > I am really -1 for this approach. Wrapping initdb -S with > find_other_exec is intrusive in back-branches knowing that all the I/O > write operations manipulating file descriptors go through file_ops.c, > and we actually just need to fsync the target file in > close_target_file(), making the fix being a 7-line patch, and there is > no need to depend on another binary at all. The backup label file, as > well as the control file are using the same code path in file_ops.c... > And I like simple things :) This is a *much* more expensive approach though. Doing the fsync directly after modifying the file. One file by one file. Will usually result in each fsync blocking for a while. In comparison of doing a flush and then an fsync pass over the whole directory will usually only block seldomly. The flushes for all files can be combined into very few barrier operations. Besides that, you're not syncing the directories, despite open_target_file() potentially creating the directory. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-03-18 20:14:07 +0530, Amit Kapila wrote: > + * from inside a signal handler in latch_sigusr1_handler(). > > * > > * Note: we assume that the kernel calls involved in drainSelfPipe() > > * and SetLatch() will provide adequate synchronization on machines > > * with weak memory ordering, so that we cannot miss seeing is_set if > > * the signal byte is already in the pipe when we drain it. > > */ > > - drainSelfPipe(); > > - > > > Above part of comment looks redundant after this patch. Don't think so. Moving it closer to the drainSelfPipe() call might be neat, but since there's several callsites... > I have done some > tests on Windows with 0003 patch which includes running the regressions > (vcregress check) and it passes. Will look into it tomorrow once again and > share if I find anything wrong with it, but feel to proceed if you want. Thanks for the testing thus far! Let's see what the buildfarm has to say. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On 2016-03-18 11:01:04 -0400, Robert Haas wrote: > On Thu, Mar 17, 2016 at 9:22 PM, Michael Paquier > wrote: > > FWIW, my instinctive thought on the matter is to report the event > > directly in WaitLatch() via a name of the event caller provided > > directly in it. The category of the event is then defined > > automatically as we would know its origin. The code path defining the > > origin point from where a event type comes from is the critical thing > > I think to define an event category. The LWLock events are doing that > > in lwlock.c. > > I'm very skeptical of grouping everything that waits using latches as > a latch wait, but maybe it's OK to do it that way. I was thinking > more of adding categories like "client wait" with events like "client > read" and "client write". +1. I think categorizing latch waits together will be pretty much meaningless. We use the same latch for a lot of different things, and the context in which we're waiting is the important bit. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fd.c doesn't remove files on a crash-restart
On 2016-03-16 11:06:23 -0700, Joshua D. Drake wrote: > On 03/16/2016 11:04 AM, Andres Freund wrote: > >On 2016-03-16 11:02:09 -0700, Joshua D. Drake wrote: > >>>>3. The problem can get worse over time. If you have a very long running > >>>>instance, any time the backend crash-restarts you have to potential to > >>>>increase disk space used for no purpose. > >>> > >>>But I think these outweigh the debugging benefit. > >> > >>Well as Andrew said, we could also create postmaster start option that > >>defaults to don't save. > > > >I think these days you'd simply use restart_after_crash = false. For > >debugging I found that to be rather valuable. > > That would have created an extended outage for this installation. I am not > sure we can force that for something that should not happen in the first > place (filling up the hard drive with dead files). Nah, I meant that if you want to debug something you'd set that (i.e. in the case you need the temp files), not you should set that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
Hi, On 2016-03-16 15:41:28 -0400, Robert Haas wrote: > On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund wrote: > >> > - Given current users we don't need a large amount of events, so having > >> > to iterate through the registered events doesn't seem bothersome. We > >> > could however change the api to be something like > >> > > >> > int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int > >> > nevents, long timeout); > >> > > >> > which would return the number of events that happened, and would > >> > basically "fill" one of the (usually stack allocated) OccurredEvent > >> > structures with what happened. > >> > >> I definitely think something along these lines is useful. I want to > >> be able to have an Append node with 100 ForeignScans under it and kick > >> off all the scans asynchronously and wait for all of the FDs at once. > > > > So you'd like to get only an event for the FD with data back? Or are you > > ok with iterating through hundred elements in an array, to see which are > > ready? > > I'd like to get an event back for the FD with data. Iterating sounds > like it could be really slow. Say you get lots of little packets back > from the same connection, while the others are idle. Now you've got > to keep iterating through them all over and over again. Blech. I've a (working) WIP version that works like I think you want. It's implemented in patch 05 (rework with just poll() support) and (add epoll suppport). It's based on patches posted here earlier, but these aren't interesting for the discussion. The API is now: typedef struct WaitEventSet WaitEventSet; typedef struct WaitEvent { int pos;/* position in the event data structure */ uint32 events; /* tripped events */ int fd; /* fd associated with event */ } WaitEvent; /* * Create a WaitEventSet with space for nevents different events to wait for. * * latch may be NULL. */ extern WaitEventSet *CreateWaitEventSet(int nevents); /* --- * Add an event to the set. Possible events are: * - WL_LATCH_SET: Wait for the latch to be set * - WL_POSTMASTER_DEATH: Wait for postmaster to die * - WL_SOCKET_READABLE: Wait for socket to become readable * can be combined in one event with WL_SOCKET_WRITEABLE * - WL_SOCKET_WRITABLE: Wait for socket to become readable * can be combined with WL_SOCKET_READABLE * * Returns the offset in WaitEventSet->events (starting from 0), which can be * used to modify previously added wait events. */ extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, int fd, Latch *latch); /* * Change the event mask and, if applicable, the associated latch of of a * WaitEvent. */ extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch); /* * Wait for events added to the set to happen, or until the timeout is * reached. At most nevents occurrent events are returned. * * Returns the number of events occurred, or 0 if the timeout was reached. */ extern int WaitEventSetWait(WaitEventSet *set, long timeout, WaitEvent* occurred_events, int nevents); I've for now left the old latch API in place, and only converted be-secure.c to the new style. I'd appreciate some feedback before I go around and convert and polish everything. Questions: * I'm kinda inclined to merge the win32 and unix latch implementations. There's already a fair bit in common, and this is just going to increase that amount. * Right now the caller has to allocate the WaitEvents he's waiting for locally (likely on the stack), but we also could allocate them as part of the WaitEventSet. Not sure if that'd be a benefit. * I can do a blind rewrite of the windows implementation, but I'm obviously not going to get that entirely right. So I need some help from a windows person to test this. * This approach, with a 'stateful' wait event data structure, will actually allow to fix a couple linering bugs we have on the windows port. C.f. http://www.postgresql.org/message-id/4351.1336927...@sss.pgh.pa.us - Andres >From a692a0bd6a8af7427d491adfecff10df3953a8ae Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Wed, 16 Mar 2016 10:44:04 -0700 Subject: [PATCH 1/6] Access the correct pollfd when checking for socket errors in the latch code. Previously, if waiting for a latch, but not a socket, we checked pfds[0].revents for socket errors. Even though pfds[0] wasn't actually associated with the socket in that case. This is currently harmless, because we check wakeEvents after the the aforementioned check. But it's a bug waiting to be happening. --- src/backend/port/uni
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-01-14 12:07:23 -0500, Robert Haas wrote: > On Thu, Jan 14, 2016 at 12:06 PM, Andres Freund wrote: > > On 2016-01-14 11:31:03 -0500, Robert Haas wrote: > >> On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund wrote: > >> I think your idea of a data structure the encapsulates a set of events > >> for which to wait is probably a good one. WaitLatch doesn't seem like > >> a great name. Maybe WaitEventSet, and then we can have > >> WaitLatch(&latch) and WaitEvents(&eventset). > > > > Hm, I'd like to have latch in the name. It seems far from improbably to > > have another wait data structure. LatchEventSet maybe? The wait would be > > implied by WaitLatch. > > I can live with that. How about the following sketch of an API typedef struct LatchEvent { uint32 events; /* interesting events */ uint32 revents; /* returned events */ int fd; /* fd associated with event */ } LatchEvent; typedef struct LatchEventSet { int nevents; LatchEvent events[FLEXIBLE_ARRAY_MEMBER]; } LatchEventSet; /* * Create a LatchEventSet with space for nevents different events to wait for. * * latch may be NULL. */ extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch); /* --- * Add an event to the set. Possible events are: * - WL_LATCH_SET: Wait for the latch to be set * - WL_SOCKET_READABLE: Wait for socket to become readable * can be combined in one event with WL_SOCKET_WRITEABLE * - WL_SOCKET_WRITABLE: Wait for socket to become readable * can be combined with WL_SOCKET_READABLE * - WL_POSTMASTER_DEATH: Wait for postmaster to die */ extern void AddLatchEventToSet(LatchEventSet *set, uint32 events, int fd); /* * Wait for any events added to the set to happen, or until the timeout is * reached. * * The return value is the union of all the events that were detected. This * allows to avoid having to look into the associated events[i].revents * fields. */ extern uint32 WaitLatchEventSet(LatchEventSet *set, long timeout); I've two questions: - Is there any benefit of being able to wait for more than one latch? I'm inclined to not allow that for now, that'd make the patch bigger, and I don't see a use-case right now. - Given current users we don't need a large amount of events, so having to iterate through the registered events doesn't seem bothersome. We could however change the api to be something like int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout); which would return the number of events that happened, and would basically "fill" one of the (usually stack allocated) OccurredEvent structures with what happened. Comments? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POC: Cache data in GetSnapshotData()
On 2016-03-16 13:29:22 -0400, Robert Haas wrote: > Whoa. At 64 clients, we're hardly getting any benefit, but then by 88 > clients, we're getting a huge benefit. I wonder why there's that sharp > change there. What's the specifics of the machine tested? I wonder if it either correlates with the number of hardware threads, real cores, or cache sizes. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
Hi, On 2016-03-17 09:01:36 -0400, Robert Haas wrote: > 0001: Looking at this again, I'm no longer sure this is a bug. > Doesn't your patch just check the same conditions in the opposite > order? Yes, that's what's required > 0004: > > + * drain it everytime WaitLatchOrSocket() is used. Should the > + * pipe-buffer fill up in some scenarios - widly unlikely - we're > > every time > wildly > > Why is it wildly (or widly) unlikely? > > The rejiggering this does between what is on which element of pfds[] > appears to be unrelated to the ostensible purpose of the patch. Well, not really. We need to know when to do drainSelfPipe(); Which gets more complicated if pfds[0] is registered optionally. I'm actually considering to drop this entirely, given the much heavier rework in the WaitEvent set patch; making these details a bit obsolete. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fd.c: flush data problems on osx
Hi, On 2016-03-17 20:09:53 +0300, Stas Kelvich wrote: > Current implementation of pg_flush_data when called with zero offset and zero > nbytes is assumed to flush all file. In osx it uses mmap with these > arguments, but according to man: > > "[EINVAL] The len argument was negative or zero. Historically, the > system call would not return an > error if the argument was zero. See other potential > additional restrictions in the COMPAT- > IBILITY section below." > > so it is generate a lot of warnings: > > "WARNING: could not mmap while flushing dirty data: Invalid argument" Hm, yea, that's buggy. > One possible solution for that is just fallback to pg_fdatasync in case when > offset = nbytes = 0. Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get the file size. Could you test that? > Also there are no default ifdef inside this function, is there any > check that will guarantee that pg_flush_data will not end up with > empty body on some platform? There doesn't need to be - it's purely "advisory", i.e. just an optimization. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-02-08 17:49:18 +0900, Kyotaro HORIGUCHI wrote: > How about allowing registration of a callback for every waiting > socket. The signature of the callback function would be like I don't think a a callback based API is going to serve us well. Most of the current latch callers would get noticeably more complex that way. And a number of them will benefit from latches using epoll internally. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-03-19 15:43:27 +0530, Amit Kapila wrote: > On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund wrote: > > > > On March 18, 2016 11:52:08 PM PDT, Amit Kapila > wrote: > > >> >Won't the new code needs to ensure that ResetEvent(latchevent) > > >should > > >> >get > > >> >called in case WaitForMultipleObjects() comes out when both > > >> >pgwin32_signal_event and latchevent are signalled at the same time? > > >> WaitForMultiple only reports the readiness of on event at a time, no? > > >> > > > > > >I don't think so, please read link [1] with a focus on below paragraph > > >which states how it reports the readiness or signaled state when > > >multiple > > >objects become signaled. > > > > > >"When *bWaitAll* is *FALSE*, this function checks the handles in the > > >array > > >in order starting with index 0, until one of the objects is signaled. > > >If > > >multiple objects become signaled, the function returns the index of the > > >first handle in the array whose object was signaled." I think this is just incredibly bad documentation. See https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273 (Raymond Chen can be considered an authority here imo). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-03-19 16:44:49 +0530, Amit Kapila wrote: > On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund wrote: > > > > > > Attached is a significantly revised version of the earlier series. Most > > importantly I have: > > * Unified the window/unix latch implementation into one file (0004) > > > > After applying patch 0004* on HEAD, using command patch -p1 < > , I am getting build failure: > > c1 : fatal error C1083: Cannot open source file: > 'src/backend/storage/ipc/latch.c': No such file or directory > > I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I > have tried with git apply, but no success. Am I doing something wrong? You skipped applying 0003. I'll send an updated version - with all the docs and such - in the next hours. > One minor suggestion about patch: > > +#ifndef WIN32 > void > latch_sigusr1_handler(void) > { > if (waiting) > sendSelfPipeByte(); > } > +#endif /* !WIN32 */ > > /* Send one byte to the self-pipe, to wake up WaitLatch */ > +#ifndef WIN32 > static void > sendSelfPipeByte(void) > > > Instead of individually defining these functions under #ifndef WIN32, isn't > it better to combine them all as they are at end of file. They're all at the end of the file. I just found it easier to reason about if both #if and #endif are visible in one screen full of code. Don't feel super strong about it tho. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-03-16 15:08:07 -0400, Robert Haas wrote: > On Wed, Mar 16, 2016 at 2:52 PM, Andres Freund wrote: > > How about the following sketch of an API > > > > typedef struct LatchEvent > > { > > uint32 events; /* interesting events */ > > uint32 revents; /* returned events */ > > int fd; /* fd associated with event */ > > } LatchEvent; > > > > typedef struct LatchEventSet > > { > > int nevents; > > LatchEvent events[FLEXIBLE_ARRAY_MEMBER]; > > } LatchEventSet; > > > > /* > > * Create a LatchEventSet with space for nevents different events to wait > > for. > > * > > * latch may be NULL. > > */ > > extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch); > > We might be able to rejigger this so that it didn't require palloc, if > we got rid of FLEXIBLE_ARRAY_MEMBER and passed int nevents and > LatchEvent * separately to WaitLatchThingy(). But I guess maybe this > will be infrequent enough not to matter. I think we'll basically end up allocating them once for the frequent callsites. > > - Given current users we don't need a large amount of events, so having > > to iterate through the registered events doesn't seem bothersome. We > > could however change the api to be something like > > > > int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, > > long timeout); > > > > which would return the number of events that happened, and would > > basically "fill" one of the (usually stack allocated) OccurredEvent > > structures with what happened. > > I definitely think something along these lines is useful. I want to > be able to have an Append node with 100 ForeignScans under it and kick > off all the scans asynchronously and wait for all of the FDs at once. So you'd like to get only an event for the FD with data back? Or are you ok with iterating through hundred elements in an array, to see which are ready? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-03-19 18:45:36 -0700, Andres Freund wrote: > On 2016-03-19 16:44:49 +0530, Amit Kapila wrote: > > On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund wrote: > > > > > > > > > Attached is a significantly revised version of the earlier series. Most > > > importantly I have: > > > * Unified the window/unix latch implementation into one file (0004) > > > > > > > After applying patch 0004* on HEAD, using command patch -p1 < > > , I am getting build failure: > > > > c1 : fatal error C1083: Cannot open source file: > > 'src/backend/storage/ipc/latch.c': No such file or directory > > > > I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I > > have tried with git apply, but no success. Am I doing something wrong? > > You skipped applying 0003. > > I'll send an updated version - with all the docs and such - in the next > hours. Here we go. I think this is getting pretty clos eto being committable, minus a bit of testing edge cases on unix (postmaster death, disconnecting clients in various ways (especially with COPY)) and windows (uh, does it even work at all?). There's no large code changes in this revision, mainly some code polishing and a good bit more comment improvements. Regards, Andres >From 9195a0136c44eca4d798d1de478a477ab0ac4724 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 18 Mar 2016 00:52:07 -0700 Subject: [PATCH 1/2] Combine win32 and unix latch implementations. Previously latches for windows and unix had been implemented in different files. The next patch in this series will introduce an expanded wait infrastructure, keeping the implementation separate would introduce too much duplication. This basically just moves the functions, without too much change. The reason to keep this separate is that it allows blame to continue working a little less badly; and to make review a tiny bit easier. --- configure | 10 +- configure.in | 8 - src/backend/Makefile | 3 +- src/backend/port/.gitignore| 1 - src/backend/port/Makefile | 2 +- src/backend/port/win32_latch.c | 349 - src/backend/storage/ipc/Makefile | 5 +- .../{port/unix_latch.c => storage/ipc/latch.c} | 280 - src/include/storage/latch.h| 2 +- src/tools/msvc/Mkvcbuild.pm| 2 - 10 files changed, 277 insertions(+), 385 deletions(-) delete mode 100644 src/backend/port/win32_latch.c rename src/backend/{port/unix_latch.c => storage/ipc/latch.c} (74%) diff --git a/configure b/configure index a45be67..c10d954 100755 --- a/configure +++ b/configure @@ -14786,13 +14786,6 @@ $as_echo "#define USE_WIN32_SHARED_MEMORY 1" >>confdefs.h SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c" fi -# Select latch implementation type. -if test "$PORTNAME" != "win32"; then - LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c" -else - LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c" -fi - # If not set in template file, set bytes to use libc memset() if test x"$MEMSET_LOOP_LIMIT" = x"" ; then MEMSET_LOOP_LIMIT=1024 @@ -15868,7 +15861,7 @@ fi ac_config_files="$ac_config_files GNUmakefile src/Makefile.global" -ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}" +ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}" if test "$PORTNAME" = "win32"; then @@ -16592,7 +16585,6 @@ do "src/backend/port/dynloader.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c" ;; "src/backend/port/pg_sema.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}" ;; "src/backend/port/pg_shmem.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}" ;; -
Re: [HACKERS] fd.c doesn't remove files on a crash-restart
On 2016-03-16 11:02:09 -0700, Joshua D. Drake wrote: > >>3. The problem can get worse over time. If you have a very long running > >>instance, any time the backend crash-restarts you have to potential to > >>increase disk space used for no purpose. > > > >But I think these outweigh the debugging benefit. > > Well as Andrew said, we could also create postmaster start option that > defaults to don't save. I think these days you'd simply use restart_after_crash = false. For debugging I found that to be rather valuable. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit ac1d794
On 2016-03-17 09:40:08 -0400, Robert Haas wrote: > On Thu, Mar 17, 2016 at 9:01 AM, Robert Haas wrote: > > I'll look at 0005 next, but thought I would send these comments along first. > > 0005: This is obviously very much WIP, but I think the overall > direction of it is good. > 0006: Same. > > I think you should use PGINVALID_SOCKET rather than -1 in various > places in various patches in this series, especially if you are going > to try to merge the Windows code path. Sure. > I wonder if CreateEventSet should accept a MemoryContext argument. It > seems like callers will usually want TopMemoryContext, and just being > able to pass that might be more convenient than having to switch back > and forth in the calling code. Makes sense. > I wonder if there's a way to refactor this code to avoid having so > much cut-and-paste duplication. I guess you mean WaitEventSetWait() and WaitEventAdjust*? I've tried, and my attempt ended up look nearly unreadable, because of the number of ifdefs. I've not found a good attempt. Which is sad, because adding back select support is going to increase the duplication further :( - but it's also further away from poll etc. (different type of timestamp, entirely different way of returming events). > When iterating over the returned events, maybe check whether events is > 0 at the top of the loop and skip it forthwith if so. You mean in WaitEventSetWait()? There's else if (rc == 0) { break; } which is the timeout case. There should never be any other case of returning 0 elements? > That's all I've got for now. Thanks for looking. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers