RE: Log a sample of transactions
Dear Adrien, I applied your file and faced a bug-candidate. (I have not read your source yet. Sorry.) When I tried to use tab-complition, some dirty messages were appeared. Messages I faced are attached in this mail. Best Regards, Hayato Kuroda Fujitsu LIMITED postgres=# SET client_min_messages to LOG; SET postgres=# SET log_transaction_sample_rate to 1; SET postgres=# select * from LOG: duration: 4.906 ms statement: SELECT pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm', 'f', 'p') AND substring(pg_catalog.quote_ident(c.relname),1,0)='' AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <> (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog') UNION SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM pg_catalog.pg_namespace n WHERE substring(pg_catalog.quote_ident(n.nspname) || '.',1,0)='' AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,0) = substring('',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) > 1 UNION SELECT pg_catalog.quote_ident(n.nspname) || '.' || pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind IN ('r', 'S', 'v', 'm', 'f', 'p') AND substring(pg_catalog.quote_ident(n.nspname) || '.' || pg_catalog.quote_ident(c.relname),1,0)='' AND substring(pg_catalog.quote_ident(n.nspname) || '.',1,0) = substring('',1,pg_catalog.length(pg_catalog.quote_ident(n.nspname))+1) AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,0) = substring('',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) = 1 LIMIT 1000 LOG: duration: 1.871 ms statement: SELECT pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm', 'f', 'p') AND substring(pg_catalog.quote_ident(c.relname),1,0)='' AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <> (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog') UNION SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM pg_catalog.pg_namespace n WHERE substring(pg_catalog.quote_ident(n.nspname) || '.',1,0)='' AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,0) = substring('',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) > 1 UNION SELECT pg_catalog.quote_ident(n.nspname) || '.' || pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind IN ('r', 'S', 'v', 'm', 'f', 'p') AND substring(pg_catalog.quote_ident(n.nspname) || '.' || pg_catalog.quote_ident(c.relname),1,0)='' AND substring(pg_catalog.quote_ident(n.nspname) || '.',1,0) = substring('',1,pg_catalog.length(pg_catalog.quote_ident(n.nspname))+1) AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,0) = substring('',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) = 1 LIMIT 1000 foo pg_catalog. pg_toast.public. information_schema. pg_temp_1. pg_toast_temp_1.
Re: O_DIRECT for relations and SLRUs (Prototype)
> 12 янв. 2019 г., в 9:46, Michael Paquier написал(а): > > Note that pg_attribute_aligned() cannot be used as that's not an > option with clang and a couple of other comilers as far as I know, so > the patch uses a simple set of placeholder buffers large enough to be > aligned with the OS pages, which should be 4k for Linux by the way, > and not set to BLCKSZ, but for WAL's O_DIRECT we don't really care > much with such details. Is it possible to avoid those memcopy's by aligning available buffers instead? I couldn't understand this from the patch and this thread. Best regards, Andrey Borodin.
Re: O_DIRECT for relations and SLRUs (Prototype)
On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote: > Is it possible to avoid those memcpy's by aligning available buffers > instead? I couldn't understand this from the patch and this thread. Sure, it had better do that. That's just a lazy implementation. -- Michael signature.asc Description: PGP signature
Re: current_logfiles not following group access and instead follows log_file_mode permissions
On Tue, Jan 15, 2019 at 4:15 PM Michael Paquier wrote: > On Tue, Jan 15, 2019 at 03:08:41PM +1100, Haribabu Kommi wrote: > > current_logfiles is a meta data file, that stores the current log writing > > file, and this file presents in the data directory. This file > > doesn't follow the group access mode set at the initdb time, but it > > follows the log_file_mode permissions. > > > > Without group access permissions, backup with group access can lead to > > failure. Attached patch fix the problem. > > initdb enforces log_file_mode to 0640 when using the group mode, still > if one enforces the parameter value then current_logfiles would just > stick with it. This is not really user-friendly. This impacts also > normal log files as these get included in base backups if the log path > is within the data folder (not everybody uses an absolute path out of > the data folder for the logs). > we got this problem when the log_file_mode is set 0600 but the database file are with group access permissions. In our scenario, the log files are outside the data folder, so we faced the problem with current_logfiles file. > One way to think about this is that we may want to worry also about > normal log files and document that one had better be careful with the > setting of log_file_mode? Still, as we are talking about a file > aiming at storing meta-data for log files, something like what you > suggest can make sense. > Yes, with log_file_mode less than 0640 containing the log files inside the data directory can leads to backup failure. Yes, providing extra information about group access when log_file_mode is getting chosen. Another option is how about not letting user to choose less than 0640 when the group access mode is enabled? > When discussing about pg_current_logfile(), I raised the point about > not including as well in base backups which would also address the > problem reported here. However we decided to keep it because it can > be helpful to know what's the last log file associated to a base > backup for debugging purposes: > > https://www.postgresql.org/message-id/50b58f25-ab07-f6bd-7a68-68f29f214...@dalibo.com > > Instead of what you are proposing, why not revisiting that and just > exclude the file from base backups. I would be in favor of just doing > that instead of switching the file's permission from log_file_mode to > pg_file_create_mode. > I am not sure how much useful having the details of the log file in the backup. It may be useful when there is any problem with backup. Excluding the file in the backup can solve the problem of backup by an unprivileged user. Is there any scenarios it can cause problems if it doesn't follow the group access mode? Regards, Haribabu Kommi Fujitsu Australia
Re: Pluggable Storage - Andres's take
On Sat, 12 Jan 2019 at 18:11, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Sat, Jan 12, 2019 at 1:44 AM Andres Freund wrote: > > > /* other fields were zeroed above */ > > > > > > @@ -9355,7 +9370,7 @@ dumpComment(Archive *fout, const char *type, const > > > char *name, > > >* post-data. > > >*/ > > > ArchiveEntry(fout, nilCatalogId, createDumpId(), > > > - tag->data, namespace, NULL, owner, > > > + tag->data, namespace, NULL, owner, > > > NULL, > > >"COMMENT", SECTION_NONE, > > >query->data, "", NULL, > > >&(dumpId), 1, > > > > We really ought to move the arguments to a struct, so we don't generate > > quite as much useless diffs whenever we do a change around one of > > these... > > That's what I thought too. Maybe then I'll suggest a mini-patch to the master > to > refactor these arguments out into a separate struct, so we can leverage it > here. Then for each of the calls, we would need to declare that structure variable (with = {0}) and assign required fields in that structure before passing it to ArchiveEntry(). But a major reason of ArchiveEntry() is to avoid doing this and instead conveniently pass those fields as parameters. This will cause unnecessary more lines of code. I think better way is to have an ArchiveEntry() function with limited number of parameters, and have an ArchiveEntryEx() with those extra parameters which are not needed in usual cases. E.g. we can have tablespace, tableam, dumpFn and dumpArg as those extra arguments of ArchiveEntryEx(), because most of the places these are passed as NULL. All future arguments would go in ArchiveEntryEx(). Comments ? -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
regress tests fails
Hi make check-world fails t/008_diff_schema.pl .. ok t/009_matviews.pl . ok t/010_truncate.pl . ok t/011_generated.pl # Looks like your test exited with 29 before it could output anything. t/011_generated.pl Dubious, test returned 29 (wstat 7424, 0x1d00) Failed 2/2 subtests Test Summary Report --- t/011_generated.pl (Wstat: 7424 Tests: 0 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 2 tests but ran 0. Files=11, Tests=50, 62 wallclock secs ( 0.09 usr 0.03 sys + 21.82 cusr 7.63 csys = 29.57 CPU) Result: FAIL make[2]: *** [Makefile:19: check] Chyba 1 make[2]: Opouští se adresář „/home/pavel/src/postgresql.master/src/test/subscription“ make[1]: *** [Makefile:48: check-subscription-recurse] Chyba 2 make[1]: Opouští se adresář „/home/pavel/src/postgresql.master/src/test“ make: *** [GNUmakefile:70: check-world-src/test-recurse] Chyba 2 Regards Pavel
Re: regress tests fails
út 15. 1. 2019 v 10:47 odesílatel Pavel Stehule napsal: > Hi > > make check-world fails > > t/008_diff_schema.pl .. ok > t/009_matviews.pl . ok > t/010_truncate.pl . ok > t/011_generated.pl # Looks like your test exited with 29 before it > could output anything. > t/011_generated.pl Dubious, test returned 29 (wstat 7424, 0x1d00) > Failed 2/2 subtests > > Test Summary Report > --- > t/011_generated.pl (Wstat: 7424 Tests: 0 Failed: 0) > Non-zero exit status: 29 > Parse errors: Bad plan. You planned 2 tests but ran 0. > Files=11, Tests=50, 62 wallclock secs ( 0.09 usr 0.03 sys + 21.82 cusr > 7.63 csys = 29.57 CPU) > Result: FAIL > make[2]: *** [Makefile:19: check] Chyba 1 > make[2]: Opouští se adresář > „/home/pavel/src/postgresql.master/src/test/subscription“ > make[1]: *** [Makefile:48: check-subscription-recurse] Chyba 2 > make[1]: Opouští se adresář „/home/pavel/src/postgresql.master/src/test“ > make: *** [GNUmakefile:70: check-world-src/test-recurse] Chyba 2 > > I am sorry for noise. looks like garbage in my repository Pavel > Regards > > Pavel >
Re: Pluggable Storage - Andres's take
On Tue, 15 Jan 2019 at 12:27, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Mon, Jan 14, 2019 at 2:07 PM Amit Khandekar > > wrote: > > > > createPQExpBuffer() should be moved after the below statement, so that > > it does not leak memory > > Thanks for noticing, fixed. Looks good. > > > So how about bumping up the archive version and doing these checks ? > > Yeah, you're right, I've added this check. Need to bump K_VERS_MINOR as well. On Mon, 14 Jan 2019 at 18:36, Amit Khandekar wrote: > +static void _selectTableAccessMethod(ArchiveHandle *AH, const char > *tablespace); > tablespace => tableam This is yet to be addressed. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Log a sample of transactions
On 1/15/19 9:00 AM, Kuroda, Hayato wrote: Dear Adrien, I applied your file and faced a bug-candidate. (I have not read your source yet. Sorry.) When I tried to use tab-complition, some dirty messages were appeared. Messages I faced are attached in this mail. Best Regards, Hayato Kuroda Fujitsu LIMITED Hello, Thanks for your review ;) Your messages looks normal to me, you will have same messages if you put log_min_duration to 0. However, my compiler warn me of multiple definition of xact_is_sampled. I fix it in this 2nd patch. Regards, diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index b6f5822b84..3eb418cf99 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5780,6 +5780,24 @@ local0.*/var/log/postgresql + + log_transaction_sample_rate (real) + + log_transaction_sample_rate configuration parameter + + + + + Set the fraction of transactions whose statements are logged. It applies + to each new transaction regardless of their duration. The default is + 0, meaning do not log statements from this transaction. + Setting this to 1 logs all statements for all transactions. + log_transaction_sample_rate is helpful to track a + sample of transaction. + + + + diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index f665e38ecf..19306a9547 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1821,6 +1821,9 @@ StartTransaction(void) s->state = TRANS_START; s->transactionId = InvalidTransactionId; /* until assigned */ + /* Determine if we log statements in this transaction */ + xact_is_sampled = random() <= log_xact_sample_rate * MAX_RANDOM_VALUE; + /* * initialize current transaction state fields * diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 0c0891b33e..faa0485796 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -100,7 +100,8 @@ int max_stack_depth = 100; /* wait N seconds to allow attach from a debugger */ int PostAuthDelay = 0; - +/* flag for logging statements in this transaction */ +bool xact_is_sampled = false; /* * private variables @@ -2203,6 +2204,8 @@ check_log_statement(List *stmt_list) * check_log_duration * Determine whether current command's duration should be logged. * If log_statement_sample_rate < 1.0, log only a sample. + * We also check if this statement in this transaction must be logged + * (regardless of its duration). * * Returns: * 0 if no logging is needed @@ -2218,7 +2221,7 @@ check_log_statement(List *stmt_list) int check_log_duration(char *msec_str, bool was_logged) { - if (log_duration || log_min_duration_statement >= 0) + if (log_duration || log_min_duration_statement >= 0 || xact_is_sampled) { long secs; int usecs; @@ -2252,11 +2255,11 @@ check_log_duration(char *msec_str, bool was_logged) (log_statement_sample_rate == 1 || random() <= log_statement_sample_rate * MAX_RANDOM_VALUE); - if ((exceeded && in_sample) || log_duration) + if ((exceeded && in_sample) || log_duration || xact_is_sampled) { snprintf(msec_str, 32, "%ld.%03d", secs * 1000 + msecs, usecs % 1000); - if (exceeded && !was_logged) + if ((exceeded || xact_is_sampled) && !was_logged) return 2; else return 1; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index c216ed0922..06a5e668aa 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -494,6 +494,7 @@ int client_min_messages = NOTICE; int log_min_duration_statement = -1; int log_temp_files = -1; double log_statement_sample_rate = 1.0; +double log_xact_sample_rate = 0; int trace_recovery_messages = LOG; int temp_file_limit = -1; @@ -3347,13 +3348,25 @@ static struct config_real ConfigureNamesReal[] = {"log_statement_sample_rate", PGC_SUSET, LOGGING_WHEN, gettext_noop("Fraction of statements over log_min_duration_statement to log."), gettext_noop("If you only want a sample, use a value between 0 (never " - "log) and 1.0 (always log).") + "log) and 1 (always log).") }, &log_statement_sample_rate, 1.0, 0.0, 1.0, NULL, NULL, NULL }, + { + {"log_transaction_sample_rate", PGC_SUSET, LOGGING_WHEN, + gettext_noop("Set the fraction of transactions to log for new transactions."), + gettext_noop("Logs all statements from a fraction of transactions. " + "Use a value between 0 (never log) and 1 (log all " + "statements for all transactions).") + }, + &log_xact_sample_rate, + 0.0, 0.0, 1.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresq
Re: FETCH FIRST clause WITH TIES option
On Wed, Jan 2, 2019 at 6:19 PM Tomas Vondra wrote: > After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this > patch should tweak estimates in some way. Currently, the cardinality > estimate is the same as for plain LIMIT, using the requested number of > rows. But let's say there are very few large groups - that will > naturally increase the number of rows produced. > > As an example, let's say the subplan produces 1M rows, and there are > 1000 groups (when split according to the ORDER BY clause). > can we use ORDER BY column raw statistic in limit node reliably? because it seems to me it can be affected by other operation in a subplan like filter condition regards Surafel
Re: Synchronous replay take III
On Sat, Dec 1, 2018 at 10:49 AM Thomas Munro wrote: > > On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > Unfortunately, cfbot says that patch can't be applied without conflicts, > > could > > you please post a rebased version and address commentaries from Masahiko? > > Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on > top of those, I found myself wondering why syncrep.c thinks it needs > special treatment for postmaster death. I don't see any reason why we > shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in > this new version. If you kill -9 the postmaster, I don't see any > reason to think that the existing coding is more correct than simply > exiting immediately. > > On Thu, Nov 15, 2018 at 6:34 PM Masahiko Sawada wrote: > > On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro > > wrote: > > > I was pinged off-list by a fellow -hackers denizen interested in the > > > synchronous replay feature and wanting a rebased patch to test. Here > > > it goes, just in time for a Commitfest. Please skip to the bottom of > > > this message for testing notes. > > > > Thank you for working on this. The overview and your summary was > > helpful for me to understand this feature, thank you. I've started to > > review this patch for PostgreSQL 12. I've tested this patch and found > > some issue but let me ask you questions about the high-level design > > first. Sorry if these have been already discussed. > > Thanks for your interest in this work! > > > > This is a design choice favouring read-mostly workloads at the expense > > > of write transactions. Hot standbys' whole raison for existing is to > > > move *some* read-only workloads off the primary server. This proposal > > > is for users who are prepared to trade increased primary commit > > > latency for a guarantee about visibility on the standbys, so that > > > *all* read-only work could be moved to hot standbys. > > > > To be clear what did you mean read-mostly workloads? > > I mean workloads where only a small percentage of transactions perform > a write. If you need write-scalability, then hot_standby is not the > solution for you (with or without this patch). > > The kind of user who would be interested in this feature is someone > who already uses some kind of heuristics to move some queries to > read-only standbys. For example, some people send transaction for > logged-in users to the primary database (because only logged-in users > generate write queries), and all the rest to standby servers (for > example "public" users who can only read content). Another technique > I have seen is to keep user sessions "pinned" on the primary server > for N minutes after they perform a write transaction. These types of > load balancing policies are primitive ways of achieving > read-your-writes consistency, but they are conservative and > pessimistic: they probably send too many queries to the primary node. > > This proposal is much more precise, allowing you to run the minimum > number of transactions on the primary node (ie transactions that > actually need to perform a write), and the maximum number of > transactions on the hot standbys. > > As discussed, making reads wait for a token would be a useful > alternative (and I am willing to help make that work too), but: > > 1. For users that do more many more reads than writes, would you > rather make (say) 80% of transactions slower or 20%? (Or 99% vs 1% as > the case may be, depending on your application.) > > 2. If you are also using synchronous_commit = on for increased > durability, then you are already making writers wait, and you might be > able to tolerate a small increase. > > Peter Eisentraut expressed an interesting point of view against this > general line of thinking: > > https://www.postgresql.org/message-id/5643933F.4010701%40gmx.net > > My questions are: Why do we have hot_standby mode? Is load balancing > a style of usage we want to support? Do we want a technology that > lets you do more of it? > > > I think there are two kind of reads on standbys: a read happend after > > writes and a directly read (e.g. reporting). The former usually > > requires the causal reads as you mentioned in order to read its own > > writes but the latter might be different: it often wants to read the > > latest data on the master at the time. IIUC even if we send a > > read-only query directly to a synchronous replay server we could get a > > stale result if the standby delayed for less than > > synchronous_replay_max_lag. So this synchronous replay feature would > > be helpful for the former case(i.e. a few writes and many reads wants > > to see them) whereas for the latter case perhaps the keeping the reads > > waiting on standby seems a reasonable solution. > > I agree 100% that this is not a solution for all users. But I also > suspect a token system would be quite complicated, and can't be done > in a way that is transparent to applications w
Re: Ryu floating point output patch
> "Andres" == Andres Freund writes: >> In particular, how does it know how every strtod() on the planet >> will react to specific input? Andres> strtod()'s job ought to computationally be significantly easier Andres> than the other way round, no? And if there's buggy strtod() Andres> implementations out there, why would they be guaranteed to do Andres> the correct thing with our current output, but not with ryu's? Funny thing: I've been devoting considerable effort to testing this, and the one failure mode I've found is very interesting; it's not a problem with strtod(), in fact it's a bug in our float4in caused by _misuse_ of strtod(). In particular, float4in thinks it's ok to do strtod() on the input, and then round the result to a float. It is wrong to think that, and here's why: Consider the (float4) bit pattern x15ae43fd. The true decimal value of this, and its adjacent values are: x15ae43fc = 7.0385300755536269169437150392273653469292493678466371420654468238353729248046875e-26 midpoint 7.0385303837024180189014515281838361605176203339429008565275580622255802154541015625e-28 x15ae43fd = 7.038530691851209120859188017140306974105991300039164570989669300615787506103515625e-26 midpoint 7.0385310222816924506096876943622661354282854517805390059947967529296875e-26 x15ae43fe = 7.0385313081487913247746609950532486012827332322316911389177739620208740234375e-26 Now look at what happens if you pass '7.038531e-26' as input for float4. >From the above values, the correct result is x15ae43fd, because any other value would be strictly further away. But if you input the value as a double first, here are the adjacent representations: x3ab5c87fafff = 7.0385309074873222531206633286975087635036133540...e-26 midpoint 7.03853096488450735186517055373347249505857809130...e-28 x3ab5c87fb000 = 7.03853102228169245060968769436226613542828545...e-28 midpoint 7.038531079678877549354185003805399958168507565784...e-28 x3ab5c87fb001 = 7.038531137076062648098692228841363689723472303024...e-28 So the double value is x3ab5c87fb000, which has a mantissa of (1).0101110010110... which when rounded to 23 bits (excluding the implied (1)), is 010 1110 0100 0011 1110 = 2e43fe So the rounded result is incorrectly x15ae43fe, rather than the expected correct value of x15ae43fd. strtof() exists as part of the relevant standards, so float4in should be using that in place of strtod, and that seems to fix this case for me. -- Andrew (irc:RhodiumToad)
Re: Log a sample of transactions
On Sat, Jan 5, 2019 at 12:57 AM Adrien Mobile wrote: > > Le 4 janvier 2019 13:20:09 GMT+01:00, Peter Eisentraut > a écrit : > >On 12/12/2018 22:32, Adrien Nayrat wrote: > >> Per idea of Nikolay Samokhvalov[1] I propose this patch to add the > >possibility > >> to log all statements from a fraction of transactions. > >> > >> I have several questions: > >> * Do we want this feature? > > > >It's not clear to me what the use is. The per-statement sampling > >allows > >you to capture slow queries without overwhelming the logs. We don't > >have any logging of slow transactions or any other transaction scope > >logging, so what will this sample? > > > >-- > >Peter Eisentraut http://www.2ndQuadrant.com/ > >PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > Hello, > When you troubleshoot applicative issues with multi-statements transaction, > you may have to log all queries to find all statements of one transaction. > With high throughput, it could be hard to log all queries without causing > troubles. Hm, can we use log_min_duration_statement to find slow queries of a transaction instead? Could you please elaborate on the use-case? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: SPI Interface to Call Procedure with Transaction Control Statements?
> "Jack" == Jack LIU writes: Jack> Hi Andrew, Jack> This is my code to call the procedure with Jack> SPI_connect_ext(SPI_OPT_NONATOMIC). Ah. You need to take a look at exec_stmt_call in plpgsql, and do the same things it does with snapshot management (specifically, setting the no_snapshot flag on the plan that you're going to execute). SPI forces atomic mode if the normal snapshot management is in use, because otherwise a commit inside the procedure would warn about still having a snapshot open. -- Andrew (irc:RhodiumToad)
Re: SPI Interface to Call Procedure with Transaction Control Statements?
On 15/01/2019 11:49, Andrew Gierth wrote: >> "Jack" == Jack LIU writes: > > Jack> Hi Andrew, > Jack> This is my code to call the procedure with > Jack> SPI_connect_ext(SPI_OPT_NONATOMIC). > > Ah. You need to take a look at exec_stmt_call in plpgsql, and do the > same things it does with snapshot management (specifically, setting the > no_snapshot flag on the plan that you're going to execute). SPI forces > atomic mode if the normal snapshot management is in use, because > otherwise a commit inside the procedure would warn about still having a > snapshot open. Yeah, eventually we might want to add a new SPI function to do non-atomic calls, but right now you need to go the manual route. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: FETCH FIRST clause WITH TIES option
On 1/15/19 11:07 AM, Surafel Temesgen wrote: > > > On Wed, Jan 2, 2019 at 6:19 PM Tomas Vondra > mailto:tomas.von...@2ndquadrant.com>> wrote: > > After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this > patch should tweak estimates in some way. Currently, the cardinality > estimate is the same as for plain LIMIT, using the requested number of > rows. But let's say there are very few large groups - that will > naturally increase the number of rows produced. > > As an example, let's say the subplan produces 1M rows, and there are > 1000 groups (when split according to the ORDER BY clause). > > > > > can we use ORDER BY column raw statistic in limit node reliably? > because it seems to me it can be affected by other operation in a > subplan like filter condition > What do you mean by "raw statistic"? Using stats from the underlying table is not quite possible, because you might be operating on top of join or something like that. IMHO the best thing you can do is call estimate_num_groups() and combine that with the number of input rows. That shall benefit from ndistinct coefficients when available, etc. I've been thinking that considering the unreliability of grouping estimates we should use a multiple of the average size (because there may be much larger groups), but I think that's quite unprecipled and I'd much rather try without it first. But maybe we can do better when there really is a single table to consider, in which case we might look at MCV lists and estimate the largest group. That would give us a much better idea of the worst case. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Pluggable Storage - Andres's take
> On Tue, Jan 15, 2019 at 10:52 AM Amit Khandekar > wrote: > > Need to bump K_VERS_MINOR as well. I've bumped it up, but somehow this change escaped the previous version. Now should be there, thanks! > On Mon, 14 Jan 2019 at 18:36, Amit Khandekar wrote: > > +static void _selectTableAccessMethod(ArchiveHandle *AH, const char > > *tablespace); > > tablespace => tableam > > This is yet to be addressed. Fixed. Also I guess another attached patch should address the psql part, namely displaying a table access method with \d+ and possibility to hide it with a psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name). pg_dump_access_method_v4.patch Description: Binary data psql_describe_am.patch Description: Binary data
Re: Connection slots reserved for replication
On Wed, Jan 2, 2019, at 12:36, Alexander Kukushkin wrote: > Hi, > > On Wed, 2 Jan 2019 at 12:17, Oleksii Kliukin wrote: > > > Thank you. Attached the new version(called it v8) with the following > > changes to the documentation: > > Thank you for jumping on it. Your changes look good to me. > > > I was also thinking about changing the value in PG_CONTROL_VERSION, > because we added the new field into the control file, but decided to > leave this change to committer. Sounds reasonable, not sure what the 'official policy' is on this. As there is no further reviews/issues found for almost 2 weeks since the last one, should we move it to RFC? Cheers, Oleksii
Re: Reviving the "Stopping logical replication protocol" patch from Vladimir Gordichuk
Dave Cramer On Sun, 13 Jan 2019 at 23:19, Craig Ringer wrote: > On Mon, 3 Dec 2018 at 19:38, Dave Cramer wrote: > >> Dmitry, >> >> Please see attached rebased patches >> > > I'm fine with patch 0001, though I find this comment a bit hard to follow: > > + * The send_data callback must enqueue complete CopyData messages to libpq > + * using pq_putmessage_noblock or similar, since the walsender loop may > send > + * CopyDone then exit and return to command mode in response to a client > + * CopyDone between calls to send_data. > > I think it needs splitting up into a couple of sentences. > > Fair point, remember it was originally written by a non-english speaker > > In patch 0002, stopping during a txn. It's important that once we skip any > action, we continue skipping. In patch 0002 I'd like it to be clearer that > we will *always* skip the rb->commit callback if rb->continue_decoding_cb() > returned false and aborted the while loop. I suggest storing the result of > (rb->continue_decoding_cb == NULL || rb->continue_decoding_cb()) in an > assignment within the while loop, and testing that result later. > > e.g. > > (continue_decoding = (rb->continue_decoding_cb == NULL || > rb->continue_decoding_cb())) > > and later > > if (continue_decoding) { > rb->commit(rb, txn, commit_lsn); > } > > Will do > I don't actually think it's necessary to re-test the continue_decoding > callback and skip commit here. If we've sent all the of the txn > except the commit, we might as well send the commit, it's tiny and might > save some hassle later. > > > I think a comment on the skipped commit would be good too: > > /* > * If we skipped any changes due to a client CopyDone we must not send a > commit > * otherwise the client would incorrectly think it received a complete > transaction. > */ > > I notice that the fast-path logic in WalSndWriteData is removed by this > patch. It isn't moved; there's no comparison of last_reply_timestamp > and wal_sender_timeout now. What's the rationale there? You want to ensure > that we reach ProcessRepliesIfAny() ? Can we cheaply test for a readable > client socket then still take the fast-path if it's not readable? > This may have been a mistake as I am taking this over from a very old patch that I did not originally write. I'll have a look at this I > > -- > Craig Ringer http://www.2ndQuadrant.com/ > 2ndQuadrant - PostgreSQL Solutions for the Enterprise >
Re: Libpq support to connect to standby server as priority
On Mon, 14 Jan 2019 at 21:19, Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com> wrote: > From: Tom Lane [mailto:t...@sss.pgh.pa.us] > > The problem here of course is that whoever invented target_session_attrs > > was unconcerned with following that precedent, so what we have is > > "target_session_attrs=(any | read-write)". > > Are we prepared to add some aliases in service of unifying these names? > > I think "yes". > Agreed. There's no downside to aliasing and I'd really like to see consistency. > > > 2. Whether or not you want to follow pgJDBC's naming, it seems like we > > ought to have both "require read only" and "prefer read only" behaviors > > in this patch, and maybe likewise "require read write" versus "prefer > > read write". > > Agreed, although I don't see a use case for "prefer read write". I don't > think there's an app like "I want to write, but I'm OK if I cannot." > > > 3. We ought to sync this up with whatever's going to happen in > > https://commitfest.postgresql.org/21/1090/ > > at least to the extent of agreeing on what GUCs we'd like to see > > the server start reporting. > > Yes. > > > 4. Given that other discussion, it's not quite clear what we should > > even be checking. The existing logic devolves to checking that > > transaction_read_only is true, but that's not really the same thing as > > "is a master server", eg you might have connected to a master server > > under a role that has SET ROLE default_transaction_read_only = false. > > (I wonder what pgJDBC is really checking, under the hood.) > > Do we want to have modes that are checking hot-standby state in some > > fashion, rather than the transaction_read_only state? > > PgJDBC uses transaction_read_only like this: > > [core/v3/ConnectionFactoryImpl.java] > private boolean isMaster(QueryExecutor queryExecutor) throws > SQLException, IOException { > byte[][] results = SetupQueryRunner.run(queryExecutor, "show > transaction_read_only", true); > String value = queryExecutor.getEncoding().decode(results[0]); > return value.equalsIgnoreCase("off"); > } > > But as some people said, I don't think this is the right way. I suppose > what's leading to the current somewhat complicated situation is that there > was no easy way for the client to know whether the server is the master. > That ended up in using "SHOW transaction_read_only" instead, and people > supported that compromise by saying "read only status is more useful than > whether the server is standby or not," I'm afraid. > > The original desire should have been the ability to connect to a primary > or a standby. So, I think we should go back to the original thinking (and > not complicate the feature), and create a read only GUC_REPORT variable, > say, server_role, that identifies whether the server is a primary or a > standby. > > I'm confused as to how this would work. Who or what determines if the server is a primary or standby? Dave Cramer da...@postgresintl.com www.postgresintl.com > >
Re: using expression syntax for partition bounds
On 15/01/2019 07:31, Amit Langote wrote: >> Is "partition bound" the right term? For list partitioning, it's not >> really a bound. Maybe it's OK. > > Hmm, maybe "partition bound value"? Just want to stress that the > expression specifies "bounding" value of a partition. I was more concerned about the term "bound", which it is not range partitioning. But I can't think of a better term. I wouldn't change expr to value as you have done between v7 and v8, since the point of this patch is to make expressions valid where previously only values were. >> I don't see any treatment of non-immutable expressions. There is a test >> case with a non-immutable cast that was removed, but that's all. There >> was considerable discussion earlier in the thread on whether >> non-immutable expressions should be allowed. I'm not sure what the >> resolution was, but in any case there should be documentation and tests >> of the outcome. > > The partition bound expression is evaluated only once, that is, when the > partition creation command is executed, and what gets stored in the > catalog is a Const expression that contains the value that was computed, > not the original non-immutable expression. So, we don't need to do > anything special in terms of code to handle users possibly specifying a > non-immutable expression as partition bound. Tom said something in that > thread which seems to support such a position: > > https://www.postgresql.org/message-id/22534.1523374457%40sss.pgh.pa.us OK, if we are going with that approach, then it needs to be documented and there should be test cases. >> The collation handling might need another look. The following is >> allowed without any diagnostic: >> >> CREATE TABLE t2 ( >> a text COLLATE "POSIX" >> ) PARTITION BY RANGE (a); >> CREATE TABLE t2a PARTITION OF t2 FOR VALUES FROM ('foo' COLLATE "C") TO >> ('xyz'); >> >> I think the correct behavior would be that an explicit collation in the >> partition bound expression is an error. > But that's essentially ignoring the collation specified by the user for > the partition bound value without providing any information about that to > the user. I've fixed that by explicitly checking if the collate clause in > the partition bound value expression contains the same collation as the > parent partition key collation and give an error otherwise. I think that needs more refinement. In v8, the following errors CREATE TABLE t2 ( a name COLLATE "POSIX" ) PARTITION BY RANGE (a); CREATE TABLE t2a PARTITION OF t2 FOR VALUES FROM (name 'foo') TO (name 'xyz'); ERROR: collation of partition bound value for column "a" does not match partition key collation "POSIX" The problem here is that the "name" type has a collation that is neither the one of the column nor the default collation. We can allow that. What we don't want is someone writing an explicit COLLATE clause. I think we just need to check that there is no top-level COLLATE clause. This would then sort of match the logic parse_collate.c for combining collations. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
On Tue, Jan 15, 2019 at 12:47 AM David Rowley wrote: > 1. In: > > + if (IsA(clause, ScalarArrayOpExpr)) > + { > + ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause; > + Node *subexpr = (Node *) ((NullTest *) predicate)->arg; > + if (op_strict(saop->opno) && > + clause_is_strict_for((Node *) linitial(saop->args), subexpr)) > + return true; > + } > + > /* foo IS NOT NULL refutes foo IS NULL */ > if (clause && IsA(clause, NullTest) && > > Your IsA(clause, ScalarArrayOpExpr) condition should also be checking > that "clause" can't be NULL. The NullTest condition one below does > this. Fixed in my local copy. I also did the same in predicate_implied_by_simple_clause since it seems to have the same potential issue. After sleeping on it and looking at it again this morning, I also realized I need to exclude weak implication in predicate_refuted_by_simple_clause. > 2. I was also staring predicate_implied_by_simple_clause() a bit at > the use of clause_is_strict_for() to ensure that the IS NOT NULL's > operand matches the ScalarArrayOpExpr left operand. Since > clause_is_strict_for() = "Can we prove that "clause" returns NULL if > "subexpr" does?", in this case, your clause is the ScalarArrayOpExpr's > left operand and subexpr is the IS NOT NULL's operand. This means > that a partial index with "WHERE a IS NOT NULL" should also be fine to > use for WHERE strict_func(a) IN (1,2,..., 101); since strict_func(a) > must be NULL if a is NULL. Also also works for WHERE a+a > IN(1,2,...,101); I wonder if it's worth adding a test for that, or > even just modify one of the existing tests to ensure you get the same > result from it. Perhaps it's worth an additional test to ensure that x > IN(1,2,...,101) does not imply x+x IS NOT NULL and maybe that x+x IS > NULL does not refute x IN(1,2,...,101), as a strict function is free > to return NULL even if it's input are not NULL. Are you suggesting a different test than clause_is_strict_for to verify the saop LHS is the same as the null test's arg? I suppose we could use "equal()" instead? I've added several tests, and noticed a few things: 1. The current code doesn't result in "strongly_implied_by = t" for the "(x + x) is not null" case, but it does result in "s_i_holds = t". This doesn't change if I switch to using "equal()" as mentioned above. 2. The tests I have for refuting "x is null" show that w_r_holds as well as s_r_holds. I'd only expected the latter, since only "strongly_refuted_by = t". 3. The tests I have for refuting "(x + x) is null" show that both s_r_holds and w_r_holds. I'd expected these to be false. I'm attaching the current version of the patch with the new tests, but I'm not sure I understand the *_holds results mentioned above. Are they an artifact of the data under test? Or do they suggest a remaining bug? I'm a bit fuzzy on what to expect for *_holds though I understand the requirements for strongly/weakly_implied/refuted_by. James Coleman saop_is_not_null-v6.patch Description: Binary data
Re: current_logfiles not following group access and instead follows log_file_mode permissions
Haribabu Kommi writes: > Excluding the file in the backup can solve the problem of backup by an > unprivileged user. Is there any scenarios it can cause problems if it > doesn't follow the group access mode? The point of this file, as I understood it, was to allow someone who's allowed to read the log files to find out which one is the latest. It makes zero sense for it to have different permissions from the log files, because doing that would break its only use-case. I am wondering what is the use-case for a backup arrangement that's so fragile it can't cope with varying permissions in the data directory. regards, tom lane
Re: insensitive collations
On 14/01/2019 13:23, Daniel Verite wrote: > On a table with pre-existing contents, the creation of a unique index > does not seem to detect the duplicates that are equal per the > collation and different binary-wise. Fixed in the attached updated patch. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From dcd081ba5fdc3a29adddcd52deac5fb67d18fcc1 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 15 Jan 2019 15:54:33 +0100 Subject: [PATCH v3] Collations with nondeterministic comparison This adds a flag "deterministic" to collations. If that is false, such a collation disables various optimizations that assume that strings are equal only if they are byte-wise equal. That then allows use cases such as case-insensitive or accent-insensitive comparisons or handling of strings with different Unicode normal forms. The term "deterministic comparison" in this context is from Unicode Technical Standard #10 (https://unicode.org/reports/tr10/#Deterministic_Comparison). --- contrib/bloom/bloom.h | 1 + contrib/bloom/blutils.c | 3 +- doc/src/sgml/catalogs.sgml| 7 + doc/src/sgml/charset.sgml | 11 +- doc/src/sgml/ref/create_collation.sgml| 22 ++ src/backend/access/hash/hashfunc.c| 86 + src/backend/catalog/pg_collation.c| 2 + src/backend/commands/collationcmds.c | 25 +- src/backend/executor/execExpr.c | 4 +- src/backend/executor/execGrouping.c | 12 +- src/backend/executor/execPartition.c | 1 + src/backend/executor/nodeAgg.c| 4 + src/backend/executor/nodeGroup.c | 1 + src/backend/executor/nodeHash.c | 14 +- src/backend/executor/nodeHashjoin.c | 5 + src/backend/executor/nodeRecursiveunion.c | 1 + src/backend/executor/nodeSetOp.c | 2 + src/backend/executor/nodeSubplan.c| 8 + src/backend/executor/nodeUnique.c | 1 + src/backend/executor/nodeWindowAgg.c | 2 + src/backend/nodes/copyfuncs.c | 7 + src/backend/nodes/outfuncs.c | 7 + src/backend/nodes/readfuncs.c | 7 + src/backend/optimizer/plan/createplan.c | 54 ++- src/backend/optimizer/util/tlist.c| 25 ++ src/backend/partitioning/partbounds.c | 4 +- src/backend/partitioning/partprune.c | 3 +- src/backend/utils/adt/arrayfuncs.c| 2 +- src/backend/utils/adt/name.c | 32 +- src/backend/utils/adt/orderedsetaggs.c| 3 +- src/backend/utils/adt/pg_locale.c | 1 + src/backend/utils/adt/ri_triggers.c | 20 + src/backend/utils/adt/varchar.c | 20 + src/backend/utils/adt/varlena.c | 51 ++- src/backend/utils/cache/catcache.c| 2 +- src/backend/utils/cache/lsyscache.c | 16 + src/bin/initdb/initdb.c | 4 +- src/bin/pg_dump/pg_dump.c | 39 +- src/bin/psql/describe.c | 17 +- src/include/catalog/pg_collation.h| 2 + src/include/executor/executor.h | 3 + src/include/executor/hashjoin.h | 1 + src/include/executor/nodeHash.h | 2 +- src/include/nodes/execnodes.h | 3 + src/include/nodes/plannodes.h | 7 + src/include/optimizer/planmain.h | 2 +- src/include/optimizer/tlist.h | 1 + src/include/partitioning/partbounds.h | 1 + src/include/utils/lsyscache.h | 1 + src/include/utils/pg_locale.h | 1 + .../regress/expected/collate.icu.utf8.out | 341 +- .../regress/expected/collate.linux.utf8.out | 25 +- src/test/regress/sql/collate.icu.utf8.sql | 115 ++ src/test/regress/sql/collate.linux.utf8.sql | 8 + 54 files changed, 928 insertions(+), 111 deletions(-) diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h index 24200fb5fa..c4672f6853 100644 --- a/contrib/bloom/bloom.h +++ b/contrib/bloom/bloom.h @@ -137,6 +137,7 @@ typedef struct BloomMetaPageData typedef struct BloomState { FmgrInfohashFn[INDEX_MAX_KEYS]; + Oid collations[INDEX_MAX_KEYS]; BloomOptions opts; /* copy of options on index's metapage */ int32 nColumns; diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c index 6458376578..d078dfbd46 100644 --- a/contrib/bloom/blutils.c +++ b/contrib/bloom/blutils.c @@ -163,6 +163,7 @@ initBloomState(BloomState *state, Relation index) fmgr_info_copy(&(state->hashFn[i]), index_getprocinfo(index, i + 1, BLOOM_HASH_PROC),
Re: [HACKERS] COPY FREEZE and PD_ALL_VISIBLE
Hello, Today I bumped into need to limit first VACUUM time on data import. I'm using utility called osmium together with COPY FREEZE to import openstreetmap data into database. osmium export -c osmium.config -f pg belarus-latest.osm.pbf -v --progress | psql -1 -c 'create table byosm(geom geometry, osm_type text, osm_id bigint, tags jsonb);copy byosm from stdin freeze;' However, first pass of VACUUM rewrites the whole table. Here is two logs of VACUUM VERBOSE in a row: https://gist.github.com/Komzpa/e765c1c5e04623d83a6263d4833cf3a5 In Russian Postgres Telegram group I've been recommended this thread. Can the patch be revived? What is needed to get it up for 12? On Sun, Aug 14, 2016 at 10:37 PM Jeff Janes wrote: > On Tue, Nov 3, 2015 at 6:37 AM, Simon Riggs wrote: > > On 3 November 2015 at 15:23, Amit Kapila > wrote: > >> > >> On Fri, Oct 23, 2015 at 6:29 AM, Simon Riggs > >> wrote: > >>> > >>> On 21 October 2015 at 13:31, Jeff Janes wrote: > >>> > Index-only scans will visit the heap for each tuple until the first > VACUUM is done. > > The first vacuum will read the entire table, but not need to write it > anymore. And will create the _vm file. > > I think we really want to create _vm file as well as set > PD_ALL_VISIBLE, > but I don't know the best way to do that. Set a flag somewhere and > then > create it in bulk at the end of the transaction? Set it bit by bit > as the > pages are extended and initialized? > >>> > >>> > >>> Easy enough to do it at the end of the COPY FREEZE in one step. > >> > >> > >> Here, we might want to consider that setting bit in visibility map > >> will generate WAL log whereas Copy Freeze otherwise skip WAL > >> when wal_level is less than archive. This can lead to extra disk > >> writes which can slow down Copy Freeze, but OTOH that might > >> be acceptable. > > > > > > I'm building the map as I go, in the latest version of this patch I'm > > working on. > > Hi Simon, > > Is this still on your radar? If you would like someone else to pick > it up, can you post the WIP patch you have? > > Thanks, > > Jeff > > > -- > Sent via pgsql-hackers mailing list (pgsql-hack...@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Darafei Praliaskouski Support me: http://patreon.com/komzpa
Re: current_logfiles not following group access and instead follows log_file_mode permissions
I wrote: > Haribabu Kommi writes: >> Excluding the file in the backup can solve the problem of backup by an >> unprivileged user. Is there any scenarios it can cause problems if it >> doesn't follow the group access mode? > The point of this file, as I understood it, was to allow someone who's > allowed to read the log files to find out which one is the latest. It > makes zero sense for it to have different permissions from the log files, > because doing that would break its only use-case. On reflection, maybe the problem is not that we're giving the file the wrong permissions, but that we're putting it in the wrong place? That is, seems like it should be in the logfile directory not the data directory. That would certainly simplify the intended use-case, and it would fix this complaint too. regards, tom lane
Re: O_DIRECT for relations and SLRUs (Prototype)
On 1/15/19 11:28 AM, Michael Paquier wrote: On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote: Is it possible to avoid those memcpy's by aligning available buffers instead? I couldn't understand this from the patch and this thread. Sure, it had better do that. That's just a lazy implementation. Hi! Could you specify all cases when buffers will not be aligned with BLCKSZ? AFAIC shared and temp buffers are aligned. And what ones are not? -- Regards, Maksim Milyutin
Re: Pluggable Storage - Andres's take
Hi, On 2019-01-15 14:37:36 +0530, Amit Khandekar wrote: > Then for each of the calls, we would need to declare that structure > variable (with = {0}) and assign required fields in that structure > before passing it to ArchiveEntry(). But a major reason of > ArchiveEntry() is to avoid doing this and instead conveniently pass > those fields as parameters. This will cause unnecessary more lines of > code. I think better way is to have an ArchiveEntry() function with > limited number of parameters, and have an ArchiveEntryEx() with those > extra parameters which are not needed in usual cases. I don't think that'll really solve the problem. I think it might be more reasonable to rely on structs. Now that we can rely on designated initializers for structs we can do something like ArchiveEntry((ArchiveArgs){.tablespace = 3, .dumpFn = somefunc, ...}); and unused arguments will automatically initialized to zero. Or we could pass the struct as a pointer, might be more efficient (although I doubt it matters here): ArchiveEntry(&(ArchiveArgs){.tablespace = 3, .dumpFn = somefunc, ...}); What do others think? It'd probably be a good idea to start a new thread about this. Greetings, Andres Freund
Re: Log a sample of transactions
On 1/15/19 11:42 AM, Masahiko Sawada wrote: When you troubleshoot applicative issues with multi-statements transaction, you may have to log all queries to find all statements of one transaction. With high throughput, it could be hard to log all queries without causing troubles. Hm, can we use log_min_duration_statement to find slow queries of a transaction instead? Could you please elaborate on the use-case? Hello, The goal is not to find slow queries in a transaction, but troubleshoot applicative issue when you have short queries. Sometimes you want to understand what happens in a transaction, either you perfectly know your application, either you have to log all queries and find ones with the same transaction ID (%x). It could be problematic if you have a huge traffic with fast queries. Thanks,
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
James Coleman writes: > I'm attaching the current version of the patch with the new tests, but > I'm not sure I understand the *_holds results mentioned above. Are > they an artifact of the data under test? Or do they suggest a > remaining bug? I'm a bit fuzzy on what to expect for *_holds though I > understand the requirements for strongly/weakly_implied/refuted_by. IIRC, the "foo_holds" outputs mean "the foo relationship appears to hold for these expressions across all tested inputs", for instance s_i_holds means "there were no tested cases where A is true and B is not true". The implied_by/refuted_by outputs mean "predtest.c claims it can prove this". Obviously, a claimed proof for a relationship that fails experimentally would be big trouble. The other way around just means that either the proof rules are inadequate to prove this particular case, or the set of test values for the expressions don't expose the fact that the relationship doesn't hold. Now, if we *expected* that predtest.c should be able to prove something, failure to do so would be a bug, but it's not a bug if we know it's incapable of making that proof. The second explanation might represent a bug in the test case. I added the foo_holds columns just as a sort of cross-check on the test cases themselves, they don't test anything about the predtest code. regards, tom lane
Re: [PATCH] Allow UNLISTEN during recovery
Mitar, thanks for giving this your attention! So patch looks good to me, but documentation changes are missing from it. > UNLISTEN should be removed from the list of commands not allowed and moved > to the list of those which are allowed [1]. Moreover, > src/test/regress/sql/hs_standby_allowed.sql and > src/test/regress/sql/hs_standby_disallowed.sql tests should be updated > based on these changes in the patch. What is surprising to me is that make > check-world does not fail with this change, but there is an explicit check > for UNLISTEN *. So does this mean those tests are not run or does it mean > that this patch does not fix the problem? > I've made the requested changes to the docs and to the regression tests. I think that specifically the standby regression tests do not get executed by check-world - see section https://www.postgresql.org/docs/current/regress-run.html#id-1.6.20.5.8. I'm guessing this should be executed as part of the build verification pipeline for patches, but I don't know anything about the PostgreSQL build infrastructure. Unfortunately I'm extremely tight for time at the moment and don't have time to do the appropriate hot standby setup to test this... As the patch is pretty straightforward, and since I'm hoping you guys execute the tests on your end, I hope that's acceptable. If it's absolutely necessary for me to test the patch locally, let me know and I'll do so. From 8c816354e820bf3d0be69d55dbf0052b1d27feeb Mon Sep 17 00:00:00 2001 From: Shay Rojansky Date: Tue, 15 Jan 2019 18:49:40 +0100 Subject: [PATCH] Allow unlisten when in hot standby:wq --- doc/src/sgml/high-availability.sgml| 16 ++-- src/backend/tcop/utility.c | 2 +- src/test/regress/sql/hs_standby_allowed.sql| 4 src/test/regress/sql/hs_standby_disallowed.sql | 2 -- 4 files changed, 15 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index 4882b20828..bb1c86f73e 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -1767,6 +1767,11 @@ if (!triggered) Plugins and extensions - LOAD + + + UNLISTEN + + @@ -1856,18 +1861,17 @@ if (!triggered) - LISTEN, UNLISTEN, NOTIFY + LISTEN, NOTIFY -In normal operation, read-only transactions are allowed to -use LISTEN, UNLISTEN, and -NOTIFY, so Hot Standby sessions operate under slightly tighter -restrictions than ordinary read-only sessions. It is possible that some -of these restrictions might be loosened in a future release. +In normal operation, read-only transactions are allowed to use +LISTEN, and NOTIFY, so Hot Standby sessions +operate under slightly tighter restrictions than ordinary read-only sessions. +It is possible that some of these restrictions might be loosened in a future release. diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 27ae6be751..44136060d6 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -629,7 +629,7 @@ standard_ProcessUtility(PlannedStmt *pstmt, { UnlistenStmt *stmt = (UnlistenStmt *) parsetree; -PreventCommandDuringRecovery("UNLISTEN"); +/* allow UNLISTEN during recovery, which is a noop */ CheckRestrictedOperation("UNLISTEN"); if (stmt->conditionname) Async_Unlisten(stmt->conditionname); diff --git a/src/test/regress/sql/hs_standby_allowed.sql b/src/test/regress/sql/hs_standby_allowed.sql index a33199dbbd..1f1cdf5a00 100644 --- a/src/test/regress/sql/hs_standby_allowed.sql +++ b/src/test/regress/sql/hs_standby_allowed.sql @@ -110,6 +110,10 @@ LOCK hs1 IN ROW SHARE MODE; LOCK hs1 IN ROW EXCLUSIVE MODE; COMMIT; +-- UNLISTEN +unlisten a; +unlisten *; + -- LOAD -- should work, easier if there is no test for that... diff --git a/src/test/regress/sql/hs_standby_disallowed.sql b/src/test/regress/sql/hs_standby_disallowed.sql index 21bbf526b7..a470600eec 100644 --- a/src/test/regress/sql/hs_standby_disallowed.sql +++ b/src/test/regress/sql/hs_standby_disallowed.sql @@ -88,8 +88,6 @@ COMMIT; -- Listen listen a; notify a; -unlisten a; -unlisten *; -- disallowed commands -- 2.19.1
Re: [PATCH] Allow UNLISTEN during recovery
Hi! On Tue, Jan 15, 2019 at 10:17 AM Shay Rojansky wrote: > Unfortunately I'm extremely tight for time at the moment and don't have time > to do the appropriate hot standby setup to test this... As the patch is > pretty straightforward, and since I'm hoping you guys execute the tests on > your end, I hope that's acceptable. If it's absolutely necessary for me to > test the patch locally, let me know and I'll do so. I would definitely prefer if you could run the tests. You might discover that your patch does not sufficiently address tests. Please do so and confirm here that it works for you and then I can do another review. Mitar -- http://mitar.tnode.com/ https://twitter.com/mitar_m
Re: Protect syscache from bloating with negative cache entries
"Tsunakawa, Takayuki" writes: > But the syscache/relcache bloat still remains a problem, when there are many > live tables and application connections. Would you agree to solve this in > some way? I thought Horiguchi-san's latest patches would solve this and the > negative entries. Can we consider that his patch and yours are orthogonal, > i.e., we can pursue Horiguchi-san's patch after yours is committed? Certainly, what I've done here doesn't preclude adding some wider solution to the issue of extremely large catcaches. I think it takes the pressure off for one rather narrow problem case, and the mechanism could be used to fix other ones. But if you've got an application that just plain accesses a huge number of objects, this isn't going to make your life better. > (As you said, some parts of Horiguchi-san's patches may be made simpler. For > example, the ability to change another session's GUC variable can be > discussed in a separate thread.) Yeah, that idea seems just bad from here ... > I think we need some limit to the size of the relcache, syscache, and > plancache. Oracle and MySQL both have it, using LRU to evict less frequently > used entries. You seem to be concerned about the LRU management based on > your experience, but would it really cost so much as long as each postgres > process can change the LRU list without coordination with other backends now? > Could you share your experience? Well, we *had* an LRU mechanism for the catcaches way back when. We got rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU info was expensive and (b) performance fell off a cliff in scenarios where the cache size limit was exceeded. You could probably find some more info about that by scanning the mail list archives from around the time of that commit, but I'm too lazy to do so right now. That was a dozen years ago, and it's possible that machine performance has moved so much since then that the problems are gone or mitigated. In particular I'm sure that any limit we would want to impose today will be far more than the 5000-entries-across-all-caches limit that was in use back then. But I'm not convinced that a workload that would create 100K cache entries in the first place wouldn't have severe problems if you tried to constrain it to use only 80K entries. I fear it's just wishful thinking to imagine that the behavior of a larger cache won't be just like a smaller one. Also, IIRC some of the problem with the LRU code was that it resulted in lots of touches of unrelated data, leading to CPU cache miss problems. It's hard to see how that doesn't get even worse with a bigger cache. As far as the relcache goes, we've never had a limit on that, but there are enough routine causes of relcache flushes --- autovacuum for instance --- that I'm not really convinced relcache bloat can be a big problem in production. The plancache has never had a limit either, which is a design choice that was strongly influenced by our experience with catcaches. Again, I'm concerned about the costs of adding a management layer, and the likelihood that cache flushes will simply remove entries we'll soon have to rebuild. > FYI, Oracle provides one parameter, shared_pool_size, that determine the > size of a memory area that contains SQL plans and various dictionary > objects. Oracle decides how to divide the area among constituents. So > it could be possible that one component (e.g. table/index metadata) is > short of space, and another (e.g. SQL plans) has free space. Oracle > provides a system view to see the free space and hit/miss of each > component. If one component suffers from memory shortage, the user > increases shared_pool_size. This is similar to what Horiguchi-san is > proposing. Oracle seldom impresses me as having designs we ought to follow. They have a well-earned reputation for requiring a lot of expertise to operate, which is not the direction this project should be going in. In particular, I don't want to "solve" cache size issues by exposing a bunch of knobs that most users won't know how to twiddle. regards, tom lane
Re: Protect syscache from bloating with negative cache entries
Hi, On 2019-01-15 13:32:36 -0500, Tom Lane wrote: > Well, we *had* an LRU mechanism for the catcaches way back when. We got > rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU > info was expensive and (b) performance fell off a cliff in scenarios where > the cache size limit was exceeded. You could probably find some more info > about that by scanning the mail list archives from around the time of that > commit, but I'm too lazy to do so right now. > > That was a dozen years ago, and it's possible that machine performance > has moved so much since then that the problems are gone or mitigated. > In particular I'm sure that any limit we would want to impose today will > be far more than the 5000-entries-across-all-caches limit that was in use > back then. But I'm not convinced that a workload that would create 100K > cache entries in the first place wouldn't have severe problems if you > tried to constrain it to use only 80K entries. I think that'd be true if you the accesses were truly randomly distributed - but that's not the case in the cases where I've seen huge caches. It's usually workloads that have tons of functions, partitions, ... and a lot of them are not that frequently accessed, but because we have no cache purging mechanism stay around for a long time. This is often exascerbated by using a pooler to keep connections around for longer (which you have to, to cope with other limits of PG). > As far as the relcache goes, we've never had a limit on that, but there > are enough routine causes of relcache flushes --- autovacuum for instance > --- that I'm not really convinced relcache bloat can be a big problem in > production. It definitely is. > The plancache has never had a limit either, which is a design choice that > was strongly influenced by our experience with catcaches. This sounds a lot of having learned lessons from one bad implementation and using it far outside of that situation. Greetings, Andres Freund
Re: Protect syscache from bloating with negative cache entries
On Tue, Jan 15, 2019 at 01:32:36PM -0500, Tom Lane wrote: > ... > > FYI, Oracle provides one parameter, shared_pool_size, that determine the > > size of a memory area that contains SQL plans and various dictionary > > objects. Oracle decides how to divide the area among constituents. So > > it could be possible that one component (e.g. table/index metadata) is > > short of space, and another (e.g. SQL plans) has free space. Oracle > > provides a system view to see the free space and hit/miss of each > > component. If one component suffers from memory shortage, the user > > increases shared_pool_size. This is similar to what Horiguchi-san is > > proposing. > > Oracle seldom impresses me as having designs we ought to follow. > They have a well-earned reputation for requiring a lot of expertise to > operate, which is not the direction this project should be going in. > In particular, I don't want to "solve" cache size issues by exposing > a bunch of knobs that most users won't know how to twiddle. > > regards, tom lane +1 Regards, Ken
Re: MERGE SQL statement for PG12
On Mon, Jan 14, 2019 at 1:37 AM Pavan Deolasee wrote: > Can you please help me understand what's fundamentally wrong with the > approach and more importantly, can you please explain what would the the > architecturally sound way to do this? The same also applies to the executor > side where the current approach is deemed wrong, but very little is said on > what's the correct way. I think the basic problem is that optimization should happen in the optimizer, not the parser. Deciding that MERGE looks like a RIGHT JOIN is not a parsing task and therefore shouldn't happen in the parser. The guys who were (are?) proposing support for ALIGN and NORMALIZE for temporal joins got the same complaint. Here's Tom talking about that: http://postgr.es/m/32265.1510681...@sss.pgh.pa.us I realize that you may have worked around some of the specific issues that Tom mentions in that email, but I believe the general point remains valid. What parse analysis should do is construct a representation of the query the user entered, not some other query derived from it. And this code clearly does the latter. You can see by the fact that it uses QSRC_PARSER, for example, a constant that is currently unused (and for good reason). And you can see that it results in hacks like the cross-check to make sure that resultRelRTE->relid == mergeRelRTE->relid. You need that because you are manipulating and transforming the query too early in the pipeline. If you do the transformation in the optimizer, you won't need stuff like this. There are other hints in this function that you're really doing planning tasks: + * The MERGE query's join can be tuned in some cases, see below for these + * special case tweaks. Tuning the query is otherwise known as optimization. Do it in the optimizer. + * Simplify the MERGE query as much as possible + * + * These seem like things that could go into Optimizer, but they are + * semantic simplifications rather than optimizations, per se. Actually, they are optimizations. The fact that an outer join may be more expensive than an inner join is a costing consideration, not something that the parser needs to know about. + /* + * This query should just provide the source relation columns. Later, in + * preprocess_targetlist(), we shall also add "ctid" attribute of the + * target relation to ensure that the target tuple can be fetched + * correctly. + */ This is not a problem unique to MERGE. It comes up for UPDATE and DELETE as well. The handling is not all one place, but it's all in the optimizer. The place which actually adds the CTID column is in preprocess_targetlist(), but note that it's much smarter than the code in the MERGE patch, because it can do different things depending on which type of rowmark is requested. The code as it exists in the MERGE patch can't possibly work for anything that doesn't have a CTID table, which means it won't work for FDWs and, in the future, any new heap types added via pluggable storage unless they happen to support CTID. Correctly written, MERGE will leverage what the optimizer already knows about rowmarks rather than reinventing them in the parse analysis phase. You should redesign this whole representation so that you just pass the whole query through to the optimizer without any structural change. Just as we do for other statements, you need to do the basic transformation stuff like looking up relation OIDs: that has to do with what the query MEANS and the external SQL objects on which it DEPENDS; those things have to be figured out before planning so that things like the plan-cache work correctly. But also just as we do for other statements, you should avoid having any code in this function that has to do with the way in which the MERGE should ultimately be executed, and you should not have any code here that builds a new query or does heavy restructuring of the parse tree. Since it looks like we will initially have only one strategy for executing MERGE, it might be OK to move the transformation that you're currently doing in transformMergeStmt() to some early phase of the optimizer. Here's me making approximately the same point about ALIGN and NORMALIZE: https://www.postgresql.org/message-id/CA+TgmoZ=UkRzpisqK5Qox_ekLG+SQP=xBeFiDkXTgLF_=1f...@mail.gmail.com However, I'm not sure that's actually the right way to go. Something, possibly this transformation, is causing us to end up with two copies of the target relation. This comment is a pretty good clue that this is nothing but an ugly kludge: + * While executing MERGE, the target relation is processed twice; once + * as a target relation and once to run a join between the target and the + * source. We generate two different RTEs for these two purposes, one with + * rte->inh set to false and other with rte->inh set to true. + * + * Since the plan re-evaluated by EvalPlanQual uses the join RTE, we must + * install the updated tuple in the scan corresponding to that RTE. The + * following me
Re: MERGE SQL statement for PG12
Hi, On 2019-01-15 14:05:25 -0500, Robert Haas wrote: > On Mon, Jan 14, 2019 at 1:37 AM Pavan Deolasee > wrote: > > Can you please help me understand what's fundamentally wrong with > > the approach and more importantly, can you please explain what would > > the the architecturally sound way to do this? The same also applies > > to the executor side where the current approach is deemed wrong, but > > very little is said on what's the correct way. > > [ Long and good explanation by Robert ] > I want to point out that it is not as if nobody has reviewed this > patch previously. Here is Andres making basically the same point > about parse analysis that I'm making here -- FWIW, I didn't find his > reply until after I'd written the above: > > https://www.postgresql.org/message-id/20180403021800.b5nsgiclzanobiup%40alap3.anarazel.de > > Here he is explaining these points some more: > > https://www.postgresql.org/message-id/2018040523.gar3j26gsk32gqpe%40alap3.anarazel.de > > And here's Peter Geoghegan not only explaining the same problem but > having a go at fixing it: > > https://www.postgresql.org/message-id/CAH2-Wz%3DZwNQvp11XjeHo-dBLHr9GDRi1vao8w1j7LQ8mOsUkzw%40mail.gmail.com > > Actually, I see that Peter Geoghegan not just the emails above but a > blizzard of other emails explaining the structural problems with the > patch, which I now see include not only the parse analysis concerns > but also the business of multiple RTEs which I mentioned above. + many. Pavan, I think the reason you're not getting much further feedback is that you and Simon have gotten a lot and only incorporated feedback only very grudgingly, if at all. You've not even attempted to sketch out a move of the main merge handling from parse-analysis to planner, as far as I can tell, despite this being one of the main criticisms for about a year. Given that not much besides rebasing has happened since v11, I don't find it surprising that people don't want to invest more energy in this patch. Greetings, Andres Freund
Re: Ryu floating point output patch
Andrew Gierth writes: > Funny thing: I've been devoting considerable effort to testing this, and > the one failure mode I've found is very interesting; it's not a problem > with strtod(), in fact it's a bug in our float4in caused by _misuse_ of > strtod(). > In particular, float4in thinks it's ok to do strtod() on the input, and > then round the result to a float. Hm. I'm not sure how much weight to put on this example, since we don't guarantee that float4 values will round-trip with less than 9 digits of printed precision. This example corresponds to extra_float_digits = 1 which doesn't meet that minimum. However, I agree that it's rounding the presented input in the less desirable direction. > strtof() exists as part of the relevant standards, so float4in should be > using that in place of strtod, and that seems to fix this case for me. To clarify that: strtof() is defined in C99 but not older standards (C89 and SUSv2 lack it). While we've agreed that we'll require C99 *compilers* going forward, I'm less enthused about supposing that the platform's libc is up to date, not least because replacing libc on an old box is orders of magnitude harder than installing a nondefault version of gcc. As a concrete example, gaur lacks strtof(), and I'd be pretty much forced to retire it if we start requiring that function. Putting a more modern gcc on that machine wasn't too painful, but I'm not messing with its libc. But I'd not object to adding a configure check and teaching float4in to use strtof() if available. regards, tom lane
Re: Libpq support to connect to standby server as priority
On Mon, Jan 14, 2019 at 5:17 PM Tom Lane wrote: > The problem here of course is that whoever invented target_session_attrs > was unconcerned with following that precedent, so what we have is > "target_session_attrs=(any | read-write)". > Are we prepared to add some aliases in service of unifying these names? I wasn't unconcerned about the problem, but I wasn't prepared to to be the first person who added a connection parameter that used namesLikeThis instead of names_like_this, especially if the semantics weren't exactly the same. That seemed to be a recipe for somebody yelling at me, and I try to avoid that when I can. > 4. Given that other discussion, it's not quite clear what we should > even be checking. The existing logic devolves to checking that > transaction_read_only is true, but that's not really the same thing as > "is a master server", eg you might have connected to a master server > under a role that has SET ROLE default_transaction_read_only = false. > (I wonder what pgJDBC is really checking, under the hood.) > Do we want to have modes that are checking hot-standby state in some > fashion, rather than the transaction_read_only state? Well, this has been discussed before, too, I'm pretty sure, but I'm too lazy to go find the old discussion right now. The upshot is that default_transaction_read_only lets an administrator make a server look read-only even if it technically isn't, which somebody might find useful. Otherwise what do you do if, for example, you are using logical replication? None of your servers are in recovery, but you can make some of them report default_transaction_read_only = true if you like. To me, that kind of configurability is a feature, not a bug. That being said, I don't object to having even more values for target_session_attrs that check other things. You could have: read_only: default_transaction_read_only => true read_write: default_transaction_read_only => false master: pg_is_in_recovery => false standby: pg_is_in_recovery => true But what I think would be a Very Bad Plan is to use confused naming that looks for something different than what it purports to do. For example, if you were to change things so that read_write checks pg_is_in_recovery(), then you might ask for a "read-write" server and get one where only read-only transactions are permitted. We need not assume that "read-write master" and "read-only standby" are the only two kinds of things that can ever exist, as long as we're careful about the names we choose. Choosing the names carefully also helps to avoid POLA violations. Another point I'd like to mention is that target_session_attrs could be extended to care about other kinds of properties which someone might want a server to have, quite apart from master/standby/read-only/read-write. I don't know exactly what sort of thing somebody might care about, but the name is such that we can decide to care about other properties in the future without having to add a whole new parameter. You can imagine a day when someone can say target_session_attrs=read-write,v42+,ftl to get a server connection that is read-write on a server running PostgreSQL 42 or greater that also has a built-in hyperdrive. Or whatever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Connection slots reserved for replication
On Wed, Jan 2, 2019 at 6:36 AM Alexander Kukushkin wrote: > I was also thinking about changing the value in PG_CONTROL_VERSION, > because we added the new field into the control file, but decided to > leave this change to committer. We typically omit catversion bumps from submitted patches to avoid unnecessary conflicts, but I think PG_CONTROL_VERSION doesn't change enough to cause a problem. Also, it's not date-dependent the way catversion is. So I would include the bump in the patch, if it were me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: New vacuum option to do only freezing
On Mon, Jan 14, 2019 at 9:24 PM Masahiko Sawada wrote: > I think that because the tuples that got dead after heap_page_prune() > looked are recorded but not removed without lazy_vacuum_page() we need > to process them in lazy_vacuum_page(). For decision about whether to > truncate we should not change it, so I will fix it. It should be an > another option to control whether to truncate if we want. I think that heap_page_prune() is going to truncate dead tuples to dead line pointers. At that point the tuple is gone. The only remaining step is to mark the dead line pointer as unused, which can't be done without scanning the indexes. So I don't understand what lazy_vacuum_page() would need to do in this case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: MERGE SQL statement for PG12
Robert Haas writes: > You should redesign this whole representation so that you just pass > the whole query through to the optimizer without any structural > change. Just as we do for other statements, you need to do the basic > transformation stuff like looking up relation OIDs: that has to do > with what the query MEANS and the external SQL objects on which it > DEPENDS; those things have to be figured out before planning so that > things like the plan-cache work correctly. But also just as we do for > other statements, you should avoid having any code in this function > that has to do with the way in which the MERGE should ultimately be > executed, and you should not have any code here that builds a new > query or does heavy restructuring of the parse tree. Just to comment on that: IMV, the primary task of parse analysis is exactly to figure out which database objects the query references or depends on. We must know that much so that we can store proper dependency information if the query goes into a view or rule. But as soon as you add more processing than that, you are going to run into problems of at least two kinds: * The further the parsetree diverges from just representing what was entered, the harder the task of ruleutils.c to reconstruct something that looks like what was entered. * The more information you absorb about the referenced objects, the more likely it is that a stored view/rule will be broken by subsequent ALTER commands. This connects to Robert's complaint about the parse transformation depending on CTID, though I don't think it's quite the same thing. We want to minimize the dependency surface of stored rules -- for example, it's okay for a stored view to depend on the signature (argument and return data types) of a function, but not for it to depend on, say, the provolatile property. If we allowed that then we'd have to forbid ALTER FUNCTION from changing the volatility. I've not looked at this patch closely enough to know whether it moves the goalposts in terms of what a dependency-on-a- table means, but judging from Robert's and Andres' reviews, I'm quite worried that it does. Delaying transformations to the rewrite or plan phases avoids these problems because now you are past the point where the query could be stored to or fetched from a rule. regards, tom lane
Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?
Hi, On 2018-12-20 15:04:11 -0800, Andres Freund wrote: > On 2018-12-19 16:56:36 -0800, Andres Freund wrote: > > On 2018-12-19 19:39:33 -0500, Tom Lane wrote: > > > Robert Haas writes: > > > > On Wed, Dec 19, 2018 at 5:37 PM Andres Freund > > > > wrote: > > > >> What's gained by the logic of emitting that warning in VACUUM after a > > > >> crash? I don't really see any robustness advantages in it. > > > > > > > I don't know. I am just normally reluctant to change things > > > > precipitously that are of long tenure. > > > > > > Me too, but I think Andres has a point here. Those warnings in VACUUM > > > are ancient, probably predating the introduction of WAL :-(. At the > > > time there was good reason to be suspicious of zeroed pages in tables. > > > Now, though, we have (what we think is) a bulletproof crash recovery > > > procedure in which possibly-zeroed pages are to be expected; so we're > > > just causing users unnecessary alarm by warning about them. I think > > > it's reasonable to, if not remove the messages entirely, at least > > > downgrade them to a low DEBUG level. > > > > Yea, I think that'd be reasonable. > > > > I think we ought to add them, as new/zero pages, to the FSM. If we did > > so both during vacuum and RelationAddExtraBlocks() we'd avoid the > > redundant writes, and avoid depending on heap layout in > > RelationAddExtraBlocks(). > > > > I wonder if it'd make sense to only log a DEBUG if there's no > > corresponding FSM entry. That'd obviously not be bulltproof, but it'd > > reduce the spammyness considerably. > > Here's a prototype patch implementing this change. I'm not sure the new > DEBUG message is really worth it? > > > Looking at the surrounding code made me wonder about the wisdom of > entering empty pages as all-visible and all-frozen into the VM. That'll > mean we'll never re-discover them on a primary, after promotion. There's > no mechanism to add such pages to the FSM on a standby (in contrast to > e.g. pages where tuples are modified), as vacuum will never visit that > page again. Now obviously it has the advantage of avoiding > re-processing the page in the next vacuum, but is that really an > important goal? If there's frequent vacuums, there got to be a fair > amount of modifications, which ought to lead to re-use of such pages at > a not too far away point? Any comments on the approach in this patch? Greetings, Andres Freund
Re: Ryu floating point output patch
> "Tom" == Tom Lane writes: On a slightly unrelated note, is the small-is-zero variant output file for the float tests still required? -- Andrew (irc:RhodiumToad)
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
James Coleman writes: > [ saop_is_not_null-v6.patch ] I quite dislike taking the responsibility out of clause_is_strict_for and putting it in the callers. Aside from requiring duplicative code, it means that this fails to prove anything for recursive situations (i.e., where the ScalarArrayOp appears one or more levels down in a clause examined by clause_is_strict_for). If the behavior needs to vary depending on proof rule, which it looks like it does, the way to handle that is to add flag argument(s) to clause_is_strict_for. I'm also not happy about the lack of comments justifying the proof rules -- eg, it's far from obvious why you need to restrict one case to !weak but not the other. regards, tom lane
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
On Wed, 16 Jan 2019 at 03:33, James Coleman wrote: > > 2. I was also staring predicate_implied_by_simple_clause() a bit at > > the use of clause_is_strict_for() to ensure that the IS NOT NULL's > > operand matches the ScalarArrayOpExpr left operand. Since > > clause_is_strict_for() = "Can we prove that "clause" returns NULL if > > "subexpr" does?", in this case, your clause is the ScalarArrayOpExpr's > > left operand and subexpr is the IS NOT NULL's operand. This means > > that a partial index with "WHERE a IS NOT NULL" should also be fine to > > use for WHERE strict_func(a) IN (1,2,..., 101); since strict_func(a) > > must be NULL if a is NULL. Also also works for WHERE a+a > > IN(1,2,...,101); I wonder if it's worth adding a test for that, or > > even just modify one of the existing tests to ensure you get the same > > result from it. Perhaps it's worth an additional test to ensure that x > > IN(1,2,...,101) does not imply x+x IS NOT NULL and maybe that x+x IS > > NULL does not refute x IN(1,2,...,101), as a strict function is free > > to return NULL even if it's input are not NULL. > > Are you suggesting a different test than clause_is_strict_for to > verify the saop LHS is the same as the null test's arg? I suppose we > could use "equal()" instead? I wasn't suggesting any code changes. I just thought the code was sufficiently hard to understand to warrant some additional tests that ensure we don't assume that if the int4 column x is not null that also x+x is not null. Only the reverse can be implied since int4pl is strict. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: pgsql: Update ssl test certificates and keys
On Fri, Jan 4, 2019 at 10:08 AM Thomas Munro wrote: > On Fri, Jan 4, 2019 at 3:36 AM Peter Eisentraut > wrote: > > On 23/12/2018 09:04, Michael Paquier wrote: > > > On Tue, Nov 27, 2018 at 02:21:39PM +, Peter Eisentraut wrote: > > >> Update ssl test certificates and keys > > >> > > >> Debian testing and newer now require that RSA and DHE keys are at > > >> least 2048 bit long and no longer allow SHA-1 for signatures in > > >> certificates. This is currently causing the ssl tests to fail there > > >> because the test certificates and keys have been created in violation > > >> of those conditions. > > >> > > >> Update the parameters to create the test files and create a new set of > > >> test files. > > > > > > Peter, would it make sense to back-patch this commit down to where the > > > SSL tests have been introduced? If /etc/ssl/ is not correctly > > > configured, this results in failures across branches on Debian if the > > > default is used. > > > > done > > Thanks. FWIW I've just updated eelpout (a Debian testing BF animal > that runs all the extra tests including SSL) to use libssl-dev > (instead of libssl1.0-dev), and cleared its accache files. Let's see > if that works... Since that upgrade (to libssl 1.1.1a-1), there are have been a few intermittent failures in the SSL tests on that animal (thanks Tom for pointing that out off-list). In a quick check, I was able to reproduce the failure after about 8 successful runs of "make check" under src/test/ssl on that machine. I couldn't immediately see what the problem was and I'm away from computers and work this week, so I'll have to investigate properly early next week. The main unusual thing about that animal is that it's an ARM CPU. FWIW I run that test by having this in build-farm.conf (I mention this in case someone wants to do the same on a Debian buster/testing x86 system to see if it has a similar problem, if there isn't one like that already): $ENV{PG_TEST_EXTRA} = "ssl ldap kerberos"; -- Thomas Munro http://www.enterprisedb.com
PSA: "tenk1" and other similar regression test tables are from the Wisconsin Benchmark
I recently purchased a copy of "The Benchmark Handbook", a book from the early 1990s that was edited by Jim Gray. It features analysis of the Wisconsin Benchmark in chapter 3 -- that's a single client benchmark that famously showed real limitations in the optimizers that were current in the early to mid 1980s. The book describes various limitations of Wisconsin as a general purpose benchmark, but it's still interesting in other ways, then and now. The book goes on to say that it is still often used in regression tests. I see that we even had a full copy of the benchmark until it was torn out by commit a05a4b47 in 2009. I don't think that anybody will be interested in the Benchmark itself, but the design of the benchmark may provide useful context. I could imagine somebody with an interest in the optimizer finding the book useful. I paid about $5 for a second hand copy of the first edition, so it isn't a hard purchase for me to justify. -- Peter Geoghegan
Re: Ryu floating point output patch
Andrew Gierth writes: > On a slightly unrelated note, is the small-is-zero variant output file > for the float tests still required? Hm, good question. I had been about to respond that it must be, but looking more closely at pg_regress.c I see that the default expected-file will be tried if the one fingered by resultmap isn't an exact match. So if everybody is matching float8.out these days, we wouldn't know it. The only one of the resultmap entries I'm in a position to try right now is x86 freebsd, and I can report that FreeBSD/x86 matches the default float8.out, and has done at least back to FreeBSD 8.4. It wouldn't be an unreasonable suspicion to guess that the other BSDen have likewise been brought up to speed to produce ERANGE on underflow, but I can't check it ATM. On the other hand, cygwin is probably pretty frozen-in-time, so it might still have the old behavior. Can anyone check that? regards, tom lane
Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?
Andres Freund writes: >> Looking at the surrounding code made me wonder about the wisdom of >> entering empty pages as all-visible and all-frozen into the VM. That'll >> mean we'll never re-discover them on a primary, after promotion. There's >> no mechanism to add such pages to the FSM on a standby (in contrast to >> e.g. pages where tuples are modified), as vacuum will never visit that >> page again. Now obviously it has the advantage of avoiding >> re-processing the page in the next vacuum, but is that really an >> important goal? If there's frequent vacuums, there got to be a fair >> amount of modifications, which ought to lead to re-use of such pages at >> a not too far away point? > Any comments on the approach in this patch? I agree with the concept of postponing page init till we're actually going to do something with the page. However, the patch seems to need some polish: * There's a comment in RelationAddExtraBlocks, just above where you changed, that * Extend by one page. This should generally match the main-line * extension code in RelationGetBufferForTuple, except that we hold * the relation extension lock throughout. This seems to be falsified by this patch, in that one of the two paths does PageInit and the other doesn't. * s/unitialized pages/uninitialized pages/ * This bit in vacuumlazy seems unnecessarily confusing: +Sizefreespace = 0; ... +if (GetRecordedFreeSpace(onerel, blkno) == 0) +freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData; + +if (freespace > 0) +{ +RecordPageWithFreeSpace(onerel, blkno, freespace); I'd write that as just +if (GetRecordedFreeSpace(onerel, blkno) == 0) +{ +Sizefreespace = BufferGetPageSize(buf) - SizeOfPageHeaderData; +RecordPageWithFreeSpace(onerel, blkno, freespace); I tend to agree that the DEBUG message isn't very necessary, or at least could be lower than DEBUG1. regards, tom lane
Re: [PROPOSAL] Shared Ispell dictionaries
Hello Arthur, I've looked at the patch today, and in general is seems quite solid to me. I do have a couple of minor points 1) I think the comments need more work. Instead of describing all the individual changes here, I've outlined those improvements in attached patches (see the attached "tweaks" patches). Some of it is formatting, minor rewording or larger changes. Some comments are rather redundant (e.g. the one before calls to release the DSM segment). 2) It's not quite clear to me why we need DictInitData, which simply combines DictPointerData and list of options. It seems as if the only point is to pass a single parameter to the init function, but is it worth it? Why not to get rid of DictInitData entirely and pass two parameters instead? 3) I find it a bit cumbersome that before each ts_dict_shmem_release call we construct a dummy DickPointerData value. Why not to pass individual parameters and construct the struct in the function? 4) The reference to max_shared_dictionaries_size is obsolete, because there's no such limit anymore. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From e76d34bcb1a84127f9b4402f0147642d77505cc2 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Tue, 15 Jan 2019 22:16:35 +0100 Subject: [PATCH 1/7] Fix ispell memory handling --- src/backend/tsearch/spell.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c index eb39466b22..eb8416ce7f 100644 --- a/src/backend/tsearch/spell.c +++ b/src/backend/tsearch/spell.c @@ -78,6 +78,8 @@ #define tmpalloc(sz) MemoryContextAlloc(Conf->buildCxt, (sz)) #define tmpalloc0(sz) MemoryContextAllocZero(Conf->buildCxt, (sz)) +#define tmpstrdup(str) MemoryContextStrdup(Conf->buildCxt, (str)) + /* * Prepare for constructing an ISpell dictionary. * @@ -498,7 +500,7 @@ NIAddSpell(IspellDict *Conf, const char *word, const char *flag) Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0') - ? cpstrdup(Conf, flag) : VoidString; + ? tmpstrdup(flag) : VoidString; Conf->nspell++; } @@ -1040,7 +1042,7 @@ setCompoundAffixFlagValue(IspellDict *Conf, CompoundAffixFlag *entry, entry->flag.i = i; } else - entry->flag.s = cpstrdup(Conf, s); + entry->flag.s = tmpstrdup(s); entry->flagMode = Conf->flagMode; entry->value = val; @@ -1541,6 +1543,9 @@ nextline: return; isnewformat: + pfree(recoded); + pfree(pstr); + if (oldformat) ereport(ERROR, (errcode(ERRCODE_CONFIG_FILE_ERROR), -- 2.17.2 >From dbb3cc3b7e7c560472cfa5efa77598f4992e0b70 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Tue, 15 Jan 2019 22:17:32 +0100 Subject: [PATCH 2/7] Change tmplinit argument --- contrib/dict_int/dict_int.c | 4 +- contrib/dict_xsyn/dict_xsyn.c| 4 +- contrib/unaccent/unaccent.c | 4 +- src/backend/commands/tsearchcmds.c | 10 +++- src/backend/snowball/dict_snowball.c | 4 +- src/backend/tsearch/dict_ispell.c| 4 +- src/backend/tsearch/dict_simple.c| 4 +- src/backend/tsearch/dict_synonym.c | 4 +- src/backend/tsearch/dict_thesaurus.c | 4 +- src/backend/utils/cache/ts_cache.c | 19 +++- src/include/tsearch/ts_cache.h | 4 ++ src/include/tsearch/ts_public.h | 69 ++-- 12 files changed, 113 insertions(+), 21 deletions(-) diff --git a/contrib/dict_int/dict_int.c b/contrib/dict_int/dict_int.c index 628b9769c3..ddde55eee4 100644 --- a/contrib/dict_int/dict_int.c +++ b/contrib/dict_int/dict_int.c @@ -30,7 +30,7 @@ PG_FUNCTION_INFO_V1(dintdict_lexize); Datum dintdict_init(PG_FUNCTION_ARGS) { - List *dictoptions = (List *) PG_GETARG_POINTER(0); + DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0); DictInt*d; ListCell *l; @@ -38,7 +38,7 @@ dintdict_init(PG_FUNCTION_ARGS) d->maxlen = 6; d->rejectlong = false; - foreach(l, dictoptions) + foreach(l, init_data->dict_options) { DefElem*defel = (DefElem *) lfirst(l); diff --git a/contrib/dict_xsyn/dict_xsyn.c b/contrib/dict_xsyn/dict_xsyn.c index 509e14aee0..15b1a0033a 100644 --- a/contrib/dict_xsyn/dict_xsyn.c +++ b/contrib/dict_xsyn/dict_xsyn.c @@ -140,7 +140,7 @@ read_dictionary(DictSyn *d, const char *filename) Datum dxsyn_init(PG_FUNCTION_ARGS) { - List *dictoptions = (List *) PG_GETARG_POINTER(0); + DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0); DictSyn*d; ListCell *l; char *filename = NULL; @@ -153,7 +153,7 @@ dxsyn_init(PG_FUNCTION_ARGS) d->matchsynonyms = false; d->keepsynonyms = true; - foreach(l, dictoptions) + foreach(l, init_data->dict_options) { DefElem*defel = (DefElem *) lfirst(l); diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c index fc5176e338..f3663cef
RE: Log a sample of transactions
Dear Adrien, > Your messages looks normal to me, you will have same messages if you put > log_min_duration to 0. That was my lack of understanding. Sorry. By the way, I think the name of this parameter should be changed. In the Cambridge dictionary, the word "rate" means as follows: the speed at which something happens or changes, or the amount or number of times it happens or changes in a particular period. This parameter represents "probability" whether it log, therefore this name is inappropriate. You should use "probability" or "ratio", I think. Next week I will give you some comments about your good source. Best Regards, Hayato Kuroda Fujitsu LIMITED
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
On Tue, Jan 15, 2019 at 3:53 PM Tom Lane wrote: > > James Coleman writes: > > [ saop_is_not_null-v6.patch ] > > I quite dislike taking the responsibility out of clause_is_strict_for > and putting it in the callers. Aside from requiring duplicative code, > it means that this fails to prove anything for recursive situations > (i.e., where the ScalarArrayOp appears one or more levels down in a > clause examined by clause_is_strict_for). > > If the behavior needs to vary depending on proof rule, which it > looks like it does, the way to handle that is to add flag argument(s) > to clause_is_strict_for. The reason I moved it was that we're no longer just proving strictness, so it seemed odd to put it in a function specifically named to be about proving strictness. If you'd prefer an argument like "bool allow_false" or "bool check_saop_implies_is_null" I'm happy to make that change, but it seems generally unhelpful to change the function implementation to contradict the name (the empty array case that returns false even then the lhs is null). > I'm also not happy about the lack of comments justifying the proof > rules -- eg, it's far from obvious why you need to restrict one > case to !weak but not the other. Both implication and refutation have the !weak check; it's just that the check in the implication function was already present. I'll add more comments inline for the proof rules, but from what I can tell only strong implication holds for implication of IS NOT NULL. A sample case disproving weak implication is "select null is not null, null::int = any(array[null]::int[])" which results in "false, null", so non-falsity does not imply non-falsity. However I'm currently unable to come up with a case for refutation such that truth of a ScalarArrayOp refutes IS NULL, so I'll update code and tests for that. That also takes care of my point: James Coleman writes: > 2. The tests I have for refuting "x is null" show that w_r_holds as > well as s_r_holds. I'd only expected the latter, since only > "strongly_refuted_by = t". But it still leaves my questions (1) and (3). James Coleman
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
On Tue, Jan 15, 2019 at 4:36 PM David Rowley wrote: > I wasn't suggesting any code changes. I just thought the code was > sufficiently hard to understand to warrant some additional tests that > ensure we don't assume that if the int4 column x is not null that also > x+x is not null. Only the reverse can be implied since int4pl is > strict. At the risk of missing something obvious, I'm not sure I see a case where "x is not null" does not imply "(x + x) is not null", at least for integers. Since an integer + an integer results in an integer, then it must imply the addition of itself is not null also? This is the root of the questions I had: James Coleman writes: > 1. The current code doesn't result in "strongly_implied_by = t" for > the "(x + x) is not null" case, but it does result in "s_i_holds = t". > This doesn't change if I switch to using "equal()" as mentioned above. > 3. The tests I have for refuting "(x + x) is null" show that both > s_r_holds and w_r_holds. I'd expected these to be false. James Coleman
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
On Wed, 16 Jan 2019 at 14:05, James Coleman wrote: > At the risk of missing something obvious, I'm not sure I see a case > where "x is not null" does not imply "(x + x) is not null", at least > for integers. Since an integer + an integer results in an integer, > then it must imply the addition of itself is not null also? A strict function guarantees it will return NULL on any NULL input. This does not mean it can't return NULL on a non-NULL input. While int4pl might do what you want, some other strict function might not. A simple example would be a strict function that decided to return NULL when the two ints combined overflowed int. The docs [1] define STRICT as: "RETURNS NULL ON NULL INPUT or STRICT indicates that the function always returns null whenever any of its arguments are null. If this parameter is specified, the function is not executed when there are null arguments; instead a null result is assumed automatically." [1] https://www.postgresql.org/docs/devel/sql-createfunction.html -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
On Tue, Jan 15, 2019 at 8:14 PM David Rowley wrote: > > On Wed, 16 Jan 2019 at 14:05, James Coleman wrote: > > At the risk of missing something obvious, I'm not sure I see a case > > where "x is not null" does not imply "(x + x) is not null", at least > > for integers. Since an integer + an integer results in an integer, > > then it must imply the addition of itself is not null also? > > A strict function guarantees it will return NULL on any NULL input. > This does not mean it can't return NULL on a non-NULL input. > > While int4pl might do what you want, some other strict function might > not. A simple example would be a strict function that decided to > return NULL when the two ints combined overflowed int. Yes, the claim about not all strict functions guarantee this (like int4pl) made sense. Is your preference in this kind of case to comment the code and/or tests but stick with int4pl even though it doesn't demonstrate the "problem", or try to engineer a different test case such that the *_holds results in the tests don't seem to imply we could prove more things than we do? James Coleman
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
On Wed, 16 Jan 2019 at 14:29, James Coleman wrote: > > On Tue, Jan 15, 2019 at 8:14 PM David Rowley > > While int4pl might do what you want, some other strict function might > > not. A simple example would be a strict function that decided to > > return NULL when the two ints combined overflowed int. > > Yes, the claim about not all strict functions guarantee this (like > int4pl) made sense. > > Is your preference in this kind of case to comment the code and/or > tests but stick with int4pl even though it doesn't demonstrate the > "problem", or try to engineer a different test case such that the > *_holds results in the tests don't seem to imply we could prove more > things than we do? I think using x+x or whatever is fine. I doubt there's a need to invent some new function that returns NULL on non-NULL input. The code you're adding has no idea about the difference between the two. It has no way to know that anyway. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Log a sample of transactions
On Tue, Jan 15, 2019 at 06:03:36PM +0100, Adrien NAYRAT wrote: > The goal is not to find slow queries in a transaction, but troubleshoot > applicative issue when you have short queries. > > Sometimes you want to understand what happens in a transaction, either you > perfectly know your application, either you have to log all queries and find > ones with the same transaction ID (%x). It could be problematic if you have > a huge traffic with fast queries. Looks like a sensible argument to me. High log throughput can cause Postgres performance to go down by a couple of percents, which kills the purpose of tracking down performance problems as this could impact directly the application. -- Michael signature.asc Description: PGP signature
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name
On Sat, Jan 12, 2019 at 08:10:07AM +0900, Michael Paquier wrote: > On Fri, Jan 11, 2019 at 12:52:08PM -0500, Robert Haas wrote: >> With the patch, the PrimaryConnInfo and PrimarySlotName arguments are >> removed from RequestXLogStreaming. That means that the new >> walreceiver could come to a different conclusion about the values of >> those arguments than the startup process. In particular, it could end >> up thinking that primary_conninfo is empty when, if the startup >> process had thought that, the walreceiver would never have been >> launched in the first place. But it's not obvious that you've added >> any logic in WALReceiverMain or elsewhere to compensate for this >> possibility -- what would happen in that case? Would we crash? >> Connect to the wrong server? > > If I contemplate the patch this morning there is this bit: > @@ -291,32 +295,40 @@ WalReceiverMain(void) > /* Unblock signals (they were blocked when the postmaster forked > us) */ > PG_SETMASK(&UnBlockSig); > > + /* > +* Fail immediately if primary_conninfo goes missing, better safe than > +* sorry. > +*/ > + if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > +errmsg("cannot connect to the primary server as > \"primary_conninfo\" is not defined"))); > > So the answer to your question is: the WAL receiver fails to start. Robert, does this answer your question? I agree that depending on the timing, we could finish by having the startup process spawning a WAL receiver with a given connection string, and that the WAL receiver could use a different one, but as long as we fail the WAL receiver startup this does not seem like an issue to me, so I disagree with the upthread statement that the patch author has not thought about such cases :) It seems to me that making the WAL receiver relying more on the GUC context makes it more robust when it comes to reloading the parameters which would be an upcoming change as we need to rely on the WAL receiver check the GUC context itself and FATAL (or we would need to have the startup process check for a change in primary_conninfo/primary_slot_name, and then have the startup process kill the WAL receiver by itself). In short, handling all that in the WAL receiver would be more robust in the long term in my opinion as we reduce interactions between the WAL receiver and the startup process. And on top of it we get rid of ready_to_display, which is something I am unhappy with since we had to introduce it. -- Michael signature.asc Description: PGP signature
Re: O_DIRECT for relations and SLRUs (Prototype)
On Tue, Jan 15, 2019 at 07:40:12PM +0300, Maksim Milyutin wrote: > Could you specify all cases when buffers will not be aligned with BLCKSZ? > > AFAIC shared and temp buffers are aligned. And what ones are not? SLRU buffers are not aligned with the OS pages (aka alignment with 4096 at least). There are also a bunch of code paths where the callers of mdread() or mdwrite() don't do that, which makes a correct patch more invasive. -- Michael signature.asc Description: PGP signature
Bump up PG_CONTROL_VERSION on HEAD
Hi all, f3db7f16 has proved that it can be a bad idea to run pg_resetwal on a data folder which does not match the version it has been compiled with. As of HEAD, PG_CONTROL_VERSION is still 1100: $ pg_controldata | grep "pg_control version" pg_control version number:1100 Wouldn't it be better to bump it up to 1200? Thanks, -- Michael signature.asc Description: PGP signature
Re: Bump up PG_CONTROL_VERSION on HEAD
Hi, On 2019-01-16 11:02:08 +0900, Michael Paquier wrote: > f3db7f16 has proved that it can be a bad idea to run pg_resetwal on a > data folder which does not match the version it has been compiled > with. > > As of HEAD, PG_CONTROL_VERSION is still 1100: > $ pg_controldata | grep "pg_control version" > pg_control version number:1100 > > Wouldn't it be better to bump it up to 1200? We don't commonly bump that without corresponding control version changes. I don't see what we'd gain by the bump? Greetings, Andres Freund
Re: current_logfiles not following group access and instead follows log_file_mode permissions
On Tue, Jan 15, 2019 at 10:53:30AM -0500, Tom Lane wrote: > On reflection, maybe the problem is not that we're giving the file > the wrong permissions, but that we're putting it in the wrong place? > That is, seems like it should be in the logfile directory not the > data directory. That would certainly simplify the intended use-case, > and it would fix this complaint too. Yeah, thinking more on this one using for this file different permissions than the log files makes little sense, so what you propose here seems like a sensible thing to do things. Even if we exclude the file from native BASE_BACKUP this would not solve the case of custom backup solutions doing their own copy of things, when they rely on group-read permissions. This would not solve completely the problem anyway if log files are in the data folder, but it would address the case where the log files are in an absolute path out of the data folder. I am adding in CC Gilles who implemented current_logfiles for his input. -- Michael signature.asc Description: PGP signature
RE: Libpq support to connect to standby server as priority
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > Michael Paquier writes: > > On Tue, Jan 15, 2019 at 02:00:57AM +, Tsunakawa, Takayuki wrote: > >> The original desire should have been the ability to connect to a > >> primary or a standby. So, I think we should go back to the original > >> thinking (and not complicate the feature), and create a read only > >> GUC_REPORT variable, say, server_role, that identifies whether the > >> server is a primary or a standby. > > > From the point of view of making sure that a client is really > > connected to a primary or a standby, this is the best idea around. > > There are a couple of issues here: > 1. Are you sure there are no use-cases for testing transaction_read_only > as such? I don't find any practical use case, but I won't object to leaving the current target_session_attrs as-is. Alide from that, I think a parameter like PgJDBC's is necessary, e.g., target_server_type = {primary | standby | prefer_standby}, which acts based on just the server role (primary or standby). > 2. What will the fallback implementation be, when connecting to a server > too old to have the variable you want? One of the following: 1. "Unsupported" error. I'll take this. 2. libpq issues "SELECT pg_is_in_recovery()". 3. Blindly accepts the first successful connection. Regards Takayuki Tsunakawa
Re: What to name the current heap after pluggable storage / what to rename?
On Tue, Jan 15, 2019 at 1:43 PM Andres Freund wrote: > Hi, > > On 2019-01-11 12:01:36 -0500, Robert Haas wrote: > > On Thu, Jan 10, 2019 at 7:05 PM Andres Freund > wrote: > > > ISTM that the first block would best belong into new files like > > > access/rel[ation].h and access/common/rel[ation].h. > > > > +1. > > > > > I think the second > > > set should be renamed to be table_open() (with backward compat > > > redirects, there's way way too many references) and should go into > > > access/table.h access/table/table.c alongside tableam.[ch], > > > > Sounds reasonable. > > > > > but I could > > > also see just putting them into relation.[ch]. > > > > I would view that as a less-preferred alternative, but not crazy. > > Here's a set of patches. Not commit quality, but enough to discuss. > > > The first patch, the only really interesting one, splits out > relation_(open|openrv|openrv_extended|close) into access/relation.h and > access/common/relation.c > and > heap_(open|openrv|openrv_extended|close) into access/table.h and > access/table/table.c > > It's worthwhile to note that there's another header named > nodes/relation.h. But there's also utils/rel.h, so I couldn't think of a > another good name. > access/relation.[c|h] name is fine. Or how about access/rel.[c|h], because nodes/relation.h is related to planner. utils/rel.h is some how related to relation caches. > I'm basically thinking that table.h, even in the post pluggable storage > world, should not contain lower level functionality like dispatching > into table-am (that'll reside in tableam.h). But e.g. a > simple_table_(insert|update|delete) could live there, as well as > potentially some other heap_ functionality strewn around the backend. > > I made table.h not include relation.h, which means that a few files > might need both. I'm not sure that's the right choice, but it seems > easier to extend that later if shows to be painful, than to do the > reverse. > The need of both relation.h and table.h into a single file is because of use of both relation_open table_open functions. when I checked one of the file where both headers are included, I found that it simply used the relation_open function even the type of the relation is known. I didn't check whether it is possible or not? In case if the kind of the relation is known at every caller of relation_open, it can be replaced with either table_open or index_open or composite_type_open functions. So that there may not need any of the relation functions needs to be exposed. I've left the following in table.h: > /* > * heap_ used to be the prefix for these routines, and a lot of code will > just > * continue to work without adaptions after the introduction of pluggable > * storage, therefore just map these names. > */ > #define heap_open(r, l) table_open(r, l) > #define heap_openrv(r, l) table_openrv(r, l) > #define heap_openrv_extended(r, l, m) table_openrv_extended(r, l, m) > #define heap_close(r, l)table_close(r, l) > > and I think we should leave that in there for the forseeable future. > Above typedefs are good. They are useful to avoid any third party extensions. Regards, Haribabu Kommi Fujitsu Australia
Re: WIP: Avoid creation of the free space map for small tables
On Fri, Jan 11, 2019 at 3:54 AM John Naylor wrote: > > On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila wrote: > > Thanks, Mithun for performance testing, it really helps us to choose > > the right strategy here. Once John provides next version, it would be > > good to see the results of regular pgbench (read-write) runs (say at > > 50 and 300 scale factor) and the results of large copy. I don't think > > there will be any problem, but we should just double check that. > > Attached is v12 using the alternating-page strategy. I've updated the > comments and README as needed. In addition, I've Below are my performance tests and numbers Machine : cthulhu Tests and setups Server settings: max_connections = 200 shared_buffers=8GB checkpoint_timeout =15min maintenance_work_mem = 1GB checkpoint_completion_target = 0.9 min_wal_size=15GB and max_wal_size=20GB. pgbench settings: --- read-write settings (TPCB like tests) ./pgbench -c $threads -j $threads -T $time_for_reading -M prepared postgres scale factor 50 -- median of 3 TPS clients v12-patch base patch% diff 1 826.081588 834.328238 -0.9884179421 16 10805.807081 10800.6628050.0476292621 32 19722.277019 19641.5466280.4110185034 64 30232.681889 30263.616073 -0.1022157561 scale factor 300 -- median of 3 TPS clients v12-patch base patch% diff 1 813.646062 822.18648 -1.038744641 16 11379.02870211277.055860.9042505709 32 21688.08409321613.044463 0.3471960192 64 36288.85711 36348.6178 -0.1644098005 Copy command Test: setup ./psql -d postgres -c "COPY pgbench_accounts TO '/mnt/data-mag/ mithun.cy/fsmbin/bin/dump.out' WITH csv" ./psql -d postgres -c "CREATE UNLOGGED TABLE pgbench_accounts_ulg (LIKE pgbench_accounts) WITH (fillfactor = 100);" Test run: TRUNCATE TABLE pgbench_accounts_ulg; \timing COPY pgbench_accounts_ulg FROM '/mnt/data-mag/mithun.cy/fsmbin/bin/dump.out' WITH csv; \timing execution time in ms. (scale factor indicates size of pgbench_accounts) scale factor v12-patchbase patch % diff 300 77166.407 77862.041 -0.8934186557 50 13329.233 13284.583 0.3361038882 So for large table tests do not show any considerable performance variance from base code! On Fri, Jan 11, 2019 at 3:54 AM John Naylor wrote: > On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila > wrote: > > Thanks, Mithun for performance testing, it really helps us to choose > > the right strategy here. Once John provides next version, it would be > > good to see the results of regular pgbench (read-write) runs (say at > > 50 and 300 scale factor) and the results of large copy. I don't think > > there will be any problem, but we should just double check that. > > Attached is v12 using the alternating-page strategy. I've updated the > comments and README as needed. In addition, I've > > -handled a possible stat() call failure during pg_upgrade > -added one more assertion > -moved the new README material into a separate paragraph > -added a comment to FSMClearLocalMap() about transaction abort > -corrected an outdated comment that erroneously referred to extension > rather than creation > -fleshed out the draft commit messages > -- Thanks and Regards Mithun Chicklore Yogendra EnterpriseDB: http://www.enterprisedb.com
RE: Libpq support to connect to standby server as priority
From: Dave Cramer [mailto:p...@fastcrypt.com] > The original desire should have been the ability to connect to a > primary or a standby. So, I think we should go back to the original thinking > (and not complicate the feature), and create a read only GUC_REPORT variable, > say, server_role, that identifies whether the server is a primary or a > standby. > > > > I'm confused as to how this would work. Who or what determines if the server > is a primary or standby? Overall, the server determines the server role (primary or standby) using the same mechanism as pg_is_in_recovery(), and set the server_role GUC parameter. As the parameter is GUC_REPORT, the change is reported to the clients using the ParameterStatus ('S') message. The clients also get the value at connection. Regards Takayuki Tsunakawa
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
James Coleman writes: > On Tue, Jan 15, 2019 at 3:53 PM Tom Lane wrote: >> I quite dislike taking the responsibility out of clause_is_strict_for >> and putting it in the callers. > The reason I moved it was that we're no longer just proving > strictness, so it seemed odd to put it in a function specifically > named to be about proving strictness. Well, as I said upthread, it seems like we need to think a bit more carefully about what it is that clause_is_strict_for is testing --- and if that ends up finding that some other name is more apposite, I'd not have any hesitation about renaming it. But we're really missing a bet if the ScalarArrayOp-specific check isn't inside that recursive check of an "atomic" expression's properties. The routines above there only recurse through things that act like AND or OR, but clause_is_strict_for is capable of descending through other stuff as long as it's strict in some sense. What we need to be clear about here is exactly what that sense is. regards, tom lane
Re: Libpq support to connect to standby server as priority
>> I'm confused as to how this would work. Who or what determines if the server >> is a primary or standby? > > Overall, the server determines the server role (primary or standby) using the > same mechanism as pg_is_in_recovery(), and set the server_role GUC parameter. > As the parameter is GUC_REPORT, the change is reported to the clients using > the ParameterStatus ('S') message. The clients also get the value at > connection. But pg_is_in_recovery() returns true even for a promoting standby. So you have to wait and retry to send pg_is_in_recovery() until it finishes the promotion to find out it is now a primary. I am not sure if backend out to be responsible for this process. If not, libpq would need to handle it but I doubt it would be possible. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
David Rowley writes: > On Wed, 16 Jan 2019 at 14:29, James Coleman wrote: >> Is your preference in this kind of case to comment the code and/or >> tests but stick with int4pl even though it doesn't demonstrate the >> "problem", or try to engineer a different test case such that the >> *_holds results in the tests don't seem to imply we could prove more >> things than we do? > I think using x+x or whatever is fine. I doubt there's a need to > invent some new function that returns NULL on non-NULL input. The > code you're adding has no idea about the difference between the two. > It has no way to know that anyway. Yeah, I never intended that the *_holds results would be "exact" in every test case. They're mostly there to catch egregious errors in test construction. Since the code we're actually testing doesn't get to see the input or output values of the test cases, it's not really useful to insist that the test cases exercise every possible combination of input/output values for the given expressions. regards, tom lane
Re: Bump up PG_CONTROL_VERSION on HEAD
Andres Freund writes: > On 2019-01-16 11:02:08 +0900, Michael Paquier wrote: >> f3db7f16 has proved that it can be a bad idea to run pg_resetwal on a >> data folder which does not match the version it has been compiled >> with. >> >> As of HEAD, PG_CONTROL_VERSION is still 1100: >> $ pg_controldata | grep "pg_control version" >> pg_control version number:1100 >> >> Wouldn't it be better to bump it up to 1200? > We don't commonly bump that without corresponding control version > changes. I don't see what we'd gain by the bump? Yeah, it has not been our practice to bump PG_CONTROL_VERSION unless the contents of pg_control actually change. The whole point of f3db7f16 was to ensure that we didn't have to do that just because of a major version change. regards, tom lane
Re: pgbench doc fix
> On 2018-Dec-03, Tatsuo Ishii wrote: > >> To: >> --- >> -M querymode >> --protocol=querymode >> >> Protocol to use for submitting queries to the server: >> >> simple: use simple query protocol. >> >> extended: use extended query protocol. >> >> prepared: use extended query protocol with prepared statements. >> >> Because in "prepared" mode pgbench reuses the parse analysis >> result for the second and subsequent query iteration, pgbench runs >> faster in the prepared mode than in other modes. > > Looks good to me. Thanks. I'm going to commit this if there's no objection. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
RE: Libpq support to connect to standby server as priority
From: Tatsuo Ishii [mailto:is...@sraoss.co.jp] > But pg_is_in_recovery() returns true even for a promoting standby. So > you have to wait and retry to send pg_is_in_recovery() until it > finishes the promotion to find out it is now a primary. I am not sure > if backend out to be responsible for this process. If not, libpq would > need to handle it but I doubt it would be possible. Yes, the application needs to retry connection attempts until success. That's not different from PgJDBC and other DBMSs. Regards Takayuki Tsunakawa
Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0
Amit-san, (2019/01/15 10:46), Amit Langote wrote: On 2019/01/11 20:04, Etsuro Fujita wrote: (2019/01/11 13:46), Amit Langote wrote: However, what you proposed here as-is would not keep that behavior, because rel->part_scheme is created for those tables as well (even though there would be no need IIUC). Partition pruning uses part_scheme and pruning can occur even if a partitioned table is under UNION ALL, so it *is* needed in that case. Ah, you are right. Thanks for pointing that out! I think enable_partitionwise_join should only be checked in relnode.c or joinrels.c. Sorry, I don't understand this. What I was trying to say is that we should check the GUC close to where partitionwise join is actually implemented even though there is no such hard and fast rule. Or maybe I'm just a bit uncomfortable with setting consider_partitionwise_join based on the GUC. I didn't think so. Consider the consider_parallel flag. I think the way of setting it deviates from that rule already; it is set essentially based on a GUC and is set in set_base_rel_sizes() (ie, before implementing parallel paths). When adding the consider_partitionwise_join flag, I thought it would be a good idea to set consider_partitionwise_join in a similar way to consider_parallel, keeping build_simple_rel() simple. I've attached a patch to show what I mean. Can you please take a look? Thanks for the patch! Maybe I'm missing something, but I don't have a strong opinion about that change. I'd rather think to modify build_simple_rel so that it doesn't create rel->part_scheme if unnecessary (ie, partitioned tables contained in inheritance trees where the top parent is a UNION ALL subquery). As I said above, partition pruning can occur even if a partitioned table happens to be under UNION ALL. However, we *can* avoid creating part_scheme and setting other partitioning properties if *all* of enable_partition_pruning, enable_partitionwise_join, and enable_partitionwise_aggregate are turned off. Yeah, I think so. If you think that this patch is a good idea, then you'll need to explicitly set consider_partitionwise_join to false for a dummy partition rel in set_append_rel_size(), because the assumption of your patch that such partition's rel's consider_partitionwise_join would be false (as initialized with the current code) would be broken by my patch. But that might be a good thing to do anyway as it will document the special case usage of consider_partitionwise_join variable more explicitly, assuming you'll be adding a comment describing why it's being set to false explicitly. I'm not sure we need this as part of a fix for the issue reported on this thread. I don't object to what you proposed here, but that would be rather an improvement, so I think we should leave that for another patch. Sure, no problem with committing it separately if at all. OK Thanks! Best regards, Etsuro Fujita
Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0
(2019/01/15 11:42), Amit Langote wrote: On 2019/01/11 21:50, Etsuro Fujita wrote: (2019/01/10 10:41), Amit Langote wrote: That's a loaded meaning and abusing it to mean something else can be challenged, but we can live with that if properly documented. Speaking of which: /* used by partitionwise joins: */ boolconsider_partitionwise_join;/* consider partitionwise join * paths? (if partitioned rel) */ Maybe, mention here how it will be abused in back-branches for non-partitioned relations? I know we don't yet reach a consensus on what to do in details to address this issue, but for the above, how about adding comments like this to set_append_rel_size(), instead of the header file: /* * If we consider partitionwise joins with the parent rel, do the same * for partitioned child rels. * * Note: here we abuse the consider_partitionwise_join flag for child * rels that are not partitioned, to tell try_partitionwise_join() * that their targetlists and EC entries have been generated. */ if (rel->consider_partitionwise_join) childrel->consider_partitionwise_join = true; ISTM that that would be more clearer than the header file. Thanks for updating the patch. I tend to agree that it might be better to add such details here than in the header as it's better to keep the latter more stable. About the comment you added, I think we could clarify the note further as: Note: here we abuse the consider_partitionwise_join flag by setting it *even* for child rels that are not partitioned. In that case, we set it to tell try_partitionwise_join() that it doesn't need to generate their targetlists and EC entries as they have already been generated here, as opposed to the dummy child rels for which the flag is left set to false so that it will generate them. Maybe it's a bit wordy, but it helps get the intention across more clearly. I think that is well-worded, so +1 from me. Thanks again! Best regards, Etsuro Fujita
Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0
Hi Ashutosh, (2019/01/15 13:29), Ashutosh Bapat wrote: I think, there's something better possible. Two partitioned relations won't use partition-wise join, if their partition schemes do not match. Partitioned relations with same partitioning scheme share PartitionScheme pointer. PartitionScheme structure should get an extra counter, maintaining a count of number of partitioned relations sharing that structure. When this counter is 1, that relation is certainly not going to participate in PWJ and thus need not have all the structure required by PWJ set up. If we use this counter coupled with enable_partitionwise_join flag, we can get rid of consider_partitionwise_join flag altogether, I think. Interesting! That flag was introduced to disable PWJs when whole-row Vars are involved, as you know, so I think we need to first eliminate that limitation, to remove that flag. Thanks! Best regards, Etsuro Fujita
Re: de-deduplicate code in DML execution hooks in postgres_fdw
(2019/01/07 20:26), Etsuro Fujita wrote: (2018/11/30 19:55), Etsuro Fujita wrote: One thing we would need to discuss more about this is the name of a new function added by the patch. IIRC, we have three options so far [1]: 1) perform_one_foreign_dml proposed by Ashutosh 2) execute_dml_single_row proposed by Michael 3) execute_parameterized_dml_stmt proposed by me I'll withdraw my proposal; I think #1 and #2 would describe the functionality better than #3. My vote would go to #2 or execute_dml_stmt_single_row, which moves the function much more closer to execute_dml_stmt. I'd like to hear the opinions of others. On second thought I'd like to propose another option: execute_foreign_modify because I think this would match the existing helper functions for updating foreign tables in postgres_fdw.c better, such as create_foreign_modify, prepare_foreign_modify and finish_foreign_modify. I don't really think the function name should contain "one" or "single_row". Like the names of the calling APIs (ie, ExecForeignInsert, ExecForeignUpdate and ExecForeignDelete), I think it's OK to omit such words from the function name. Here is an updated version of the patch. In addition to the naming, I tweaked the function a little bit to match other functions (mainly variable names), moved it to the place where the helper functions are defined, fiddled with some comments, and removed an extra include file that the original patch added. If there are no objections, I'll commit that version of the patch. Best regards, Etsuro Fujita
Re: Libpq support to connect to standby server as priority
> From: Tatsuo Ishii [mailto:is...@sraoss.co.jp] >> But pg_is_in_recovery() returns true even for a promoting standby. So >> you have to wait and retry to send pg_is_in_recovery() until it >> finishes the promotion to find out it is now a primary. I am not sure >> if backend out to be responsible for this process. If not, libpq would >> need to handle it but I doubt it would be possible. > > Yes, the application needs to retry connection attempts until success. > That's not different from PgJDBC and other DBMSs. I don't know what PgJDBC is doing, however I think libpq needs to do more than just retrying. 1) Try to find a node on which pg_is_in_recovery() returns false. If found, then we assume that is the primary. We also assume that other nodes are standbys. done. 2) If there's no node on which pg_is_in_recovery() returns false, then we need to retry until we find it. To not retry forever, there should be a timeout counter parameter. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
RE: Protect syscache from bloating with negative cache entries
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > Certainly, what I've done here doesn't preclude adding some wider solution > to > the issue of extremely large catcaches. I'm relieved to hear that. > I think it takes the pressure off > for one rather narrow problem case, and the mechanism could be used to fix > other ones. But if you've got an application that just plain accesses a > huge number of objects, this isn't going to make your life better. I understand you're trying to solve the problem caused by negative cache entries as soon as possible, because the user is really suffering from it. I feel sympathy with that attitude, because you seem to be always addressing issues that others are reluctant to take. That's one of the reasons I respect you. > Well, we *had* an LRU mechanism for the catcaches way back when. We got > rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU > info was expensive and (b) performance fell off a cliff in scenarios where > the cache size limit was exceeded. You could probably find some more info > about that by scanning the mail list archives from around the time of that > commit, but I'm too lazy to do so right now. Oh, in 2006... I'll examine the patch and the discussion to see how the LRU management was done. > That was a dozen years ago, and it's possible that machine performance > has moved so much since then that the problems are gone or mitigated. I really, really hope so. Even if we see some visible impact by the LRU management, I think that's the debt PostgreSQL had to pay for but doesn't now. Even the single-process MySQL, which doesn't suffer from cache bloat for many server processes, has the ability to limit the cache. And PostgreSQL has many parameters for various memory components such as shared_buffers, wal_buffers, work_mem, etc, so it would be reasonable to also have the limit for the catalog caches. That said, we can avoid the penalty and retain the current performance by disabling the limit (some_size_param = 0). I think we'll evaluate the impact of LRU management by adding prev and next members to catcache and relcache structures, and putting the entry at the front (or back) of the LRU chain every time the entry is obtained. I think pgbench's select-only mode is enough for evaluation. I'd like to hear if any other workload is more appropriate to see the CPU cache effect. > In particular I'm sure that any limit we would want to impose today will > be far more than the 5000-entries-across-all-caches limit that was in use > back then. But I'm not convinced that a workload that would create 100K > cache entries in the first place wouldn't have severe problems if you > tried to constrain it to use only 80K entries. I fear it's just wishful > thinking to imagine that the behavior of a larger cache won't be just > like a smaller one. Also, IIRC some of the problem with the LRU code > was that it resulted in lots of touches of unrelated data, leading to > CPU cache miss problems. It's hard to see how that doesn't get even > worse with a bigger cache. > > As far as the relcache goes, we've never had a limit on that, but there > are enough routine causes of relcache flushes --- autovacuum for instance > --- that I'm not really convinced relcache bloat can be a big problem in > production. As Andres and Robert mentioned, we want to free less frequently used cache entries. Otherwise, we're now suffering from the bloat to TBs of memory. This is a real, not hypothetical issue... > The plancache has never had a limit either, which is a design choice that > was strongly influenced by our experience with catcaches. Again, I'm > concerned about the costs of adding a management layer, and the likelihood > that cache flushes will simply remove entries we'll soon have to rebuild. Fortunately, we're not bothered with the plan cache. But I remember you said you were annoyed by PL/pgSQL's plan cache use at Salesforce. Were you able to overcome it somehow? > Oracle seldom impresses me as having designs we ought to follow. > They have a well-earned reputation for requiring a lot of expertise to > operate, which is not the direction this project should be going in. > In particular, I don't want to "solve" cache size issues by exposing > a bunch of knobs that most users won't know how to twiddle. Oracle certainly seems to be difficult to use. But they seem to be studying other DBMSs to make it simpler to use. I'm sure they also have a lot we should learn, and the cache limit if one of them (although MySQL's per-cache tuning may be better.) And having limits for various components would be the first step toward the autonomous database; tunable -> auto tuning -> autonomous Regards Takayuki Tsunakawa
Re: de-deduplicate code in DML execution hooks in postgres_fdw
On Wed, Jan 16, 2019 at 02:59:15PM +0900, Etsuro Fujita wrote: > (2019/01/07 20:26), Etsuro Fujita wrote: >> On second thought I'd like to propose another option: >> execute_foreign_modify because I think this would match the existing >> helper functions for updating foreign tables in postgres_fdw.c better, >> such as create_foreign_modify, prepare_foreign_modify and >> finish_foreign_modify. I don't really think the function name should >> contain "one" or "single_row". Like the names of the calling APIs (ie, >> ExecForeignInsert, ExecForeignUpdate and ExecForeignDelete), I think >> it's OK to omit such words from the function name. Here is an updated >> version of the patch. In addition to the naming, I tweaked the function >> a little bit to match other functions (mainly variable names), moved it >> to the place where the helper functions are defined, fiddled with some >> comments, and removed an extra include file that the original patch added. > > If there are no objections, I'll commit that version of the patch. I think that you could use PgFdwModifyState for the second argument of execute_foreign_modify instead of ResultRelInfo. Except of that nit, it looks fine to me, thanks for taking care of it. -- Michael signature.asc Description: PGP signature
Re: using expression syntax for partition bounds
Thanks for the review. On 2019/01/15 22:24, Peter Eisentraut wrote: > On 15/01/2019 07:31, Amit Langote wrote: >>> Is "partition bound" the right term? For list partitioning, it's not >>> really a bound. Maybe it's OK. >> >> Hmm, maybe "partition bound value"? Just want to stress that the >> expression specifies "bounding" value of a partition. > > I was more concerned about the term "bound", which it is not range > partitioning. But I can't think of a better term. > > I wouldn't change expr to value as you have done between v7 and v8, > since the point of this patch is to make expressions valid where > previously only values were. OK, will change it back to partition_bound_expr. Removing "bound" from it makes the term ambiguous? >>> I don't see any treatment of non-immutable expressions. There is a test >>> case with a non-immutable cast that was removed, but that's all. There >>> was considerable discussion earlier in the thread on whether >>> non-immutable expressions should be allowed. I'm not sure what the >>> resolution was, but in any case there should be documentation and tests >>> of the outcome. >> >> The partition bound expression is evaluated only once, that is, when the >> partition creation command is executed, and what gets stored in the >> catalog is a Const expression that contains the value that was computed, >> not the original non-immutable expression. So, we don't need to do >> anything special in terms of code to handle users possibly specifying a >> non-immutable expression as partition bound. Tom said something in that >> thread which seems to support such a position: >> >> https://www.postgresql.org/message-id/22534.1523374457%40sss.pgh.pa.us > > OK, if we are going with that approach, then it needs to be documented > and there should be test cases. How about the following note in the documentation: + Although volatile expressions such as + CURRENT_TIMESTAMP can be used + for this, be careful when using them, because + PostgreSQL will evaluate them only once + when adding the partition. Sorry but I'm not sure how or what I would test about this. Maybe, just add a test in create_table.sql/alter_table.sql that shows that using volatile expression doesn't cause an error? >>> The collation handling might need another look. The following is >>> allowed without any diagnostic: >>> >>> CREATE TABLE t2 ( >>> a text COLLATE "POSIX" >>> ) PARTITION BY RANGE (a); >>> CREATE TABLE t2a PARTITION OF t2 FOR VALUES FROM ('foo' COLLATE "C") TO >>> ('xyz'); >>> >>> I think the correct behavior would be that an explicit collation in the >>> partition bound expression is an error. > >> But that's essentially ignoring the collation specified by the user for >> the partition bound value without providing any information about that to >> the user. I've fixed that by explicitly checking if the collate clause in >> the partition bound value expression contains the same collation as the >> parent partition key collation and give an error otherwise. > > I think that needs more refinement. In v8, the following errors > > CREATE TABLE t2 ( a name COLLATE "POSIX" ) PARTITION BY RANGE (a); > CREATE TABLE t2a PARTITION OF t2 FOR VALUES FROM (name 'foo') TO (name > 'xyz'); > ERROR: collation of partition bound value for column "a" does not match > partition key collation "POSIX" > > The problem here is that the "name" type has a collation that is neither > the one of the column nor the default collation. We can allow that. So, should the "name" type's collation should simply be discarded in favor of "POSIX" that's being used for partitioning? > What we don't want is someone writing an explicit COLLATE clause. I > think we just need to check that there is no top-level COLLATE clause. > This would then sort of match the logic parse_collate.c for combining > collations. Maybe I'm missing something, but isn't it OK to allow the COLLATE clause as long as it specifies the matching collation as the parent? So: CREATE TABLE t2b PARTITION OF t2 FOR VALUES FROM (name 'bar' collate "C") TO (name 'foo'); ERROR: collation of partition bound value for column "a" does not match partition key collation "POSIX" -- either of the following is ok CREATE TABLE t2b PARTITION OF t2 FOR VALUES FROM (name 'bar' collate "POSIX") TO (name 'foo'); CREATE TABLE t2b PARTITION OF t2 FOR VALUES FROM (name 'bar') TO (name 'foo'); Thanks, Amit