Re: [PATCH v1] Add \echo_stderr to psql
Hello Corey, \warn ... \warning ... These two seem about the best to me, drawing from the perl warn command. Yep, I was thinking of perl & gmake. Maybe the 4 letter option is better because its the same length as "echo". I suppose we could go the bash &2 route here, but I don't want to. I agree on this one. -- Fabien.
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Hello. At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo wrote in > Please see my replies inline. Thanks. > > On Fri, Apr 19, 2019 at 12:38 PM Asim R P wrote: > > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo wrote: > > > > > > create db with tablespace > > > drop database > > > drop tablespace. > > > > Essentially, that sequence of operations causes crash recovery to fail > > if the "drop tablespace" transaction was committed before crashing. > > This is a bug in crash recovery in general and should be reproducible > > without configuring a standby. Is that right? > > > > No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on > master. > That makes the file/directory update-to-date if I understand the related > code correctly. > For standby, checkpoint redo does not ensure that. That's right partly. As you must have seen, fast shutdown forces restartpoint for the last checkpoint and it prevents the problem from happening. Anyway it seems to be a problem. > > Your patch creates missing directories in the destination. Don't we > > need to create the tablespace symlink under pg_tblspc/? I would > > > > 'create db with tablespace' redo log does not include the tablespace real > directory information. > Yes, we could add in it into the xlog, but that seems to be an overdesign. But I don't think creating directory that is to be removed just after is a wanted solution. The directory most likely to be be removed just after. > > prefer extending the invalid page mechanism to deal with this, as > > suggested by Ashwin off-list. It will allow us to avoid creating > > directories and files only to remove them shortly afterwards when the > > drop database and drop tablespace records are replayed. > > > > > 'invalid page' mechanism seems to be more proper for missing pages of a > file. For > missing directories, we could, of course, hack to use that (e.g. reading > any page of > a relfile in that database) to make sure the tablespace create code > (without symlink) > safer (It assumes those directories will be deleted soon). > > More feedback about all of the previous discussed solutions is welcome. It doesn't seem to me that the invalid page mechanism is applicable in straightforward way, because it doesn't consider simple file copy. Drop failure is ignored any time. I suppose we can ignore the error to continue recovering as far as recovery have not reached consistency. The attached would work *at least* your case, but I haven't checked this covers all places where need the same treatment. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 10a663bae6..0bc63f48da 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -522,6 +522,44 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, return buffer; } +/* + * XLogMakePGDirectory + * + * There is a possibility that WAL replay causes a creation of the same + * directory left by the previous crash. Issuing ERROR prevents the caller + * from continuing recovery. + * + * To prevent that case, this function issues WARNING instead of ERROR on + * error if consistency is not reached yet. + */ +int +XLogMakePGDirectory(const char *directoryName) +{ + int ret; + + ret = MakePGDirectory(directoryName); + + if (ret != 0) + { + int elevel = ERROR; + + /* + * We might get error trying to create existing directory that is to + * be removed just after. Don't issue ERROR in the case. Recovery + * will stop if we again failed after reaching consistency. + */ + if (InRecovery && !reachedConsistency) + elevel = WARNING; + + ereport(elevel, +(errcode_for_file_access(), + errmsg("could not create directory \"%s\": %m", directoryName))); + return ret; + } + + return 0; +} + /* * Struct actually returned by XLogFakeRelcacheEntry, though the declared * return type is Relation. diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c index 30f6200a86..0216270dd3 100644 --- a/src/backend/storage/file/copydir.c +++ b/src/backend/storage/file/copydir.c @@ -22,11 +22,11 @@ #include #include +#include "access/xlogutils.h" #include "storage/copydir.h" #include "storage/fd.h" #include "miscadmin.h" #include "pgstat.h" - /* * copydir: copy a directory * @@ -41,10 +41,12 @@ copydir(char *fromdir, char *todir, bool recurse) char fromfile[MAXPGPATH * 2]; char tofile[MAXPGPATH * 2]; - if (MakePGDirectory(todir) != 0) - ereport(ERROR, -(errcode_for_file_access(), - errmsg("could not create directory \"%s\": %m", todir))); + /* + * We might have to skip copydir to continue recovery. See the function + * for details. + */ + if (XLogMakePGDirectory(todir) != 0) + return; xldir = AllocateDir(fromdir); diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h index 0ab5ba62f5..46a7596315 1
Re: block-level incremental backup
On 22.04.2019 2:02, Robert Haas wrote: On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost wrote: Having been around for a while working on backup-related things, if I was to implement the protocol for pg_basebackup today, I'd definitely implement "give me a list" and "give me this file" rather than the tar-based approach, because I've learned that people want to be able to do parallel backups and that's a decent way to do that. I wouldn't set out and implement something new that's there's just no hope of making parallel. Maybe the first write of pg_basebackup would still be simple and serial since it's certainly more work to make a frontend tool like that work in parallel, but at least the protocol would be ready to support a parallel option being added alter without being rewritten. And that's really what I was trying to get at here- if we've got the choice now to decide what this is going to look like from a protocol level, it'd be great if we could make it able to support being used in a parallel fashion, even if pg_basebackup is still single-threaded. I think we're getting closer to a meeting of the minds here, but I don't think it's intrinsically necessary to rewrite the whole method of operation of pg_basebackup to implement incremental backup in a sensible way. One could instead just do a straightforward extension to the existing BASE_BACKUP command to enable incremental backup. Then, to enable parallel full backup and all sorts of out-of-core hacking, one could expand the command language to allow tools to access individual steps: START_BACKUP, SEND_FILE_LIST, SEND_FILE_CONTENTS, STOP_BACKUP, or whatever. The second thing makes for an appealing project, but I do not think there is a technical reason why it has to be done first. Or for that matter why it has to be done second. As I keep saying, incremental backup and full backup are separate projects and I believe it's completely reasonable for whoever is doing the work to decide on the order in which they would like to do the work. Having said that, I'm curious what people other than Stephen (and other pgbackrest hackers) think about the relative value of parallel backup vs. incremental backup. Stephen appears quite convinced that parallel backup is full of win and incremental backup is a bit of a yawn by comparison, and while I certainly would not want to discount the value of his experience in this area, it sometimes happens on this mailing list that [ drum roll please ] not everybody agrees about everything. So, what do other people think? Based on the experience of pg_probackup users I can say that there is no 100% winer and depending on use case either parallel either incremental backups are preferable. - If size of database is not so larger and intensity of updates is high enough, then parallel backup within one data center is definitely more efficient solution. - If size of database is very large and data is rarely updated or database is mostly append-only, then incremental backup is preferable. - Some customers need to collect at central server backups of databases installed at many nodes with slow and unreliable connection (assume DBMS installed at locomotives). Definitely parallelism can not help here, unlike support of incremental backup. - Parallel backup more aggressively consumes resources of the system, interfering with normal work of application. So performing parallel backup may cause significant degradation of application speed. pg_probackup supports both features: parallel and incremental backups and it is up to user how to use it in more efficient way for particular configuration. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Oops! The comment in the previous patch is wrong. At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI wrote in <20190422.161513.258021727.horiguchi.kyot...@lab.ntt.co.jp> > At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo wrote in > > > Please see my replies inline. Thanks. > > > > On Fri, Apr 19, 2019 at 12:38 PM Asim R P wrote: > > > > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo wrote: > > > > > > > > create db with tablespace > > > > drop database > > > > drop tablespace. > > > > > > Essentially, that sequence of operations causes crash recovery to fail > > > if the "drop tablespace" transaction was committed before crashing. > > > This is a bug in crash recovery in general and should be reproducible > > > without configuring a standby. Is that right? > > > > > > > No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on > > master. > > That makes the file/directory update-to-date if I understand the related > > code correctly. > > For standby, checkpoint redo does not ensure that. > > That's right partly. As you must have seen, fast shutdown forces > restartpoint for the last checkpoint and it prevents the problem > from happening. Anyway it seems to be a problem. > > > > Your patch creates missing directories in the destination. Don't we > > > need to create the tablespace symlink under pg_tblspc/? I would > > > > > > > 'create db with tablespace' redo log does not include the tablespace real > > directory information. > > Yes, we could add in it into the xlog, but that seems to be an overdesign. > > But I don't think creating directory that is to be removed just > after is a wanted solution. The directory most likely to be be > removed just after. > > > > prefer extending the invalid page mechanism to deal with this, as > > > suggested by Ashwin off-list. It will allow us to avoid creating > > > > directories and files only to remove them shortly afterwards when the > > > drop database and drop tablespace records are replayed. > > > > > > > > 'invalid page' mechanism seems to be more proper for missing pages of a > > file. For > > missing directories, we could, of course, hack to use that (e.g. reading > > any page of > > a relfile in that database) to make sure the tablespace create code > > (without symlink) > > safer (It assumes those directories will be deleted soon). > > > > More feedback about all of the previous discussed solutions is welcome. > > It doesn't seem to me that the invalid page mechanism is > applicable in straightforward way, because it doesn't consider > simple file copy. > > Drop failure is ignored any time. I suppose we can ignore the > error to continue recovering as far as recovery have not reached > consistency. The attached would work *at least* your case, but I > haven't checked this covers all places where need the same > treatment. The comment for the new function XLogMakePGDirectory is wrong: + * There is a possibility that WAL replay causes a creation of the same + * directory left by the previous crash. Issuing ERROR prevents the caller + * from continuing recovery. The correct one is: + * There is a possibility that WAL replay causes an error by creation of a + * directory under a directory removed before the previous crash. Issuing + * ERROR prevents the caller from continuing recovery. It is fixed in the attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 10a663bae6..01331f0da9 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -522,6 +522,44 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, return buffer; } +/* + * XLogMakePGDirectory + * + * There is a possibility that WAL replay causes an error by creation of a + * directory under a directory removed before the previous crash. Issuing + * ERROR prevents the caller from continuing recovery. + * + * To prevent that case, this function issues WARNING instead of ERROR on + * error if consistency is not reached yet. + */ +int +XLogMakePGDirectory(const char *directoryName) +{ + int ret; + + ret = MakePGDirectory(directoryName); + + if (ret != 0) + { + int elevel = ERROR; + + /* + * We might get error trying to create existing directory that is to + * be removed just after. Don't issue ERROR in the case. Recovery + * will stop if we again failed after reaching consistency. + */ + if (InRecovery && !reachedConsistency) + elevel = WARNING; + + ereport(elevel, +(errcode_for_file_access(), + errmsg("could not create directory \"%s\": %m", directoryName))); + return ret; + } + + return 0; +} + /* * Struct actually returned by XLogFakeRelcacheEntry, though the declared * return type is Relation. diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c index 30f6200a86..0216270dd3 100644 --- a/src/backend/storag
display of variables in EXPLAIN VERBOSE
Hi, ISTM show_plan_tlist()'s rule of whether to the show range table prefix with displayed variables contradicts the description of the VERBOSE option in EXPLAIN documentation, which is as follows: === VERBOSE Display additional information regarding the plan. Specifically, include the output column list for each node in the plan tree, schema-qualify table and function names, always label variables in expressions with their range table alias, and always print the name of each trigger for which statistics are displayed. This parameter defaults to FALSE. === Specifically, the current behavior contradicts the part of the sentence that says "always label variables in expressions with their range table alias". See this example: create table foo (a int); create table foo1 () inherits (foo); -- "a" is not labeled here explain verbose select * from only foo order by 1; QUERY PLAN Sort (cost=0.01..0.02 rows=1 width=4) Output: a Sort Key: foo.a -> Seq Scan on public.foo (cost=0.00..0.00 rows=1 width=4) Output: a (5 rows) -- it's labeled in this case explain verbose select * from foo order by 1; QUERY PLAN ─── Sort (cost=192.60..198.98 rows=2551 width=4) Output: foo.a Sort Key: foo.a -> Append (cost=0.00..48.26 rows=2551 width=4) -> Seq Scan on public.foo (cost=0.00..0.00 rows=1 width=4) Output: foo.a -> Seq Scan on public.foo1 (cost=0.00..35.50 rows=2550 width=4) Output: foo1.a (8 rows) Seeing that "Sort Key" is always displayed with the range table alias, I checked explain.c to see why the discrepancy exists and it seems that show_plan_tlist() (and show_tablesample()) use the following condition for whether or not to use the range table prefix: useprefix = list_length(es->rtable) > 1; whereas other functions, including show_sort_group_keys() that prints the "Sort Key", use the following condition: useprefix = (list_length(es->rtable) > 1 || es->verbose); I can think of two ways we could do: 1. Change show_plan_tlist() and show_tablesample() to use the same rule as others 2. Change other functions to use the same rule as show_plan_tlist(), also updating the documentation to note the exceptional case when column names are not prefixed Thoughts? Thanks, Amit
Allow any[] as input arguments for sql/plpgsql functions to mimic format()
Hey everyone, I am writing a plpgsql function that (to greatly simplify) raises an exception with a formatted* message. Ideally, I should be able to call it with raise_exception('The person %I has only %I bananas.', 'Fred', 8), which mimics the format(text, any[]) calling convention. Here is where I have encountered a limitation of PostgreSQL's design: https://www.postgresql.org/docs/11/datatype-pseudo.html mentions explicitly that, "At present most procedural languages forbid use of a pseudo-type as an argument type". My reasoning is that I should be able to accept a value of some type if all I do is passing it to a function that accepts exactly that type, such as format(text, any[]). Given the technical reality, I assume that I wouldn't be able to do anything else with that value, but that is fine, since I don't have to do anything with it regardless. BR Michał "phoe" Herda *I do not want to use the obvious solution of raise_exception(format(...)) because the argument to that function is the error ID that is then looked up in a table from which the error message and sqlstate are retrieved. My full code is in the attached SQL file. Once it is executed: SELECT gateway_error('user_does_not_exist', '2'); -- works but is unnatural, SELECT gateway_error('user_does_not_exist', 2); -- is natural but doesn't work. install-errors.sql Description: application/sql
Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()
Hi po 22. 4. 2019 v 11:27 odesílatel Michał "phoe" Herda napsal: > Hey everyone, > > I am writing a plpgsql function that (to greatly simplify) raises an > exception with a formatted* message. Ideally, I should be able to call > it with raise_exception('The person %I has only %I bananas.', 'Fred', > 8), which mimics the format(text, any[]) calling convention. > > Here is where I have encountered a limitation of PostgreSQL's design: > https://www.postgresql.org/docs/11/datatype-pseudo.html mentions > explicitly that, "At present most procedural languages forbid use of a > pseudo-type as an argument type". > > My reasoning is that I should be able to accept a value of some type if > all I do is passing it to a function that accepts exactly that type, > such as format(text, any[]). Given the technical reality, I assume that > I wouldn't be able to do anything else with that value, but that is > fine, since I don't have to do anything with it regardless. > > BR > Michał "phoe" Herda > > *I do not want to use the obvious solution of > raise_exception(format(...)) because the argument to that function is > the error ID that is then looked up in a table from which the error > message and sqlstate are retrieved. My full code is in the attached SQL > file. Once it is executed: > > SELECT gateway_error('user_does_not_exist', '2'); -- works but is > unnatural, > SELECT gateway_error('user_does_not_exist', 2); -- is natural but > doesn't work. > It is known problem, and fix is not easy. Any expressions inside plpgsql are simple queries like SELECT expr, and they are executed same pipeline like queries. The plans of these queries are stored and reused. Originally these plans disallow any changes, now some changes are supported, but parameters should be same all time. This is ensured by disallowing "any" type. Other polymorphic types are very static, so there is not described risk. Probably some enhancement can be in this are. The plan can be re-planed after some change - but it can has lot of performance impacts. It is long open topic. Some changes in direction to dynamic languages can increase cost of some future optimization to higher performance :-(. Regards Pavel
Regression test PANICs with master-standby setup on same machine
Hello hackers, With a master-standby setup configured on the same machine, I'm getting a panic in tablespace test while running make installcheck. 1. CREATE TABLESPACE regress_tblspacewith LOCATION 'blah'; 2. DROP TABLESPACE regress_tblspacewith; 3. CREATE TABLESPACE regress_tblspace LOCATION 'blah'; -- do some operations in this tablespace 4. DROP TABLESPACE regress_tblspace; The master panics at the last statement when standby has completed applying the WAL up to step 2 but hasn't started step 3. PANIC: could not fsync file "pg_tblspc/16387/PG_12_201904072/16384/16446": No such file or directory The reason is both the tablespace points to the same location. When master tries to delete the new tablespace (and requests a checkpoint), the corresponding folder is already deleted by the standby while applying WAL to delete the old tablespace. I'm able to reproduce the issue with the attached script. sh standby-server-setup.sh make installcheck I accept that configuring master-standby on the same machine for this test is not okay. But, can we avoid the PANIC somehow? Or, is this intentional and I should not include testtablespace in this case? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com echo "Cleaning.." pkill -9 postgres rm -rf *_logfile rm /tmp/failover.log rm -rf /tmp/archive_dir mkdir /tmp/archive_dir export PGPORT=5432 #MASTER PORT PGSQL_DIR=$(pwd) PGSQL_BIN=$PGSQL_DIR/install/bin PGSQL_MASTER=$PGSQL_BIN/master #DATA FOLDER FOR PRIMARY/MASTER SERVER PGSQL_STANDBY=$PGSQL_BIN/standby #DATA FOLDER FOR BACKUP/STANDBY SERVER #Cleanup the master and slave data directory and create a new one. rm -rf $PGSQL_MASTER $PGSQL_STANDBY mkdir $PGSQL_MASTER $PGSQL_STANDBY chmod 700 $PGSQL_MASTER chmod 700 $PGSQL_STANDBY #Initialize MASTER $PGSQL_BIN/initdb -D $PGSQL_MASTER echo "wal_level = hot_standby" >> $PGSQL_MASTER/postgresql.conf echo "max_wal_senders = 3" >> $PGSQL_MASTER/postgresql.conf echo "wal_keep_segments = 50" >> $PGSQL_MASTER/postgresql.conf echo "hot_standby = on" >> $PGSQL_MASTER/postgresql.conf #echo "max_standby_streaming_delay= -1" >> $PGSQL_MASTER/postgresql.conf #echo "wal_consistency_checking='all'" >> $PGSQL_MASTER/postgresql.conf echo "max_wal_size= 10GB" >> $PGSQL_MASTER/postgresql.conf #echo "log_min_messages= debug2" >> $PGSQL_MASTER/postgresql.conf #Setup replication settings #Start Master export PGPORT=5432 echo "Starting Master.." $PGSQL_BIN/pg_ctl -D $PGSQL_MASTER -c -w -l master_logfile start #Perform Backup in the Standy Server $PGSQL_BIN/pg_basebackup --pgdata=$PGSQL_STANDBY -P echo "primary_conninfo= 'port=5432'" >> $PGSQL_STANDBY/postgresql.conf touch $PGSQL_STANDBY/standby.signal #echo "standby_mode = on" >> $PGSQL_STANDBY/postgresql.conf #Start STANDBY export PGPORT=5433 echo "Starting Slave.." $PGSQL_BIN/pg_ctl -D $PGSQL_STANDBY -c -w -l slave_logfile start
Re: bug in update tuple routing with foreign partitions
(2019/04/19 14:39), Etsuro Fujita wrote: (2019/04/19 13:00), Amit Langote wrote: On 2019/04/18 22:10, Etsuro Fujita wrote: On 2019/04/18 14:06, Amit Langote wrote: On 2019/04/17 21:49, Etsuro Fujita wrote: So what I'm thinking is to throw an error for cases like this. (Though, I think we should keep to allow rows to be moved from local partitions to a foreign-table subplan targetrel that has been updated already.) Hmm, how would you distinguish (totally inside postgred_fdw I presume) the two cases? One thing I came up with to do that is this: @@ -1917,6 +1937,23 @@ postgresBeginForeignInsert(ModifyTableState *mtstate, initStringInfo(&sql); + /* + * If the foreign table is an UPDATE subplan resultrel that hasn't yet + * been updated, routing tuples to the table might yield incorrect + * results, because if routing tuples, routed tuples will be mistakenly + * read from the table and updated twice when updating the table --- it + * would be nice if we could handle this case; but for now, throw an error + * for safety. + */ + if (plan&& plan->operation == CMD_UPDATE&& + (resultRelInfo->ri_usesFdwDirectModify || + resultRelInfo->ri_FdwState)&& + resultRelInfo> mtstate->resultRelInfo + mtstate->mt_whichplan) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot route tuples into foreign table to be updated \"%s\"", + RelationGetRelationName(rel; Oh, I see. So IIUC, you're making postgresBeginForeignInsert() check two things: 1. Whether the foreign table it's about to begin inserting (moved) rows into is a subplan result rel, checked by (resultRelInfo->ri_usesFdwDirectModify || resultRelInfo->ri_FdwState) 2. Whether the foreign table it's about to begin inserting (moved) rows into will be updated later, checked by (resultRelInfo> mtstate->resultRelInfo + mtstate->mt_whichplan) This still allows a foreign table to receive rows moved from the local partitions if it has already finished the UPDATE operation. Seems reasonable to me. Forgot to say that since this is a separate issue from the original bug report, maybe we can first finish fixing the latter. What to do you think? Yeah, I think we can do that, but my favorite would be to fix the two issues in a single shot, because they seem to me rather close to each other. So I updated a new version in a single patch, which I'm attaching. I agree that we can move to fix the other issue right away as the fix you outlined above seems reasonable, but I wonder if it wouldn't be better to commit the two fixes separately? The two fixes, although small, are somewhat complicated and combining them in a single commit might be confusing. Also, a combined commit might make it harder for the release note author to list down the exact set of problems being fixed. OK, I'll separate the patch into two. After I tried to separate the patch a bit, I changed my mind: I think we should commit the two issues in a single patch, because the original issue that overriding fmstate for the UPDATE operation mistakenly by fmstate for the INSERT operation caused backend crash is fixed by what I proposed above. So I add the commit message to the previous single patch, as you suggested. Some mostly cosmetic comments on the code changes: * In the following comment: + /* + * If the foreign table is an UPDATE subplan resultrel that hasn't yet + * been updated, routing tuples to the table might yield incorrect + * results, because if routing tuples, routed tuples will be mistakenly + * read from the table and updated twice when updating the table --- it + * would be nice if we could handle this case; but for now, throw an error + * for safety. + */ the part that start with "because if routing tuples..." reads a bit unclear to me. How about writing this as: /* * If the foreign table we are about to insert routed rows into is * also an UPDATE result rel and the UPDATE hasn't been performed yet, * proceeding with the INSERT will result in the later UPDATE * incorrectly modifying those routed rows, so prevent the INSERT --- * it would be nice if we could handle this case, but for now, throw * an error for safety. */ I think that's better than mine; will use that wording. Done. I simplified your wording a little bit, though. I see that in all the hunks involving some manipulation of aux_fmstate, there's a comment explaining what it is, which seems a bit repetitive. I can see more or less the same explanation in postgresExecForeignInsert(), postgresBeginForeignInsert(), and postgresEndForeignInsert(). Maybe just keep the description in postgresBeginForeignInsert as follows: @@ -1983,7 +2020,19 @@ postgresBeginForeignInsert(ModifyTableState *mtstate, retrieved_attrs != NIL, retrieved_attrs); - resultRelInfo->ri_FdwState = fmstate; + /* + * If the given resultRelInfo already has PgFdwModifyState set, it means + * the foreign table is an UPDATE subplan resultrel; in which case, store + * the resulting state into the aux_fmstate of th
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
At Mon, 22 Apr 2019 16:40:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI wrote in <20190422.164027.33866403.horiguchi.kyot...@lab.ntt.co.jp> > At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI > wrote in > <20190422.161513.258021727.horiguchi.kyot...@lab.ntt.co.jp> > > At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo wrote in > > > > > Please see my replies inline. Thanks. > > > > > > On Fri, Apr 19, 2019 at 12:38 PM Asim R P wrote: > > > > > > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo wrote: > > > > > > > > > > create db with tablespace > > > > > drop database > > > > > drop tablespace. > > > > > > > > Essentially, that sequence of operations causes crash recovery to fail > > > > if the "drop tablespace" transaction was committed before crashing. > > > > This is a bug in crash recovery in general and should be reproducible > > > > without configuring a standby. Is that right? > > > > > > > > > > No. In general, checkpoint is done for drop_db/create_db/drop_tablespace > > > on > > > master. > > > That makes the file/directory update-to-date if I understand the related > > > code correctly. > > > For standby, checkpoint redo does not ensure that. The attached exercises this sequence, needing some changes in PostgresNode.pm and RecursiveCopy.pm to allow tablespaces. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From dbe6306a730f94a5bd8beaf0e534c28ebdd815d4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 22 Apr 2019 20:10:20 +0900 Subject: [PATCH 1/2] Allow TAP test to excecise tablespace. To perform tablespace related checks, this patch lets PostgresNode::backup have a new parameter "tablespace_mapping", and make init_from_backup handle capable to handle a backup created using tablespace_mapping. --- src/test/perl/PostgresNode.pm | 10 -- src/test/perl/RecursiveCopy.pm | 33 + 2 files changed, 37 insertions(+), 6 deletions(-) diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index 76874141c5..59a939821d 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -540,13 +540,19 @@ target server since it isn't done by default. sub backup { - my ($self, $backup_name) = @_; + my ($self, $backup_name, %params) = @_; my $backup_path = $self->backup_dir . '/' . $backup_name; my $name= $self->name; + my @rest = (); + + if (defined $params{tablespace_mapping}) + { + push(@rest, "--tablespace-mapping=$params{tablespace_mapping}"); + } print "# Taking pg_basebackup $backup_name from node \"$name\"\n"; TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h', - $self->host, '-p', $self->port, '--no-sync'); + $self->host, '-p', $self->port, '--no-sync', @rest); print "# Backup finished\n"; return; } diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm index baf5d0ac63..c912ce412d 100644 --- a/src/test/perl/RecursiveCopy.pm +++ b/src/test/perl/RecursiveCopy.pm @@ -22,6 +22,7 @@ use warnings; use Carp; use File::Basename; use File::Copy; +use TestLib; =pod @@ -97,14 +98,38 @@ sub _copypath_recurse # invoke the filter and skip all further operation if it returns false return 1 unless &$filterfn($curr_path); - # Check for symlink -- needed only on source dir - # (note: this will fall through quietly if file is already gone) - croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath; - # Abort if destination path already exists. Should we allow directories # to exist already? croak "Destination path \"$destpath\" already exists" if -e $destpath; + # Check for symlink -- needed only on source dir + # (note: this will fall through quietly if file is already gone) + if (-l $srcpath) + { + croak "Cannot operate on symlink \"$srcpath\"" + if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/); + + # We have mapped tablespaces. Copy them individually + my $linkname = $1; + my $tmpdir = TestLib::tempdir; + my $dstrealdir = TestLib::real_dir($tmpdir); + my $srcrealdir = readlink($srcpath); + + opendir(my $dh, $srcrealdir); + while (readdir $dh) + { + next if (/^\.\.?$/); + my $spath = "$srcrealdir/$_"; + my $dpath = "$dstrealdir/$_"; + + copypath($spath, $dpath); + } + closedir $dh; + + symlink $dstrealdir, $destpath; + return 1; + } + # If this source path is a file, simply copy it to destination with the # same name and we're done. if (-f $srcpath) -- 2.16.3 >From 382910fbe3738c9098c0568cdc992928f471c7c5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 22 Apr 2019 20:10:25 +0900 Subject: [PATCH 2/2] Add check for recovery failure caused by tablespace. Removal of a tablespace on master can cause recovery failure on standby. This patch adds the check for the case. --- src/test/recovery/t/011_crash_recovery.pl | 52 ++- 1 file changed, 51 insertions(+), 1 deletion(-) diff --git a/src/test/recovery/t/011_crash_recovery.p
Re: Regression test PANICs with master-standby setup on same machine
Hello. At Mon, 22 Apr 2019 15:52:59 +0530, Kuntal Ghosh wrote in > Hello hackers, > > With a master-standby setup configured on the same machine, I'm > getting a panic in tablespace test while running make installcheck. > > 1. CREATE TABLESPACE regress_tblspacewith LOCATION 'blah'; > 2. DROP TABLESPACE regress_tblspacewith; > 3. CREATE TABLESPACE regress_tblspace LOCATION 'blah'; > -- do some operations in this tablespace > 4. DROP TABLESPACE regress_tblspace; > > The master panics at the last statement when standby has completed > applying the WAL up to step 2 but hasn't started step 3. > PANIC: could not fsync file > "pg_tblspc/16387/PG_12_201904072/16384/16446": No such file or > directory > > The reason is both the tablespace points to the same location. When > master tries to delete the new tablespace (and requests a checkpoint), > the corresponding folder is already deleted by the standby while > applying WAL to delete the old tablespace. I'm able to reproduce the > issue with the attached script. > > sh standby-server-setup.sh > make installcheck > > I accept that configuring master-standby on the same machine for this > test is not okay. But, can we avoid the PANIC somehow? Or, is this > intentional and I should not include testtablespace in this case? If you don't have a problem using TAP test suite, tablespace is allowed with a bit restricted steps using the first patch in my just posted patchset[1]. This will work for you if you are okay with creating a standby after creating a tablespace. See the second patch in the patchset. If you stick on shell script, the following steps allow tablespaces. 1. Create tablespace directories for both master and standby. 2. Create a master then start. 3. Create tablespaces on the master. 4. Create a standby using pg_basebackup --tablespace_mapping== 5. Start the standby. [1] https://www.postgresql.org/message-id/20190422.211933.156769089.horiguchi.kyot...@lab.ntt.co.jp regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pgsql: Allow insert and update tuple routing and COPY for foreign table
(2019/04/20 20:53), Laurenz Albe wrote: On Fri, 2018-04-06 at 23:24 +, Robert Haas wrote: Allow insert and update tuple routing and COPY for foreign tables. Also enable this for postgres_fdw. Etsuro Fujita, based on an earlier patch by Amit Langote. The larger patch series of which this is a part has been reviewed by Amit Langote, David Fetter, Maksim Milyutin, Álvaro Herrera, Stephen Frost, and me. Minor documentation changes to the final version by me. Discussion: http://postgr.es/m/29906a26-da12-8c86-4fb9-d8f88442f...@lab.ntt.co.jp This commit makes life hard for foreign data wrappers that support data modifications. If a FDW implements ExecForeignInsert, this commit automatically assumes that it also supports COPY FROM. It will call ExecForeignInsert without calling PlanForeignModify and BeginForeignModify, and a FDW that does not expect that will probably fail. This is not 100% correct; the FDW documentation says: Tuples inserted into a partitioned table by INSERT or COPY FROM are routed to partitions. If an FDW supports routable foreign-table partitions, it should also provide the following callback functions. These functions are also called when COPY FROM is executed on a foreign table. maybe there are FDWs that support INSERT but don't want to support COPY for some reason. I agree on that point. I propose that PostgreSQL only allows COPY FROM on a foreign table if the FDW implements BeginForeignInsert. The attached patch implements that. I don't think that is a good idea, because there might be some FDWs that want to support COPY FROM on foreign tables without providing BeginForeignInsert. (As for INSERT into foreign tables, we actually allow FDWs to support it without providing PlanForeignModify, BeginForeignModify, or EndForeignModify.) It's permissible to throw an error in BeginForeignInsert, so what I was thinking for FDWs that don't want to support COPY FROM and INSERT/UPDATE/COPY FROM tuple routing was to provide BeginForeignInsert implementing something like this: static void fooBeginForeignInsert(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo) { Relationrel = resultRelInfo->ri_RelationDesc; if (mtstate->ps.plan == NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot copy to foreign table \"%s\"", RelationGetRelationName(rel; else ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot route tuples into foreign table \"%s\"", RelationGetRelationName(rel; } Best regards, Etsuro Fujita
Re: Regression test PANICs with master-standby setup on same machine
On Mon, Apr 22, 2019 at 6:07 PM Kyotaro HORIGUCHI wrote: > If you don't have a problem using TAP test suite, tablespace is > allowed with a bit restricted steps using the first patch in my > just posted patchset[1]. This will work for you if you are okay > with creating a standby after creating a tablespace. See the > second patch in the patchset. > > If you stick on shell script, the following steps allow tablespaces. > > 1. Create tablespace directories for both master and standby. > 2. Create a master then start. > 3. Create tablespaces on the master. > 4. Create a standby using pg_basebackup --tablespace_mapping== > 5. Start the standby. > > [1] > https://www.postgresql.org/message-id/20190422.211933.156769089.horiguchi.kyot...@lab.ntt.co.jp > Thank you for the info. I'll try the same. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: Runtime pruning problem
On Fri, 19 Apr 2019 at 05:25, Tom Lane wrote: > So what I'm thinking is that I made a bad decision in 1cc29fe7c, > which did this: > > ... In passing, simplify the EXPLAIN code by > having it deal primarily in the PlanState tree rather than separately > searching Plan and PlanState trees. This is noticeably cleaner for > subplans, and about a wash elsewhere. > > It was definitely silly to have the recursion in explain.c passing down > both Plan and PlanState nodes, when the former is always easily accessible > from the latter. So that was an OK change, but at the same time I changed > ruleutils.c to accept PlanState pointers not Plan pointers from explain.c, > and that is now looking like a bad idea. If we were to revert that > decision, then instead of assuming that an AppendState always has at least > one live child, we'd only have to assume that an Append has at least one > live child. Which is true. > > I don't recall that there was any really strong reason for switching > ruleutils' API like that, although maybe if we look harder we'll find one. > I think it was mainly just for consistency with the way that explain.c > now looks at the world; which is not a negligible consideration, but > it's certainly something we could overrule. I started working on this today and I've attached what I have so far. For a plan like the following, as shown by master's EXPLAIN, we get: postgres=# explain verbose select *,(select * from t1 where t1.a=listp.a) z from listp where a = three() order by z; QUERY PLAN -- Sort (cost=1386.90..1386.95 rows=22 width=12) Output: listp1.a, listp1.b, ((SubPlan 1)) Sort Key: ((SubPlan 1)) -> Append (cost=0.00..1386.40 rows=22 width=12) Subplans Removed: 1 -> Seq Scan on public.listp1 (cost=0.00..693.15 rows=11 width=12) Output: listp1.a, listp1.b, (SubPlan 1) Filter: (listp1.a = three()) SubPlan 1 -> Index Only Scan using t1_pkey on public.t1 (cost=0.15..8.17 rows=1 width=4) Output: t1.a Index Cond: (t1.a = listp1.a) (12 rows) With the attached we end up with: postgres=# explain verbose select *,(select * from t1 where t1.a=listp.a) z from listp where a = three() order by z; QUERY PLAN - Sort (cost=1386.90..1386.95 rows=22 width=12) Output: listp1.a, listp1.b, ((SubPlan 1)) Sort Key: ((SubPlan 1)) -> Append (cost=0.00..1386.40 rows=22 width=12) Subplans Removed: 2 (5 rows) notice the reference to SubPlan 1, but no definition of what Subplan 1 actually is. I don't think this is particularly good, but not all that sure what to do about it. The code turned a bit more complex than I'd have hoped. In order to still properly resolve the parameters in find_param_referent() I had to keep the ancestor list, but also had to add an ancestor_plan list so that we can properly keep track of the Plan node parents too. Remember that a Plan node may not have a corresponding PlanState node if the state was never initialized. For Append and MergeAppend I ended up always using the first of the initialized subnodes if at least one is present and only resorted to using the first planned subnode if there are no subnodes in the AppendState/MergeAppendState. Without this, the Vars shown in the MergeAppend sort keys lost their alias prefix if the first subplan happened to have been pruned, but magically would gain it again if it was some other node that was pruned. This was just a bit too weird, so I ended up making a special case for this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services wip_improve_runtime_pruning_explain_output.patch Description: Binary data
Re: Unhappy about API changes in the no-fsm-for-small-rels patch
On Fri, Apr 19, 2019 at 2:46 PM Amit Kapila wrote: > > > > > I am thinking that we should at least give it a try to move the map to > > > rel cache level to see how easy or difficult it is and also let's wait > > > for a day or two to see if Andres/Tom has to say anything about this > > > or on the response by me above to improve the current patch. > > > > Since we have a definite timeline, I'm okay with that, although I'm > > afraid I'm not quite knowledgeable enough to help much with the > > relcache piece. > > > > Okay, I can try to help. I think you can start by looking at > RelationData members (for ex. see how we cache index's metapage in > rd_amcache) and study a bit about routines in relcache.h. > Attached is a hacky and work-in-progress patch to move fsm map to relcache. This will give you some idea. I think we need to see if there is a need to invalidate the relcache due to this patch. I think some other comments of Andres also need to be addressed, see if you can attempt to fix some of them. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com store_local_map_relcache_v1.patch Description: Binary data
Re: display of variables in EXPLAIN VERBOSE
On Mon, 22 Apr 2019 at 19:49, Amit Langote wrote: > Seeing that "Sort Key" is always displayed with the range table alias, I > checked explain.c to see why the discrepancy exists and it seems that > show_plan_tlist() (and show_tablesample()) use the following condition for > whether or not to use the range table prefix: > > useprefix = list_length(es->rtable) > 1; > > whereas other functions, including show_sort_group_keys() that prints the > "Sort Key", use the following condition: > > useprefix = (list_length(es->rtable) > 1 || es->verbose); > > I can think of two ways we could do: > > 1. Change show_plan_tlist() and show_tablesample() to use the same rule as > others > > 2. Change other functions to use the same rule as show_plan_tlist(), also > updating the documentation to note the exceptional case when column names > are not prefixed I'd vote to make the code match the documentation, but probably implement it by adding a new field to ExplainState and just calculate what to do once in ExplainQuery() instead of calculating what to do in various random places. I don't think we should backpatch this change, likely it would be better to keep the explain output as stable as possible in the back branches, so that might mean a documentation tweak should be done for them. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Do CustomScan as not projection capable node
On Fri, Apr 19, 2019 at 12:45 AM Tom Lane wrote: > I don't buy this for a minute. Where do you think projection is > going to happen? There isn't any existing node type that *couldn't* > support projection if we insisted that that be done across-the-board. > I think it's mostly just a legacy thing that some don't. I think there may actually be some good reasons for that. If something like an Append or Material node projects, it seems to me that this means that we built the wrong tlist for its input(s). That justification doesn't apply to custom scans, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH v1] Add \echo_stderr to psql
On Mon, Apr 22, 2019 at 09:04:08AM +0200, Fabien COELHO wrote: > > Hello Corey, > > > >\warn ... > > >\warning ... > > > > These two seem about the best to me, drawing from the perl warn command. > > Yep, I was thinking of perl & gmake. Maybe the 4 letter option is better > because its the same length as "echo". > > > I suppose we could go the bash &2 route here, but I don't want to. > > I agree on this one. Please find attached v2, name is now \warn. How might we test this portably? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From db987e21d0e25ce306d5482fa78d71de8454f1aa Mon Sep 17 00:00:00 2001 From: David Fetter Date: Sun, 21 Apr 2019 11:08:40 -0700 Subject: [PATCH v2] Add \warn to psql To: hackers MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="2.20.1" This is a multi-part message in MIME format. --2.20.1 Content-Type: text/plain; charset=UTF-8; format=fixed Content-Transfer-Encoding: 8bit - Does what it says on the label - Do we have any way to test this? diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index b86764003d..4edf8e67a1 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1879,6 +1879,26 @@ Tue Oct 26 21:40:57 CEST 1999 + +\warn text [ ... ] + + +Prints the arguments to the standard error, separated by one +space and followed by a newline. This can be useful to +intersperse information in the output of scripts when not polluting +standard output is desired. For example: + + +=> \echo :variable +=> \warn `date` +Sun Apr 21 10:48:11 PDT 2019 + +If the first argument is an unquoted -n the trailing +newline is not written. + + + + \ef function_description line_number diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 8254d61099..9425f02005 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -319,7 +319,7 @@ exec_command(const char *cmd, status = exec_command_ef_ev(scan_state, active_branch, query_buf, true); else if (strcmp(cmd, "ev") == 0) status = exec_command_ef_ev(scan_state, active_branch, query_buf, false); - else if (strcmp(cmd, "echo") == 0 || strcmp(cmd, "qecho") == 0) + else if (strcmp(cmd, "echo") == 0 || strcmp(cmd, "warn") == 0 || strcmp(cmd, "qecho") == 0) status = exec_command_echo(scan_state, active_branch, cmd); else if (strcmp(cmd, "elif") == 0) status = exec_command_elif(scan_state, cstack, query_buf); @@ -1114,7 +1114,7 @@ exec_command_ef_ev(PsqlScanState scan_state, bool active_branch, } /* - * \echo and \qecho -- echo arguments to stdout or query output + * \echo, \warn and \qecho -- echo arguments to stdout or query output */ static backslashResult exec_command_echo(PsqlScanState scan_state, bool active_branch, const char *cmd) @@ -1129,6 +1129,8 @@ exec_command_echo(PsqlScanState scan_state, bool active_branch, const char *cmd) if (strcmp(cmd, "qecho") == 0) fout = pset.queryFout; + else if (strcmp(cmd, "warn") == 0) + fout = stderr; else fout = stdout; diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c index d6d41b51d5..2ecb8b73c1 100644 --- a/src/bin/psql/help.c +++ b/src/bin/psql/help.c @@ -207,6 +207,7 @@ slashUsage(unsigned short int pager) fprintf(output, _("Input/Output\n")); fprintf(output, _(" \\copy ... perform SQL COPY with data stream to the client host\n")); fprintf(output, _(" \\echo [STRING] write string to standard output\n")); + fprintf(output, _(" \\warn [STRING] write string to standard error\n")); fprintf(output, _(" \\i FILEexecute commands from file\n")); fprintf(output, _(" \\ir FILE as \\i, but relative to location of current script\n")); fprintf(output, _(" \\o [FILE] send all query results to file or |pipe\n")); diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index bcddc7601e..383a17a7f4 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1421,7 +1421,7 @@ psql_completion(const char *text, int start, int end) "\\t", "\\T", "\\timing", "\\unset", "\\x", - "\\w", "\\watch", + "\\w","\\warn", "\\watch", "\\z", "\\!", "\\?", NULL --2.20.1--
Re: proposal: psql PSQL_TABULAR_PAGER variable
On Thu, Apr 18, 2019 at 11:46 AM Pavel Stehule wrote: > My idea is following - pseudocode > > if view is a table > { > if is_defined PSQL_TABULAR_PAGER > { > pager = PSQL_TABULAR_PAGER > } > else if is_defined PSQL_PAGER > { > pager = PSQL_PAGER > } > else > { > pager = PAGER > } > } > else /* for \h xxx */ > { > if is_defined PSQL_PAGER > { > pager = PSQL_PAGER > } > else > { > pager = PAGER > } > Seems like pspg could just hand off to the regular pager if it discovers that the input is not in a format it finds suitable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: proposal: psql PSQL_TABULAR_PAGER variable
Robert Haas writes: > Seems like pspg could just hand off to the regular pager if it > discovers that the input is not in a format it finds suitable. It might be slightly tricky to do that after having already consumed some of the input :-(. Still, I've got to say that I find this proposal pretty horrid. I already thought that PSQL_PAGER was a dubious idea: what other program do you know anywhere that isn't satisfied with PAGER? Inventing still more variables of the same ilk is making it even messier, and more obviously poorly designed, and more obviously likely to end up with forty-nine different variables for slightly different purposes. I think that the general problem here is "we need psql to be able to give some context info to pspg", and the obvious way to handle that is to make a provision for arguments on pspg's command line. That is, instead of just calling "pspg", call "pspg table" or "pspg help" etc etc, with the understanding that the set of context words could be extended over time. We could shoehorn this into what we already have by saying that PSQL_PAGER is interpreted as a format, and if it contains say "%c" then replace that with a context word (and again, there's room for more format codes over time). Probably best *not* to apply such an interpretation to PAGER, though. Whether the whole problem is really worth this much infrastructure is a fair question. But if we're going to do something, I'd rather go down a path like this than inventing a new environment variable every month. regards, tom lane
Re: proposal: psql PSQL_TABULAR_PAGER variable
po 22. 4. 2019 v 15:46 odesílatel Robert Haas napsal: > On Thu, Apr 18, 2019 at 11:46 AM Pavel Stehule > wrote: > > My idea is following - pseudocode > > > > if view is a table > > { > > if is_defined PSQL_TABULAR_PAGER > > { > > pager = PSQL_TABULAR_PAGER > > } > > else if is_defined PSQL_PAGER > > { > > pager = PSQL_PAGER > > } > > else > > { > > pager = PAGER > > } > > } > > else /* for \h xxx */ > > { > > if is_defined PSQL_PAGER > > { > > pager = PSQL_PAGER > > } > > else > > { > > pager = PAGER > > } > > > > Seems like pspg could just hand off to the regular pager if it > discovers that the input is not in a format it finds suitable. > This is possible, and I wrote it. But it is "little bit" strange, start another pager from a pager. I think so task oriented pagers can enhance custom experience of TUI applications - and there is a big space for enhancement. Currently pspg have to reparse data and there are some heuristic to detect format. Can be nice, if psql can send some additional info about the data. Maybe psql can send raw data, and printing formatting can be on parser side. Pavel > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: proposal: psql PSQL_TABULAR_PAGER variable
po 22. 4. 2019 v 16:21 odesílatel Tom Lane napsal: > Robert Haas writes: > > Seems like pspg could just hand off to the regular pager if it > > discovers that the input is not in a format it finds suitable. > > It might be slightly tricky to do that after having already consumed > some of the input :-(. > pspg supports both direction scrolling, so data are in buffer, and can be displayed again. > Still, I've got to say that I find this proposal pretty horrid. > I already thought that PSQL_PAGER was a dubious idea: what other > program do you know anywhere that isn't satisfied with PAGER? > Inventing still more variables of the same ilk is making it even > messier, and more obviously poorly designed, and more obviously > likely to end up with forty-nine different variables for slightly > different purposes. > The programs with some complex output usually doesn't use a pagers - or use pagers only for part of their output. Initially I would to teach "less" to support tabular data - but the after some initial research I found so I am not able to modify "less". > I think that the general problem here is "we need psql to be able to > give some context info to pspg", and the obvious way to handle that > is to make a provision for arguments on pspg's command line. That > is, instead of just calling "pspg", call "pspg table" or "pspg help" > etc etc, with the understanding that the set of context words could > be extended over time. We could shoehorn this into what we already > have by saying that PSQL_PAGER is interpreted as a format, and if > it contains say "%c" then replace that with a context word (and > again, there's room for more format codes over time). Probably > best *not* to apply such an interpretation to PAGER, though. > It can be a way. There are some issues unfixable on pager side - like dynamic column resizing when FETCH_COUNT > 0 and some others. I can imagine a situation, when psql send just raw data in some easy machine readable format (like CSV), and specialized pager can format these data, and can support some interactive work (hiding columns, columns switch, ..) Regards Pavel > > Whether the whole problem is really worth this much infrastructure > is a fair question. But if we're going to do something, I'd rather > go down a path like this than inventing a new environment variable > every month. > > regards, tom lane >
Re: pg_dump is broken for partition tablespaces
On Wed, Apr 17, 2019 at 6:06 PM Alvaro Herrera wrote: > > > 3. Partitioned relations can have the database tablespace in > > >pg_class.reltablespace, as opposed to storage-bearing relations which > > >cannot. This is useful to be able to put partitions in the database > > >tablespace even if the default_tablespace is set to something else. > > > > I still feel that this is a darn bad idea. It goes against the rule > > that's existed for pg_class.reltablespace since its beginning, and > > I think it's inevitable that that's going to break something. > > Yes, this deviates from current practice, and while I tested this in as > many ways as I could think of, I cannot deny that it might break > something unexpectedly. Like Tom, I think this has got to be broken. Suppose that you have a partitioned table which has reltablespace = dattablespace. It has a bunch of children with reltablespace = 0. So far so good: new children of the partitioned table go into the database tablespace regardless of default_tablespace. Now somebody changes the default tablespace using ALTER DATABASE .. SET TABLESPACE. All the existing children end up in the new default tablespace, but new children of the partitioned table end up going into the tablespace that used to be the default but is no longer. That's pretty odd, because the whole point of setting a tablespace on the children was to get all of the children into the same tablespace. The same thing happens if you clone the database using CREATE DATABASE .. TEMPLATE .. TABLESPACE, as Tom mentioned. PostgreSQL has historically and very deliberately *not made a distinction* between "this object is in the default tablespace" and "this object is in tablespace X which happens to be the default." I think that it's too late to invent such a distinction for reasons of backward compatibility -- and if we were going to do it, surely it would need to exist for both partitioned tables and the partitions themselves. Otherwise it just produces more strange inconsistencies. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
Greetings, * Peter Geoghegan (p...@bowt.ie) wrote: > Andres has suggested that I work on teaching nbtree to accommodate > variable-width, logical table identifiers, such as those required for > indirect indexes, or clustered indexes, where secondary indexes must > use a logical primary key value instead of a heap TID. I'm not > currently committed to working on this as a project, but I definitely > don't want to make it any harder. This has caused me to think about > the problem as it relates to the new on-disk representation for v4 > nbtree indexes in Postgres 12. I do have a minor misgiving about one > particular aspect of what I came up with: The precise details of how > we represent heap TID in pivot tuples seems like it might make things > harder than they need to be for a future logical/varwidth table > identifier project. This probably isn't worth doing anything about > now, but it seems worth discussing now, just in case. This seems like it would be helpful for global indexes as well, wouldn't it? > This also results in an immediate though small benefit for v4 nbtree > indexes: _bt_truncate() produces smaller pivot tuples in a few cases. > For example, indexes with one or two boolean fields will have pivot > tuples that are 15 bytes and 16 bytes in length respectively, > occupying 16 bytes of tuple space on internal pages. The saving comes > because we can use the alignment padding hole, that was empty in the > original non-pivot index tuple that the new pivot tuple is to be > formed from. Currently, the size of these pivot tuples would be 24 > bytes, so we're occasionally saving a MAXALIGN() quantum in space this > way. It is unlikely that anyone would actually care very much about > these kinds of space savings, but at the same time it feels more > elegant to me. The heap TID may not have a pg_attribute entry, but > ISTM that the on-disk representation should not have padding "in the > wrong place", on general principle. > > Thoughts? I agree with trying to avoid having padding 'in the wrong place' and if it makes some indexes smaller, great, even if they're unlikely to be interesting in the vast majority of cases, they may still exist out there. Of course, this is provided that it doesn't overly complicate the code, but it sounds like it wouldn't be too bad in this case. Thanks! Stephen signature.asc Description: PGP signature
Re: finding changed blocks using WAL scanning
On Sun, Apr 21, 2019 at 06:24:50PM -0400, Robert Haas wrote: > On Sat, Apr 20, 2019 at 5:54 PM Bruce Momjian wrote: > > Good point. You mentioned: > > > > It seems better to me to give the files names like > > ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g. > > 0001.00016858.0001687DBBB8.modblock, so that you can > > see exactly which *records* are covered by that segment. > > > > but it seems like it should be ${TLI}.${ENDLSN}... (END first) because > > you would not want to delete the modblock file until you are about to > > delete the final WAL, not the first WAL, but as you mentioned, it might > > be ENDLSN-1. > > Hmm. It seems to me that it is almost universally the convention to > put the starting point prior to the ending point. If you are taking a > biology class, the teacher will not tell you to study chapters six > through three. My point is that most WAL archive tools will order and remove files based on their lexical ordering, so if you put the start first, the file will normally be removed when it should be kept, e.g., if you have WAL files like: 00010001 00010002 00010003 00010004 00010005 putting the start first and archiving some wal would lead to: 00010001-00010004.modblock 00010003 00010004 00010005 We removed 1 and 2, but kept the modblock file, which looks out of order. Having the end at the start would have: 00010003 00010004 00010004-00010001.modblock 00010005 My point is that you would normally only remove the modblock file when 4 is removed because this modblock files is useful for incremental backups from base backups that happened between 1 and 4. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
make \d pg_toast.foo show its indices
It's deliberate that \dt doesn't show toast tables. \d shows them, but doesn't show their indices. It seems to me that their indices should be shown, without having to think and know to query pg_index. postgres=# \d pg_toast.pg_toast_2600 TOAST table "pg_toast.pg_toast_2600" Column | Type +- chunk_id | oid chunk_seq | integer chunk_data | bytea Indexes: "pg_toast_2600_index" PRIMARY KEY, btree (chunk_id, chunk_seq) Justin diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index d7390d5..d26d986 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -2274,6 +2274,7 @@ describeOneTableDetails(const char *schemaname, else if (tableinfo.relkind == RELKIND_RELATION || tableinfo.relkind == RELKIND_MATVIEW || tableinfo.relkind == RELKIND_FOREIGN_TABLE || + tableinfo.relkind == RELKIND_TOASTVALUE || tableinfo.relkind == RELKIND_PARTITIONED_TABLE) { /* Footer information about a table */
Re: finding changed blocks using WAL scanning
On Mon, Apr 22, 2019 at 11:20:43AM +0900, Michael Paquier wrote: > On Sat, Apr 20, 2019 at 12:21:36AM -0400, Robert Haas wrote: > > The segment size doesn't have much to do with it. If you make > > segments bigger, you'll have to scan fewer larger ones; if you make > > them smaller, you'll have more smaller ones. The only thing that > > really matters is the amount of I/O and CPU required, and that doesn't > > change very much as you vary the segment size. > > If you create the extra file when a segment is finished and we switch > to a new one, then the extra work would happen for a random backend, > and it is going to be more costly to scan a 1GB segment than a 16MB > segment as a one-time operation, and less backends would see a > slowdown at equal WAL data generated. From what I can see, you are > not planning to do such operations when a segment finishes being > written, which would be much better. I think your point is that the 16MB is more likely to be in memory, while the 1GB is less likely. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: display of variables in EXPLAIN VERBOSE
David Rowley writes: > I'd vote to make the code match the documentation, but probably > implement it by adding a new field to ExplainState and just calculate > what to do once in ExplainQuery() instead of calculating what to do in > various random places. Yeah, this is none too consistent: $ grep -n 'useprefix =' explain.c 2081: useprefix = list_length(es->rtable) > 1; 2151: useprefix = (IsA(planstate->plan, SubqueryScan) ||es->verbose); 2165: useprefix = (list_length(es->rtable) > 1 || es->verbose); 2238: useprefix = (list_length(es->rtable) > 1 || es->verbose); 2377: useprefix = (list_length(es->rtable) > 1 || es->verbose); 2485: useprefix = list_length(es->rtable) > 1; If we're going to mess with this, I'd also suggest that we not depend on list_length(es->rtable) per se, as that counts RTEs that may have nothing to do with the plan. For instance, I've never been very happy about this behavior: regression=# create table tt (f1 int, f2 int); CREATE TABLE regression=# explain verbose select * from tt; QUERY PLAN - Seq Scan on public.tt (cost=0.00..32.60 rows=2260 width=8) Output: f1, f2 (2 rows) regression=# create view vv as select * from tt; CREATE VIEW regression=# explain verbose select * from vv; QUERY PLAN - Seq Scan on public.tt (cost=0.00..32.60 rows=2260 width=8) Output: tt.f1, tt.f2 (2 rows) The reason for the difference is the presence of the view's RTE in the plan, but why should that affect the printout? Maybe we could make it depend on the number of RTE names assigned by select_rtable_names_for_explain, instead. BTW, now that I look at this, I think the reason why I didn't make tlist printouts pay attention to VERBOSE for this purpose is that you don't get them at all if not verbose: regression=# explain select * from tt; QUERY PLAN -- Seq Scan on tt (cost=0.00..32.60 rows=2260 width=8) (1 row) So if we were to be rigidly consistent with this point of the docs, there would be no way to see a tlist without variable qualification, which doesn't really seem that nice. Alternatively, we could just leave this as-is. I do not think the quoted doc paragraph was ever meant as an exact specification of what EXPLAIN VERBOSE does, nor do I believe that making it so would be helpful. regards, tom lane
Re: finding changed blocks using WAL scanning
On Sun, Apr 21, 2019 at 10:21 PM Michael Paquier wrote: > If you create the extra file when a segment is finished and we switch > to a new one, then the extra work would happen for a random backend, > and it is going to be more costly to scan a 1GB segment than a 16MB > segment as a one-time operation, and less backends would see a > slowdown at equal WAL data generated. From what I can see, you are > not planning to do such operations when a segment finishes being > written, which would be much better. Well, my plan was to do it all from a background worker, so I do not think a random backend would ever have to do any extra work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: clean up docs for v12
Hi, On 2019-04-22 14:48:26 +0900, Michael Paquier wrote: > On Fri, Apr 19, 2019 at 09:43:01AM -0500, Justin Pryzby wrote: > > Thanks for committing those portions. > > I have done an extra pass on your patch set to make sure that I am > missing nothing, and the last two remaining places which need some > tweaks are the comments from the JIT code you pointed out. Attached > is a patch with these adjustments. > -- > Michael > diff --git a/src/backend/jit/llvm/llvmjit_deform.c > b/src/backend/jit/llvm/llvmjit_deform.c > index 94b4635218..e7aa92e274 100644 > --- a/src/backend/jit/llvm/llvmjit_deform.c > +++ b/src/backend/jit/llvm/llvmjit_deform.c > @@ -298,9 +298,9 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc > desc, > } > > /* > - * Check if's guaranteed the all the desired attributes are available in > - * tuple. If so, we can start deforming. If not, need to make sure to > - * fetch the missing columns. > + * Check if all the desired attributes are available in the tuple. If > so, > + * we can start deforming. If not, we need to make sure to fetch the > + * missing columns. >*/ That's imo not an improvement. The guaranteed bit is actually relevant. What this block is doing is eliding the check against the tuple header for the number of attributes, if NOT NULL attributes for later columns guarantee that the desired columns are present in the NULL bitmap. But the rephrasing makes it sound like we're actually checking against the tuple. I think it'd be better just to fix s/the all/that all/. > if ((natts - 1) <= guaranteed_column_number) > { > @@ -383,7 +383,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc > desc, > > /* >* If this is the first attribute, slot->tts_nvalid was 0. > Therefore > - * reset offset to 0 to, it be from a previous execution. > + * reset offset to 0 too, as it may be from a previous > execution. >*/ > if (attnum == 0) > { That obviously makes sense. Greetings, Andres Freund
Re: finding changed blocks using WAL scanning
On Mon, Apr 22, 2019 at 11:48 AM Bruce Momjian wrote: > My point is that most WAL archive tools will order and remove files > based on their lexical ordering, so if you put the start first, the file > will normally be removed when it should be kept, e.g., if you have WAL > files like: > > 00010001 > 00010002 > 00010003 > 00010004 > 00010005 > > putting the start first and archiving some wal would lead to: > > 00010001-00010004.modblock > 00010003 > 00010004 > 00010005 > > We removed 1 and 2, but kept the modblock file, which looks out of > order. Having the end at the start would have: > > 00010003 > 00010004 > 00010004-00010001.modblock > 00010005 > > My point is that you would normally only remove the modblock file when 4 > is removed because this modblock files is useful for incremental backups > from base backups that happened between 1 and 4. That's an interesting point. On the other hand, I think it would be typical to want the master to retain .modblock files for much longer than it retains WAL segments, and in my design, the WAL archive wouldn't see those files at all; they'd be stored on the master. I was actually thinking that they should possibly be stored in a separate directory to avoid confusion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: clean up docs for v12
On 2019-Apr-22, Andres Freund wrote: > On 2019-04-22 14:48:26 +0900, Michael Paquier wrote: > > /* > > -* Check if's guaranteed the all the desired attributes are available in > > -* tuple. If so, we can start deforming. If not, need to make sure to > > -* fetch the missing columns. > > +* Check if all the desired attributes are available in the tuple. If > > so, > > +* we can start deforming. If not, we need to make sure to fetch the > > +* missing columns. > > */ > > That's imo not an improvement. The guaranteed bit is actually > relevant. What this block is doing is eliding the check against the > tuple header for the number of attributes, if NOT NULL attributes for > later columns guarantee that the desired columns are present in the NULL > bitmap. But the rephrasing makes it sound like we're actually checking > against the tuple. > > I think it'd be better just to fix s/the all/that all/. (and s/if's/if it's/) > > > if ((natts - 1) <= guaranteed_column_number) > > { > > @@ -383,7 +383,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc > > desc, > > > > /* > > * If this is the first attribute, slot->tts_nvalid was 0. > > Therefore > > -* reset offset to 0 to, it be from a previous execution. > > +* reset offset to 0 too, as it may be from a previous > > execution. > > */ > > if (attnum == 0) > > { > > That obviously makes sense. Hmm, I think "as it *is*", not "as it *may be*", right? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: clean up docs for v12
Alvaro Herrera writes: > On 2019-Apr-22, Andres Freund wrote: >> On 2019-04-22 14:48:26 +0900, Michael Paquier wrote: >>> /* >>> -* Check if's guaranteed the all the desired attributes are available in >>> -* tuple. If so, we can start deforming. If not, need to make sure to >>> -* fetch the missing columns. >>> +* Check if all the desired attributes are available in the tuple. If >>> so, >>> +* we can start deforming. If not, we need to make sure to fetch the >>> +* missing columns. >>> */ >> That's imo not an improvement. The guaranteed bit is actually >> relevant. What this block is doing is eliding the check against the >> tuple header for the number of attributes, if NOT NULL attributes for >> later columns guarantee that the desired columns are present in the NULL >> bitmap. But the rephrasing makes it sound like we're actually checking >> against the tuple. >> >> I think it'd be better just to fix s/the all/that all/. > (and s/if's/if it's/) ISTM that Michael's proposed wording change shows that the existing comment is easily misinterpreted. I don't think these minor grammatical fixes will avoid the misinterpretation problem, and so some more-extensive rewording is called for. But TBH, now that I look at the code, I think the entire optimization is a bad idea and should be removed. Am I right in thinking that the presence of a wrong attnotnull marker could cause the generated code to actually crash, thanks to not checking the tuple's natts field? I don't have enough faith in our enforcement of those constraints to want to see JIT taking that risk to save a nanosecond or two. (Possibly I'd not think this if I weren't fresh off a couple of days with my nose in the ALTER TABLE SET NOT NULL code. But right now, I think that believing that that code does not and never will have any bugs is just damfool.) regards, tom lane
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
Hi, On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote: > Andres has suggested that I work on teaching nbtree to accommodate > variable-width, logical table identifiers, such as those required for > indirect indexes, or clustered indexes, where secondary indexes must > use a logical primary key value instead of a heap TID. I think it's two more cases: - table AMs that want to support tables that are bigger than 32TB. That used to be unrealistic, but it's not anymore. Especially when the need to VACUUM etc is largely removed / reduced. - global indexes (for cross-partition unique constraints and such), which need a partition identifier as part of the tid (or as part of the index key, but I think that actually makes interaction with indexam from other layers more complicated - the inside of the index maybe may want to represent it as a column, but to the outside that ought not to be visible) > Thoughts? Seems reasonable to me. I, more generally, wonder if there's not a case to squeeze out more padding than "just" what you describe (since we IIRC don't frequently keep pointers into such tuples anyway, and definitely don't for byval attrs). But that's very likely better done separately. Greetings, Andres Freund
Re: finding changed blocks using WAL scanning
On Mon, Apr 22, 2019 at 12:15:32PM -0400, Robert Haas wrote: > On Mon, Apr 22, 2019 at 11:48 AM Bruce Momjian wrote: > > My point is that you would normally only remove the modblock file when 4 > > is removed because this modblock files is useful for incremental backups > > from base backups that happened between 1 and 4. > > That's an interesting point. On the other hand, I think it would be > typical to want the master to retain .modblock files for much longer > than it retains WAL segments, and in my design, the WAL archive > wouldn't see those files at all; they'd be stored on the master. I > was actually thinking that they should possibly be stored in a > separate directory to avoid confusion. I assumed the modblock files would be stored in the WAL archive so some external tools could generate incremental backups using just the WAL files. I assumed they would also be sent to standby servers so incremental backups could be done on standby servers too. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: clean up docs for v12
Hi, On 2019-04-22 12:33:24 -0400, Tom Lane wrote: > ISTM that Michael's proposed wording change shows that the existing > comment is easily misinterpreted. I don't think these minor grammatical > fixes will avoid the misinterpretation problem, and so some more-extensive > rewording is called for. Fair enough. > But TBH, now that I look at the code, I think the entire optimization > is a bad idea and should be removed. Am I right in thinking that the > presence of a wrong attnotnull marker could cause the generated code to > actually crash, thanks to not checking the tuple's natts field? I don't > have enough faith in our enforcement of those constraints to want to see > JIT taking that risk to save a nanosecond or two. It's not a minor optimization, it's very measurable. Without the check there's no pipeline stall when the memory for the tuple header is not in the CPU cache (very common, especially for seqscans and such, due to the "backward" memory location ordering of tuples in seqscans, which CPUs don't predict). Server grade CPUs of the last ~5 years just march on and start the work to fetch the first attributes (especially if they're NOT NULL) - but can't do that if natts has to be checked. And starting to check the NULL bitmap for NOT NULL attributes, would make that even worse - and would required if we don't trust attnotnull. > (Possibly I'd not think this if I weren't fresh off a couple of days > with my nose in the ALTER TABLE SET NOT NULL code. But right now, > I think that believing that that code does not and never will have > any bugs is just damfool.) But there's plenty places where we rely on NOT NULL actually working? We'll return wrong query results, and even crash in non-JIT places because we thought there was guaranteed to be datum? Greetings, Andres Freund
Re: clean up docs for v12
Hi, On 2019-04-22 09:43:56 -0700, Andres Freund wrote: > On 2019-04-22 12:33:24 -0400, Tom Lane wrote: > > ISTM that Michael's proposed wording change shows that the existing > > comment is easily misinterpreted. I don't think these minor grammatical > > fixes will avoid the misinterpretation problem, and so some more-extensive > > rewording is called for. > > Fair enough. The computation of that variable above has: * If the column is possibly missing, we can't rely on its (or * subsequent) NOT NULL constraints to indicate minimum attributes in * the tuple, so stop here. */ if (att->atthasmissing) break; /* * Column is NOT NULL and there've been no preceding missing columns, * it's guaranteed that all columns up to here exist at least in the * NULL bitmap. */ if (att->attnotnull) guaranteed_column_number = attnum; and only then the comment referenced in the discussion here follows: /* * Check if's guaranteed the all the desired attributes are available in * tuple. If so, we can start deforming. If not, need to make sure to * fetch the missing columns. */ I think just reformulating that to something like /* * Check if it's guaranteed that all the desired attributes are available * in the tuple (but still possibly NULL), by dint of either the last * to-be-deformed column being NOT NULL, or subsequent ones not accessed * here being NOT NULL. If that's not guaranteed the tuple headers natt's * has to be checked, and missing attributes potentially have to be * fetched (using slot_getmissingattrs(). */ should make that clearer? Greetings, Andres Freund
Optimizer items in the release notes
We had a discussion in October about adding more optimizer items to the release notes: https://www.postgresql.org/message-id/flat/20181010220601.GA7807%40momjian.us#11d805ea0b0fcd0552dfa99251417cc1 There was no agreement on a change, but if people want to propose a change, please post here and we can discuss it. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Trouble with FETCH_COUNT and combined queries in psql
Hi, When FETCH_COUNT is set, queries combined in a single request don't work as expected: \set FETCH_COUNT 10 select pg_sleep(2) \; select 1; No result is displayed, the pg_sleep(2) is not run, and no error is shown. That's disconcerting. The sequence that is sent under the hood is: #1 BEGIN #2 DECLARE _psql_cursor NO SCROLL CURSOR FOR select pg_sleep(2) ; select 1; #3 CLOSE _psql_cursor #4 ROLLBACK The root problem is in deciding that a statement can be run through a cursor if the query text starts with "select" or "values" (in is_select_command() in common.c), but not knowing about multiple queries in the buffer, which are not compatible with the cursor thing. When sending #2, psql expects the PQexec("DECLARE...") to yield a PGRES_COMMAND_OK, but it gets a PGRES_TUPLES_OK instead. Given that, it abandons the cursor, rollbacks the transaction (if it opened it), and clears out the results of the second select without displaying them. If there was already a transaction open, the problem is worse because it doesn't rollback and we're silently missing an SQL statement that was possibly meant to change the state of the data, as in BEGIN; SELECT compute_something() \; select get_results(); END; Does anyone have thoughts about how to fix this? ATM I don't see a plausible fix that does not involve the parser to store the information that it's a multiple-query command and pass it down somehow to is_select_command(). Or a more modern approach could be to give up on the cursor-based method in favor of PQsetSingleRowMode(). That might be too big a change for a bug fix though, Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Hi, On 2019-04-22 18:49:44 +0530, Amit Kapila wrote: > Attached is a hacky and work-in-progress patch to move fsm map to > relcache. This will give you some idea. I think we need to see if > there is a need to invalidate the relcache due to this patch. I think > some other comments of Andres also need to be addressed, see if you > can attempt to fix some of them. > /* > @@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks) > BlockNumber blkno, > cached_target_block; > > - /* The local map must not be set already. */ > - Assert(!FSM_LOCAL_MAP_EXISTS); > - > /* >* Starting at the current last block in the relation and working >* backwards, mark alternating blocks as available. > @@ -1142,7 +1117,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks) > blkno = cur_nblocks - 1; > while (true) > { > - fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL; > + rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL; > if (blkno >= 2) > blkno -= 2; > else > @@ -1150,13 +1125,13 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks) > } > > /* Cache the number of blocks. */ > - fsm_local_map.nblocks = cur_nblocks; > + rel->fsm_local_map->nblocks = cur_nblocks; > > /* Set the status of the cached target block to 'unavailable'. */ > cached_target_block = RelationGetTargetBlock(rel); > if (cached_target_block != InvalidBlockNumber && > cached_target_block < cur_nblocks) > - fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL; > + rel->fsm_local_map->map[cached_target_block] = > FSM_LOCAL_NOT_AVAIL; > } I think there shouldn't be any need for this anymore. After this change we shouldn't invalidate the map until there's no space on it - thereby addressing the cost overhead, and greatly reducing the likelihood that the local FSM can lead to increased bloat. > /* > @@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks) > * This function is used when there is no FSM. > */ > static BlockNumber > -fsm_local_search(void) > +fsm_local_search(Relation rel) > { > BlockNumber target_block; > > /* Local map must be set by now. */ > - Assert(FSM_LOCAL_MAP_EXISTS); > + Assert(FSM_LOCAL_MAP_EXISTS(rel)); > > - target_block = fsm_local_map.nblocks; > + target_block = rel->fsm_local_map->nblocks; > do > { > target_block--; > - if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL) > + if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL) > return target_block; > } while (target_block > 0); > > @@ -1189,7 +1164,22 @@ fsm_local_search(void) >* first, which would otherwise lead to the same conclusion again and >* again. >*/ > - FSMClearLocalMap(); > + fsm_clear_local_map(rel); I'm not sure I like this. My inclination would be that we should be able to check the local fsm repeatedly even if there's no space in the in-memory representation - otherwise the use of the local FSM increases bloat. Greetings, Andres Freund
Re: block-level incremental backup
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost wrote: > > Having been around for a while working on backup-related things, if I > > was to implement the protocol for pg_basebackup today, I'd definitely > > implement "give me a list" and "give me this file" rather than the > > tar-based approach, because I've learned that people want to be > > able to do parallel backups and that's a decent way to do that. I > > wouldn't set out and implement something new that's there's just no hope > > of making parallel. Maybe the first write of pg_basebackup would still > > be simple and serial since it's certainly more work to make a frontend > > tool like that work in parallel, but at least the protocol would be > > ready to support a parallel option being added alter without being > > rewritten. > > > > And that's really what I was trying to get at here- if we've got the > > choice now to decide what this is going to look like from a protocol > > level, it'd be great if we could make it able to support being used in a > > parallel fashion, even if pg_basebackup is still single-threaded. > > I think we're getting closer to a meeting of the minds here, but I > don't think it's intrinsically necessary to rewrite the whole method > of operation of pg_basebackup to implement incremental backup in a > sensible way. It wasn't my intent to imply that the whole method of operation of pg_basebackup would have to change for this. > One could instead just do a straightforward extension > to the existing BASE_BACKUP command to enable incremental backup. Ok, how do you envision that? As I mentioned up-thread, I am concerned that we're talking too high-level here and it's making the discussion more difficult than it would be if we were to put together specific ideas and then discuss them. One way I can imagine to extend BASE_BACKUP is by adding LSN as an optional parameter and then having the database server scan the entire cluster and send a tarball which contains essentially a 'diff' file of some kind for each file where we can construct a diff based on the LSN, and then the complete contents of the file for everything else that needs to be in the backup. So, sure, that would work, but it wouldn't be able to be parallelized and I don't think it'd end up being very exciting for the external tools because of that, but it would be fine for pg_basebackup. On the other hand, if you added new commands for 'list of files changed since this LSN' and 'give me this file' and 'give me this file with the changes in it since this LSN', then pg_basebackup could work with that pretty easily in a single-threaded model (maybe with two connections to the backend, but still in a single process, or maybe just by slurping up the file list and then asking for each one) and the external tools could leverage those new capabilities too for their backups, both full backups and incremental ones. This also wouldn't have to change how pg_basebackup does full backups today one bit, so what we're really talking about here is the direction to take the new code that's being written, not about rewriting existing code. I agree that it'd be a bit more work... but hopefully not *that* much more, and it would mean we could later add parallel backup to pg_basebackup more easily too, if we wanted to. > Then, to enable parallel full backup and all sorts of out-of-core > hacking, one could expand the command language to allow tools to > access individual steps: START_BACKUP, SEND_FILE_LIST, > SEND_FILE_CONTENTS, STOP_BACKUP, or whatever. The second thing makes > for an appealing project, but I do not think there is a technical > reason why it has to be done first. Or for that matter why it has to > be done second. As I keep saying, incremental backup and full backup > are separate projects and I believe it's completely reasonable for > whoever is doing the work to decide on the order in which they would > like to do the work. I didn't mean to imply that one had to be done before the other from a technical standpoint. I agree that they don't depend on each other. You're certainly welcome to do what you would like, I simply wanted to share my experiences and try to help move this in a direction that would involve less code rewrite in the future and to have a feature that would be more appealing to the external tools. > Having said that, I'm curious what people other than Stephen (and > other pgbackrest hackers) While David and I do talk, we haven't really discussed this proposal all that much, so please don't assume that he shares my thoughts here. I'd also like to hear what others think, particularly those who have been working in this area. > think about the relative value of parallel > backup vs. incremental backup. Stephen appears quite convinced that > parallel backup is full of win and incremental backup is a bit of a > yawn by comparison, and while I certainly would not want to di
Re: finding changed blocks using WAL scanning
On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian wrote: > I assumed the modblock files would be stored in the WAL archive so some > external tools could generate incremental backups using just the WAL > files. I assumed they would also be sent to standby servers so > incremental backups could be done on standby servers too. Yeah, that's another possible approach. I am not sure what is best. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: finding changed blocks using WAL scanning
On Mon, Apr 22, 2019 at 01:11:22PM -0400, Robert Haas wrote: > On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian wrote: > > I assumed the modblock files would be stored in the WAL archive so some > > external tools could generate incremental backups using just the WAL > > files. I assumed they would also be sent to standby servers so > > incremental backups could be done on standby servers too. > > Yeah, that's another possible approach. I am not sure what is best. I am thinking you need to allow any of these, and putting the WAL files in pg_wal and having them streamed and archived gives that flexibility. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost wrote: > This seems like it would be helpful for global indexes as well, wouldn't > it? Yes, though that should probably work by reusing what we already do with heap TID (use standard IndexTuple fields on the leaf level for heap TID), plus an additional identifier for the partition number that is located at the physical end of the tuple. IOW, I think that this might benefit from a design that is half way between what we already do with heap TIDs and what we would be required to do to make varwidth logical row identifiers in tables work -- the partition number is varwidth, though often only a single byte. > I agree with trying to avoid having padding 'in the wrong place' and if > it makes some indexes smaller, great, even if they're unlikely to be > interesting in the vast majority of cases, they may still exist out > there. Of course, this is provided that it doesn't overly complicate > the code, but it sounds like it wouldn't be too bad in this case. Here is what it took: * Removed the "conservative" MAXALIGN() within index_form_tuple(), bringing it in line with heap_form_tuple(), which only MAXALIGN()s so that the first attribute in tuple's data area can safely be accessed on alignment-picky platforms, but doesn't do the same with data_len. * Removed most of the MAXALIGN()s from nbtinsert.c, except one that considers if a page split is required. * Didn't change the nbtsplitloc.c code, because we need to assume MAXALIGN()'d space quantities there. We continue to not trust the reported tuple length to be MAXALIGN()'d, which is now essentially rather than just defensive. * Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the whole tuple size in the case where new pivot tuple requires a heap TID representation. We access TIDs as 3 2 byte integers, so this is necessary for alignment-picky platforms. I will pursue this as a project for PostgreSQL 13. It doesn't affect on-disk compatibility, because BTreeTupleGetHeapTID() works just as well with either the existing scheme, or this new one. Having the "real" tuple length available will make it easier to implement "true" suffix truncation, where we truncate *within* a text attribute (i.e. generate a new, shorter value using new opclass infrastructure). -- Peter Geoghegan
Re: clean up docs for v12
Andres Freund writes: > On 2019-04-22 12:33:24 -0400, Tom Lane wrote: >> But TBH, now that I look at the code, I think the entire optimization >> is a bad idea and should be removed. Am I right in thinking that the >> presence of a wrong attnotnull marker could cause the generated code to >> actually crash, thanks to not checking the tuple's natts field? I don't >> have enough faith in our enforcement of those constraints to want to see >> JIT taking that risk to save a nanosecond or two. > It's not a minor optimization, it's very measurable. Doesn't matter, if it's unsafe. >> (Possibly I'd not think this if I weren't fresh off a couple of days >> with my nose in the ALTER TABLE SET NOT NULL code. But right now, >> I think that believing that that code does not and never will have >> any bugs is just damfool.) > But there's plenty places where we rely on NOT NULL actually working? I do not think there are any other places where we make this particular assumption. Given the number of ways in which we rely on there being natts checks to avoid rewriting tables, I'm very afraid of the idea that JIT is making more assumptions than the mainline code does. In hopes of putting some fear into you too, I exhibit the following behavior, which is not a bug according to our current definitions: regression=# create table pp(f1 int); CREATE TABLE regression=# create table cc() inherits (pp); CREATE TABLE regression=# insert into cc values(1); INSERT 0 1 regression=# insert into cc values(2); INSERT 0 1 regression=# insert into cc values(null); INSERT 0 1 regression=# alter table pp add column f2 text; ALTER TABLE regression=# alter table pp add column f3 text; ALTER TABLE regression=# alter table only pp alter f3 set not null; ALTER TABLE regression=# select * from pp; f1 | f2 | f3 ++ 1 || 2 || || (3 rows) The tuples coming out of cc will still have natts = 1, I believe. If they were deformed according to pp's tupdesc, there'd be a problem. Now, we shouldn't do that, because this is not the only possible discrepancy between parent and child tupdescs --- but I think this example shows that attnotnull is a lot spongier than you are assuming, even without considering the possibility of outright bugs. regards, tom lane
Re: Do CustomScan as not projection capable node
On 22/04/2019 18:40, Robert Haas wrote: On Fri, Apr 19, 2019 at 12:45 AM Tom Lane wrote: I don't buy this for a minute. Where do you think projection is going to happen? There isn't any existing node type that *couldn't* support projection if we insisted that that be done across-the-board. I think it's mostly just a legacy thing that some don't. I think there may actually be some good reasons for that. If something like an Append or Material node projects, it seems to me that this means that we built the wrong tlist for its input(s). That justification doesn't apply to custom scans, though. The main reason for my question was incomplete information about the parameter custom_scan_tlist / fdw_scan_tlist. In the process of testing my custom node, I encountered an error in setrefs.c caused by optimization of the projection operation. In order to reliably understand how to properly use custom_scan_tlist, I had to study in detail the mechanics of the FDW plan generator and now the problem is solved. We have only three references to this parameter in the hackers mailing list, a brief reference on postgresql.org and limited comments into two patches: 1a8a4e5 and e7cb7ee. It is possible that custom_scan_tlist is designed too nontrivially, and it is possible that it needs some comments describing in more detail how to use it. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
On Mon, Apr 22, 2019 at 9:35 AM Andres Freund wrote: > I, more generally, wonder if there's not a case to squeeze out more > padding than "just" what you describe (since we IIRC don't frequently > keep pointers into such tuples anyway, and definitely don't for byval > attrs). But that's very likely better done separately. There is one way that that is kind of relevant here. The current requirement seems to be that *any* sort of tuple header be MAXALIGN()'d, because in the worst case the first attribute needs to be accessed at a MAXALIGN()'d boundary on an alignment-picky platform. That isn't so bad today, since we usually find a reasonably good way to put those 8 bytes (or 23/24 bytes in the case of heap tuples) to use. However, with varwidth table identifiers, the only use of those 8 bytes that I can think of is the offset to the identifier (or perhaps its length), plus the usual t_info stuff. We'd almost invariably waste 4 or 5 bytes, which seems like a problem to me. -- Peter Geoghegan
Re: clean up docs for v12
Andres Freund writes: > The computation of that variable above has: >* If the column is possibly missing, we can't rely on its (or >* subsequent) NOT NULL constraints to indicate minimum > attributes in >* the tuple, so stop here. >*/ > if (att->atthasmissing) > break; BTW, why do we have to stop? ISTM that a not-null column without atthasmissing is enough to prove this, regardless of the state of prior columns. (This is assuming that you trust attnotnull for this, which as I said I don't, but that's not relevant to this question.) I wonder also if it wouldn't be smart to explicitly check that the "guaranteeing" column is not attisdropped. > I think just reformulating that to something like > /* >* Check if it's guaranteed that all the desired attributes are > available >* in the tuple (but still possibly NULL), by dint of either the last >* to-be-deformed column being NOT NULL, or subsequent ones not accessed >* here being NOT NULL. If that's not guaranteed the tuple headers > natt's >* has to be checked, and missing attributes potentially have to be >* fetched (using slot_getmissingattrs(). > */ > should make that clearer? OK by me. regards, tom lane
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
Greetings, * Peter Geoghegan (p...@bowt.ie) wrote: > On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost wrote: > > This seems like it would be helpful for global indexes as well, wouldn't > > it? > > Yes, though that should probably work by reusing what we already do > with heap TID (use standard IndexTuple fields on the leaf level for > heap TID), plus an additional identifier for the partition number that > is located at the physical end of the tuple. IOW, I think that this > might benefit from a design that is half way between what we already > do with heap TIDs and what we would be required to do to make varwidth > logical row identifiers in tables work -- the partition number is > varwidth, though often only a single byte. Yes, global indexes for partitioned tables could potentially be simpler than the logical row identifiers, but maybe it'd be useful to just have one implementation based around logical row identifiers which ends up working for global indexes as well as the other types of indexes and table access methods. If we thought that the 'single-byte' partition number covered enough use-cases then we could possibly consider supporting them for partitions by just 'stealing' a byte from BlockIdData and having the per-partition size be limited to 4TB when a global index exists on the partitioned table. That's certainly not an ideal limitation but it might be appealing to some users who really would like global indexes and could possibly require less to implement, though there's a lot of other things that would have to be done to have global indexes. Anyhow, just some random thoughts that I figured I'd share in case there might be something there worth thinking about. > > I agree with trying to avoid having padding 'in the wrong place' and if > > it makes some indexes smaller, great, even if they're unlikely to be > > interesting in the vast majority of cases, they may still exist out > > there. Of course, this is provided that it doesn't overly complicate > > the code, but it sounds like it wouldn't be too bad in this case. > > Here is what it took: > > * Removed the "conservative" MAXALIGN() within index_form_tuple(), > bringing it in line with heap_form_tuple(), which only MAXALIGN()s so > that the first attribute in tuple's data area can safely be accessed > on alignment-picky platforms, but doesn't do the same with data_len. > > * Removed most of the MAXALIGN()s from nbtinsert.c, except one that > considers if a page split is required. > > * Didn't change the nbtsplitloc.c code, because we need to assume > MAXALIGN()'d space quantities there. We continue to not trust the > reported tuple length to be MAXALIGN()'d, which is now essentially > rather than just defensive. > > * Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the > whole tuple size in the case where new pivot tuple requires a heap TID > representation. We access TIDs as 3 2 byte integers, so this is > necessary for alignment-picky platforms. > > I will pursue this as a project for PostgreSQL 13. It doesn't affect > on-disk compatibility, because BTreeTupleGetHeapTID() works just as > well with either the existing scheme, or this new one. Having the > "real" tuple length available will make it easier to implement "true" > suffix truncation, where we truncate *within* a text attribute (i.e. > generate a new, shorter value using new opclass infrastructure). This sounds pretty good to me, though I'm not nearly as close to the code there as you are. Thanks! Stephen signature.asc Description: PGP signature
Re: block-level incremental backup
Hi, On 2019-04-19 20:04:41 -0400, Stephen Frost wrote: > I agree that we don't want another implementation and that there's a lot > that we want to do to improve replay performance. We've already got > frontend tools which work with multiple execution threads, so I'm not > sure I get the "not easily feasible" bit, and the argument about the > checkpointer seems largely related to that (as in- if we didn't have > multiple threads/processes then things would perform quite badly... but > we can and do have multiple threads/processes in frontend tools today, > even in pg_basebackup). You need not just multiple execution threads, but basically a new implementation of shared buffers, locking, process monitoring, with most of the related infrastructure. You're literally talking about reimplementing a very substantial portion of the backend. I'm not sure I can transport in written words - via a public medium - how bad an idea it would be to go there. > You certainly bring up some good concerns though and they make me think > of other bits that would seem like they'd possibly be larger issues for > a frontend tool- like having a large pool of memory for cacheing (aka > shared buffers) the changes. If what we're talking about here is *just* > replay though, without having the system available for reads, I wonder > if we might want a different solution there. No. > > Which I think is entirely reasonable. With the 'consistent' and LSN > > recovery targets one already can get most of what's needed from such a > > tool, anyway. I'd argue the biggest issue there is that there's no > > equivalent to starting postgres with a private socket directory on > > windows, and perhaps an option or two making it easier to start postgres > > in a "private" mode for things like this. > > This would mean building in a way to do parallel WAL replay into the > server binary though, as discussed above, and it seems like making that > work in a way that allows us to still be available as a read-only > standby would be quite a bit more difficult. We could possibly support > parallel WAL replay only when we aren't a replica but from the same > binary. I'm doubtful that we should try to implement parallel WAL apply that can't support HS - a substantial portion of the the logic to avoid issues around relfilenode reuse, consistency etc is going to be to be necessary for non-HS aware apply anyway. But if somebody had a concrete proposal for something that's fundamentally only doable without HS, I could be convinced. > The concerns mentioned about making it easier to start PG in a > private mode don't seem too bad but I am not entirely sure that the > tools which want to leverage that kind of capability would want to have > to exec out to the PG binary to use it. Tough luck. But even leaving infeasability aside, it seems like a quite bad idea to do this in-process inside a tool that manages backup & recovery. Creating threads / sub-processes with complicated needs (like any pared down version of pg to do just recovery would have) from within a library has substantial complications. So you'd not want to do this in-process anyway. > A lot of this part of the discussion feels like a tangent though, unless > I'm missing something. I'm replying to: On 2019-04-17 18:43:10 -0400, Stephen Frost wrote: > Wow. I have to admit that I feel completely opposite of that- I'd > *love* to have an independent tool (which ideally uses the same code > through the common library, or similar) that can be run to apply WAL. And I'm basically saying that anything that starts from this premise is fatally flawed (in the ex falso quodlibet kind of sense ;)). > The "WAL compression" tool contemplated > previously would be much simpler and not the full-blown WAL replay > capability, which would be left to the server, unless you're suggesting > that even that should be exclusively the purview of the backend? Though > that ship's already sailed, given that external projects have > implemented it. I'm extremely doubtful of such tools (but it's not what I was responding too, see above). I'd be extremely surprised if even one of them came close to being correct. The old FPI removal tool had data corrupting bugs left and right. > Having a library to provide that which external > projects could leverage would be nicer than having everyone write their > own version. No, I don't think that's necessarily true. Something complicated that's hard to get right doesn't have to be provided by core. Even if other projects decide that their risk/reward assesment is different than core postgres'. We don't have to take on all kind of work and complexity for external tools. Greetings, Andres Freund
Re: Trouble with FETCH_COUNT and combined queries in psql
Bonjour Daniel, When FETCH_COUNT is set, queries combined in a single request don't work as expected: \set FETCH_COUNT 10 select pg_sleep(2) \; select 1; No result is displayed, the pg_sleep(2) is not run, and no error is shown. That's disconcerting. Indeed. Does anyone have thoughts about how to fix this? ATM I don't see a plausible fix that does not involve the parser to store the information that it's a multiple-query command and pass it down somehow to is_select_command(). The lexer (not parser) is called by psql to know where the query stops (i.e. waiting for ";"), so it could indeed know whether there are several queries. I added some stuff to extract embedded "\;" for pgbench "\cset", which has been removed though, but it is easy to add back a detection of "\;", and also to detect select. If the position of the last select is known, the cursor can be declared in the right place, which would also solve the problem. Or a more modern approach could be to give up on the cursor-based method in favor of PQsetSingleRowMode(). Hmmm. I'm not sure that row count is available under this mode? ISTM that the FETCH_COUNT stuff should really batch fetching result by this amount. I'm not sure of the 1 by 1 row approach. -- Fabien.
Re: block-level incremental backup
On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost wrote: > > I think we're getting closer to a meeting of the minds here, but I > > don't think it's intrinsically necessary to rewrite the whole method > > of operation of pg_basebackup to implement incremental backup in a > > sensible way. > > It wasn't my intent to imply that the whole method of operation of > pg_basebackup would have to change for this. Cool. > > One could instead just do a straightforward extension > > to the existing BASE_BACKUP command to enable incremental backup. > > Ok, how do you envision that? As I mentioned up-thread, I am concerned > that we're talking too high-level here and it's making the discussion > more difficult than it would be if we were to put together specific > ideas and then discuss them. > > One way I can imagine to extend BASE_BACKUP is by adding LSN as an > optional parameter and then having the database server scan the entire > cluster and send a tarball which contains essentially a 'diff' file of > some kind for each file where we can construct a diff based on the LSN, > and then the complete contents of the file for everything else that > needs to be in the backup. /me scratches head. Isn't that pretty much what I described in my original post? I even described what that "'diff' file of some kind" would look like in some detail in the paragraph of that emailed numbered "2.", and I described the reasons for that choice at length in http://postgr.es/m/ca+tgmozrqdv-tb8ny9p+1pqlqkxp5f1afghuohh5qt6ewdk...@mail.gmail.com I can't figure out how I'm managing to be so unclear about things about which I thought I'd been rather explicit. > So, sure, that would work, but it wouldn't be able to be parallelized > and I don't think it'd end up being very exciting for the external tools > because of that, but it would be fine for pg_basebackup. Stop being such a pessimist. Yes, if we only add the option to the BASE_BACKUP command, it won't directly be very exciting for external tools, but a lot of the work that is needed to do things that ARE exciting for external tools will have been done. For instance, if the work to figure out which blocks have been modified via WAL-scanning gets done, and initially that's only exposed via BASE_BACKUP, it won't be much work for somebody to write code for a new code that exposes that information directly through some new replication command. There's a difference between something that's going in the wrong direction and something that's going in the right direction but not as far or as fast as you'd like. And I'm 99% sure that everything I'm proposing here falls in the latter category rather than the former. > On the other hand, if you added new commands for 'list of files changed > since this LSN' and 'give me this file' and 'give me this file with the > changes in it since this LSN', then pg_basebackup could work with that > pretty easily in a single-threaded model (maybe with two connections to > the backend, but still in a single process, or maybe just by slurping up > the file list and then asking for each one) and the external tools could > leverage those new capabilities too for their backups, both full backups > and incremental ones. This also wouldn't have to change how > pg_basebackup does full backups today one bit, so what we're really > talking about here is the direction to take the new code that's being > written, not about rewriting existing code. I agree that it'd be a bit > more work... but hopefully not *that* much more, and it would mean we > could later add parallel backup to pg_basebackup more easily too, if we > wanted to. For purposes of implementing parallel pg_basebackup, it would probably be better if the server rather than the client decided which files to send via which connection. If the client decides, then every time the server finishes sending a file, the client has to request another file, and that introduces some latency: after the server finishes sending each file, it has to wait for the client to finish receiving the data, and it has to wait for the client to tell it what file to send next. If the server decides, then it can just send data at top speed without a break. So the ideal interface for pg_basebackup would really be something like: START_PARALLEL_BACKUP blah blah PARTICIPANTS 4; ...returning a cookie that can be then be used by each participant for an argument to a new commands: JOIN_PARALLLEL_BACKUP 'cookie'; However, that is obviously extremely inconvenient for third-party tools. It's possible we need both an interface like this -- for use by parallel pg_basebackup -- and a START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type interface for use by external tools. On the other hand, maybe the additional overhead caused by managing the list of files to be fetched on the client side is negligible. It'd be interesting to see, though, how busy the server is when running an incremental backup managed by an external
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
On Mon, Apr 22, 2019 at 10:32 AM Stephen Frost wrote: > Yes, global indexes for partitioned tables could potentially be simpler > than the logical row identifiers, but maybe it'd be useful to just have > one implementation based around logical row identifiers which ends up > working for global indexes as well as the other types of indexes and > table access methods. Maybe so. I think that we'd have to actually try it out to know for sure. > If we thought that the 'single-byte' partition number covered enough > use-cases then we could possibly consider supporting them for partitions > by just 'stealing' a byte from BlockIdData and having the per-partition > size be limited to 4TB when a global index exists on the partitioned > table. I don't think that that would make it any easier to implement. > This sounds pretty good to me, though I'm not nearly as close to the > code there as you are. I'm slightly concerned that I may have broken an index_form_tuple() caller from some other access method, but they all seem to not trust index_form_tuple() to have MAXALIGN()'d on their behalf, just like nbtree (nbtree won't when I'm done, though, because it will actively try to preserve the "real" tuple size). It's convenient to me that no caller seems to rely on the index_form_tuple() MAXALIGN() that I want to remove. -- Peter Geoghegan
Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()
Hey! OK - thank you for the update and the explanation. My reasoning in this case is - if we allow the any[] type to only be passed to other functions that accept any[], and disallow any kind of other operations on this array (such as retrieving its elements or modifying it), I do not yet see any places where it might introduce a performance regression. These arguments will literally be pass-only, and since we are unable to interact with them in any other way, there will be no possibility of type mismatches and therefore for performance penalties. This approach puts all the heavy work on the plpgsql compiler - it will need to ensure that, if there is a any[] or VARIADIC any variable in a function arglist, it must NOT be accessed in any way, and can only be passed to other functions which accept any[] or VARIADIC any. BR ~phoe On 22.04.2019 12:09, Pavel Stehule wrote: > Hi > > po 22. 4. 2019 v 11:27 odesílatel Michał "phoe" Herda > mailto:p...@disroot.org>> napsal: > > Hey everyone, > > I am writing a plpgsql function that (to greatly simplify) raises an > exception with a formatted* message. Ideally, I should be able to call > it with raise_exception('The person %I has only %I bananas.', 'Fred', > 8), which mimics the format(text, any[]) calling convention. > > Here is where I have encountered a limitation of PostgreSQL's design: > https://www.postgresql.org/docs/11/datatype-pseudo.html mentions > explicitly that, "At present most procedural languages forbid use of a > pseudo-type as an argument type". > > My reasoning is that I should be able to accept a value of some > type if > all I do is passing it to a function that accepts exactly that type, > such as format(text, any[]). Given the technical reality, I assume > that > I wouldn't be able to do anything else with that value, but that is > fine, since I don't have to do anything with it regardless. > > BR > Michał "phoe" Herda > > *I do not want to use the obvious solution of > raise_exception(format(...)) because the argument to that function is > the error ID that is then looked up in a table from which the error > message and sqlstate are retrieved. My full code is in the > attached SQL > file. Once it is executed: > > SELECT gateway_error('user_does_not_exist', '2'); -- works but is > unnatural, > SELECT gateway_error('user_does_not_exist', 2); -- is natural but > doesn't work. > > > It is known problem, and fix is not easy. > > Any expressions inside plpgsql are simple queries like SELECT expr, > and they are executed same pipeline like queries. > > The plans of these queries are stored and reused. Originally these > plans disallow any changes, now some changes are supported, but > parameters should be same all time. This is ensured by disallowing > "any" type. > > Other polymorphic types are very static, so there is not described risk. > > Probably some enhancement can be in this are. The plan can be > re-planed after some change - but it can has lot of performance > impacts. It is long open topic. Some changes in direction to dynamic > languages can increase cost of some future optimization to higher > performance :-(. > > Regards > > Pavel > > > >
Re: [PATCH v1] Add \echo_stderr to psql
\warn ... \warning ... These two seem about the best to me, drawing from the perl warn command. Yep, I was thinking of perl & gmake. Maybe the 4 letter option is better because its the same length as "echo". I suppose we could go the bash &2 route here, but I don't want to. I agree on this one. Please find attached v2, name is now \warn. How might we test this portably? TAP testing? see pgbench which has tap test which can test stdout & stderr by calling utility command_checks_all, the same could be done with psql. -- Fabien.
Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()
Hi po 22. 4. 2019 v 19:20 odesílatel Michał "phoe" Herda napsal: > Hey! > > OK - thank you for the update and the explanation. > > My reasoning in this case is - if we allow the any[] type to only be > passed to other functions that accept any[], and disallow any kind of other > operations on this array (such as retrieving its elements or modifying it), > I do not yet see any places where it might introduce a performance > regression. These arguments will literally be pass-only, and since we are > unable to interact with them in any other way, there will be no possibility > of type mismatches and therefore for performance penalties. > > This approach puts all the heavy work on the plpgsql compiler - it will > need to ensure that, if there is a any[] or VARIADIC any variable in a > function arglist, it must NOT be accessed in any way, and can only be > passed to other functions which accept any[] or VARIADIC any. > PLpgSQL compiler knows nothing about a expressions - the compiler process only plpgsql statements. Expressions are processed at runtime only by SQL parser and executor. It is good to start with plpgsql codes - https://github.com/postgres/postgres/tree/master/src/pl/plpgsql/src you can see there, so plpgsql is very different from other compilers. It just glue of SQL expressions or queries, that are black box for PLpgSQL compiler and executor. Just any[] is not plpgsql way. For your case you should to use a overloading create or replace function fx(fmt text, par text) returns void as $$ begin raise notice '%', format(fmt, par); end; $$ language plpgsql; create or replace function fx(fmt text, par numeric) returns void as $$ begin raise notice '%', format(fmt, par); end; $$ language plpgsql; There is another limit, you cannot to declare function parameter type that enforce explicit casting can be nice (but it is strange idea) to have some other flags for arguments CREATE OR REPLACE FUNCTION gateway_error(fmt text, par text FORCE EXPLICIT CAST) ... Regards Pavel > BR > ~phoe > On 22.04.2019 12:09, Pavel Stehule wrote: > > Hi > > po 22. 4. 2019 v 11:27 odesílatel Michał "phoe" Herda > napsal: > >> Hey everyone, >> >> I am writing a plpgsql function that (to greatly simplify) raises an >> exception with a formatted* message. Ideally, I should be able to call >> it with raise_exception('The person %I has only %I bananas.', 'Fred', >> 8), which mimics the format(text, any[]) calling convention. >> >> Here is where I have encountered a limitation of PostgreSQL's design: >> https://www.postgresql.org/docs/11/datatype-pseudo.html mentions >> explicitly that, "At present most procedural languages forbid use of a >> pseudo-type as an argument type". >> >> My reasoning is that I should be able to accept a value of some type if >> all I do is passing it to a function that accepts exactly that type, >> such as format(text, any[]). Given the technical reality, I assume that >> I wouldn't be able to do anything else with that value, but that is >> fine, since I don't have to do anything with it regardless. >> >> BR >> Michał "phoe" Herda >> >> *I do not want to use the obvious solution of >> raise_exception(format(...)) because the argument to that function is >> the error ID that is then looked up in a table from which the error >> message and sqlstate are retrieved. My full code is in the attached SQL >> file. Once it is executed: >> >> SELECT gateway_error('user_does_not_exist', '2'); -- works but is >> unnatural, >> SELECT gateway_error('user_does_not_exist', 2); -- is natural but >> doesn't work. >> > > It is known problem, and fix is not easy. > > Any expressions inside plpgsql are simple queries like SELECT expr, and > they are executed same pipeline like queries. > > The plans of these queries are stored and reused. Originally these plans > disallow any changes, now some changes are supported, but parameters should > be same all time. This is ensured by disallowing "any" type. > > Other polymorphic types are very static, so there is not described risk. > > Probably some enhancement can be in this are. The plan can be re-planed > after some change - but it can has lot of performance impacts. It is long > open topic. Some changes in direction to dynamic languages can increase > cost of some future optimization to higher performance :-(. > > Regards > > Pavel > > > > > >
Re: clean up docs for v12
Hi, On 2019-04-22 13:27:17 -0400, Tom Lane wrote: > Andres Freund writes: > > The computation of that variable above has: > > > * If the column is possibly missing, we can't rely on its (or > > * subsequent) NOT NULL constraints to indicate minimum > > attributes in > > * the tuple, so stop here. > > */ > > if (att->atthasmissing) > > break; > > BTW, why do we have to stop? ISTM that a not-null column without > atthasmissing is enough to prove this, regardless of the state of prior > columns. (This is assuming that you trust attnotnull for this, which > as I said I don't, but that's not relevant to this question.) Are you wondering if we could also use this kind of logic to infer the length of the null bitmap if there's preceding columns with atthasmissing true as long as there's a later !hasmissing column that's NOT NULL? Right. The logic could be made more powerful - I implemented the above after Andrew's commit of fast-not-null broke JIT (not because of that logic, but because it simply didn't look up the missing columns). I assume it doesn't terribly matter to be fast once attributes after a previously missing one are accessed - it's likely not going to be the hotly accessed data? > I wonder > also if it wouldn't be smart to explicitly check that the "guaranteeing" > column is not attisdropped. Yea, that probably would be smart. I don't think there's an active problem, because we remove NOT NULL when deleting an attribute, but it seems good to be doubly sure / explain why that's safe: /* Remove any NOT NULL constraint the column may have */ attStruct->attnotnull = false; I'm a bit unsure whether to make it an assert, elog(ERROR) or just not assume column presence? Greetings, Andres Freund
Re: Patch: doc for pg_logical_emit_message()
On Fri, Apr 19, 2019 at 3:21 PM Matsumura, Ryo wrote: > > Hi, Hackers > > pg_logical_emit_message() can be used by any user, > but the following document says that it can be used by only superuser. > > > Table 9.88. Replication SQL Functions > > Use of these functions is restricted to superusers. > > I think that pg_logicl_emit_message() should be used by any user. > Therefore, I attach the document patch. Thanks for the patch! Use of not only pg_logical_emit_message() but also other replication functions in Table 9.88 is not restricted to superuser. For example, replication role can execute pg_create_physical_replication_slot(). So I think that the patch should fix also the description for those replication functions. Thought? Regards, -- Fujii Masao
Re: block-level incremental backup
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2019-04-19 20:04:41 -0400, Stephen Frost wrote: > > I agree that we don't want another implementation and that there's a lot > > that we want to do to improve replay performance. We've already got > > frontend tools which work with multiple execution threads, so I'm not > > sure I get the "not easily feasible" bit, and the argument about the > > checkpointer seems largely related to that (as in- if we didn't have > > multiple threads/processes then things would perform quite badly... but > > we can and do have multiple threads/processes in frontend tools today, > > even in pg_basebackup). > > You need not just multiple execution threads, but basically a new > implementation of shared buffers, locking, process monitoring, with most > of the related infrastructure. You're literally talking about > reimplementing a very substantial portion of the backend. I'm not sure > I can transport in written words - via a public medium - how bad an idea > it would be to go there. Yes, there'd be some need for locking and process monitoring, though if we aren't supporting ongoing read queries at the same time, there's a whole bunch of things that we don't need from the existing backend. > > > Which I think is entirely reasonable. With the 'consistent' and LSN > > > recovery targets one already can get most of what's needed from such a > > > tool, anyway. I'd argue the biggest issue there is that there's no > > > equivalent to starting postgres with a private socket directory on > > > windows, and perhaps an option or two making it easier to start postgres > > > in a "private" mode for things like this. > > > > This would mean building in a way to do parallel WAL replay into the > > server binary though, as discussed above, and it seems like making that > > work in a way that allows us to still be available as a read-only > > standby would be quite a bit more difficult. We could possibly support > > parallel WAL replay only when we aren't a replica but from the same > > binary. > > I'm doubtful that we should try to implement parallel WAL apply that > can't support HS - a substantial portion of the the logic to avoid > issues around relfilenode reuse, consistency etc is going to be to be > necessary for non-HS aware apply anyway. But if somebody had a concrete > proposal for something that's fundamentally only doable without HS, I > could be convinced. I'd certainly prefer that we support parallel WAL replay *with* HS, that just seems like a much larger problem, but I'd be quite happy to be told that it wouldn't be that much harder. > > A lot of this part of the discussion feels like a tangent though, unless > > I'm missing something. > > I'm replying to: > > On 2019-04-17 18:43:10 -0400, Stephen Frost wrote: > > Wow. I have to admit that I feel completely opposite of that- I'd > > *love* to have an independent tool (which ideally uses the same code > > through the common library, or similar) that can be run to apply WAL. > > And I'm basically saying that anything that starts from this premise is > fatally flawed (in the ex falso quodlibet kind of sense ;)). I'd just say that it'd be... difficult. :) > > The "WAL compression" tool contemplated > > previously would be much simpler and not the full-blown WAL replay > > capability, which would be left to the server, unless you're suggesting > > that even that should be exclusively the purview of the backend? Though > > that ship's already sailed, given that external projects have > > implemented it. > > I'm extremely doubtful of such tools (but it's not what I was responding > too, see above). I'd be extremely surprised if even one of them came > close to being correct. The old FPI removal tool had data corrupting > bugs left and right. I have concerns about it myself, which is why I'd actually really like to see something in core that does it, and does it the right way, that other projects could then leverage (ideally by just linking into the library without having to rewrite what's in core, though that might not be an option for things like WAL-G that are in Go and possibly don't want to link in some C library). > > Having a library to provide that which external > > projects could leverage would be nicer than having everyone write their > > own version. > > No, I don't think that's necessarily true. Something complicated that's > hard to get right doesn't have to be provided by core. Even if other > projects decide that their risk/reward assesment is different than core > postgres'. We don't have to take on all kind of work and complexity for > external tools. No, it doesn't have to be provided by core, but I sure would like it to be and I'd be much more comfortable if it was because then we'd also take care to not break whatever assumptions are made (or to do so in a way that can be detected and/or handled) as new code is written. As discussed above, as long as it isn't provided by core
Re: Do CustomScan as not projection capable node
Andrey Lepikhov writes: > It is possible that custom_scan_tlist is designed too nontrivially, and > it is possible that it needs some comments describing in more detail how > to use it. I totally buy the argument that the custom scan stuff is underdocumented :-(. FWIW, if we did have a use-case showing that somebody would like to make custom scans that can't project, the way to do it would be to add a flag bit showing whether a particular CustomPath/CustomScan could project or not. Not to assume that they all can't. This wouldn't be that much code really, but I'd still like to see a plausible use-case before adding it, because it'd be a small API break for existing CustomPath providers. regards, tom lane
Re: pg_dump is broken for partition tablespaces
On 2019-Apr-22, Robert Haas wrote: > PostgreSQL has historically and very deliberately *not made a > distinction* between "this object is in the default tablespace" and > "this object is in tablespace X which happens to be the default." I > think that it's too late to invent such a distinction for reasons of > backward compatibility -- and if we were going to do it, surely it > would need to exist for both partitioned tables and the partitions > themselves. Otherwise it just produces more strange inconsistencies. Yeah, this is probably right. (I don't think it's the same thing that Tom was saying, though, or at least I didn't understand his argument this way.) I think we can get out of this whole class of problems by forbidding the TABLESPACE clause for partitioned rels from mentioning the database tablespace -- that is, users either mention some *other* tablespace, or partitions follow default_tablespace like everybody else. AFAICS with that restriction this whole problem does not arise, and the patch may become simpler. I'll give it a spin. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: clean up docs for v12
Andres Freund writes: > On 2019-04-22 13:27:17 -0400, Tom Lane wrote: >> I wonder >> also if it wouldn't be smart to explicitly check that the "guaranteeing" >> column is not attisdropped. > Yea, that probably would be smart. I don't think there's an active > problem, because we remove NOT NULL when deleting an attribute, but it > seems good to be doubly sure / explain why that's safe: > /* Remove any NOT NULL constraint the column may have */ > attStruct->attnotnull = false; > I'm a bit unsure whether to make it an assert, elog(ERROR) or just not > assume column presence? I'd just make the code look like /* * If it's NOT NULL then it must be present in every tuple, * unless there's a "missing" entry that could provide a non-null * value for it. Out of paranoia, also check !attisdropped. */ if (att->attnotnull && !att->atthasmissing && !att->attisdropped) guaranteed_column_number = attnum; I don't think the extra check is so expensive as to be worth obsessing over. regards, tom lane
Re: pg_dump is broken for partition tablespaces
Hi, On 2019-04-22 14:16:28 -0400, Alvaro Herrera wrote: > On 2019-Apr-22, Robert Haas wrote: > > > PostgreSQL has historically and very deliberately *not made a > > distinction* between "this object is in the default tablespace" and > > "this object is in tablespace X which happens to be the default." I > > think that it's too late to invent such a distinction for reasons of > > backward compatibility -- and if we were going to do it, surely it > > would need to exist for both partitioned tables and the partitions > > themselves. Otherwise it just produces more strange inconsistencies. > > Yeah, this is probably right. (I don't think it's the same thing that > Tom was saying, though, or at least I didn't understand his argument > this way.) > > I think we can get out of this whole class of problems by forbidding the > TABLESPACE clause for partitioned rels from mentioning the database > tablespace -- that is, users either mention some *other* tablespace, or > partitions follow default_tablespace like everybody else. AFAICS with > that restriction this whole problem does not arise, and the patch may > become simpler. I'll give it a spin. Why is the obvious answer is to not just remove the whole tablespace inheritance behaviour? It's obviously ambiguous and hard to get right. I still don't see any usecase that even comes close to making the inheritance useful enough to justify the amount of work (code, tests, bugfixes) and docs that are required. Greetings, Andres Freund
Re: clean up docs for v12
Hi, On 2019-04-22 14:17:48 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2019-04-22 13:27:17 -0400, Tom Lane wrote: > >> I wonder > >> also if it wouldn't be smart to explicitly check that the "guaranteeing" > >> column is not attisdropped. > > > Yea, that probably would be smart. I don't think there's an active > > problem, because we remove NOT NULL when deleting an attribute, but it > > seems good to be doubly sure / explain why that's safe: > > /* Remove any NOT NULL constraint the column may have */ > > attStruct->attnotnull = false; > > I'm a bit unsure whether to make it an assert, elog(ERROR) or just not > > assume column presence? > > I'd just make the code look like > > /* > * If it's NOT NULL then it must be present in every tuple, > * unless there's a "missing" entry that could provide a non-null > * value for it. Out of paranoia, also check !attisdropped. > */ > if (att->attnotnull && > !att->atthasmissing && > !att->attisdropped) > guaranteed_column_number = attnum; > > I don't think the extra check is so expensive as to be worth obsessing > over. Oh, yea, the cost is irrelevant here - it's one-off work basically, and pales in comparison to the cost of JITing. I was more thinking about whether it's worth "escalating" the violation of assumptions. Greetings, Andres Freund
Re: clean up docs for v12
Hi, On 2019-04-22 13:18:18 -0400, Tom Lane wrote: > Andres Freund writes: > >> (Possibly I'd not think this if I weren't fresh off a couple of days > >> with my nose in the ALTER TABLE SET NOT NULL code. But right now, > >> I think that believing that that code does not and never will have > >> any bugs is just damfool.) > > > But there's plenty places where we rely on NOT NULL actually working? > > I do not think there are any other places where we make this particular > assumption. Sure, not exactly the assumtion that JITed deforming benefits from, but as far as I can tell, plenty things would be broken just as well if we allowed NOT NULL columns to not be present (whether "physically" present or present via atthasmissing) for tuples in a table. Fast defaults wouldn't work, Assert(!isnull) checks would fire, primary keys would be broken etc. > In hopes of putting some fear into you too, I exhibit the following > behavior, which is not a bug according to our current definitions: > > regression=# create table pp(f1 int); > CREATE TABLE > regression=# create table cc() inherits (pp); > CREATE TABLE > regression=# insert into cc values(1); > INSERT 0 1 > regression=# insert into cc values(2); > INSERT 0 1 > regression=# insert into cc values(null); > INSERT 0 1 > regression=# alter table pp add column f2 text; > ALTER TABLE > regression=# alter table pp add column f3 text; > ALTER TABLE > regression=# alter table only pp alter f3 set not null; > ALTER TABLE > regression=# select * from pp; > f1 | f2 | f3 > ++ > 1 || > 2 || > || > (3 rows) > > The tuples coming out of cc will still have natts = 1, I believe. > If they were deformed according to pp's tupdesc, there'd be a > problem. Now, we shouldn't do that, because this is not the only > possible discrepancy between parent and child tupdescs --- but > I think this example shows that attnotnull is a lot spongier > than you are assuming, even without considering the possibility > of outright bugs. Unortunately it doesn't really put the fear into me - given that attribute numbers don't even have to match between inheritance children, making inferrences about the length of the NULL bitmap seems peanuts compared to the breakage of using the wrong tupdesc to deform. Greetings, Andres Freund
Re: block-level incremental backup
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost wrote: > > > One could instead just do a straightforward extension > > > to the existing BASE_BACKUP command to enable incremental backup. > > > > Ok, how do you envision that? As I mentioned up-thread, I am concerned > > that we're talking too high-level here and it's making the discussion > > more difficult than it would be if we were to put together specific > > ideas and then discuss them. > > > > One way I can imagine to extend BASE_BACKUP is by adding LSN as an > > optional parameter and then having the database server scan the entire > > cluster and send a tarball which contains essentially a 'diff' file of > > some kind for each file where we can construct a diff based on the LSN, > > and then the complete contents of the file for everything else that > > needs to be in the backup. > > /me scratches head. Isn't that pretty much what I described in my > original post? I even described what that "'diff' file of some kind" > would look like in some detail in the paragraph of that emailed > numbered "2.", and I described the reasons for that choice at length > in > http://postgr.es/m/ca+tgmozrqdv-tb8ny9p+1pqlqkxp5f1afghuohh5qt6ewdk...@mail.gmail.com > > I can't figure out how I'm managing to be so unclear about things > about which I thought I'd been rather explicit. There was basically zero discussion about what things would look like at a protocol level (I went back and skimmed over the thread before sending my last email to specifically see if I was going to get this response back..). I get the idea behind the diff file, the contents of which I wasn't getting into above. > > So, sure, that would work, but it wouldn't be able to be parallelized > > and I don't think it'd end up being very exciting for the external tools > > because of that, but it would be fine for pg_basebackup. > > Stop being such a pessimist. Yes, if we only add the option to the > BASE_BACKUP command, it won't directly be very exciting for external > tools, but a lot of the work that is needed to do things that ARE > exciting for external tools will have been done. For instance, if the > work to figure out which blocks have been modified via WAL-scanning > gets done, and initially that's only exposed via BASE_BACKUP, it won't > be much work for somebody to write code for a new code that exposes > that information directly through some new replication command. > There's a difference between something that's going in the wrong > direction and something that's going in the right direction but not as > far or as fast as you'd like. And I'm 99% sure that everything I'm > proposing here falls in the latter category rather than the former. I didn't mean to imply that you're doing in the wrong direction here and I thought I said somewhere in my last email more-or-less exactly the same, that a great deal of the work needed for block-level incremental backup would be done, but specifically that this proposal wouldn't allow external tools to leverage that. It sounds like what you're suggesting now is that you're happy to implement the backend code, expose it in a way that works just for pg_basebackup, and that if someone else wants to add things to the protocol to make it easier for external tools to leverage, great. All I can say is that that's basically how we ended up in the situation we're in today where pg_basebackup doesn't support parallel backup but a bunch of external tools do and they don't go through the backend to get there, even though they'd probably prefer to. > > On the other hand, if you added new commands for 'list of files changed > > since this LSN' and 'give me this file' and 'give me this file with the > > changes in it since this LSN', then pg_basebackup could work with that > > pretty easily in a single-threaded model (maybe with two connections to > > the backend, but still in a single process, or maybe just by slurping up > > the file list and then asking for each one) and the external tools could > > leverage those new capabilities too for their backups, both full backups > > and incremental ones. This also wouldn't have to change how > > pg_basebackup does full backups today one bit, so what we're really > > talking about here is the direction to take the new code that's being > > written, not about rewriting existing code. I agree that it'd be a bit > > more work... but hopefully not *that* much more, and it would mean we > > could later add parallel backup to pg_basebackup more easily too, if we > > wanted to. > > For purposes of implementing parallel pg_basebackup, it would probably > be better if the server rather than the client decided which files to > send via which connection. If the client decides, then every time the > server finishes sending a file, the client has to request another > file, and that introduces some latency: after the server finishes > sending each file, it has
Re: pg_dump is broken for partition tablespaces
Andres Freund writes: > On 2019-04-22 14:16:28 -0400, Alvaro Herrera wrote: >> I think we can get out of this whole class of problems by forbidding the >> TABLESPACE clause for partitioned rels from mentioning the database >> tablespace -- that is, users either mention some *other* tablespace, or >> partitions follow default_tablespace like everybody else. AFAICS with >> that restriction this whole problem does not arise, and the patch may >> become simpler. I'll give it a spin. > Why is the obvious answer is to not just remove the whole tablespace > inheritance behaviour? It's obviously ambiguous and hard to get right. > I still don't see any usecase that even comes close to making the > inheritance useful enough to justify the amount of work (code, tests, > bugfixes) and docs that are required. Yeah, that's where I'm at as well. Alvaro's proposal could be made to work perhaps, but I think it would still end up with some odd corner-case behaviors. One example is that "TABLESPACE X" would be allowed if the database's default tablespace is Y, but if you try to dump and restore into a database whose default is X, it'd be rejected (?). The results after ALTER DATABASE ... SET TABLESPACE X are unclear too. regards, tom lane
Re: block-level incremental backup
Hi, On 2019-04-22 14:26:40 -0400, Stephen Frost wrote: > I'm disappointed that the concerns about the trouble that end users are > likely to have with this didn't garner more discussion. My impression is that endusers are having a lot more trouble due to important backup/restore features not being in core/pg_basebackup, than due to external tools having a harder time to implement certain features. Focusing on external tools being able to provide all those features, because core hasn't yet, is imo entirely the wrong thing to concentrate upon. And it's not like things largely haven't been implemented in pg_basebackup for fundamental architectural reasons. It's because we've built like 5 different external tools with randomly differing featureset and licenses. Greetings, Andres Freund
Re: TM format can mix encodings in to_char()
On Sun, Apr 21, 2019 at 6:26 AM Andrew Dunstan wrote: > How does one do that? Just set a Turkish locale? tr_TR is, in a sense, special among locales: http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html The Turkish dotless i has apparently been implicated in all kinds of bugs in quite a variety of contexts. -- Peter Geoghegan
Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()
=?UTF-8?Q?Micha=c5=82_=22phoe=22_Herda?= writes: > My reasoning in this case is - if we allow the any[] type to only be > passed to other functions that accept any[], and disallow any kind of > other operations on this array (such as retrieving its elements or > modifying it), I do not yet see any places where it might introduce a > performance regression. Performance regressions are not the question here --- or at least, there are a lot of other questions to get past first. * plpgsql doesn't have any mechanism for restricting the use of a parameter in the way you suggest. It's not clear if it'd be practical to add one, given the arms-length way in which plpgsql does expression evaluation, and it seems likely that any such thing would be messy and bug-prone. * There's not actually any such type as any[]. There's anyarray, which is not what you're wishing for here because it'd constrain all the actual arguments to be the same type (or at least coercible to the same array element type). This is related to the next point... * format() isn't declared as taking any[]. It's really regression=# \df format List of functions Schema | Name | Result data type | Argument data types | Type ++--+--+-- pg_catalog | format | text | text | func pg_catalog | format | text | text, VARIADIC "any" | func (2 rows) "VARIADIC any" is a very special hack, because unlike other VARIADIC cases, it doesn't result in collapsing the actual arguments into an array. (Again, it can't because they might not have a common type.) The called function has to have special logic for examining its arguments to find out how many there are and what their types are. format() can do that because it's written in C, but a plpgsql function, not so much. * We could imagine allowing a plpgsql function to be declared "VARIADIC any", and treating it as having N polymorphic arguments of independent types, but then what? plpgsql has no notation for accessing such arguments (they wouldn't have names, to start with), nor for finding out how many there are, and it certainly has no notation for passing the whole group of them on to some other function. I think the closest you'll be able to get here is to declare the plpgsql function as taking "variadic text[]" and then passing the text array to format() with a VARIADIC marker. That should work mostly okay, although calls might need explicit casts to text in some cases. FWIW, it'd likely be easier to get around these problems in plperl or pltcl than in plpgsql, as those are both much less concerned with the exact data types of their arguments than plpgsql, and more accustomed to dealing with functions with variable argument lists. I don't know Python well enough to say whether the same is true of plpython. regards, tom lane
Re: pg_dump is broken for partition tablespaces
On 2019-Apr-22, Andres Freund wrote: > Why is the obvious answer is to not just remove the whole tablespace > inheritance behaviour? Because it was requested by many, and there were plenty of people surprised that things didn't work that way. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: block-level incremental backup
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2019-04-22 14:26:40 -0400, Stephen Frost wrote: > > I'm disappointed that the concerns about the trouble that end users are > > likely to have with this didn't garner more discussion. > > My impression is that endusers are having a lot more trouble due to > important backup/restore features not being in core/pg_basebackup, than > due to external tools having a harder time to implement certain > features. I had been referring specifically to the concern I raised about incremental block-level backups being added to pg_basebackup and how that'll make using pg_basebackup more complicated and therefore more difficult for end-users to get right, particularly if the end user is having to handle management of the association between the full backup and the incremental backups. I wasn't referring to anything regarding external tools. > Focusing on external tools being able to provide all those > features, because core hasn't yet, is imo entirely the wrong thing to > concentrate upon. And it's not like things largely haven't been > implemented in pg_basebackup for fundamental architectural reasons. > It's because we've built like 5 different external tools with randomly > differing featureset and licenses. There's a few challenges when it comes to adding backup features to core. One of the reasons is that core naturally moves slower when it comes to development than external projects do, as was discusssed earlier on this thread. Another is that, when it comes to backup, specifically, people want to back up their *existing* systems, which means that they need a backup tool that's going to work with whatever version of PG they've currently got deployed and that's often a few years old already. Certainly when I've thought about features that we'd like to see and considered if there's something that could be implemented in core vs. implemented outside of core, the answer often ends up being "well, if we do it ourselves then we can make it work for PG 9.2 and above, and have it working for existing users, but if we work it in as part of core, it won't be available until next year and only for version 12 and above, and users can only use it once they've upgraded.." Thanks, Stephen signature.asc Description: PGP signature
Re: pg_dump is broken for partition tablespaces
On 2019-Apr-22, Tom Lane wrote: > Yeah, that's where I'm at as well. Alvaro's proposal could be made > to work perhaps, but I think it would still end up with some odd > corner-case behaviors. One example is that "TABLESPACE X" would > be allowed if the database's default tablespace is Y, but if you > try to dump and restore into a database whose default is X, it'd be > rejected (?). Hmm, I don't think so, because dump uses default_tablespace on a plain table instead of TABLESPACE clauses, and the table is attached afterwards. > The results after ALTER DATABASE ... SET TABLESPACE X > are unclear too. Currently we disallow SET TABLESPACE X if you have any table in that tablespace, and we do that by searching for files. A partitioned table would not have a file that would cause it to fail, so this is something to study. (BTW I think these tablespace behaviors are not tested very much. The tests we have are intra-database operations only, and there's only a single non-default tablespace.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: TM format can mix encodings in to_char()
Peter Geoghegan writes: > On Sun, Apr 21, 2019 at 6:26 AM Andrew Dunstan > wrote: >> How does one do that? Just set a Turkish locale? > tr_TR is, in a sense, special among locales: > http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html > The Turkish dotless i has apparently been implicated in all kinds of > bugs in quite a variety of contexts. Yeah, we've had our share of those :-(. But the dotless i is not the problem here --- it happens to not trigger an encoding conversion issue, it seems. Amusingly, the existing test case for lc_time = tr_TR in collate.linux.utf8.sql is specifically coded to check what happens with dotted/dotless i, and yet it manages to not trip over this problem. (I suspect the reason is that what comes out of strftime is "Nis" which is ASCII, and the non-ASCII characters only arise from subsequent case conversion within PG.) regards, tom lane
Re: pg_dump is broken for partition tablespaces
Alvaro Herrera writes: > On 2019-Apr-22, Andres Freund wrote: >> Why is the obvious answer is to not just remove the whole tablespace >> inheritance behaviour? > Because it was requested by many, and there were plenty of people > surprised that things didn't work that way. There are lots of things in SQL that people find surprising. In this particular case, "we can't do it because it conflicts with ancient decisions about how PG tablespaces work" seems like a defensible answer, even without getting into the question of whether "partitions inherit their tablespace from the parent" is really any less surprising than "partitions work exactly like normal tables as far as tablespace selection goes". regards, tom lane
Re: pgsql: Allow insert and update tuple routing and COPY for foreign table
On Mon, 2019-04-22 at 21:45 +0900, Etsuro Fujita wrote: Thanks for looking into this! > (2019/04/20 20:53), Laurenz Albe wrote: > > On Fri, 2018-04-06 at 23:24 +, Robert Haas wrote: > > > Allow insert and update tuple routing and COPY for foreign tables. > > > > > > Also enable this for postgres_fdw. > > > > > > Etsuro Fujita, based on an earlier patch by Amit Langote. The larger > > > patch series of which this is a part has been reviewed by Amit > > > Langote, David Fetter, Maksim Milyutin, Álvaro Herrera, Stephen Frost, > > > and me. Minor documentation changes to the final version by me. > > > > > > Discussion: > > > http://postgr.es/m/29906a26-da12-8c86-4fb9-d8f88442f...@lab.ntt.co.jp > > > > This commit makes life hard for foreign data wrappers that support > > data modifications. > > > > If a FDW implements ExecForeignInsert, this commit automatically assumes > > that it also supports COPY FROM. It will call ExecForeignInsert without > > calling PlanForeignModify and BeginForeignModify, and a FDW that does not > > expect that will probably fail. > > This is not 100% correct; the FDW documentation says: > > > Tuples inserted into a partitioned table by > INSERT or > COPY FROM are routed to partitions. If an FDW > supports routable foreign-table partitions, it should also provide the > following callback functions. These functions are also called when > COPY FROM is executed on a foreign table. > I don't see the difference between the documentation and what I wrote above. Before v11, a FDW could expect that ExecForeignInsert is only called if BeginForeignModify was called earlier. That has silently changed with v11. > > maybe there are FDWs that support INSERT but don't want to support COPY > > for some reason. > > I agree on that point. > > > I propose that PostgreSQL only allows COPY FROM on a foreign table if the > > FDW > > implements BeginForeignInsert. The attached patch implements that. > > I don't think that is a good idea, because there might be some FDWs that > want to support COPY FROM on foreign tables without providing > BeginForeignInsert. (As for INSERT into foreign tables, we actually > allow FDWs to support it without providing PlanForeignModify, > BeginForeignModify, or EndForeignModify.) > > It's permissible to throw an error in BeginForeignInsert, so what I was > thinking for FDWs that don't want to support COPY FROM and > INSERT/UPDATE/COPY FROM tuple routing was to provide BeginForeignInsert > implementing something like this: > > static void > fooBeginForeignInsert(ModifyTableState *mtstate, >ResultRelInfo *resultRelInfo) > { > Relationrel = resultRelInfo->ri_RelationDesc; > > if (mtstate->ps.plan == NULL) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("cannot copy to foreign table \"%s\"", > RelationGetRelationName(rel; > else > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("cannot route tuples into foreign table \"%s\"", > RelationGetRelationName(rel; > } Sure, it is not hard to modify a FDW to continue working with v11. My point is that this should not be necessary. If a FDW worked well with v10, it should continue to work with v11 unless there is a necessity for a compatibility-breaking change. On the other hand, if a FDW wants to support COPY in v11 and has no need for BeginForeignInsert to support that, it is a simple exercise for it to provide an empty BeginForeignInsert just to signal that it wants to support COPY. I realized that my previous patch forgot to check for tuple routing, updated patch attached. Yours, Laurenz Albe From 545ac9d567ea6ca610ce08355451923cc787e13d Mon Sep 17 00:00:00 2001 From: Laurenz Albe Date: Mon, 22 Apr 2019 21:35:06 +0200 Subject: [PATCH] Foreign table COPY FROM and tuple routing requires BeginForeignInsert Commit 3d956d956a introduced support for foreign tables as partitions and COPY FROM on foreign tables. If a foreign data wrapper supports data modifications, but either has not adapted to this change yet or doesn't want to support it for other reasons, it probably got broken by the above commit, because COPY will just call ExecForeignInsert anyway, which might not work because neither PlanForeignModify nor BeginForeignModify have been called. To avoid breaking third-party foreign data wrappers in that way, allow COPY FROM and tuple routing for foreign tables only if the foreign data wrapper implements BeginForeignInsert. --- doc/src/sgml/fdwhandler.sgml | 2 ++ src/backend/commands/copy.c | 5 + src/backend/executor/execPartition.c | 17 - 3 files changed, 19 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml index 2c07a86637..84f
Re: block-level incremental backup
On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost wrote: > There was basically zero discussion about what things would look like at > a protocol level (I went back and skimmed over the thread before sending > my last email to specifically see if I was going to get this response > back..). I get the idea behind the diff file, the contents of which I > wasn't getting into above. Well, I wrote: "There should be a way to tell pg_basebackup to request from the server only those blocks where LSN >= threshold_value." I guess I assumed that people would interested in the details take that to mean "and therefore the protocol would grow an option for this type of request in whatever way is the most straightforward possible extension of the current functionality is," which is indeed how you eventually interpreted it when you said we could "extend BASE_BACKUP is by adding LSN as an optional parameter." I could have been more explicit, but sometimes people tell me that my emails are too long. > external tools to leverage that. It sounds like what you're suggesting > now is that you're happy to implement the backend code, expose it in a > way that works just for pg_basebackup, and that if someone else wants to > add things to the protocol to make it easier for external tools to > leverage, great. Yep, that's more or less it, although I am potentially willing to do some modest amount of that other work along the way. I just don't want to prioritize it higher than getting the actual thing I want to build built, which I think is a pretty fair position for me to take. > All I can say is that that's basically how we ended up > in the situation we're in today where pg_basebackup doesn't support > parallel backup but a bunch of external tools do and they don't go > through the backend to get there, even though they'd probably prefer to. I certainly agree that core should try to do things in a way that is useful to external tools when that can be done without undue effort, but only if it can actually be done without undo effort. Let's see whether that's the case here: - Anastasia wants a command added that dumps out whatever the server knows about what files have changed, which I already agreed was a reasonable extension of my initial proposal. - You said that for this to be useful to pgbackrest, it'd have to use a whole different mechanism that includes commands to request individual files and blocks within those files, which would be a significant rewrite of pg_basebackup that you agreed is more closely related to parallel backup than to the project under discussion on this thread. And that even then pgbackrest probably wouldn't use it because it also does server-side compression and encryption which are not included in this proposal. It seems to me that the first one falls into the category a reasonable additional effort and the second one falls into the category of lots of extra and unrelated work that wouldn't even get used. > Thanks for sharing your thoughts on that, certainly having the backend > able to be more intelligent about streaming files to avoid latency is > good and possibly the best approach. Another alternative to reducing > the latency would be to have a way for the client to request a set of > files, but I don't know that it'd be better. I don't know either. This is an area that needs more thought, I think, although as discussed, it's more related to parallel backup than $SUBJECT. > I'm not really sure why the above is extremely inconvenient for > third-party tools, beyond just that they've already been written to work > with an assumption that the server-side of things isn't as intelligent > as PG is. Well, one thing you might want to do is have a tool that connects to the server, enters backup mode, requests information on what blocks have changed, copies those blocks via direct filesystem access, and then exits backup mode. Such a tool would really benefit from a START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP command language, because it would just skip ever issuing the SEND_FILE_CONTENTS command in favor of doing that part of the work via other means. On the other hand, a START_PARALLEL_BACKUP LSN '1/234' command is useless to such a tool. Contrariwise, a tool that has its own magic - perhaps based on WAL-scanning or something like ptrack - to know which files currently exist and which blocks are modified could use SEND_FILE_CONTENTS but not SEND_FILE_LIST. And a filesystem-snapshot based technique might use START_BACKUP and STOP_BACKUP but nothing else. In short, providing granular commands like this lets the client be really intelligent even if the server isn't, and lets the client have fine-grained control of the process. This is very good if you're an out-of-core tool maintainer and your tool is trying to be smarter than - or even just differently-designed than - core. But if what you really want is just a maximally-efficient parallel backup, you don't nee
translatability tweaks
Here's a small number of translatability tweaks to the error messages in the backend. I found these while updating the Spanish translation over the past weekend, a task I had neglected for two or three years, so they might involve some older messages. However, I won't backpatch this -- only apply to pg12 tomorrow or so. (I haven't yet gone over the full backend message catalog yet, so I might propose more fixes later on.) -- Álvaro Herrerahttp://www.twitter.com/alvherre >From 091d4043aca3fa4936f1020b635a78e8b6d9010d Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 22 Apr 2019 15:45:06 -0400 Subject: [PATCH] Unify error messages ... for translatability purposes. --- src/backend/storage/ipc/latch.c | 12 +--- src/backend/storage/ipc/signalfuncs.c | 4 +++- src/backend/utils/adt/formatting.c| 9 ++--- src/backend/utils/adt/genfile.c | 4 +++- src/backend/utils/adt/json.c | 4 +++- src/backend/utils/adt/jsonb.c | 4 +++- 6 files changed, 27 insertions(+), 10 deletions(-) diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 59fa917ae04..e0712f906a1 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -856,7 +856,9 @@ WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action) if (rc < 0) ereport(ERROR, (errcode_for_socket_access(), - errmsg("epoll_ctl() failed: %m"))); + /* translator: %s is a syscall name, such as "poll()" */ + errmsg("%s failed: %m", + "epoll_ctl()"))); } #endif @@ -1087,7 +1089,9 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, waiting = false; ereport(ERROR, (errcode_for_socket_access(), - errmsg("epoll_wait() failed: %m"))); + /* translator: %s is a syscall name, such as "poll()" */ + errmsg("%s failed: %m", + "epoll_wait()"))); } return 0; } @@ -1211,7 +1215,9 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, waiting = false; ereport(ERROR, (errcode_for_socket_access(), - errmsg("poll() failed: %m"))); + /* translator: %s is a syscall name, such as "poll()" */ + errmsg("%s failed: %m", + "poll()"))); } return 0; } diff --git a/src/backend/storage/ipc/signalfuncs.c b/src/backend/storage/ipc/signalfuncs.c index 4769b1b51eb..4bfbd57464c 100644 --- a/src/backend/storage/ipc/signalfuncs.c +++ b/src/backend/storage/ipc/signalfuncs.c @@ -181,7 +181,9 @@ pg_rotate_logfile(PG_FUNCTION_ARGS) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be superuser to rotate log files with adminpack 1.0"), - errhint("Consider using pg_logfile_rotate(), which is part of core, instead."; + /* translator: %s is a SQL function name */ + errhint("Consider using %s, which is part of core, instead.", + "pg_logfile_rotate()"; if (!Logging_collector) { diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index df1db7bc9f1..69a691f18e7 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -1566,7 +1566,8 @@ str_tolower(const char *buff, size_t nbytes, Oid collid) */ ereport(ERROR, (errcode(ERRCODE_INDETERMINATE_COLLATION), - errmsg("could not determine which collation to use for lower() function"), + errmsg("could not determine which collation to use for %s function", +"lower()"), errhint("Use the COLLATE clause to set the collation explicitly."))); } mylocale = pg_newlocale_from_collation(collid); @@ -1688,7 +1689,8 @@ str_toupper(const char *buff, size_t nbytes, Oid collid) */ ereport(ERROR, (errcode(ERRCODE_INDETERMINATE_COLLATION), - errmsg("could not determine which collation to use for upper() function"), + errmsg("could not determine which collation to use for %s function", +"upper()"), errhint("Use the COLLATE clause to set the collation explicitly."))); } mylocale = pg_newlocale_from_collation(collid); @@ -1811,7 +1813,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid) */ ereport(ERROR, (errcode(ERRCODE_INDETERMINATE_COLLATION), - errmsg("could not determine which collation to use for initcap() function"), + errmsg("could not determine which collation to use for %s function", +"initcap()"), errhint("Use the COLLATE clause to set the collation explicitly."))); } mylocale = pg_newlocale_from_collation(collid); diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c index d6976609968..a3c6adaf640 100644 --- a/src/backend/utils/adt/genfile.c +++ b/src/backend/utils/adt/genfile.c @@ -219,7 +219,9 @@ pg_read_file(PG_FUNCTION_ARGS) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be superuser to read files with adminpack 1.0"), - errhint("
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Thu, Nov 29, 2018 at 3:44 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Fri, Aug 24, 2018 at 5:53 PM Alexander Korotkov > > wrote: > > > > Given I've no feedback on this idea yet, I'll try to implement a PoC > > patch for that. It doesn't look to be difficult. And we'll see how > > does it work. > > Unfortunately, current version of the patch doesn't pass the tests and fails > on > initdb. But maybe you already have this PoC you were talking about, that will > also incorporate the feedback from this thread? For now I'll move it to the > next CF. Finally, I managed to write a PoC. If look at the list of problems I've enumerated in [1], this PoC is aimed for 1 and 3. > 1) Data corruption on file truncation error (explained in [1]). > 2) Expensive scanning of the whole shared buffers before file truncation. > 3) Cancel of read-only queries on standby even if hot_standby_feedback > is on, caused by replication of AccessExclusiveLock. 2 is pretty independent problem and could be addressed later. Basically, this patch does following: 1. Introduces new flag BM_DIRTY_BARRIER, which prevents dirty buffer from being written out. 2. Implements two-phase truncation of node buffers. First phase is prior to file truncation and marks past truncation point dirty buffers as BM_DIRTY_BARRIER. Second phase is post file truncation and actually wipes out past truncation point buffers. 3. On exception happen during file truncation, BM_DIRTY_BARRIER flag will be released from buffers. Thus, no data corruption should happens here. If file truncation was partially complete, then file might be extended by write of dirty buffer. I'm not sure how likely is it, but extension could lead to the errors again. But this still shouldn't cause a data corruption. 4. Having too many buffers marked as BM_DIRTY_BARRIER, would paralyze buffer manager. This is why we're keeping not more than NBuffers/2 to be marked as BM_DIRTY_BARRIER. If limit is exceeded, then dirty buffers are just written at the first phase. 5. lazy_truncate_heap() now takes ExclusiveLock instead of AccessExclusiveLock. This part is not really complete. At least, we need to ensure that past truncation point reads, caused by real-only queries concurrent to truncation, don't lead to real errors. Any thoughts? 1. https://www.postgresql.org/message-id/CAPpHfdtD3U2DpGZQJNe21s9s1s-Va7NRNcP1isvdCuJxzYypcg%40mail.gmail.com -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company 0001-POC-fix-relation-truncation-1.patch Description: Binary data
Re: pg_dump is broken for partition tablespaces
On Mon, Apr 22, 2019 at 3:08 PM Alvaro Herrera wrote: > On 2019-Apr-22, Andres Freund wrote: > > Why is the obvious answer is to not just remove the whole tablespace > > inheritance behaviour? > > Because it was requested by many, and there were plenty of people > surprised that things didn't work that way. On the other hand, that behavior worked correctly on its own terms, and this behavior seems to be broken, and you've been through a whole series of possible designs trying to figure out how to fix it, and it's still not clear that you've got a working solution. I don't know whether that shows that it is impossible to make this idea work sensibly, but at the very least it proves that the whole area needed a lot more thought than it got before this code was committed (a complaint that I also made at the time, if you recall). "Surprising" is not great, but it is clearly superior to "broken." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pgsql: Allow insert and update tuple routing and COPY for foreign table
On Mon, Apr 22, 2019 at 3:37 PM Laurenz Albe wrote: > Sure, it is not hard to modify a FDW to continue working with v11. > > My point is that this should not be necessary. I'm not sure whether this proposal is a good idea or a bad idea, but I think that it's inevitable that FDWs are going to need some updating for new releases as the API evolves. That has happened before and will continue to happen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pgsql: Allow insert and update tuple routing and COPY for foreign table
Hi, On 2019-04-22 21:37:25 +0200, Laurenz Albe wrote: > Subject: [PATCH] Foreign table COPY FROM and tuple routing requires > BeginForeignInsert > > Commit 3d956d956a introduced support for foreign tables as partitions > and COPY FROM on foreign tables. > > If a foreign data wrapper supports data modifications, but either has > not adapted to this change yet or doesn't want to support it > for other reasons, it probably got broken by the above commit, > because COPY will just call ExecForeignInsert anyway, which might not > work because neither PlanForeignModify nor BeginForeignModify have > been called. > > To avoid breaking third-party foreign data wrappers in that way, allow > COPY FROM and tuple routing for foreign tables only if the foreign data > wrapper implements BeginForeignInsert. Isn't this worse though? Before this it's an API change between major versions. With this it's an API change in a *minor* version. Sure, it's one that doesn't crash, but it's still a pretty substantial function regression, no? Greetings, Andres Freund
Re: pg_dump is broken for partition tablespaces
On 2019-Apr-22, Robert Haas wrote: > On Mon, Apr 22, 2019 at 3:08 PM Alvaro Herrera > wrote: > > On 2019-Apr-22, Andres Freund wrote: > > > Why is the obvious answer is to not just remove the whole tablespace > > > inheritance behaviour? > > > > Because it was requested by many, and there were plenty of people > > surprised that things didn't work that way. > > On the other hand, that behavior worked correctly on its own terms, > and this behavior seems to be broken, and you've been through a whole > series of possible designs trying to figure out how to fix it, and > it's still not clear that you've got a working solution. Well, frequently when people discuss ideas on this list, others discuss and provide further ideas to try help to find a working solution, rather than throw every roadblock they can think of (though roadblocks are indeed thrown now and then). If I've taken a long time to find a working solution, maybe it's because I have no shoulders of giants to stand on, and I'm a pretty short guy, so I need to build me a ladder. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Allow insert and update tuple routing and COPY for foreign table
On Mon, 2019-04-22 at 16:24 -0400, Robert Haas wrote: > On Mon, Apr 22, 2019 at 3:37 PM Laurenz Albe wrote: > > Sure, it is not hard to modify a FDW to continue working with v11. > > > > My point is that this should not be necessary. > > I'm not sure whether this proposal is a good idea or a bad idea, but I > think that it's inevitable that FDWs are going to need some updating > for new releases as the API evolves. That has happened before and > will continue to happen. Absolutely. I am just unhappy that this change caused unnecessary breakage. If you developed a read-only FDW for 9.2, it didn't break with the write support introduced in 9.3, because that used different API functions. That's how it should be IMHO. If you developed a FDW for 9.1, it got broken in 9.2, because the API had to change to allow returning multiple paths. That was unfortunate but necessary, so it is ok. Nothing in this patch required an incompatible change. Yours, Laurenz Albe
Re: pgsql: Allow insert and update tuple routing and COPY for foreign table
On Mon, 2019-04-22 at 13:27 -0700, Andres Freund wrote: > On 2019-04-22 21:37:25 +0200, Laurenz Albe wrote: > > Commit 3d956d956a introduced support for foreign tables as partitions > > and COPY FROM on foreign tables. > > > > If a foreign data wrapper supports data modifications, but either has > > not adapted to this change yet or doesn't want to support it > > for other reasons, it probably got broken by the above commit, > > because COPY will just call ExecForeignInsert anyway, which might not > > work because neither PlanForeignModify nor BeginForeignModify have > > been called. > > > > To avoid breaking third-party foreign data wrappers in that way, allow > > COPY FROM and tuple routing for foreign tables only if the foreign data > > wrapper implements BeginForeignInsert. > > Isn't this worse though? Before this it's an API change between major > versions. With this it's an API change in a *minor* version. Sure, it's > one that doesn't crash, but it's still a pretty substantial function > regression, no? Hm, that's a good point. You could say that this patch is too late, because a FDW might already be relying on COPY FROM to work without BeginForeignInsert in v11. How about just applying the patch from v12 on? Then it is a behavior change in a major release, which is acceptable. It requires the imaginary FDW above to add an empty BeginForeignInsert callback function, but will unbreak FDW that slept through the change that added COPY support. Yours, Laurenz Albe
Re: pgsql: Allow insert and update tuple routing and COPY for foreign table
Hi, On 2019-04-22 22:56:20 +0200, Laurenz Albe wrote: > On Mon, 2019-04-22 at 13:27 -0700, Andres Freund wrote: > > On 2019-04-22 21:37:25 +0200, Laurenz Albe wrote: > > > Commit 3d956d956a introduced support for foreign tables as partitions > > > and COPY FROM on foreign tables. > > > > > > If a foreign data wrapper supports data modifications, but either has > > > not adapted to this change yet or doesn't want to support it > > > for other reasons, it probably got broken by the above commit, > > > because COPY will just call ExecForeignInsert anyway, which might not > > > work because neither PlanForeignModify nor BeginForeignModify have > > > been called. > > > > > > To avoid breaking third-party foreign data wrappers in that way, allow > > > COPY FROM and tuple routing for foreign tables only if the foreign data > > > wrapper implements BeginForeignInsert. > > > > Isn't this worse though? Before this it's an API change between major > > versions. With this it's an API change in a *minor* version. Sure, it's > > one that doesn't crash, but it's still a pretty substantial function > > regression, no? > > Hm, that's a good point. > You could say that this patch is too late, because a FDW might already be > relying on COPY FROM to work without BeginForeignInsert in v11. I think that's the case. > How about just applying the patch from v12 on? > Then it is a behavior change in a major release, which is acceptable. > It requires the imaginary FDW above to add an empty BeginForeignInsert > callback function, but will unbreak FDW that slept through the change > that added COPY support. I fail to see the advantage. It'll still require FDWs to be fixed to work correctly for v11, but additionally adds another set of API differences that needs to be fixed by another set of FDWs. I think this ship simply has sailed. Greetings, Andres Freund
Re: TM format can mix encodings in to_char()
Actually, I tried to show my findings with the tr_TR regression test, but you can reproduce the same issue with other locales and non-ASCII characters, as Tom has pointed out. For exampe: de_DE ISO-8859-1: March es_ES ISO-8859-1: Wednesday fr_FR ISO-8859-1: February Regards, Juan José Santamaría Flecha
Re: speeding up planning with partitions
Amit Langote writes: > On 2019/04/02 2:34, Tom Lane wrote: >> I'm not at all clear on what we think the interaction between >> enable_partition_pruning and constraint_exclusion ought to be, >> so I'm not sure what the appropriate resolution is here. Thoughts? > Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and > DELETE queries is realized by applying constraint exclusion to the > partition constraint of the target partition. The conclusion of the > discussion when adding the enable_partition_pruning GUC [1] was that > whether or not constraint exclusion is carried out (to facilitate > partition pruning) should be governed by the new GUC, not > constraint_exclusion, if only to avoid confusing users. I got back to thinking about how this ought to work. It appears to me that we've got half a dozen different behaviors that depend on one or both of these settings: 1. Use of ordinary table constraints (CHECK, NOT NULL) in baserel pruning, that is relation_excluded_by_constraints for baserels. This is enabled by constraint_exclusion = on. 2. Use of partition constraints in baserel pruning (applicable only when a partition is accessed directly). This is currently partly broken, and it's what your patch wants to change. 3. Use of ordinary table constraints in appendrel pruning, that is relation_excluded_by_constraints for appendrel members. This is enabled by constraint_exclusion >= partition. 4. Use of partition constraints in appendrel pruning. This is enabled by the combination of enable_partition_pruning AND constraint_exclusion >= partition. However, it looks to me like this is now nearly if not completely useless because of #5. 5. Use of partition constraints in expand_partitioned_rtentry. Enabled by enable_partition_pruning (see prune_append_rel_partitions). 6. Use of partition constraints in run-time partition pruning. This is also enabled by enable_partition_pruning, cf create_append_plan, create_merge_append_plan. Now in addition to what I mention above, there are assorted random differences in behavior depending on whether we are in an inherited UPDATE/DELETE or not. I consider these differences to be so bogus that I'm not even going to include them in this taxonomy; they should not exist. The UPDATE/DELETE target ought to act the same as a baserel. I think this is ridiculously overcomplicated even without said random differences. I propose that we do the following: * Get rid of point 4 by not considering partition constraints for appendrel members in relation_excluded_by_constraints. It's just useless cycles in view of point 5, or nearly so. (Possibly there are corner cases where we could prove contradictions between a relation's partition constraints and regular constraints ... but is it really worth spending planner cycles to look for that?) * Make point 2 like point 1 by treating partition constraints for baserels like ordinary table constraints, ie, they are considered only when constraint_exclusion = on (independently of whether enable_partition_pruning is on). * Treat an inherited UPDATE/DELETE target table as if it were an appendrel member for the purposes of relation_excluded_by_constraints, thus removing the behavioral differences between SELECT and UPDATE/DELETE. With this, constraint_exclusion would act pretty much as it traditionally has, and in most cases would not have any special impact on partitions compared to old-style inheritance. The behaviors that enable_partition_pruning would control are expand_partitioned_rtentry pruning and run-time pruning, neither of which have any applicability to old-style inheritance. Thoughts? regards, tom lane
Calling PrepareTempTablespaces in BufFileCreateTemp
Hi, PrepareTempTablespaces is called by most callers of BufFileCreateTemp, so I was wondering if there is a reason not to call it inside BufFileCreateTemp. As a developer using BufFileCreateTemp to write code that will create spill files, it was easy to forget the extra step of checking the temp_tablespaces GUC to ensure I create the spill files there if it is set. Thanks, Melanie Plageman
Re: pg_dump is broken for partition tablespaces
On Mon, Apr 22, 2019 at 4:43 PM Alvaro Herrera wrote: > Well, frequently when people discuss ideas on this list, others discuss > and provide further ideas to try help to find a working solution, rather > than throw every roadblock they can think of (though roadblocks are > indeed thrown now and then). If I've taken a long time to find a > working solution, maybe it's because I have no shoulders of giants to > stand on, and I'm a pretty short guy, so I need to build me a ladder. What exactly do you mean by throwing up roadblocks? I don't have a basket full of great ideas for how to solve this problem that I'm failing to suggest out of some sort of perverse desire to see you fail. I'm not quite as convinced as Tom and Andres that this whole idea is fatally flawed and can't ever be made to work correctly, but I think it's quite possible that they are right, both because their objections sound to me like they are target and because they are pretty smart people. But that's also because I haven't spent a lot of time on this issue, which I think is pretty fair, because it seems like it would be unfair to complain that I am not spending enough time helping fix code that I advised against committing in the first place. How much time should I spent giving you advice if my previous advice was ignored? But FWIW, it seems to me that a good place to start solving this problem would be to think hard about what Andres said here: http://postgr.es/m/20190306161744.22jdkg37fyi2z...@alap3.anarazel.de Specifically, this complaint: "I still think the feature as is doesn't seem to have very well defined behaviour." If we know what the feature is supposed to do, then it should be possible to look at each relevant piece of code and decides whether it implements the specification correctly or not. But if we don't have a specification -- that is, we don't know precisely what the feature is supposed to do -- then we'll just end up whacking the behavior around trying to get some combination that makes sense, and the whole effort is probably doomed. I think that the large quote block from David Rowley in the middle of the above-linked email gets at the definitional problem pretty clearly: the documentation seems to be intending to say -- although it is not 100% clear -- that if TABLESPACE is specified it has effect on all future children, and if not specified then those children get the tablespace they would have gotten anyway. But that behavior is impossible to implement correctly unless there is a way to distinguish between a partitioned table for which TABLESPACE was not specified and where it was specified to be the same as the default tablespace for the database. And we know, per previous conversation, that the catalog representation admits of no way to make that distinction. Using reltablespace = dattablespace is clearly the wrong answer, because that breaks stuff. Tom earlier suggested, I believe, that some fixed OID could be used, e.g. reltablespace = 1 means "tablespace not specified" and reltablespace = 0 means dattablespace. That should be safe enough, since I don't think OID 1 can ever be the OID of a real tablespace. I think this is probably the only way forward if this definition captures the desired behavior. The other option is to decide that we want some other behavior. In that case, the way forward depends on what we want the behavior to be. Your proposal upthread is to disallow the case where the user provides an explicit TABLESPACE setting whose value matches the default tablespace for the database. But that definition seems to have an obvious problem: just because that's not true when the partitioned table is defined doesn't mean it won't become true later, because the default tablespace for the database can be changed, or the database can be copied and the copy be assigned a different default tablespace. Even if there were no issue, that definition doesn't sound very clean, because it means that reltablespace = 0 for a regular relation means dattablespace and for a partitioned relation it means none. Now that's not the only way we could go either, I suppose. There must be a variety of other possible behaviors. But the one Tom and Andres are proposing -- go back to the way it worked in v10 -- is definitely not crazy. You are right that a lot of people didn't like that, but the definition was absolutely clear, because it was the exact same definition we use for normal tables, with the straigthforward extension that for relations with no storage it meant nothing. Going back to the proposal of making OID = 0 mean TABLESPACE dattablespace, OID = 1 meaning no TABLESPACE clause specified for this partitioned table, and OID = whatever meaning that particular non-default tablespace, I do see a couple of problems with that idea too: - I don't have any idea how to salvage things in v11, where the catalog representation is irredeemably ambiguous. We'd probably have to be satisfied with fairly goofy
Re: Calling PrepareTempTablespaces in BufFileCreateTemp
On Mon, Apr 22, 2019 at 3:12 PM Melanie Plageman wrote: > PrepareTempTablespaces is called by most callers of BufFileCreateTemp, so I > was > wondering if there is a reason not to call it inside BufFileCreateTemp. The best answer I can think of is that a BufFileCreateTemp() caller might not want to do catalog access. Perhaps the contortions within assign_temp_tablespaces() are something that callers ought to opt in to explicitly. That doesn't seem like a particularly good or complete answer, though. Perhaps it should simply be called within BufFileCreateTemp(). The BufFile/fd.c layering is confusing in a number of ways IMV. -- Peter Geoghegan
Re: pg_dump is broken for partition tablespaces
On 2019-Mar-06, Tom Lane wrote: > David Rowley writes: > > As far as I can see, the biggest fundamental difference with doing > > things this way will be that the column order of partitions will be > > preserved, where before it would inherit the order of the partitioned > > table. I'm a little unsure if doing this column reordering was an > > intended side-effect or not. > > Well, if the normal behavior results in changing the column order, > it'd be necessary to do things differently in --binary-upgrade mode > anyway, because there we *must* preserve column order. I don't know > if what you're describing represents a separate bug for pg_upgrade runs, > but it might. Is there any test case for the situation left behind by > the core regression tests? Now that I re-read this complaint once again, I wonder if a mismatching column order in partitions isn't a thing we ought to preserve anyhow. Robert, Amit -- is it by design that pg_dump loses the original column order for partitions, when not in binary-upgrade mode? To me, it sounds unintuitive to accept partitions that don't exactly match the order of the parent table; but it's been supported all along. In the statu quo, if users dump and restore such a database, the restored partition ends up with the column order of the parent instead of its own column order (by virtue of being created as CREATE TABLE PARTITION OF). Isn't that wrong? It'll cause an INSERT/COPY direct to the partition that worked prior to the restore to fail after the restore, if the column list isn't specified. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: finding changed blocks using WAL scanning
On Thu, Apr 18, 2019 at 04:25:24PM -0400, Robert Haas wrote: On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian wrote: How would you choose the STARTLSN/ENDLSN? If you could do it per checkpoint, rather than per-WAL, I think that would be great. I thought of that too. It seems appealing, because you probably only really care whether a particular block was modified between one checkpoint and the next, not exactly when during that interval it was modified. That's probably true for incremental backups, but there are other use cases that could leverage this information. Some time ago there was a discussion about prefetching blocks during recovery on a standby, and that's a great example of a use case that benefit from this - look which blocks where modified in the next chunk of WAL, prefetch them. But that requires fairly detailed information about which blocks were modified in the next few megabytes of WAL. So just doing it once per checkpoint (or even anything above a single WAL segment) and removing all the detailed LSN location makes it useless for this use case. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: finding changed blocks using WAL scanning
On Mon, Apr 22, 2019 at 01:15:49PM -0400, Bruce Momjian wrote: On Mon, Apr 22, 2019 at 01:11:22PM -0400, Robert Haas wrote: On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian wrote: > I assumed the modblock files would be stored in the WAL archive so some > external tools could generate incremental backups using just the WAL > files. I assumed they would also be sent to standby servers so > incremental backups could be done on standby servers too. Yeah, that's another possible approach. I am not sure what is best. I am thinking you need to allow any of these, and putting the WAL files in pg_wal and having them streamed and archived gives that flexibility. I agree - this would be quite useful for the prefetching use case I've already mentioned in my previous message. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Calling PrepareTempTablespaces in BufFileCreateTemp
Peter Geoghegan writes: > On Mon, Apr 22, 2019 at 3:12 PM Melanie Plageman > wrote: >> PrepareTempTablespaces is called by most callers of BufFileCreateTemp, so I >> was >> wondering if there is a reason not to call it inside BufFileCreateTemp. > The best answer I can think of is that a BufFileCreateTemp() caller > might not want to do catalog access. Perhaps the contortions within > assign_temp_tablespaces() are something that callers ought to opt in > to explicitly. It's kind of hard to see a reason to call it outside a transaction, and even if we did, there are provisions for it not to go boom. > That doesn't seem like a particularly good or complete answer, though. > Perhaps it should simply be called within BufFileCreateTemp(). The > BufFile/fd.c layering is confusing in a number of ways IMV. I don't actually see why BufFileCreateTemp should do it; if we're to add a call, seems like OpenTemporaryFile is the place, as that's what is really concerned with the temp tablespace(s). I'm in favor of doing this, I think, as it sure looks to me like gistInitBuildBuffers() is calling BufFileCreateTemp without any closely preceding PrepareTempTablespaces. So we already have an instance of Melanie's bug in core. It'd be difficult to notice because of the silent-fallback-to-default-tablespace behavior. regards, tom lane
Re: pg_dump is broken for partition tablespaces
Alvaro Herrera writes: > Now that I re-read this complaint once again, I wonder if a mismatching > column order in partitions isn't a thing we ought to preserve anyhow. > Robert, Amit -- is it by design that pg_dump loses the original column > order for partitions, when not in binary-upgrade mode? I haven't looked at the partitioning code, but I am quite sure that that's always happened for old-style inheritance children, and I imagine pg_dump is just duplicating that old behavior. Wasn't there already a patch on the table to change this, though, by removing the code path that uses inheritance rather than the binary-upgrade-like solution? regards, tom lane
Re: finding changed blocks using WAL scanning
On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote: On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost wrote: > Oh. Well, I already explained my algorithm for doing that upthread, > which I believe would be quite cheap. > > 1. When you generate the .modblock files, stick all the block > references into a buffer. qsort(). Dedup. Write out in sorted > order. Having all of the block references in a sorted order does seem like it would help, but would also make those potentially quite a bit larger than necessary (I had some thoughts about making them smaller elsewhere in this discussion). That might be worth it though. I suppose it might also be possible to line up the bitmaps suggested elsewhere to do essentially a BitmapOr of them to identify the blocks changed (while effectively de-duping at the same time). I don't see why this would make them bigger than necessary. If you sort by relfilenode/fork/blocknumber and dedup, then references to nearby blocks will be adjacent in the file. You can then decide what format will represent that most efficiently on output. Whether or not a bitmap is better idea than a list of block numbers or something else depends on what percentage of blocks are modified and how clustered they are. Not sure I understand correctly - do you suggest to deduplicate and sort the data before writing them into the .modblock files? Because that the the sorting would make this information mostly useless for the recovery prefetching use case I mentioned elsewhere. For that to work we need information about both the LSN and block, in the LSN order. So if we want to allow that use case to leverage this infrastructure, we need to write the .modfiles kinda "raw" and do this processing in some later step. Now, maybe the incremental backup use case is so much more important the right thing to do is ignore this other use case, and I'm OK with that - as long as it's a conscious choice. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: finding changed blocks using WAL scanning
On Tue, Apr 23, 2019 at 01:21:27AM +0200, Tomas Vondra wrote: > On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote: > > On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost wrote: > > > > Oh. Well, I already explained my algorithm for doing that upthread, > > > > which I believe would be quite cheap. > > > > > > > > 1. When you generate the .modblock files, stick all the block > > > > references into a buffer. qsort(). Dedup. Write out in sorted > > > > order. > > > > > > Having all of the block references in a sorted order does seem like it > > > would help, but would also make those potentially quite a bit larger > > > than necessary (I had some thoughts about making them smaller elsewhere > > > in this discussion). That might be worth it though. I suppose it might > > > also be possible to line up the bitmaps suggested elsewhere to do > > > essentially a BitmapOr of them to identify the blocks changed (while > > > effectively de-duping at the same time). > > > > I don't see why this would make them bigger than necessary. If you > > sort by relfilenode/fork/blocknumber and dedup, then references to > > nearby blocks will be adjacent in the file. You can then decide what > > format will represent that most efficiently on output. Whether or not > > a bitmap is better idea than a list of block numbers or something else > > depends on what percentage of blocks are modified and how clustered > > they are. > > > > Not sure I understand correctly - do you suggest to deduplicate and sort > the data before writing them into the .modblock files? Because that the > the sorting would make this information mostly useless for the recovery > prefetching use case I mentioned elsewhere. For that to work we need > information about both the LSN and block, in the LSN order. > > So if we want to allow that use case to leverage this infrastructure, we > need to write the .modfiles kinda "raw" and do this processing in some > later step. > > Now, maybe the incremental backup use case is so much more important the > right thing to do is ignore this other use case, and I'm OK with that - > as long as it's a conscious choice. I think the concern is that the more graunular the modblock files are (with less de-duping), the larger they will be. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
memory leak checking
Hello PostgreSQL Hackers, What is the standard memory leak checking policy for the PostgreSQL codebase? I know there is some support for valgrind -- is the test suite being run continuously with valgrind on the build farm? Is there any plan to support clang's AddressSanitizer? I've seen a thread that memory leaks are allowed in initdb, because it is a short-lived process. Obviously they are not allowed in the database server. Are memory leaks allowed in the psql tool? Regards, Mikhail
Re: memory leak checking
Hi, On 2019-04-22 16:50:25 -0700, Mikhail Bautin wrote: > What is the standard memory leak checking policy for the PostgreSQL > codebase? I know there is some support for valgrind -- is the test suite > being run continuously with valgrind on the build farm? There's continuous use of valgrind on the buildfarm - but those animals have leak checking disabled. Postgres for nearly all server allocations uses memory contexts, which allows to bulk-free memory. There's also plenty memory that's intentionally allocated till the end of the backend lifetime (e.g. the various caches over the system catalogs). Due to that checks like valgrinds are not particularly meaningful. > Is there any plan to support clang's AddressSanitizer? Not for the leak portion. I use asan against the backend, after disabling the leak check, and that's useful. Should probably set up a proper buildfarm animal for that. > I've seen a thread that memory leaks are allowed in initdb, because it is a > short-lived process. Obviously they are not allowed in the database server. > Are memory leaks allowed in the psql tool? Leaks are allowed if they are once-per-backend type things. There's no point in e.g. freeing information for timezone metadata, given that it'll be used for the whole server lifetime. And there's such things in psql too, IIRC. Greetings, Andres Freund