Re: Checkpoint start logging is done inside critical section
On Thu, Oct 18, 2018 at 9:02 AM Amit Kapila wrote: > > On Thu, Oct 18, 2018 at 10:27 AM Andres Freund wrote: > > (that's why we mark the ctx as being ok with that). > > > > Yeah, as the palloc for log message would be called in an ErrorContext > where it is safe to do the allocation, so ideally this shouldn't be a > problem. So, it seems to me that this is not a problem, Ants, do you > see any problem in any particular scenario or was this based on > theoretical analysis? This was purely theoretical, as also evidenced by lack of complaints even though the code has been like that for a very long time. I was actually mostly worried about extension code run by logging hook causing the panic. Regards, Ants Aasma
Re: DSM robustness failure (was Re: Peripatus/failures)
On Thu, Oct 18, 2018 at 6:02 PM Larry Rosenman wrote: > Let me know soon(ish) if any of you want to poke at this machine, as I'm > likely to forget and reboot it. Hi Larry, Thanks for the offer but it looks like there is no way to get our hands on that memory for forensics now. I'll see if there is some way to improve that situation for next time something like this happens. -- Thomas Munro http://www.enterprisedb.com
Re: DSM robustness failure (was Re: Peripatus/failures)
On Thu, Oct 18, 2018 at 5:00 PM Amit Kapila wrote: > The below code seems to be problemetic: > dsm_cleanup_using_control_segment() > { > .. > if (!dsm_control_segment_sane(old_control, mapped_size)) > { > dsm_impl_op(DSM_OP_DETACH, old_control_handle, 0, &impl_private, > &mapped_address, &mapped_size, LOG); > .. > } > > Here, don't we need to use dsm_control_* variables instead of local > variable mapped_* variables? I was a little fuzzy on when exactly dsm_cleanup_using_control_segment() and dsm_postmaster_shutdown() run, but after some more testing I think I have this straight now. You can test by setting dsm_control->magic to 42 in a debugger and trying three cases: 1. Happy shutdown: dsm_postmaster_shutdown() complains on shutdown. 2. kill -9 a non-postmaster process: dsm_postmaster_shutdown() complains during auto-restart. 3. kill -9 the postmaster, manually start up again: dsm_cleanup_using_control_segment() runs. It ignores the old segment quietly if it doesn't pass the sanity test. So to answer your question: no, dsm_cleanup_using_control_segment() is case 3. This entirely new postmaster process has never had the segment mapped in, so the dsm_control_* variables are not relevant here. Hmm but if you're running N other independent clusters on the same machine that started up after this cluster crashed in case 3, I think there is an N-in-four-billion chance that the segment with that ID now belongs to another cluster and happens to be its DSM control segment, and therefore passes the magic-number sanity test, and then we'll nuke it and all the segments it references. Am I missing something? Could we fix that by putting some kind of cluster ID in the control segment? So on second thoughts, I think I should remove the dsm_cleanup_using_control_segment() change from the patch, because it widens the risk of case 3 nuking someone else's stuff to include even segments that don't pass the sanity test. That's not good. Hunk removed. > ... Do we need to reset the dsm_control_* > variables in dsm_cleanup_for_mmap? I don't think so -- that unlinks the files in file-backed DSM, but we also call the other code in that case. > @@ -336,13 +333,14 @@ dsm_postmaster_shutdown(int code, Datum arg) > * stray shared memory segments, but there's not much we can do about that > * if the metadata is gone. > */ > - nitems = dsm_control->nitems; > if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size)) > { > ereport(LOG, > (errmsg("dynamic shared memory control segment is corrupt"))); > - return; > + nitems = 0; > > The comment above this code says: > /* > * If some other backend exited uncleanly, it might have corrupted the > * control segment while it was dying. In that case, we warn and ignore > * the contents of the control segment. This may end up leaving behind > * stray shared memory segments, but there's not much we can do about that > * if the metadata is gone. > */ > > Don't we need to adjust this comment, if we want to go with what you > have proposed? The comment seems correct: we ignore the *contents* of the control segment. But to be clearer I'll add a sentence to say that we still attempt to destroy the control segment itself. > BTW, it doesn't appear insane to me that if the > control segment got corrupted, then we leave it as it is rather than > trying to destroy it. I am not sure, but if it is corrupted, then is > it guaranteed that destroy will always succeed? You're right that it might fail. If the shm_unlink() fails it doesn't matter, we just produce LOG, but if the unmap() fails we still consider the segment to be mapped (because it is!), so ... hmm, we'd better forget about it if we plan to continue running, by clearing those variables explicitly. That's actually another pre-existing way for the assertion to fail in current master. (Realistically, that unmmap() could only fail if our global variables are trashed so we try to unmap() a junk address; AFAIK our model here is that the postmaster is sane, even if everything else is insane, so that's sort of outside the model. But this new patch addresses it anyway.) Thanks for the review! -- Thomas Munro http://www.enterprisedb.com 0001-Fix-thinko-in-handling-of-corrupted-DSM-control-a-v2.patch Description: Binary data
lowering pg_regress privileges on Windows
From the "scratch a long running itch" department. The attached ridiculously tiny patch solves the problem whereby while we can run Postgres on Windows safely from an Administrator account, we can't run run the regression tests from the same account, since it fails on the tablespace test, the tablespace directory having been set up without first having lowered privileges. The solution is to lower pg_regress' privileges in the same way that we do with other binaries. This is useful in setups like Appveyor where running under any other account is ... difficult. For the cfbot Thomas has had to make the script hack the schedule file to omit the tablespace test. This would make that redundant. I propose to backpatch this. It's close enough to a bug and the risk is almost infinitely small. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c index 6890678..3248603 100644 --- a/src/test/regress/pg_regress.c +++ b/src/test/regress/pg_regress.c @@ -2081,6 +2081,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc progname = get_progname(argv[0]); set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_regress")); + get_restricted_token(progname); + atexit(stop_postmaster); #ifndef HAVE_UNIX_SOCKETS
Re: Removing unneeded self joins
Here is a version that compiles. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index 3bb91c9..62d46a1 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -166,10 +166,13 @@ _equalVar(const Var *a, const Var *b) COMPARE_SCALAR_FIELD(vartypmod); COMPARE_SCALAR_FIELD(varcollid); COMPARE_SCALAR_FIELD(varlevelsup); - COMPARE_SCALAR_FIELD(varnoold); - COMPARE_SCALAR_FIELD(varoattno); COMPARE_LOCATION_FIELD(location); + /* + * varnoold/varoattno are used only for debugging and may differ even + * when the variables are logically the same. + */ + return true; } diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index f295558..fb07ca0 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -2964,7 +2964,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, * relation_has_unique_index_for * Determine whether the relation provably has at most one row satisfying * a set of equality conditions, because the conditions constrain all - * columns of some unique index. + * columns of some unique index. If index_info is not null, it is set to + * point to a new UniqueIndexInfo containing the index and conditions. * * The conditions can be represented in either or both of two ways: * 1. A list of RestrictInfo nodes, where the caller has already determined @@ -2985,7 +2986,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, bool relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, List *restrictlist, - List *exprlist, List *oprlist) + List *exprlist, List *oprlist, + UniqueIndexInfo **index_info) { ListCell *ic; @@ -3041,6 +3043,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, { IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic); int c; + List *matched_restrictlist = NIL; /* * If the index is not unique, or not immediately enforced, or if it's @@ -3089,6 +3092,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, if (match_index_to_operand(rexpr, c, ind)) { matched = true; /* column is unique */ + matched_restrictlist = lappend(matched_restrictlist, rinfo); break; } } @@ -3131,7 +3135,22 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, /* Matched all columns of this index? */ if (c == ind->ncolumns) + { + if (index_info != NULL) + { +/* This may be called in GEQO memory context. */ +MemoryContext oldContext = MemoryContextSwitchTo(root->planner_cxt); +*index_info = palloc(sizeof(UniqueIndexInfo)); +(*index_info)->index = ind; +(*index_info)->clauses = list_copy(matched_restrictlist); +MemoryContextSwitchTo(oldContext); + } + if (matched_restrictlist) +list_free(matched_restrictlist); return true; + } + if (matched_restrictlist) + list_free(matched_restrictlist); } return false; diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c index 642f951..2edcf22 100644 --- a/src/backend/optimizer/path/joinpath.c +++ b/src/backend/optimizer/path/joinpath.c @@ -176,7 +176,8 @@ add_paths_to_joinrel(PlannerInfo *root, innerrel, JOIN_INNER, restrictlist, - false); + false, + NULL /*index_info*/); break; default: extra.inner_unique = innerrel_is_unique(root, @@ -185,7 +186,8 @@ add_paths_to_joinrel(PlannerInfo *root, innerrel, jointype, restrictlist, - false); + false, + NULL /*index_info*/); break; } diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index 0e73f9c..2bb3a10 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -22,12 +22,15 @@ */ #include "postgres.h" +#include "catalog/pg_class.h" #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/joininfo.h" #include "optimizer/pathnode.h" #include "optimizer/paths.h" #include "optimizer/planmain.h" +#include "optimizer/predtest.h" +#include "optimizer/restrictinfo.h" #include "optimizer/tlist.h" #include "optimizer/var.h" #include "utils/lsyscache.h" @@ -39,14 +42,15 @@ static void remove_rel_from_query(PlannerInfo *root, int relid, static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved); static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel); static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel, - List *clause_list); + List *clause_list, UniqueIndexInfo **info); static Oid distinct_col_search(int colno, List *colnos, List *opids); stat
Possibly redundant context switch in postgres_fdw
Hi hackers, ISTM that context switch in `create_cursor()`: if (numParams > 0) { MemoryContext oldcontext; oldcontext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); process_query_params(econtext, fsstate->param_flinfo, fsstate->param_exprs, values); MemoryContextSwitchTo(oldcontext); } is redundant since we should already be in `ecxt_per_tuple_memory` context according to `ForeignNext()`. Do I miss some hidden purpose? If not here is a patch that removes it. Regards, Ildar Musin diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index fd20aa96aa..65962ea657 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -3083,16 +3083,10 @@ create_cursor(ForeignScanState *node) */ if (numParams > 0) { - MemoryContext oldcontext; - - oldcontext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); - process_query_params(econtext, fsstate->param_flinfo, fsstate->param_exprs, values); - - MemoryContextSwitchTo(oldcontext); } /* Construct the DECLARE CURSOR command */
Re: PG vs macOS Mojave
I wrote: > Jakob Egger writes: >> I would assume that clang sets -isysroot automatically, but I have no idea >> why that didn't work for you previously. > [ experiments further ... ] It looks like clang does default to assuming > -isysroot with the correct sysroot for its Xcode version. The missing bit > of info is that -I /System/... is taken as an absolute path that's not > affected by the sysroot. You have to write -iwithsysroot /System/... > to have something that is taken relative to the sysroot. > So as far as pltcl is concerned, we don't really need the configure > hack to insert PG_SYSROOT into TCL_INCLUDE_SPEC; we could leave Apple's > value alone and it should work. (Not that that gets pltcl out of > dependency on the sysroot entirely, because AFAICS there's still no way > to find tclConfig.sh without explicitly accounting for sysroot. And > I'm not exactly convinced that using explicit sysroot to do that and > then leaving it out later is a hot idea.) > However, on the Perl side, we'd have to change -I to -iwithsysroot if > we wanted to avoid explicitly mentioning the sysroot path, and that > seems way more invasive than it's worth. > I'm inclined to leave things as they stand now; it makes for reasonably > consistent build behavior between pltcl and plperl. After sleeping on it, I'm not really happy with that answer. I think we'd be better advised to remove the hack on TCL_INCLUDE_SPEC, and change what we've done for Perl to look more like what Tcl does. That is, instead of writing "-I$(perl_includedir)/CORE" in Makefiles, we should write "$(perl_includespec)" so that the switch spelling is included in the variable. Then we'd make configure set that variable to either "-I$perl_archlibexp/CORE" or "-iwithsysroot $perl_archlibexp/CORE" depending on what it finds to be needed. In this way, while we can't avoid knowing the sysroot path explicitly during configure, we aren't wiring it into anything that configure emits. It would be a good idea to change this now, before the results of commit 5e2217131 spread too far in the wild, so that plperl-related extensions only have to deal with one Makefile change not two. (BTW, plpython is already doing it like this, meaning that it's plperl that's the odd man out, not pltcl. It appears that Apple have not yet sysroot-ified their Python installation, but it is not hard to see where that train is going.) regards, tom lane
Re: Implementation of Flashback Query
Hello,I'll take a look at some of the features that zheap undo, is considering for flashback based on undo.Flashback Drop is more complex and, at the moment, doesn't do anything about it. Yang JieHighgo Software http://www.highgo.com/ On 10/18/2018 08:16,Tsunakawa, Takayuki wrote: From: Yang Jie [mailto:yang...@highgo.com] Delayed cleanup, resulting in performance degradation, what are the solutions recommended? What do you suggest for the flashback feature? Although postgres has excellent backup and restore capabilities, have you considered adding flashbacks?Great proposal, I appreciate it.Regarding the bloat, check the recent mail threads titled "undo xxx". A new table storage format is being developed to use undo logs like Oracle and MySQL instead of generating dead tuples.How about the possibility of adding a feature like Flashback Drop to restore dropped and truncated tables? I sometimes saw customers who mistakenly dropped or truncated tables but didn't have any backup.RegardsTakayuki Tsunakawa
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Wed, Oct 3, 2018 at 4:39 PM Peter Geoghegan wrote: > I did find a pretty clear regression, though only with writes to > unique indexes. Attached is v6, which fixes the issue. More on that > below. I've been benchmarking my patch using oltpbench's TPC-C benchmark these past few weeks, which has been very frustrating -- the picture is very mixed. I'm testing a patch that has evolved from v6, but isn't too different. In one way, the patch does exactly what it's supposed to do when these benchmarks are run: it leaves indexes *significantly* smaller than the master branch will on the same (rate-limited) workload, without affecting the size of tables in any noticeable way. The numbers that I got from my much earlier synthetic single client benchmark mostly hold up. For example, the stock table's primary key is about 35% smaller, and the order line index is only about 20% smaller relative to master, which isn't quite as good as in the synthetic case, but I'll take it (this is all because of the v6-0003-Add-split-at-new-tuple-page-split-optimization.patch stuff). However, despite significant effort, and despite the fact that the index shrinking is reliable, I cannot yet consistently show an increase in either transaction throughput, or transaction latency. I can show a nice improvement in latency on a slightly-rate-limited TPC-C workload when backend_flush_after=0 (something like a 40% reduction on average), but that doesn't hold up when oltpbench isn't rate-limited and/or has backend_flush_after set. Usually, there is a 1% - 2% regression, despite the big improvements in index size, and despite the big reduction in the amount of buffers that backends must write out themselves. The obvious explanation is that throughput is decreased due to our doing extra work (truncation) while under an exclusive buffer lock. However, I've worked hard on that, and, as I said, I can sometimes observe a nice improvement in latency. This makes me doubt the obvious explanation. My working theory is that this has something to do with shared_buffers eviction. Maybe we're making worse decisions about which buffer to evict, or maybe the scalability of eviction is hurt. Perhaps both. You can download results from a recent benchmark to get some sense of this. It includes latency and throughput graphs, plus details statistics collector stats: https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing I would welcome any theories as to what could be the problem here. I'm think that this is fixable, since the picture for the patch is very positive, provided you only focus on bgwriter/checkpoint activity and on-disk sizes. It seems likely that there is a very specific gap in my understanding of how the patch affects buffer cleaning. -- Peter Geoghegan
removing unnecessary get_att*() lsyscache functions
I noticed that get_attidentity() isn't really necessary because the information can be obtained from an existing tuple descriptor in each case. Also, get_atttypmod() hasn't been used since 2004. I propose the attached patches to remove these two functions. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 2003c5b4fdad1cd444ddb1fad2bdc9ab09db2fd2 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 18 Oct 2018 19:27:18 +0200 Subject: [PATCH 1/2] Remove get_attidentity() All existing uses can get this information more easily from the relation descriptor, so the detour through the syscache is not necessary. --- src/backend/commands/tablecmds.c| 4 ++-- src/backend/parser/parse_utilcmd.c | 3 ++- src/backend/utils/cache/lsyscache.c | 32 - src/include/utils/lsyscache.h | 1 - 4 files changed, 4 insertions(+), 36 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 3e112b4ef4..7ff25a9eff 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -5945,7 +5945,7 @@ ATExecDropNotNull(Relation rel, const char *colName, LOCKMODE lockmode) errmsg("cannot alter system column \"%s\"", colName))); - if (get_attidentity(RelationGetRelid(rel), attnum)) + if (TupleDescAttr(RelationGetDescr(rel), attnum - 1)->attidentity) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("column \"%s\" of relation \"%s\" is an identity column", @@ -6148,7 +6148,7 @@ ATExecColumnDefault(Relation rel, const char *colName, errmsg("cannot alter system column \"%s\"", colName))); - if (get_attidentity(RelationGetRelid(rel), attnum)) + if (TupleDescAttr(RelationGetDescr(rel), attnum - 1)->attidentity) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("column \"%s\" of relation \"%s\" is an identity column", diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index d8387d4356..212eedfa87 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -3067,7 +3067,8 @@ transformAlterTableStmt(Oid relid, AlterTableStmt *stmt, * if attribute not found, something will error about it * later */ - if (attnum != InvalidAttrNumber && get_attidentity(relid, attnum)) + if (attnum != InvalidAttrNumber && + TupleDescAttr(RelationGetDescr(rel), attnum - 1)->attidentity) { Oid seq_relid = getOwnedSequence(relid, attnum); Oid typeOid = typenameTypeId(pstate, def->typeName); diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c index 12b2532d95..c5fcaa9542 100644 --- a/src/backend/utils/cache/lsyscache.c +++ b/src/backend/utils/cache/lsyscache.c @@ -821,38 +821,6 @@ get_attnum(Oid relid, const char *attname) return InvalidAttrNumber; } -/* - * get_attidentity - * - * Given the relation id and the attribute name, - * return the "attidentity" field from the attribute relation. - * - * Returns '\0' if not found. - * - * Since no identity is represented by '\0', this can also be used as a - * Boolean test. - */ -char -get_attidentity(Oid relid, AttrNumber attnum) -{ - HeapTuple tp; - - tp = SearchSysCache2(ATTNUM, -ObjectIdGetDatum(relid), -Int16GetDatum(attnum)); - if (HeapTupleIsValid(tp)) - { - Form_pg_attribute att_tup = (Form_pg_attribute) GETSTRUCT(tp); - charresult; - - result = att_tup->attidentity; - ReleaseSysCache(tp); - return result; - } - else - return '\0'; -} - /* * get_atttype * diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h index 8c57de77c0..151081c693 100644 --- a/src/include/utils/lsyscache.h +++ b/src/include/utils/lsyscache.h @@ -85,7 +85,6 @@ extern Oid get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype, int16 procnum); extern char *get_attname(Oid relid, AttrNumber attnum, bo
Re: Large writable variables
On 17/10/2018 23:51, Andres Freund wrote: >> __builtin_types_compatible_p(const char *, char *) returns false (0) for me. > > Right, that's why I added a const, inside the macro, to the type > specified in the unconstify argument. So typeof() yields a const char *, > and the return type is specified as char *, and adding a const in the > argument also yields a const char *. Yeah, that works. The C++-inspired version also allowed casting from not-const to const, which we don't really need. I'd perhaps change the signature #define unconstify(underlying_type, var) because the "var" doesn't actually have to be a variable. Attached is my previous patch adapted to your macro. I'm tempted to get rid of the stuff in dfmgr.c and just let the "older platforms" get the warning. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 54693f7201aa4768509c5d6939961b50a9021bcf Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 18 Oct 2018 22:11:44 +0200 Subject: [PATCH v2] Apply unconstify() in more places --- src/backend/utils/adt/json.c| 4 ++-- src/backend/utils/adt/misc.c| 2 +- src/backend/utils/adt/varlena.c | 4 ++-- src/backend/utils/fmgr/dfmgr.c | 4 ++-- src/port/win32setlocale.c | 2 +- 5 files changed, 8 insertions(+), 8 deletions(-) diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c index 6f0fe94d63..f47a498228 100644 --- a/src/backend/utils/adt/json.c +++ b/src/backend/utils/adt/json.c @@ -207,12 +207,12 @@ IsValidJsonNumber(const char *str, int len) */ if (*str == '-') { - dummy_lex.input = (char *) str + 1; + dummy_lex.input = unconstify(char *, str) + 1; dummy_lex.input_length = len - 1; } else { - dummy_lex.input = (char *) str; + dummy_lex.input = unconstify(char *, str); dummy_lex.input_length = len; } diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index 6ea3679835..309eb2935c 100644 --- a/src/backend/utils/adt/misc.c +++ b/src/backend/utils/adt/misc.c @@ -423,7 +423,7 @@ pg_get_keywords(PG_FUNCTION_ARGS) HeapTuple tuple; /* cast-away-const is ugly but alternatives aren't much better */ - values[0] = (char *) ScanKeywords[funcctx->call_cntr].name; + values[0] = unconstify(char *, ScanKeywords[funcctx->call_cntr].name); switch (ScanKeywords[funcctx->call_cntr].category) { diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index a5e812d026..0fd3b15748 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -182,7 +182,7 @@ char * text_to_cstring(const text *t) { /* must cast away the const, unfortunately */ - text *tunpacked = pg_detoast_datum_packed((struct varlena *) t); + text *tunpacked = pg_detoast_datum_packed(unconstify(text *, t)); int len = VARSIZE_ANY_EXHDR(tunpacked); char *result; @@ -213,7 +213,7 @@ void text_to_cstring_buffer(const text *src, char *dst, size_t dst_len) { /* must cast away the const, unfortunately */ - text *srcunpacked = pg_detoast_datum_packed((struct varlena *) src); + text *srcunpacked = pg_detoast_datum_packed(unconstify(text *, src)); size_t src_len = VARSIZE_ANY_EXHDR(srcunpacked); if (dst_len > 0) diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c index 4a5cc7cfc7..d200b960d2 100644 --- a/src/backend/utils/fmgr/dfmgr.c +++ b/src/backend/utils/fmgr/dfmgr.c @@ -126,7 +126,7 @@ load_external_function(const char *filename, const char *funcname, * should declare its second argument as "const char *", but older * platforms might not, so for the time being we just cast away const. */ - retval = (PGFunction) dlsym(lib_handle, (char *) funcname); + retval = (PGFunction) dlsym(lib_handle, unconstify(char *, funcname)); if (retval == NULL && signalNotFound) ereport(ERROR, @@ -175,7 +175,7 @@ PGFunction lookup_external_function(void *filehandle, const char *funcname) { /* as above, cast away const for the time being */ - return (PGFunction) dlsym(filehandle, (char *) funcname); + return (PGFunction) dlsym(filehandle, unconstify(char *, funcname)); } diff --git a/src/port/win32setlocale.c b/src/port/win32setlocale.c index 0597c2afca..a8cf170dd1 100644 --- a/src/port/win32setlocale.c +++ b/src/port/win32setlocale.c @@ -183,7 +183,7 @@ pgwin32_setlocale(int category, const char *locale) * forbidden to modify, so casting away the "const" is innocuous. */ if (result) - result = (char *) map_locale(locale_map_result, result
Re: Large writable variables
On 18/10/2018 22:17, Peter Eisentraut wrote: > On 17/10/2018 23:51, Andres Freund wrote: >>> __builtin_types_compatible_p(const char *, char *) returns false (0) for me. >> >> Right, that's why I added a const, inside the macro, to the type >> specified in the unconstify argument. So typeof() yields a const char *, >> and the return type is specified as char *, and adding a const in the >> argument also yields a const char *. > > Yeah, that works. The C++-inspired version also allowed casting from > not-const to const, which we don't really need. > Attached is my previous patch adapted to your macro. Oh, I forgot to mention, your version doesn't work for this code in pqexpbuffer.c: str->data = (char *) oom_buffer; That's probably not a big deal though. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Large writable variables
On 2018-10-18 22:20:29 +0200, Peter Eisentraut wrote: > On 18/10/2018 22:17, Peter Eisentraut wrote: > > On 17/10/2018 23:51, Andres Freund wrote: > >>> __builtin_types_compatible_p(const char *, char *) returns false (0) for > >>> me. > >> > >> Right, that's why I added a const, inside the macro, to the type > >> specified in the unconstify argument. So typeof() yields a const char *, > >> and the return type is specified as char *, and adding a const in the > >> argument also yields a const char *. > > > > Yeah, that works. The C++-inspired version also allowed casting from > > not-const to const, which we don't really need. > > > Attached is my previous patch adapted to your macro. > > Oh, I forgot to mention, your version doesn't work for this code in > pqexpbuffer.c: > > str->data = (char *) oom_buffer; That kind of seems correct ;). Can easily be done via unconstify(char *, &oom_buffer[0]), if necessary. > That's probably not a big deal though. Greetings, Andres Freund
Re: Large writable variables
On 2018-10-18 22:17:55 +0200, Peter Eisentraut wrote: > I'd perhaps change the signature > > #define unconstify(underlying_type, var) > > because the "var" doesn't actually have to be a variable. Hm, so expr, or what would you use? > Attached is my previous patch adapted to your macro. > > I'm tempted to get rid of the stuff in dfmgr.c and just let the "older > platforms" get the warning. Yea, that works for me. Are you planning to apply this? Greetings, Andres Freund
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Hi, On 2018-10-18 12:54:27 -0700, Peter Geoghegan wrote: > I can show a nice improvement in latency on a slightly-rate-limited > TPC-C workload when backend_flush_after=0 (something like a 40% > reduction on average), but that doesn't hold up when oltpbench isn't > rate-limited and/or has backend_flush_after set. Usually, there is a > 1% - 2% regression, despite the big improvements in index size, and > despite the big reduction in the amount of buffers that backends must > write out themselves. What kind of backend_flush_after values where you trying? backend_flush_after=0 obviously is the default, so I'm not clear on that. How large is the database here, and how high is shared_buffers > The obvious explanation is that throughput is decreased due to our > doing extra work (truncation) while under an exclusive buffer lock. > However, I've worked hard on that, and, as I said, I can sometimes > observe a nice improvement in latency. This makes me doubt the obvious > explanation. My working theory is that this has something to do with > shared_buffers eviction. Maybe we're making worse decisions about > which buffer to evict, or maybe the scalability of eviction is hurt. > Perhaps both. Is it possible that there's new / prolonged cases where a buffer is read from disk after the patch? Because that might require doing *write* IO when evicting the previous contents of the victim buffer, and obviously that can take longer if you're running with backend_flush_after > 0. I wonder if it'd make sense to hack up a patch that logs when evicting a buffer while already holding another lwlock. That shouldn't be too hard. > You can download results from a recent benchmark to get some sense of > this. It includes latency and throughput graphs, plus details > statistics collector stats: > > https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing I'm uncllear which runs are what here? I assume "public" is your patchset, and master is master? Do you reset the stats inbetween runs? Greetings, Andres Freund
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Shared_buffers is 10gb iirc. The server has 32gb of memory. Yes, 'public' is the patch case. Sorry for not mentioning it initially. -- Peter Geoghegan (Sent from my phone)
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Thu, Oct 18, 2018 at 1:44 PM Andres Freund wrote: > What kind of backend_flush_after values where you trying? > backend_flush_after=0 obviously is the default, so I'm not clear on > that. How large is the database here, and how high is shared_buffers I *was* trying backend_flush_after=512kB, but it's backend_flush_after=0 in the benchmark I posted. See the "postgres*settings" files. On the master branch, things looked like this after the last run: pg@tpcc_oltpbench[15547]=# \dt+ List of relations Schema │Name│ Type │ Owner │ Size │ Description ┼┼───┼───┼──┼─ public │ customer │ table │ pg│ 4757 MB │ public │ district │ table │ pg│ 5240 kB │ public │ history│ table │ pg│ 1442 MB │ public │ item │ table │ pg│ 10192 kB │ public │ new_order │ table │ pg│ 140 MB │ public │ oorder │ table │ pg│ 1185 MB │ public │ order_line │ table │ pg│ 19 GB│ public │ stock │ table │ pg│ 9008 MB │ public │ warehouse │ table │ pg│ 4216 kB │ (9 rows) pg@tpcc_oltpbench[15547]=# \di+ List of relations Schema │ Name │ Type │ Owner │ Table│ Size │ Description ┼──┼───┼───┼┼─┼─ public │ customer_pkey│ index │ pg│ customer │ 367 MB │ public │ district_pkey│ index │ pg│ district │ 600 kB │ public │ idx_customer_name│ index │ pg│ customer │ 564 MB │ public │ idx_order│ index │ pg│ oorder │ 715 MB │ public │ item_pkey│ index │ pg│ item │ 2208 kB │ public │ new_order_pkey │ index │ pg│ new_order │ 188 MB │ public │ oorder_o_w_id_o_d_id_o_c_id_o_id_key │ index │ pg│ oorder │ 715 MB │ public │ oorder_pkey │ index │ pg│ oorder │ 958 MB │ public │ order_line_pkey │ index │ pg│ order_line │ 9624 MB │ public │ stock_pkey │ index │ pg│ stock │ 904 MB │ public │ warehouse_pkey │ index │ pg│ warehouse │ 56 kB │ (11 rows) > Is it possible that there's new / prolonged cases where a buffer is read > from disk after the patch? Because that might require doing *write* IO > when evicting the previous contents of the victim buffer, and obviously > that can take longer if you're running with backend_flush_after > 0. Yes, I suppose that that's possible, because the buffer popularity/usage_count will be affected in ways that cannot easily be predicted. However, I'm not running with "backend_flush_after > 0" here -- that was before. > I wonder if it'd make sense to hack up a patch that logs when evicting a > buffer while already holding another lwlock. That shouldn't be too hard. I'll look into this. Thanks -- Peter Geoghegan
Re: file cloning in pg_upgrade and CREATE DATABASE
On 11/10/2018 16:50, Peter Eisentraut wrote: > On 10/10/2018 21:50, Bruce Momjian wrote: >>> I see. Peter is proposing to have a fourth mode, essentially >>> --transfer-mode=clone-or-copy. >> >> Uh, if you use --link, and the two data directories are on different >> file systems, it fails. No one has ever asked for link-or-copy, so why >> are we considering clone-or-copy? Are we going to need >> link-or-clone-or-copy too? I do realize that clone and copy have >> non-destructive behavior on the old cluster once started, so it does >> make some sense to merge them, unlike link. > > I'm OK to get rid of the clone-or-copy mode. I can see the confusion. New patch that removes all the various reflink modes and adds a new option --clone that works similar to --link. I think it's much cleaner now. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 8b7285e61e425fad4d07a011efe2ccd7fd280d54 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 18 Oct 2018 23:09:59 +0200 Subject: [PATCH v6] pg_upgrade: Allow use of file cloning Add another transfer mode --clone to pg_upgrade (besides the existing --link and the default copy), using special file cloning calls. This makes the file transfer faster and more space efficient, achieving speed similar to --link mode without the associated drawbacks. On Linux, file cloning is supported on Btrfs and XFS (if formatted with reflink support). On macOS, file cloning is supported on APFS. --- configure| 2 +- configure.in | 1 + doc/src/sgml/ref/pgupgrade.sgml | 37 ++--- src/bin/pg_upgrade/check.c | 13 - src/bin/pg_upgrade/file.c| 90 src/bin/pg_upgrade/option.c | 8 +++ src/bin/pg_upgrade/pg_upgrade.h | 6 ++- src/bin/pg_upgrade/relfilenode.c | 44 ++-- src/include/pg_config.h.in | 3 ++ 9 files changed, 179 insertions(+), 25 deletions(-) diff --git a/configure b/configure index 43ae8c869d..5a7aa02b27 100755 --- a/configure +++ b/configure @@ -15130,7 +15130,7 @@ fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l +for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.in b/configure.in index 519ecd5e1e..7f22bcee75 100644 --- a/configure.in +++ b/configure.in @@ -1602,6 +1602,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` AC_CHECK_FUNCS(m4_normalize([ cbrt clock_gettime + copyfile fdatasync getifaddrs getpeerucred diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml index d51146d641..e616ea9dbc 100644 --- a/doc/src/sgml/ref/pgupgrade.sgml +++ b/doc/src/sgml/ref/pgupgrade.sgml @@ -182,6 +182,25 @@ Options display version information, then exit + + --clone + + +Use efficient file cloning (also known as reflinks) +instead of copying files to the new cluster. This can result in +near-instantaneous copying of the data files, giving the speed +advantages of -k/--link while +leaving the old cluster untouched. + + + +At present, reflinks are supported on Linux (kernel 4.5 or later) with +Btrfs and XFS (on file systems created with reflink support, which is +not the default for XFS at this writing), and on macOS with APFS. + + + + -? --help @@ -340,7 +359,7 @@ Run pg_upgrade Always run the pg_upgrade binary of the new server, not the old one. pg_upgrade requires the specification of the old and new cluster's data and executable (bin) directories. You can also specify - user and port values, and whether you want the data files linked + user and port values, and whether you want the data files linked or cloned instead of the default copy behavior. @@ -351,8 +370,12 @@ Run pg_upgrade once you start the new cluster after the upgrade. Link mode also requires that the old and new cluster data directories be in the same file system. (Tablespaces and pg_wal can be on - different file systems.) See pg_upgrade --help for a full - list of options. + different file sy
Re: Postgres, fsync, and OSs (specifically linux)
Hello hackers, Let's try to get this issue resolved. Here is my position on the course of action we should take in back-branches: 1. I am -1 on back-patching the fd-transfer code. It's a significant change, and even when sufficiently debugged (I don't think it's there yet), we have no idea what will happen on all the kernels we support under extreme workloads. IMHO there is no way we can spring this on users in a point release. 2. I am +1 on back-patching Craig's PANIC-on-failure logic. Doing nothing is not an option I like. I have some feedback and changes to propose though; see attached. Responses to a review from Robert: On Thu, Jul 19, 2018 at 7:23 AM Robert Haas wrote: > 2. I don't like promote_ioerr_to_panic() very much, partly because the > same pattern gets repeated over and over, and partly because it would > be awkwardly-named if we discovered that another 2 or 3 errors needed > similar handling (or some other variant handling). I suggest instead > having a function like report_critical_fsync_failure(char *path) that > does something like this: > > int elevel = ERROR; > if (errno == EIO) > elevel = PANIC; > ereport(elevel, > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\": %m", path); > > And similarly I'd add report_critical_close_failure. In some cases, > this would remove wording variations (e.g. in twophase.c) but I think > that's fine, and maybe an improvement, as discussed on another recent > thread. I changed it to look like data_sync_elevel(ERROR) and made it treat all errnos the same. ENOSPC, EIO, EWOK, EIEIO, it makes no difference to the level of faith I have that my data still exists. > 3. slru.c calls pg_fsync() but isn't changed by the patch. That looks wrong. Fixed. > 4. The comment changes in snapbuild.c interact with the TODO that > immediately follows. I think more adjustment is needed here. I don't understand this. > 5. It seems odd that you adjusted the comment for > pg_fsync_no_writethrough() but not pg_fsync_writethrough() or > pg_fsync(). Either pg_fsync_writethrough() doesn't have the same > problem, in which case, awesome, but let's add a comment, or it does, > in which case it should refer to the other one. And I think > pg_fsync() itself needs a comment saying that every caller must be > careful to use promote_ioerr_to_panic() or > report_critical_fsync_failure() or whatever we end up calling it > unless the fsync is not critical for data integrity. I removed these comments and many others; I don't see the point in scattering descriptions of this problem and references to specific versions of Linux and -hackers archive links all over the place. I added a comment in one place, and also added some user documentation of the problem. > 6. In md.c, there's a stray blank line added. But more importantly, > the code just above that looks like this: > > if (!FILE_POSSIBLY_DELETED(errno) || > failures > 0) > -ereport(ERROR, > +ereport(promote_ioerr_to_panic(ERROR), > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\": %m", > path))); > else > ereport(DEBUG1, > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\" > but retrying: %m", > path))); > > I might be all wet here, but it seems like if we enter the bottom > branch, we still need the promote-to-panic behavior. That case only is only reached if FILE_POSSIBLY_DELETED() on the first time through the loop, and it detects an errno value not actually from fsync(). It's from FileSync(), when it tries to reopen a virtual fd and gets ENOENT, before calling fsync(). Code further down then absorbs incoming requests before checking if that was expected, closing a race. The comments could make that clearer, admittedly. > 7. The comment adjustment for SyncDataDirectory mentions an > "important" fact about fsync behavior, but then doesn't seem to change > any logic on that basis. I think in general a number of these > comments need a little more thought, but in this particular case, I > think we also need to consider what the behavior should be (and the > comment should reflect our considered judgement on that point, and the > implementation should match). I updated the comment. I don't think this is too relevant to the fsync() failure case, because we'll be rewriting all changes from the WAL again during recovery; I think this function is mostly useful for switching from fsync = off to fsync = on and restarting, not coping with previous fsync() failures by retrying (which we know to be useless anyway). Someone could argue that if you restarted after ch
Re: file cloning in pg_upgrade and CREATE DATABASE
On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote: > New patch that removes all the various reflink modes and adds a new > option --clone that works similar to --link. I think it's much cleaner now. Thanks Peter for the new version. + + {"clone", no_argument, NULL, 1}, + {NULL, 0, NULL, 0} Why newlines here? @@ -293,6 +300,7 @@ usage(void) printf(_(" -U, --username=NAME cluster superuser (default \"%s\")\n"), os_info.user); printf(_(" -v, --verbose enable verbose internal logging\n")); printf(_(" -V, --version display version information, then exit\n")); + printf(_(" --clone clone instead of copying files to new cluster\n")); printf(_(" -?, --helpshow this help, then exit\n")); printf(_("\n" An idea for a one-letter option could be -n. This patch can live without. + pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n", + schemaName, relName, src, strerror(errno)); The tail of those error messages "could not open file" and "could not create file" are already available as translatable error messages. Would it be better to split those messages in two for translators? One is generated with pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n", + schemaName, relName, src, strerror(errno)); The tail of those error messages "could not open file" and "could not create file" are already available as translatable error messages. Would it be better to split those messages in two for translators? One is generated with PG_VERBOSE to mention that cloning checks are done, and the second one is fatal with the actual errors. Those are all minor points, the patch looks good to me. -- Michael signature.asc Description: PGP signature
Re: Function to promote standby servers
On Tue, Oct 16, 2018 at 09:49:20AM +0200, Laurenz Albe wrote: > Only for hot standby - how do you want to execute a function on a standby > server that doesn't allow connections? In most deployments hot standby will be enabled, and wal_level uses the same value for archive and hot_standby for some time now, so that does not sound like a huge issue to me. -- Michael signature.asc Description: PGP signature
Re: Function to promote standby servers
On Mon, Oct 15, 2018 at 04:16:02PM +0200, Laurenz Albe wrote: > Michael Paquier wrote: >> Let's remove this restriction at code level, and instead use >> function-level ACLs so as it is possible to allow non-superusers to >> trigger a promotion as well, say a dedicated role. We could add a >> system role for this purpose, but I am not sure that it is worth the >> addition yet. > > Agreed, done. >> As of now, this function returns true if promotion has been triggered, >> and false if not. However it seems to me that we should have something >> more consistent with "pg_ctl promote", so there are actually three >> possible states: >> 1) Node is in recovery, with promotion triggered. pg_promote() returns >> true in case of success in creating the trigger file. >> 2) Node is in recovery, with promotion triggered. pg_promote() returns >> false in case of failure in creating the trigger file. >> 3) Node is not in recovery, where I think a call to pg_promote should >> trigger an error. This is not handled in the patch. > > So far I had returned "false" in the last case, but I am fine with > throwing an error in that case. Done. Thanks, that looks correct to me. I think that consistency with pg_ctl is quite important, as this is a feature present for many releases now. >> At the end this patch needs a bit more work. > > Yes, it did. Thanks for the thorough review! + /* wait for up to a minute for promotion */ + for (i = 0; i < WAITS_PER_SECOND * WAIT_SECONDS; ++i) + { + if (!RecoveryInProgress()) + PG_RETURN_BOOL(true); + + pg_usleep(100L / WAITS_PER_SECOND); + } I would recommend to avoid pg_usleep and instead use a WaitLatch() or similar to generate a wait event. The wait can then also be seen in pg_stat_activity, which is useful for monitoring purposes. Using RecoveryInProgress is indeed doable, and that's more simple than what I thought first. Something I missed to mention in the previous review: the timeout should be manually enforceable, with a default at 60s. Is the function marked as strict? Per the code it should be, I am not able to test now though. You are missing REVOKE EXECUTE ON FUNCTION pg_promote() in system_views.sql, or any users could trigger a promotion, no? -- Michael signature.asc Description: PGP signature
Re: lowering pg_regress privileges on Windows
On Thu, Oct 18, 2018 at 08:31:11AM -0400, Andrew Dunstan wrote: > The attached ridiculously tiny patch solves the problem whereby while we can > run Postgres on Windows safely from an Administrator account, we can't run > run the regression tests from the same account, since it fails on the > tablespace test, the tablespace directory having been set up without first > having lowered privileges. The solution is to lower pg_regress' privileges > in the same way that we do with other binaries. This is useful in setups > like Appveyor where running under any other account is ... difficult. For > the cfbot Thomas has had to make the script hack the schedule file to omit > the tablespace test. This would make that redundant. > > I propose to backpatch this. It's close enough to a bug and the risk is > almost infinitely small. +1. get_restricted_token() refactoring has been done down to REL9_5_STABLE. With 9.4 and older you would need to copy again this full routine into pg_regress.c, which is in my opinion not worth worrying about. -- Michael signature.asc Description: PGP signature
Re: lowering pg_regress privileges on Windows
On Fri, Oct 19, 2018 at 1:13 PM Michael Paquier wrote: > On Thu, Oct 18, 2018 at 08:31:11AM -0400, Andrew Dunstan wrote: > > The attached ridiculously tiny patch solves the problem whereby while we can > > run Postgres on Windows safely from an Administrator account, we can't run > > run the regression tests from the same account, since it fails on the > > tablespace test, the tablespace directory having been set up without first > > having lowered privileges. The solution is to lower pg_regress' privileges > > in the same way that we do with other binaries. This is useful in setups > > like Appveyor where running under any other account is ... difficult. For > > the cfbot Thomas has had to make the script hack the schedule file to omit > > the tablespace test. This would make that redundant. > > > > I propose to backpatch this. It's close enough to a bug and the risk is > > almost infinitely small. > > +1. get_restricted_token() refactoring has been done down to > REL9_5_STABLE. With 9.4 and older you would need to copy again this > full routine into pg_regress.c, which is in my opinion not worth > worrying about. FWIW here is a successful Appveyor build including the full test schedule (CI patch attached in case anyone is interested). Woohoo! Thanks for figuring that out Andrew. I will be very happy to remove that wart from my workflows. https://ci.appveyor.com/project/macdice/postgres/builds/19626669 -- Thomas Munro http://www.enterprisedb.com 0001-CI-configuration-for-Appveyor.patch Description: Binary data
Re: DSM segment handle generation in background workers
On Sat, Oct 13, 2018 at 11:45 PM Thomas Munro wrote: > On Thu, Oct 11, 2018 at 5:51 PM Tom Lane wrote: > > The comment adjacent to the change in InitStandaloneProcess bothers me. > > In particular, it points out that what BackendRun() is currently doing > > creates more entropy in the process's random seed than what you have > > here: MyStartTime is only good to the nearest second, whereas the code > > you removed from BackendRun() factors in fractional seconds to whatever > > the precision of GetCurrentTimestamp is. This does not seem like an > > improvement. > > True. > > > I'm inclined to think we should refactor a bit more aggressively so > > that, rather than dumbing backends down to the standard of other > > processes, we make them all use the better method. A reasonable way > > to approach this would be to invent a global variable MyStartTimestamp > > beside MyStartTime, and initialize both of them in the places that > > currently initialize the latter, using code like that in > > BackendInitialize: > > > > /* save process start time */ > > port->SessionStartTime = GetCurrentTimestamp(); > > MyStartTime = timestamptz_to_time_t(port->SessionStartTime); > > > > and then this new InitRandomSeeds function could adopt BackendRun's > > seed initialization method instead of the stupider way. > > Ok, done. > > With MyStartTimestamp in the picture, port->SessionStartTime is > probably redundant... and in fact the comment in struct > Port says it shouldn't even be there. So... I removed it, and used > the new MyStartTimestamp in the couple of places that wanted it. > Thoughts? > > > Possibly it'd be sane to merge the MyStartTime* initializations > > and the srandom resets into one function, not sure. > > +1 > > The main difficulty was coming up with a name for the function that > does those things. I went with InitProcessGlobals(). Better > suggestions welcome. Pushed to master only. -- Thomas Munro http://www.enterprisedb.com
Re: partition tree inspection functions
On 2018/10/10 13:01, Michael Paquier wrote: > On Wed, Oct 10, 2018 at 11:54:48AM +0900, Amit Langote wrote: >> I was partly wrong in saying that we wouldn't need any changes to support >> partitioned indexes here. Actually, the core function >> find_inheritance_children wouldn't scan pg_inherits for us if we pass an >> (partitioned) index to it, even if the index may have children. >> >> It appears that we don't set relhassubclass for partitioned indexes (that >> is, on parent indexes), so the above the test is useless for indexes. > > Aha, so that was it. > >> Maybe, we need to start another thread to discuss whether we should be >> manipulating relhassubclass for index partitioning (I'd brought this up in >> the context of relispartition, btw [1]). I'm not immediately sure if >> setting relhassubclass correctly for indexes will have applications beyond >> this one, but it might be something we should let others know so that we >> can hear more opinions. > > This reminds of https://postgr.es/m/20180226053613.gd6...@paquier.xyz, > which has resulted in relhaspkey being removed from pg_class with > f66e8bf8. I would much prefer if we actually removed it... Now > relhassubclass is more tricky than relhaspkey as it gets used as a > fast-exit path for a couple of things: > 1) prepunion.c when planning for child relations. > 2) plancat.c > 2) analyze.c > 3) rewriteHandler.c I'm concerned about the 1st one. Currently in expand_inherited_rtentry(), checking relhassubclass allows us to short-circuit scanning pg_inherits to find out if a table has children. Note that expand_inherited_rtentry() is called for *all* queries that contain a table so that we correctly identify tables that would require inheritance planning. So, removing relhassubclass means a slight (?) degradation for *all* queries. >> For time being, I've modified that code as follows: >> >> @@ -68,9 +70,11 @@ find_inheritance_children(Oid parentrelId, LOCKMODE >> lockmode) >> >> /* >> * Can skip the scan if pg_class shows the relation has never had a >> - * subclass. >> + * subclass. Since partitioned indexes never have their >> + * relhassubclass set, we cannot skip the scan based on that. >> */ >> -if (!has_subclass(parentrelId)) >> +if (get_rel_relkind(parentrelId) != RELKIND_PARTITIONED_INDEX && >> +!has_subclass(parentrelId)) >> return NIL; > > I don't like living with this hack. What if we simply nuked this > fast-path exit all together? The only code path where this could > represent a performance issue is RelationBuildPartitionKey(), but we > expect a partitioned table to have children anyway, and there will be > few cases where partitioned tables have no partitions. As I said above, the price of removing relhassubclass might be a bit steep. So, the other alternative I mentioned before is to set relhassubclass correctly even for indexes if only for pg_partition_tree to be able to use find_inheritance_children unchanged. Thanks, Amit
relhassubclass and partitioned indexes
Hi, $subject came up in [1]. Should relhassubclass be set/reset for partitioned indexes? The only use case being sought here is to use find_inheritance_children() for getting an index's partitions, but it uses relhassublcass test to short-circuit scanning pg_inherits. That means it cannot be used to get an index's children, because relhassublcass is never set for indexes. Michael suggested on the linked thread to get rid of relhassubclass altogether, like we did for relhaspkey recently, but I'm not sure whether it would be a good idea right yet. Thoughts? Thanks, Amit [1] https://www.postgresql.org/message-id/3acdcbf4-6a62-fb83-3bfd-5f275602ca7d%40lab.ntt.co.jp
Re: Postgres, fsync, and OSs (specifically linux)
On Fri, 19 Oct 2018 at 07:27, Thomas Munro wrote: > > 2. I am +1 on back-patching Craig's PANIC-on-failure logic. Doing > nothing is not an option I like. I have some feedback and changes to > propose though; see attached. > Thanks very much for the work on reviewing and revising this. > I don't see why sync_file_range(SYNC_FILE_RANGE_WRITE) should get a > pass here. Inspection of some version of the kernel might tell us it > can't advance the error counter and report failure, but what do we > gain by relying on that? Changed. > I was sure it made sense at the time, but I can't explain that decision now, and it looks like we should treat it as a failure. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: relhassubclass and partitioned indexes
Amit Langote writes: > Should relhassubclass be set/reset for partitioned indexes? Seems like a reasonable idea to me, at least the "set" end of it. We don't ever clear relhassubclass for tables, so maybe that's not necessary for indexes either. > Michael suggested on the linked thread to get rid of relhassubclass > altogether, like we did for relhaspkey recently, but I'm not sure whether > it would be a good idea right yet. We got rid of relhaspkey mostly because it was of no use to the backend. That's far from true for relhassubclass. regards, tom lane
Re: relhassubclass and partitioned indexes
On Fri, Oct 19, 2018 at 01:45:03AM -0400, Tom Lane wrote: > Amit Langote writes: >> Should relhassubclass be set/reset for partitioned indexes? > > Seems like a reasonable idea to me, at least the "set" end of it. > We don't ever clear relhassubclass for tables, so maybe that's > not necessary for indexes either. No objections to the proposal. Allowing find_inheritance_children to find index trees for partitioned indexes could be actually useful for extensions like pg_partman. >> Michael suggested on the linked thread to get rid of relhassubclass >> altogether, like we did for relhaspkey recently, but I'm not sure whether >> it would be a good idea right yet. > > We got rid of relhaspkey mostly because it was of no use to the backend. > That's far from true for relhassubclass. Partitioned tables are expected to have partitions, so the optimizations related to relhassubclass don't seem much worth worrying. However relations not having inherited tables may take a performance hit. If this flag removal would be done, we'd need to be careful about the performance impact and the cost of extra lookups at pg_inherit. -- Michael signature.asc Description: PGP signature