Hi, While reviewing the thread about issues with auto-analyze on partitioned tables [1] I remembered that the way we build statistics on the parent is somewhat expensive, because it samples rows from the children again.
It's true we sample much smaller amounts of rows from each partition (proportional to it's size), so it's not as expensive as simply running ANALYZE on all partitions individually. But it's not free either, and in some cases (e.g. with partitioning by time) it may touch old data that is not touched by regular queries, so likely not cached etc. That's a bit against the idea to use partitioning to "segregate" old data. One reason why the ANALYZE costs are not a huge issue in practice is that we're not actually triggering that - changes_since_analyze is not updated for the parent, so autoanalyze does not realize it needs to do anything, and we never rebuild the statistics :-( But as shown in [2], that may lead to poor estimates (and bad plans) in cases when we rely on the parent's stats (probably due to not expanding the list of children early enough). (Note: I wonder if we might simply not rely on the parent stats at all, and always consult directly the children stats - but with many children that's likely problematic/expensive, and it does have the same issues as the merging.) The other thread [1] attempts to fix that by incrementing the counters for the parent too, but that'll make this much worse. firstly, with multi-level partitioning we'll have to sample the children repeatedly, essentially once for each level. Secondly, it's tricky to control when exactly are the counters propagated to the parent, making the parent analyzes more frequent. Not great. Even if we do a great job in [1] and come up with smart heuristics, there will always be cases where it does not work too well and we either analyze too late or too often. Note: Propagating changes_since_analyze is only part of the story, as it does not do anything about stats after attach/detach of a partition. This attempts to approach the problem from the other end - instead of tightly controlling when to analyze the parent, it makes the analyze much cheaper. That means we don't need to worry too much about doing the analyze too often, and we can update the stats more aggressively. So, how does it work? Well, it simply fetches the statistics for all children, and merges them together. For most statistics that's fairly simple for most statistics types. 1) stawidth, stanullfrac - Those are trivial to merge. 2) stacorrelation - Not sure, I've used weighted average for now. Perhaps we could store the "internal" counters (sumX, sumX2) which would allow us to calculate regular estimate for the parent. 3) stadistinct - This is quite problematic. We only have the per-child estimates, and it's not clear if there's any overlap. For now I've just summed it up, because that's safer / similar to what we do for gather merge paths etc. Maybe we could improve this by estimating the overlap somehow (e.g. from MCV lists / histograms). But honestly, I doubt the estimates based on tiny sample of each child are any better. I suppose we could introduce a column option, determining how to combine ndistinct (similar to how we can override n_distinct itself). 4) MCV - It's trivial to build a new "parent" MCV list, although it may be too large (in which case we cut it at statistics target, and copy the remaining bits to the histogram) 5) histograms - A bit more complicated, because it requires dealing with overlapping bins, so we may have to "cut" and combine them in some way. If we assume that cutting a bin in K parts means each part has 1/K tuples (no matter where exactly we cut it), then it's trivial and it works just fine in practice. That's because with N children, each bin actually represents 1.0/(target*N) of the tuples, so the errors are quite limited. The attached patch is a PoC - it should work, but there's plenty of room for improvement. It only deals with "regular" per-column statistics, not with extended stats (but I don't see why it wouldn't work for them too). It adds a new analyze option "MERGE" which does not sample the children but instead just combines the statistics. So the example from [2] looks like this: ====================================================================== create table p (i integer, j integer) partition by list (i); create table p0 partition of p for values in (0); create table p1 partition of p for values in (1); insert into p select 0,generate_series(1,1000); insert into p select 1,generate_series(1,1000); analyze p0; analyze p1; create table q (i integer); insert into q values (0); analyze q; test=# explain select * from q join p using (i) where j between 1 and 500; QUERY PLAN --------------------------------------------------------------------- Hash Join (cost=1.02..49.82 rows=5 width=8) Hash Cond: (p.i = q.i) -> Append (cost=0.00..45.00 rows=1000 width=8) -> Seq Scan on p0 p_1 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Seq Scan on p1 p_2 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Hash (cost=1.01..1.01 rows=1 width=4) -> Seq Scan on q (cost=0.00..1.01 rows=1 width=4) (9 rows) test=# analyze (merge) p; ANALYZE test=# explain select * from q join p using (i) where j between 1 and 500; QUERY PLAN --------------------------------------------------------------------- Hash Join (cost=1.02..54.77 rows=500 width=8) Hash Cond: (p.i = q.i) -> Append (cost=0.00..45.00 rows=1000 width=8) -> Seq Scan on p0 p_1 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Seq Scan on p1 p_2 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Hash (cost=1.01..1.01 rows=1 width=4) -> Seq Scan on q (cost=0.00..1.01 rows=1 width=4) (9 rows) ====================================================================== FWIW I'm not sure we need the separate MERGE mode, but it's an easy way to allow both the regular and "merge" approach, so that it's possible to experiment and compare the behavior. One issue is that this would require some coordination between the parent/child analyzes. Essentially, any time a child is analyzed, the parent should rebuild the stats (to merge from the new child stats). This is similar to the issue of analyzing the parent too often because we don't know when exactly the counters get updated, but it's much cheaper to merge the stats, so it's much less problematic. regards [1] https://commitfest.postgresql.org/32/2492/ [2] https://www.postgresql.org/message-id/CAM-w4HO9hUHvJDVwQ8%3DFgm-znF9WNvQiWsfyBjCr-5FD7gWKGA%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>From 423f639cdd6e5d3d1fef67588e230ff7663d4e0f Mon Sep 17 00:00:00 2001 From: Tomas Vondra <tomas.von...@postgresql.org> Date: Sun, 28 Mar 2021 17:43:11 +0200 Subject: [PATCH] Combine statistics from child relations ANALYZE (MERGE) t; --- src/backend/commands/analyze.c | 830 ++++++++++++++++++++++++++++++++- src/backend/commands/vacuum.c | 6 +- src/include/commands/vacuum.h | 1 + 3 files changed, 834 insertions(+), 3 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index f84616d3d2..dbffabae9f 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -66,6 +66,7 @@ #include "utils/spccache.h" #include "utils/syscache.h" #include "utils/timestamp.h" +#include "utils/typcache.h" /* Per-index data for ANALYZE */ @@ -90,6 +91,10 @@ static void do_analyze_rel(Relation onerel, VacuumParams *params, List *va_cols, AcquireSampleRowsFunc acquirefunc, BlockNumber relpages, bool inh, bool in_outer_xact, int elevel); +static void do_analyze_merge_rel(Relation onerel, + VacuumParams *params, List *va_cols, + AcquireSampleRowsFunc acquirefunc, BlockNumber relpages, + bool inh, bool in_outer_xact, int elevel); static void compute_index_stats(Relation onerel, double totalrows, AnlIndexData *indexdata, int nindexes, HeapTuple *rows, int numrows, @@ -125,6 +130,7 @@ analyze_rel(Oid relid, RangeVar *relation, int elevel; AcquireSampleRowsFunc acquirefunc = NULL; BlockNumber relpages = 0; + bool merge_stats; /* Select logging level */ if (params->options & VACOPT_VERBOSE) @@ -132,6 +138,9 @@ analyze_rel(Oid relid, RangeVar *relation, else elevel = DEBUG2; + /* requested to merge child statistics instead of sampling */ + merge_stats = (params->options & VACOPT_MERGE_STATS); + /* Set up static variables */ vac_strategy = bstrategy; @@ -263,10 +272,19 @@ analyze_rel(Oid relid, RangeVar *relation, /* * If there are child tables, do recursive ANALYZE. + * + * If ANALYZE (MERGE) was requested, simply fetch and merge statistics + * from the children, instead of sampling the data. */ if (onerel->rd_rel->relhassubclass) - do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages, - true, in_outer_xact, elevel); + { + if (merge_stats) + do_analyze_merge_rel(onerel, params, va_cols, acquirefunc, relpages, + true, in_outer_xact, elevel); + else + do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages, + true, in_outer_xact, elevel); + } /* * Close source relation now, but keep lock so that no one deletes it @@ -802,6 +820,814 @@ do_analyze_rel(Relation onerel, VacuumParams *params, anl_context = NULL; } +/* merging of statistics from children tables (MCV, histogram) */ + +typedef struct MCVMergeItem { + Datum value; + double nrows; +} MCVMergeItem; + +typedef struct HistogramMergeItem { + Datum minval; + Datum maxval; + double nrows; +} HistogramMergeItem; + +typedef struct merge_context_t { + FmgrInfo opproc; + Oid collation; +} merge_context_t; + +/* compare MCV items by the values */ +static int +compare_mcv_merge_item(const void *a, const void *b, void *arg) +{ + bool ltcmp; + MCVMergeItem *ma = (MCVMergeItem *) a; + MCVMergeItem *mb = (MCVMergeItem *) b; + merge_context_t *cxt = (merge_context_t *) arg; + + ltcmp = DatumGetBool(FunctionCall2Coll(&cxt->opproc, cxt->collation, + ma->value, mb->value)); + + if (ltcmp) + return -1; + + ltcmp = DatumGetBool(FunctionCall2Coll(&cxt->opproc, cxt->collation, + mb->value, ma->value)); + + if (ltcmp) + return 1; + + return 0; +} + +/* sort the MCV items by tuples in descending order */ +static int +compare_mcv_by_tuples(const void *a, const void *b) +{ + MCVMergeItem *ma = (MCVMergeItem *) a; + MCVMergeItem *mb = (MCVMergeItem *) b; + + if (ma->nrows > mb->nrows) + return -1; + else if (ma->nrows > mb->nrows) + return 1; + + return 0; +} + +/* + * Combine duplicate items from multiple MCV lists, build a new list. + * + * XXX In principle the size of the MCV list should be determined the same + * way analyze_mcv_list() does it for individual tables, but we have only + * very unreliable ndistinct estimate. So we simply assume that whatever + * got into the per-table MCV list, is worth keeping in the MCV list for + * the parent too. + */ +static int +merge_mcv(MCVMergeItem *mcv, int *mcv_items, merge_context_t *cxt) +{ + int i; + int n; + + if (*mcv_items == 0) + return 0; + + qsort_arg(mcv, *mcv_items, sizeof(MCVMergeItem), + compare_mcv_merge_item, cxt); + + n = 1; + for (i = 1; i < *mcv_items; i++) + { + if (compare_mcv_merge_item(&mcv[i-1], &mcv[i], cxt) == 0) + continue; + + mcv[n] = mcv[i]; + n++; + } + + *mcv_items = n; + + pg_qsort(mcv, n, sizeof(MCVMergeItem), compare_mcv_by_tuples); + + if (*mcv_items > default_statistics_target) + n = default_statistics_target; + + return n; +} + +/* compare datum using the comparator / collation */ +static int +compare_values(const void *a, const void *b, void *arg) +{ + bool ltcmp; + merge_context_t *cxt = (merge_context_t *) arg; + + Datum da = * (Datum *) a; + Datum db = * (Datum *) b; + + ltcmp = DatumGetBool(FunctionCall2Coll(&cxt->opproc, cxt->collation, + da, db)); + + if (ltcmp) + return -1; + + ltcmp = DatumGetBool(FunctionCall2Coll(&cxt->opproc, cxt->collation, + db, da)); + + if (ltcmp) + return 1; + + return 0; +} + +/* merge ranges from multiple histograms, build a new histogram */ +static Datum * +merge_histogram(HistogramMergeItem *hist, int hist_items, int *nitems, + merge_context_t *cxt) +{ + int i; + int n; + double nrows_total; + Datum *values; + int nvalues; + int new_hist_items; + HistogramMergeItem *new_hist; + double step; + double next; + double curr; + + if (hist_items == 0) + { + *nitems = 0; + return NULL; + } + + /* total number of rows in the histogram */ + nrows_total = 0; + for (i = 0; i < hist_items; i++) + nrows_total += hist[i].nrows; + + /* construct non-overlapping intervals - we don't split */ + nvalues = 0; + values = (Datum *) palloc(2 * hist_items * sizeof(Datum)); + for (i = 0; i < hist_items; i++) + { + values[nvalues++] = hist[i].minval; + values[nvalues++] = hist[i].maxval; + } + + qsort_arg(values, nvalues, sizeof(Datum), compare_values, cxt); + + n = 1; + for (i = 1; i < nvalues; i++) + { + if (compare_values(&values[i-1], &values[i], cxt) == 0) + continue; + + values[n++] = values[i]; + } + nvalues = n; + + /* transform the values into histogram */ + new_hist_items = (nvalues - 1); + new_hist = (HistogramMergeItem *) palloc0(sizeof(HistogramMergeItem) * new_hist_items); + for (i = 0; i < new_hist_items; i++) + { + new_hist[i].minval = values[i]; + new_hist[i].maxval = values[i+1]; + } + + /* redistribute the rows from the old into the new histogram */ + for (i = 0; i < hist_items; i++) + { + int startidx = 0; + int endidx = 0; + double nrows; + int j; + + /* find the first matching bin in the new histogram - loop until we + * find the first bin after it, then go one step back */ + while (true) + { + if ((startidx == new_hist_items) || + compare_values(&hist[i].minval, &new_hist[startidx].minval, cxt) < 0) + { + startidx--; + break; + } + + startidx++; + } + + Assert(startidx >= 0); + Assert(startidx < new_hist_items); + + /* find the last matching bin in the new histogram - loop until we + * find the first bin after it, then go one step back */ + while (true) + { + if ((endidx == new_hist_items) || + compare_values(&hist[i].maxval, &new_hist[endidx].minval, cxt) < 0) + { + endidx--; + break; + } + + endidx++; + } + + Assert(endidx >= startidx); + Assert(startidx < new_hist_items); + + /* + * Redistribute the counts between the matching ranges equally. + * + * XXX We could do better for data types with "distance", maybe? + */ + nrows = hist[i].nrows / (endidx - startidx + 1); + + for (j = startidx; j <= endidx; j++) + new_hist[j].nrows += nrows; + } + + nrows_total = 0; + for (i = 0; i < new_hist_items; i++) + nrows_total += new_hist[i].nrows; + + step = (nrows_total / default_statistics_target); + next = step; + curr = 0; + + values[0] = new_hist[0].minval; + n = 1; + + for (i = 0; i < new_hist_items; i++) + { + curr += new_hist[i].nrows; + + /* not reached the next threshold */ + if (curr < next) + continue; + + while (curr >= next) + { + values[n++] = new_hist[i].maxval; + next += step; + } + } + + *nitems = n; + + return values; +} + +/* + * do_analyze_merge_rel() -- analyze partitioned relation by merging stats + * from child tables + */ +static void +do_analyze_merge_rel(Relation onerel, VacuumParams *params, + List *va_cols, AcquireSampleRowsFunc acquirefunc, + BlockNumber relpages, bool inh, bool in_outer_xact, + int elevel) +{ + int attr_cnt, + tcnt, + i; + VacAttrStats **vacattrstats; + PGRUsage ru0; + TimestampTz starttime = 0; + MemoryContext caller_context; + Oid save_userid; + int save_sec_context; + int save_nestlevel; + int64 AnalyzePageHit = VacuumPageHit; + int64 AnalyzePageMiss = VacuumPageMiss; + int64 AnalyzePageDirty = VacuumPageDirty; + PgStat_Counter startreadtime = 0; + PgStat_Counter startwritetime = 0; + List *children; + Relation *rels; + int nrels; + ListCell *lc; + + double *relblocks; + double *reltuples; + double totalblocks; + double totaltuples; + + /* pointers to pg_statistic tuples */ + HeapTuple *statistic_tuples; + + /* partitioned tables / inheritance trees only */ + Assert(inh); + + /* + * Set up a working context so that we can easily free whatever junk gets + * created. + */ + anl_context = AllocSetContextCreate(CurrentMemoryContext, + "Analyze", + ALLOCSET_DEFAULT_SIZES); + caller_context = MemoryContextSwitchTo(anl_context); + + /* + * Switch to the table owner's userid, so that any index functions are run + * as that user. Also lock down security-restricted operations and + * arrange to make GUC variable changes local to this command. + * + * XXX Probably not needed, we're not going to run any functions. + */ + GetUserIdAndSecContext(&save_userid, &save_sec_context); + SetUserIdAndSecContext(onerel->rd_rel->relowner, + save_sec_context | SECURITY_RESTRICTED_OPERATION); + save_nestlevel = NewGUCNestLevel(); + + /* measure elapsed time iff autovacuum logging requires it */ + if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0) + { + if (track_io_timing) + { + startreadtime = pgStatBlockReadTime; + startwritetime = pgStatBlockWriteTime; + } + + pg_rusage_init(&ru0); + if (params->log_min_duration >= 0) + starttime = GetCurrentTimestamp(); + } + + /* + * Determine which columns to analyze + * + * Note that system attributes are never analyzed, so we just reject them + * at the lookup stage. We also reject duplicate column mentions. (We + * could alternatively ignore duplicates, but analyzing a column twice + * won't work; we'd end up making a conflicting update in pg_statistic.) + */ + if (va_cols != NIL) + { + Bitmapset *unique_cols = NULL; + ListCell *le; + + vacattrstats = (VacAttrStats **) palloc(list_length(va_cols) * + sizeof(VacAttrStats *)); + tcnt = 0; + foreach(le, va_cols) + { + char *col = strVal(lfirst(le)); + + i = attnameAttNum(onerel, col, false); + if (i == InvalidAttrNumber) + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_COLUMN), + errmsg("column \"%s\" of relation \"%s\" does not exist", + col, RelationGetRelationName(onerel)))); + if (bms_is_member(i, unique_cols)) + ereport(ERROR, + (errcode(ERRCODE_DUPLICATE_COLUMN), + errmsg("column \"%s\" of relation \"%s\" appears more than once", + col, RelationGetRelationName(onerel)))); + unique_cols = bms_add_member(unique_cols, i); + + vacattrstats[tcnt] = examine_attribute(onerel, i, NULL); + if (vacattrstats[tcnt] != NULL) + tcnt++; + } + attr_cnt = tcnt; + } + else + { + attr_cnt = onerel->rd_att->natts; + vacattrstats = (VacAttrStats **) + palloc(attr_cnt * sizeof(VacAttrStats *)); + tcnt = 0; + for (i = 1; i <= attr_cnt; i++) + { + vacattrstats[tcnt] = examine_attribute(onerel, i, NULL); + if (vacattrstats[tcnt] != NULL) + tcnt++; + } + attr_cnt = tcnt; + } + + /* + * XXX We don't want to touch the parent's indexes at all. + */ + + /**/ + children = find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, NULL); + + /* filter interesting relations (leafs only) */ + rels = (Relation *) palloc(list_length(children) * sizeof(Relation)); + relblocks = (double *) palloc(list_length(children) * sizeof(double)); + reltuples = (double *) palloc(list_length(children) * sizeof(double)); + totalblocks = 0; + totaltuples = 0; + + nrels = 0; + + foreach(lc, children) + { + Oid childoid = lfirst_oid(lc); + Relation childrel; + + /* We already got the needed lock */ + childrel = table_open(childoid, NoLock); + + /* Ignore if temp table of another backend */ + if (RELATION_IS_OTHER_TEMP(childrel)) + { + /* ... but release the lock on it */ + Assert(childrel != onerel); + table_close(childrel, AccessShareLock); + continue; + } + + /* + * ignore anything but valid leaf relatiins with data, but release + * the lock on it. don't try to unlock the passed-in relation + */ + if (childrel->rd_rel->relkind != RELKIND_RELATION && + childrel->rd_rel->relkind != RELKIND_MATVIEW && + childrel->rd_rel->relkind != RELKIND_FOREIGN_TABLE) + { + Assert(childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); + if (childrel != onerel) + table_close(childrel, AccessShareLock); + else + table_close(childrel, NoLock); + continue; + } + + /* OK, we'll process this child */ + rels[nrels] = childrel; + relblocks[nrels] = (double) RelationGetNumberOfBlocks(childrel); + reltuples[nrels] = (double) childrel->rd_rel->reltuples; + totalblocks += relblocks[nrels]; + totaltuples += reltuples[nrels]; + nrels++; + } + + statistic_tuples = (HeapTuple *) palloc(list_length(children) * sizeof(HeapTuple)); + + /* FIXME do the merge here */ + for (i = 0; i < attr_cnt; i++) + { + int j; + float4 r_correlation, + r_nullfrac; + int slot_idx; + int64 r_stawidth = 0; + double r_ndistinct = 0; + + MCVMergeItem *mcv; + int max_mcv_items = 64; + int mcv_items = 0; + int mcv_size; + + HistogramMergeItem *hist; + int max_hist_items = 64; + int hist_items = 0; + + Datum *values; + int nvalues; + + TypeCacheEntry *typcache; + merge_context_t cxt; + + VacAttrStats *stats = vacattrstats[i]; + + Oid ltopr, + eqopr; + + /* Look for default "<" and "=" operators for column's type */ + get_sort_group_operators(stats->attrtypid, + false, false, false, + <opr, &eqopr, NULL, + NULL); + + typcache = lookup_type_cache(vacattrstats[i]->attr->atttypid, + TYPECACHE_LT_OPR); + + mcv = (MCVMergeItem *) palloc(sizeof(MCVMergeItem) * max_mcv_items); + hist = (HistogramMergeItem *) palloc(sizeof(HistogramMergeItem) * max_hist_items); + + r_correlation = 0; + r_nullfrac = 0; + + /* fetch stats for the attribute from all children */ + for (j = 0; j < nrels; j++) + { + HeapTuple htup; + Form_pg_statistic stats; + double nullfrac; + double mcvfrac; + + statistic_tuples[j] = NULL; + + htup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(RelationGetRelid(rels[j])), + Int16GetDatum(vacattrstats[i]->attr->attnum), + BoolGetDatum(false)); + + if (!HeapTupleIsValid(htup)) + { + elog(WARNING, "stats for %d %d not found", + RelationGetRelid(rels[j]), vacattrstats[i]->attr->attnum); + continue; + } + + stats = (Form_pg_statistic) GETSTRUCT(htup); + + nullfrac = stats->stanullfrac; + mcvfrac = 0; + + /* increment the result null fraction */ + r_nullfrac += (nullfrac * reltuples[j]); + + /* increment the average width */ + r_stawidth += (stats->stawidth * reltuples[j]); + + /* + * XXX We sum the ndistinct for child tables. This is probably + * overkill, because the child tables likely share at least some + * values (except for the partition key case). But it's similar + * to what we do to estimate number of groups before calling + * create_gather_merge_path, etc. + */ + if (stats->stadistinct < 0) + r_ndistinct += -(stats->stadistinct * reltuples[j]); + else + r_ndistinct += stats->stadistinct; + + /* FIXME do ACL checks here */ + + /* extract stats about the attribute */ + { + AttStatsSlot sslot; + + /* expand the MCV list */ + if (get_attstatsslot(&sslot, htup, + STATISTIC_KIND_MCV, InvalidOid, + ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS)) + { + int k; + + for (k = 0; k < sslot.nvalues; k++) + { + if (max_mcv_items == mcv_items) + { + max_mcv_items *= 2; + mcv = repalloc(mcv, sizeof(MCVMergeItem) * max_mcv_items); + } + + mcvfrac += sslot.numbers[k]; + + mcv[mcv_items].value = sslot.values[k]; + mcv[mcv_items].nrows = (sslot.numbers[k] * reltuples[j]); + mcv_items++; + } + } + + /* expand the histogram */ + if (get_attstatsslot(&sslot, htup, + STATISTIC_KIND_HISTOGRAM, InvalidOid, + ATTSTATSSLOT_VALUES)) + { + int k; + /* one histogram bin represents this fraction of rows */ + double frac = (1 - mcvfrac - nullfrac) / (sslot.nvalues - 1); + + for (k = 0; k < sslot.nvalues - 1; k++) + { + if (max_hist_items == hist_items) + { + max_hist_items *= 2; + hist = repalloc(hist, sizeof(HistogramMergeItem) * max_hist_items); + } + + hist[hist_items].minval = sslot.values[k]; + hist[hist_items].maxval = sslot.values[k+1]; + hist[hist_items].nrows = (frac * reltuples[j]); + hist_items++; + } + } + + /* expand the correlation */ + if (get_attstatsslot(&sslot, htup, + STATISTIC_KIND_CORRELATION, InvalidOid, + ATTSTATSSLOT_NUMBERS)) + r_correlation += sslot.numbers[0] * (reltuples[j] / totaltuples); + + /* FIXME handle the other stakind types (array elements etc.) */ + } + + statistic_tuples[j] = htup; + } + + fmgr_info(get_opcode(typcache->lt_opr), &cxt.opproc); + cxt.collation = stats->attrcollid; + + /* merge values in the MCV lists (if any) */ + mcv_size = merge_mcv(mcv, &mcv_items, &cxt); + + /* + * If there are MCV items that did not make it to the combined MCV list, + * add them to the histogram as ranges. + */ + for (j = mcv_size; j < mcv_items; j++) + { + if (max_hist_items == hist_items) + { + max_hist_items *= 2; + hist = repalloc(hist, sizeof(HistogramMergeItem) * max_hist_items); + } + + hist[hist_items].minval = mcv[j].value; + hist[hist_items].maxval = mcv[j].value; + hist[hist_items].nrows = mcv[j].nrows; + hist_items++; + } + + /* merge histograms and remaining parts of MCV */ + values = merge_histogram(hist, hist_items, &nvalues, &cxt); + + slot_idx = 0; + + if (mcv_size > 0) + { + Datum *mcv_values = (Datum *) palloc(sizeof(Datum) * mcv_size); + float4 *mcv_freqs = (float4 *) palloc(sizeof(float4) * mcv_size); + + for (j = 0; j < mcv_size; j++) + { + mcv_values[j] = mcv[j].value; + mcv_freqs[j] = (mcv[j].nrows / totaltuples); + } + + stats->stakind[slot_idx] = STATISTIC_KIND_MCV; + stats->staop[slot_idx] = eqopr; + stats->stacoll[slot_idx] = stats->attrcollid; + stats->stanumbers[slot_idx] = mcv_freqs; + stats->numnumbers[slot_idx] = mcv_size; + stats->stavalues[slot_idx] = mcv_values; + stats->numvalues[slot_idx] = mcv_size; + slot_idx++; + } + + if (nvalues > 0) + { + stats->stakind[slot_idx] = STATISTIC_KIND_HISTOGRAM; + stats->staop[slot_idx] = ltopr; + stats->stacoll[slot_idx] = stats->attrcollid; + stats->stavalues[slot_idx] = values; + stats->numvalues[slot_idx] = nvalues; + slot_idx++; + } + + { + float4 *corrs = (float4 *) palloc(sizeof(float4)); + corrs[0] = r_correlation; + + stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION; + stats->staop[slot_idx] = ltopr; + stats->stacoll[slot_idx] = stats->attrcollid; + stats->stanumbers[slot_idx] = corrs; + stats->numnumbers[slot_idx] = 1; + slot_idx++; + } + + stats->stanullfrac = r_nullfrac / totaltuples; + stats->stawidth = r_stawidth / totaltuples; + + if (r_ndistinct > 0.1 * totaltuples) + stats->stadistinct = -(r_ndistinct / totaltuples); + else + stats->stadistinct = r_ndistinct; + + stats->stats_valid = true; + + /* update the statistics */ + update_attstats(RelationGetRelid(onerel), inh, + 1, &stats); + + /* release the stat tuples */ + for (j = 0; j < nrels; j++) + ReleaseSysCache(statistic_tuples[j]); + } + + /* close the remaining rels */ + for (i = 0; i < nrels; i++) + table_close(rels[i], NoLock); + + pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE, + PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE); + + /* Log the action if appropriate */ + if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0) + { + TimestampTz endtime = GetCurrentTimestamp(); + + if (params->log_min_duration == 0 || + TimestampDifferenceExceeds(starttime, endtime, + params->log_min_duration)) + { + long delay_in_ms; + double read_rate = 0; + double write_rate = 0; + StringInfoData buf; + + /* + * Calculate the difference in the Page Hit/Miss/Dirty that + * happened as part of the analyze by subtracting out the + * pre-analyze values which we saved above. + */ + AnalyzePageHit = VacuumPageHit - AnalyzePageHit; + AnalyzePageMiss = VacuumPageMiss - AnalyzePageMiss; + AnalyzePageDirty = VacuumPageDirty - AnalyzePageDirty; + + /* + * We do not expect an analyze to take > 25 days and it simplifies + * things a bit to use TimestampDifferenceMilliseconds. + */ + delay_in_ms = TimestampDifferenceMilliseconds(starttime, endtime); + + /* + * Note that we are reporting these read/write rates in the same + * manner as VACUUM does, which means that while the 'average read + * rate' here actually corresponds to page misses and resulting + * reads which are also picked up by track_io_timing, if enabled, + * the 'average write rate' is actually talking about the rate of + * pages being dirtied, not being written out, so it's typical to + * have a non-zero 'avg write rate' while I/O Timings only reports + * reads. + * + * It's not clear that an ANALYZE will ever result in + * FlushBuffer() being called, but we track and support reporting + * on I/O write time in case that changes as it's practically free + * to do so anyway. + */ + + if (delay_in_ms > 0) + { + read_rate = (double) BLCKSZ * AnalyzePageMiss / (1024 * 1024) / + (delay_in_ms / 1000.0); + write_rate = (double) BLCKSZ * AnalyzePageDirty / (1024 * 1024) / + (delay_in_ms / 1000.0); + } + + /* + * We split this up so we don't emit empty I/O timing values when + * track_io_timing isn't enabled. + */ + + initStringInfo(&buf); + appendStringInfo(&buf, _("automatic analyze of table \"%s.%s.%s\"\n"), + get_database_name(MyDatabaseId), + get_namespace_name(RelationGetNamespace(onerel)), + RelationGetRelationName(onerel)); + appendStringInfo(&buf, _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"), + (long long) AnalyzePageHit, + (long long) AnalyzePageMiss, + (long long) AnalyzePageDirty); + appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"), + read_rate, write_rate); + if (track_io_timing) + { + appendStringInfoString(&buf, _("I/O Timings:")); + if (pgStatBlockReadTime - startreadtime > 0) + appendStringInfo(&buf, _(" read=%.3f"), + (double) (pgStatBlockReadTime - startreadtime) / 1000); + if (pgStatBlockWriteTime - startwritetime > 0) + appendStringInfo(&buf, _(" write=%.3f"), + (double) (pgStatBlockWriteTime - startwritetime) / 1000); + appendStringInfoChar(&buf, '\n'); + } + appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0)); + + ereport(LOG, + (errmsg_internal("%s", buf.data))); + + pfree(buf.data); + } + } + + /* Roll back any GUC changes executed by index functions */ + AtEOXact_GUC(false, save_nestlevel); + + /* Restore userid and security context */ + SetUserIdAndSecContext(save_userid, save_sec_context); + + /* Restore current context and release memory */ + MemoryContextSwitchTo(caller_context); + MemoryContextDelete(anl_context); + anl_context = NULL; +} + /* * Compute statistics about indexes of a relation */ diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index c064352e23..e2cd2ee317 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -103,6 +103,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel) bool analyze = false; bool freeze = false; bool full = false; + bool merge_stats = false; bool disable_page_skipping = false; bool process_toast = true; ListCell *lc; @@ -124,6 +125,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel) verbose = defGetBoolean(opt); else if (strcmp(opt->defname, "skip_locked") == 0) skip_locked = defGetBoolean(opt); + else if (strcmp(opt->defname, "merge") == 0) + merge_stats = defGetBoolean(opt); else if (!vacstmt->is_vacuumcmd) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), @@ -193,7 +196,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel) (freeze ? VACOPT_FREEZE : 0) | (full ? VACOPT_FULL : 0) | (disable_page_skipping ? VACOPT_DISABLE_PAGE_SKIPPING : 0) | - (process_toast ? VACOPT_PROCESS_TOAST : 0); + (process_toast ? VACOPT_PROCESS_TOAST : 0) | + (merge_stats ? VACOPT_MERGE_STATS : 0); /* sanity checks on options */ Assert(params.options & (VACOPT_VACUUM | VACOPT_ANALYZE)); diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h index d029da5ac0..688f475162 100644 --- a/src/include/commands/vacuum.h +++ b/src/include/commands/vacuum.h @@ -183,6 +183,7 @@ typedef struct VacAttrStats #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */ #define VACOPT_PROCESS_TOAST 0x40 /* process the TOAST table, if any */ #define VACOPT_DISABLE_PAGE_SKIPPING 0x80 /* don't skip any pages */ +#define VACOPT_MERGE_STATS 0x0100 /* merge statistics from children */ /* * A ternary value used by vacuum parameters. -- 2.30.2