On Wed, Jul 20, 2022 at 8:19 PM Önder Kalacı <onderkal...@gmail.com> wrote: > >> >> > - Add a new class of invalidation callbacks: Today, if we do ALTER TABLE >> > or CREATE INDEX on a table, the CacheRegisterRelcacheCallback helps us to >> > re-create the cache entries. In this case, as far as I can see, we need a >> > callback that is called when table "ANALYZE"d, because that is when the >> > statistics change. That is the time picking a new index makes sense. >> > However, that seems like adding another dimension to this patch, which I >> > can try but also see that committing becomes even harder. >> > >> >> This idea sounds worth investigating. I see that this will require >> more work but OTOH, we can't allow the existing system to regress >> especially because depending on workload it might regress badly. > > > Just to note if that is not clear: This patch avoids (or at least aims to > avoid assuming no bugs) changing the behavior of the existing systems with > PRIMARY KEY or UNIQUE index. In that case, we still use the relevant indexes. > >> >> We >> can create a patch for this atop the base patch for easier review/test >> but I feel we need some way to address this point. >> > > One another idea could be to re-calculate the index, say after N > updates/deletes for the table. We may consider using subscription_parameter > for getting N -- with a good default, or even hard-code into the code. I > think the cost of re-calculating should really be pretty small compared to > the other things happening during logical replication. So, a sane default > might work? >
One difficulty in deciding the value of N for the user or choosing a default would be that we need to probably consider the local DML operations on the table as well. > If you think the above doesn't work, I can try to work on a separate patch > which adds something like "analyze invalidation callback". > I suggest we should give this a try and if this turns out to be problematic or complex then we can think of using some heuristic as you are suggesting above. > >> >> >> Now, your point related to planner code in the patch bothers me as >> well but I haven't studied the patch in detail to provide any >> alternatives at this stage. Do you have any other ideas to make it >> simpler or solve this problem in some other way? >> > > One idea I tried earlier was to go over the existing indexes and on the > table, then get the IndexInfo via BuildIndexInfo(). And then, try to find a > good heuristic to pick an index. In the end, I felt like that is doing a > sub-optimal job, requiring a similar amount of code of the current patch, and > still using the similar infrastructure. > > My conclusion for that route was I should either use a very simple heuristic > (like pick the index with the most columns) and have a suboptimal index pick, > Not only that but say all index have same number of columns then we need to probably either pick the first such index or use some other heuristic. > > OR use a complex heuristic with a reasonable index pick. And, the latter > approach converged to the planner code in the patch. Do you think the former > approach is acceptable? > In this regard, I was thinking in which cases a sequence scan can be better than the index scan (considering one is available). I think if a certain column has a lot of duplicates (for example, a column has a boolean value) then probably doing a sequence scan is better. Now, considering this even though your other approach sounds simpler but could lead to unpredictable results. So, I think the latter approach is preferable. BTW, do we want to consider partial indexes for the scan in this context? I mean it may not have data of all rows so how that would be usable? Few comments: =============== 1. static List * +CreateReplicaIdentityFullPaths(Relation localrel) { ... + /* + * Rather than doing all the pushups that would be needed to use + * set_baserel_size_estimates, just do a quick hack for rows and width. + */ + rel->rows = rel->tuples; Is it a good idea to set rows without any selectivity estimation? Won't this always set the entire rows in a relation? Also, if we don't want to use set_baserel_size_estimates(), how will we compute baserestrictcost which will later be used in the costing of paths (for example, costing of seqscan path (cost_seqscan) uses it)? In general, I think it will be better to consider calling some top-level planner functions even for paths. Can we consider using make_one_rel() instead of building individual paths? On similar lines, in function PickCheapestIndexPathIfExists(), can we use set_cheapest()? 2. @@ -57,9 +60,6 @@ build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel, int2vector *indkey = &idxrel->rd_index->indkey; bool hasnulls = false; - Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) || - RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel)); You have removed this assertion but there is a comment ("This is not generic routine, it expects the idxrel to be replication identity of a rel and meet all limitations associated with that.") atop this function which either needs to be changed/removed and probably we should think if the function needs some change after removing that restriction. -- With Regards, Amit Kapila.