Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Amit Kapila Thu, 21 Jul 2022 03:08:37 -0700

On Wed, Jul 20, 2022 at 8:19 PM Önder Kalacı <onderkal...@gmail.com> wrote:
>
>>
>> > - Add a new class of invalidation callbacks: Today, if we do ALTER TABLE 
>> > or CREATE INDEX on a table, the CacheRegisterRelcacheCallback helps us to 
>> > re-create the cache entries. In this case, as far as I can see, we need a 
>> > callback that is called when table "ANALYZE"d, because that is when the 
>> > statistics change. That is the time picking a new index makes sense.
>> > However, that seems like adding another dimension to this patch, which I 
>> > can try but also see that committing becomes even harder.
>> >
>>
>> This idea sounds worth investigating. I see that this will require
>> more work but OTOH, we can't allow the existing system to regress
>> especially because depending on workload it might regress badly.
>
>
> Just to note if that is not clear: This patch avoids (or at least aims to 
> avoid assuming no bugs) changing the behavior of the existing systems with 
> PRIMARY KEY or UNIQUE index. In that case, we still use the relevant indexes.
>
>>
>> We
>> can create a patch for this atop the base patch for easier review/test
>> but I feel we need some way to address this point.
>>
>
> One another idea could be to re-calculate the index, say after N 
> updates/deletes for the table. We may consider using subscription_parameter 
> for getting N -- with a good default, or even hard-code into the code. I 
> think the cost of re-calculating should really be pretty small compared to 
> the other things happening during logical replication. So, a sane default 
> might work?
>


One difficulty in deciding the value of N for the user or choosing a
default would be that we need to probably consider the local DML
operations on the table as well.

> If you think the above doesn't work, I can try to work on a separate patch 
> which adds something like "analyze invalidation callback".
>

I suggest we should give this a try and if this turns out to be
problematic or complex then we can think of using some heuristic as
you are suggesting above.

>
>>
>>
>> Now, your point related to planner code in the patch bothers me as
>> well but I haven't studied the patch in detail to provide any
>> alternatives at this stage. Do you have any other ideas to make it
>> simpler or solve this problem in some other way?
>>
>
> One idea I tried earlier was to go over the existing indexes and on the 
> table, then get the IndexInfo via BuildIndexInfo(). And then, try to find a 
> good heuristic to pick an index. In the end, I felt like that is doing a 
> sub-optimal job, requiring a similar amount of code of the current patch, and 
> still using the similar infrastructure.
>
> My conclusion for that route was I should either use a very simple heuristic 
> (like pick the index with the most columns) and have a suboptimal index pick,
>

Not only that but say all index have same number of columns then we
need to probably either pick the first such index or use some other
heuristic.

>
> OR use a complex heuristic with a reasonable index pick. And, the latter 
> approach converged to the planner code in the patch. Do you think the former 
> approach is acceptable?
>

In this regard, I was thinking in which cases a sequence scan can be
better than the index scan (considering one is available). I think if
a certain column has a lot of duplicates (for example, a column has a
boolean value) then probably doing a sequence scan is better. Now,
considering this even though your other approach sounds simpler but
could lead to unpredictable results. So, I think the latter approach
is preferable.

BTW, do we want to consider partial indexes for the scan in this
context? I mean it may not have data of all rows so how that would be
usable?

Few comments:
===============
1.
static List *
+CreateReplicaIdentityFullPaths(Relation localrel)
{
...
+ /*
+ * Rather than doing all the pushups that would be needed to use
+ * set_baserel_size_estimates, just do a quick hack for rows and width.
+ */
+ rel->rows = rel->tuples;

Is it a good idea to set rows without any selectivity estimation?
Won't this always set the entire rows in a relation? Also, if we don't
want to use set_baserel_size_estimates(), how will we compute
baserestrictcost which will later be used in the costing of paths (for
example, costing of seqscan path (cost_seqscan) uses it)?

In general, I think it will be better to consider calling some
top-level planner functions even for paths. Can we consider using
make_one_rel() instead of building individual paths? On similar lines,
in function PickCheapestIndexPathIfExists(), can we use
set_cheapest()?

2.
@@ -57,9 +60,6 @@ build_replindex_scan_key(ScanKey skey, Relation rel,
Relation idxrel,
  int2vector *indkey = &idxrel->rd_index->indkey;
  bool hasnulls = false;

- Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) ||
-    RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel));

You have removed this assertion but there is a comment ("This is not
generic routine, it expects the idxrel to be replication identity of a
rel and meet all limitations associated with that.") atop this
function which either needs to be changed/removed and probably we
should think if the function needs some change after removing that
restriction.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Reply via email to