On Wed, Aug 26, 2020 at 8:14 AM David Rowley <dgrowle...@gmail.com> wrote:

> On Wed, 26 Aug 2020 at 05:18, Andy Fan <zhihui.fan1...@gmail.com> wrote:
> >
> >
> > On Tue, Aug 25, 2020 at 11:53 PM Andres Freund <and...@anarazel.de>
> wrote:
> >>
> >> On 2020-08-25 20:48:37 +1200, David Rowley wrote:
> >> > Also, just in case anyone is misunderstanding this Andres' argument.
> >> > It's entirely based on the performance impact of having an additional
> >> > node.
> >>
> >> Not entirely, no. It's also just that it doesn't make sense to have two
> >> nodes setting parameters that then half magically picked up by a special
> >> subsidiary node type and used as a cache key. This is pseudo modularity,
> >> not real modularity. And makes it harder to display useful information
> >> in explain etc. And makes it harder to e.g. clear the cache in cases we
> >> know that there's no further use of the current cache. At least without
> >> piercing the abstraction veil.
> >>
> >>
> >> > However, given the correct planner choice, there will never be
> >> > a gross slowdown due to having the extra node.
> >>
> >> There'll be a significant reduction in increase in performance.
> >
> >
> > If this is a key blocking factor for this topic, I'd like to do a simple
> hack
> > to put the cache function into the subplan node, then do some tests to
> > show the real difference.  But it is better to decide how much difference
> > can be thought of as a big difference.  And  for education purposes,
> > I'd like to understand where these differences come from.  For my
> > current knowledge,  my basic idea is it saves some function calls?
>
> If testing this, the cache hit ratio will be pretty key to the
> results. You'd notice the overhead much less with a larger cache hit
> ratio since you're not pulling the tuple from as deeply a nested node.
> I'm unsure how you'd determine what is a good cache hit ratio to
> test it with.


I wanted to test the worst case where the cache hit ratio is 0. and then
compare the difference between putting the cache as a dedicated
node and in a SubPlan node.  However, we have a better way
to test the difference based on your below message.


>

The lower the cache expected cache hit ratio, the higher
> the cost of the Result Cache node will be, so the planner has less
> chance of choosing to use it.
>

IIRC, we add the ResultCache for subplan nodes unconditionally now.
The main reason is we lack of ndistinct estimation during the subquery
planning.  Tom suggested converting the AlternativeSubPlan to SubPlan
in setrefs.c [1], and I also ran into a case that can be resolved if we do
such conversion even earlier[2], the basic idea is we can do such
conversation
once we can get the actual values for the subplan.

something like
if (bms_is_subset(subplan->deps_relids,  rel->relids)
{
   convert_alternativesubplans_to_subplan(rel);
}
you can see if that can be helpful for ResultCache in this user case.   my
patch in [2] is still in a very PoC stage so it only takes care of subplan
in
rel->reltarget.


> Say you find a case with the hit ratio of 90%.  Going by [1] I found
> pulling a tuple through an additional node to cost about 12
> nanoseconds on an intel 4712HQ CPU.  With a hit ratio of 90% we'll
> only pull 10% of tuples through the additional node, so that's about
> 1.2 nanoseconds per tuple, or 1.2 milliseconds per million tuples. It
> might become hard to measure above the noise. More costly inner scans
> will have the planner choose to Result Cache with lower estimated hit
> ratios, but in that case, pulling the tuple through the additional
> node during a cache miss will be less noticeable due to the more
> costly inner side of the join.
>
> Likely you could test the overhead only in theory without going to the
> trouble of adapting the code to make SubPlan and Nested Loop do the
> caching internally.  If you just modify ExecResultCache() to have it
> simply return its subnode, then measure the performance with and
> without enable_resultcache, you should get an idea of the per-tuple
> overhead of pulling the tuple through the additional node on your CPU.
>

Thanks for the hints.  I think we can test it even easier with Limit node.

create table test_pull_tuples(a int);
insert into test_pull_tuples select i from generate_seri
insert into test_pull_tuples select i from generate_series(1, 100000)i;
-- test with pgbench.
select * from test_pull_tuples;                           18.850 ms
select * from test_pull_tuples limit 100000;       20.500 ms

Basically it is 16 nanoseconds per tuple on my Intel(R) Xeon(R) CPU
E5-2650.
Personally I'd say the performance difference is negligible unless I see
some
different numbers.

[1]
https://www.postgresql.org/message-id/1992952.1592785225%40sss.pgh.pa.us
[2]
https://www.postgresql.org/message-id/CAKU4AWoMRzZKk1vPstKTjS7sYeN43j8WtsAZy2pv73vm_E_6dA%40mail.gmail.com


-- 
Best Regards
Andy Fan

Reply via email to