On Wed, Aug 26, 2020 at 8:14 AM David Rowley <dgrowle...@gmail.com> wrote:
> On Wed, 26 Aug 2020 at 05:18, Andy Fan <zhihui.fan1...@gmail.com> wrote: > > > > > > On Tue, Aug 25, 2020 at 11:53 PM Andres Freund <and...@anarazel.de> > wrote: > >> > >> On 2020-08-25 20:48:37 +1200, David Rowley wrote: > >> > Also, just in case anyone is misunderstanding this Andres' argument. > >> > It's entirely based on the performance impact of having an additional > >> > node. > >> > >> Not entirely, no. It's also just that it doesn't make sense to have two > >> nodes setting parameters that then half magically picked up by a special > >> subsidiary node type and used as a cache key. This is pseudo modularity, > >> not real modularity. And makes it harder to display useful information > >> in explain etc. And makes it harder to e.g. clear the cache in cases we > >> know that there's no further use of the current cache. At least without > >> piercing the abstraction veil. > >> > >> > >> > However, given the correct planner choice, there will never be > >> > a gross slowdown due to having the extra node. > >> > >> There'll be a significant reduction in increase in performance. > > > > > > If this is a key blocking factor for this topic, I'd like to do a simple > hack > > to put the cache function into the subplan node, then do some tests to > > show the real difference. But it is better to decide how much difference > > can be thought of as a big difference. And for education purposes, > > I'd like to understand where these differences come from. For my > > current knowledge, my basic idea is it saves some function calls? > > If testing this, the cache hit ratio will be pretty key to the > results. You'd notice the overhead much less with a larger cache hit > ratio since you're not pulling the tuple from as deeply a nested node. > I'm unsure how you'd determine what is a good cache hit ratio to > test it with. I wanted to test the worst case where the cache hit ratio is 0. and then compare the difference between putting the cache as a dedicated node and in a SubPlan node. However, we have a better way to test the difference based on your below message. > The lower the cache expected cache hit ratio, the higher > the cost of the Result Cache node will be, so the planner has less > chance of choosing to use it. > IIRC, we add the ResultCache for subplan nodes unconditionally now. The main reason is we lack of ndistinct estimation during the subquery planning. Tom suggested converting the AlternativeSubPlan to SubPlan in setrefs.c [1], and I also ran into a case that can be resolved if we do such conversion even earlier[2], the basic idea is we can do such conversation once we can get the actual values for the subplan. something like if (bms_is_subset(subplan->deps_relids, rel->relids) { convert_alternativesubplans_to_subplan(rel); } you can see if that can be helpful for ResultCache in this user case. my patch in [2] is still in a very PoC stage so it only takes care of subplan in rel->reltarget. > Say you find a case with the hit ratio of 90%. Going by [1] I found > pulling a tuple through an additional node to cost about 12 > nanoseconds on an intel 4712HQ CPU. With a hit ratio of 90% we'll > only pull 10% of tuples through the additional node, so that's about > 1.2 nanoseconds per tuple, or 1.2 milliseconds per million tuples. It > might become hard to measure above the noise. More costly inner scans > will have the planner choose to Result Cache with lower estimated hit > ratios, but in that case, pulling the tuple through the additional > node during a cache miss will be less noticeable due to the more > costly inner side of the join. > > Likely you could test the overhead only in theory without going to the > trouble of adapting the code to make SubPlan and Nested Loop do the > caching internally. If you just modify ExecResultCache() to have it > simply return its subnode, then measure the performance with and > without enable_resultcache, you should get an idea of the per-tuple > overhead of pulling the tuple through the additional node on your CPU. > Thanks for the hints. I think we can test it even easier with Limit node. create table test_pull_tuples(a int); insert into test_pull_tuples select i from generate_seri insert into test_pull_tuples select i from generate_series(1, 100000)i; -- test with pgbench. select * from test_pull_tuples; 18.850 ms select * from test_pull_tuples limit 100000; 20.500 ms Basically it is 16 nanoseconds per tuple on my Intel(R) Xeon(R) CPU E5-2650. Personally I'd say the performance difference is negligible unless I see some different numbers. [1] https://www.postgresql.org/message-id/1992952.1592785225%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/CAKU4AWoMRzZKk1vPstKTjS7sYeN43j8WtsAZy2pv73vm_E_6dA%40mail.gmail.com -- Best Regards Andy Fan