Em sáb., 28 de mai. de 2022 às 09:35, Tomas Vondra < tomas.von...@enterprisedb.com> escreveu:
> On 5/28/22 02:36, Ranier Vilela wrote: > > Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <and...@anarazel.de > > <mailto:and...@anarazel.de>> escreveu: > > > > Hi, > > > > On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote: > > > Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra < > > > tomas.von...@enterprisedb.com > > <mailto:tomas.von...@enterprisedb.com>> escreveu: > > > > > > > On 5/27/22 02:11, Ranier Vilela wrote: > > > > > > > > > > ... > > > > > > > > > > Here the results with -T 60: > > > > > > > > Might be a good idea to share your analysis / interpretation of > the > > > > results, not just the raw data. After all, the change is being > > proposed > > > > by you, so do you think this shows the change is beneficial? > > > > > > > I think so, but the expectation has diminished. > > > I expected that the more connections, the better the performance. > > > And for both patch and head, this doesn't happen in tests. > > > Performance degrades with a greater number of connections. > > > > Your system has four CPUs. Once they're all busy, adding more > > connections > > won't improve performance. It'll just add more and more context > > switching, > > cache misses, and make the OS scheduler do more work. > > > > conns tps head > > 10 82365.634750 > > 50 74593.714180 > > 80 69219.756038 > > 90 67419.574189 > > 100 66613.771701 > > Yes it is quite disappointing that with 100 connections, tps loses to 10 > > connections. > > > > IMO that's entirely expected on a system with only 4 cores. Increasing > the number of connections inevitably means more overhead (you have to > track/manage more stuff). And at some point the backends start competing > for L2/L3 caches, context switches are not free either, etc. So once you > cross ~2-3x the number of cores, you should expect this. > > This behavior is natural/inherent, it's unlikely to go away, and it's > one of the reasons why we recommend not to use too many connections. If > you try to maximize throughput, just don't do that. Or just use machine > with more cores. > > > > > > GetSnapShowData() isn't a bottleneck? > > > > I'd be surprised if it showed up in a profile on your machine with > that > > workload in any sort of meaningful way. The snapshot reuse logic > > will always > > work - because there are no writes - and thus the only work that > > needs to be > > done is to acquire the ProcArrayLock briefly. And because there is > > only a > > small number of cores, contention on the cacheline for that isn't a > > problem. > > > > Thanks for sharing this. > > > > > > > > > > > > These results look much saner, but IMHO it also does not show > > any clear > > > > benefit of the patch. Or are you still claiming there is a > benefit? > > > > > > > We agree that they are micro-optimizations. However, I think they > > should be > > > considered micro-optimizations in inner loops, because all in > > procarray.c is > > > a hotpath. > > > > As explained earlier, I don't agree that they optimize anything - > you're > > making some of the scalability behaviour *worse*, if it's changed at > > all. > > > > > > > The first objective, I believe, was achieved, with no performance > > > regression. > > > I agree, the gains are small, by the tests done. > > > > There are no gains. > > > > IMHO, I must disagree. > > > > You don't have to, really. What you should do is showing results > demonstrating the claimed gains, and so far you have not done that. > > I don't want to be rude, but so far you've shown results from a > benchmark testing fork(), due to only running 10 transactions per > client, and then results from a single run for each client count (which > doesn't really show any gains either, and is so noisy). > > As mentioned GetSnapshotData() is not even in perf profile, so why would > the patch even make a difference? > > You've also claimed it helps generating better code on older compilers, > but you've never supported that with any evidence. > > > Maybe there is an improvement - show us. Do a benchmark with more runs, > to average-out the noise. Calculate VAR/STDEV to show how variable the > results are. Use that to compare results and decide if there is an > improvement. Also, keep in mind binary layout matters [1]. > I redid the benchmark with a better machine: Intel i7-10510U RAM 8GB SSD 512GB Linux Ubuntu 64 bits All files are attached, including the raw data of the results. I did the calculations as requested. But a quick average of the 10 benchmarks, done resulted in 10,000 tps more. Not bad, for a simple patch, made entirely of micro-optimizations. Results attached. regards, Ranier Vilela
procarray_bench.tar.xz
Description: application/xz