Hi, On 2019-06-11 08:36:55 +0200, Fabien COELHO wrote: > > I noticed that pgbench's -R influences not just the computation of lag, > > but also of latency. That doesn't look right to me, but maybe I'm just > > missing something? > > > > It's quite easy to demonstrate when running pgbench with/without > > progress report at a transaction rate that's around the limit of what > > the server can do: > > > > andres@alap4:~/src/postgresql$ pgbench -n -M prepared -c 1 -j 1 -T 100000 > > -P1 -r -S pgbench_10 > > progress: 1.0 s, 37416.3 tps, lat 0.027 ms stddev 0.013 > > progress: 2.0 s, 37345.1 tps, lat 0.027 ms stddev 0.011 > > progress: 3.0 s, 38787.8 tps, lat 0.026 ms stddev 0.009 > > ... > > > > andres@alap4:~/src/postgresql$ pgbench -n -M prepared -c 1 -j 1 -T 100000 > > -P1 -r -S -R 37000 pgbench_10 > > progress: 1.0 s, 32792.8 tps, lat 81.795 ms stddev 35.552, lag 81.765 ms > > progress: 2.0 s, 37770.6 tps, lat 113.194 ms stddev 4.651, lag 113.168 ms > > progress: 3.0 s, 37006.3 tps, lat 113.905 ms stddev 5.007, lag 113.878 ms > > [...] > > > Fabien, is this a bug, or am I misunderstanding something? > > This behavior under -R is fully voluntary, and the result above just show > that the database cannot really keep up with the load, which is simply the > case, so for me it is okay to show bad figures.
I mean, you just turned one named value, into a different one, without renaming it. And the new meaning under -R, is basically the same as one that's already there (lag). Also note that it also can actually keep up in the above example. > The idea under throttling is to model a client which would want the result > of a query at a certain point in time, say a query for a web page which is > being generated, which is the scheduled time. It is the when the client > knows it wants an answer. If it is not processed immediately, that is bad > for its client perceived latency. > Whether this is due to lag (i.e. the server is loaded and cannot start to > process the answer) or because the server is slow to answer is irrelevant, > the client is waiting, the web page is not generated, the system is slow. So > latency under -R is really "client latency", not only query latency, as it > is documented. What does that have to do with incorporating the same data into both lag and latency? I just fail to see what the point is, except to make it unnecessarily harder to compare postgres' behaviour under both a throttled and push-it-to-the-breaking point loads. How long individual transactions take, and how much variance there is in that, is something *crucial* to optimize for. *Especially* when the machine/load is provisioned in a way to not overload the machine. How is e.g. progress: 1.6 s, 0.0 tps, lat 0.000 ms stddev 0.000, lag 0.000 ms progress: 2.0 s, 103546.5 tps, lat 1584.161 ms stddev 35.589, lag 1582.043 ms progress: 3.0 s, 108535.2 tps, lat 1347.619 ms stddev 101.782, lag 1346.170 ms progress: 4.0 s, 108528.8 tps, lat 996.603 ms stddev 106.052, lag 995.159 ms progress: 5.0 s, 109468.8 tps, lat 633.464 ms stddev 108.483, lag 632.030 ms progress: 6.0 s, 110606.7 tps, lat 252.923 ms stddev 110.391, lag 251.505 ms progress: 7.0 s, 84253.3 tps, lat 6.829 ms stddev 15.067, lag 6.423 ms progress: 8.0 s, 80470.7 tps, lat 0.142 ms stddev 0.079, lag 0.017 ms progress: 9.0 s, 80104.2 tps, lat 0.142 ms stddev 0.081, lag 0.017 ms progress: 10.0 s, 80277.0 tps, lat 0.152 ms stddev 0.150, lag 0.017 ms the lat column adds basically nothing over the lag column here. more useful than: progress: 1.3 s, 0.0 tps, lat 0.000 ms stddev 0.000, lag 0.000 ms progress: 2.0 s, 116315.6 tps, lat 1.425 ms stddev 1.440, lag 1087.076 ms progress: 3.0 s, 113526.2 tps, lat 1.390 ms stddev 0.408, lag 709.908 ms progress: 4.0 s, 111816.4 tps, lat 1.407 ms stddev 0.399, lag 302.866 ms progress: 5.0 s, 88061.9 tps, lat 0.543 ms stddev 0.652, lag 16.526 ms progress: 6.0 s, 80045.4 tps, lat 0.128 ms stddev 0.079, lag 0.017 ms progress: 7.0 s, 79636.3 tps, lat 0.124 ms stddev 0.073, lag 0.016 ms progress: 8.0 s, 80535.3 tps, lat 0.125 ms stddev 0.073, lag 0.016 ms where I can see that the transactions are now actually fast enough. Obviously this is a toy example, but this really make -R close to useless to me. I often want to switch from a unthrottled to a 90% load, and improve the outlier beheaviour - but that outlier behaviour is hidden due to this redefinition of lat (as the issue is now reported over a much longer period of time, as it includes lag). I think we should just restore lat to a sane behaviour under -R, and if you want to have lat + lag as a separate column in -R mode, then let's do that. Greetings, Andres Freund