On 2019-02-19 20:02:32 +0100, Tomas Vondra wrote: > Let's do a short example. Assume the default vacuum costing parameters > > vacuum_cost_limit = 200 > vacuum_cost_delay = 20ms > cost_page_dirty = 20 > > and for simplicity we only do writes. So vacuum can do ~8MB/s of writes. > > Now, let's also throttle based on WAL - once in a while, after producing > some amount of WAL we sleep for a while. Again, for simplicity let's > assume the sleeps perfectly interleave and are also 20ms. So we have > something like:
> sleep(20ms); -- vacuum > sleep(20ms); -- WAL > sleep(20ms); -- vacuum > sleep(20ms); -- WAL > sleep(20ms); -- vacuum > sleep(20ms); -- WAL > sleep(20ms); -- vacuum > sleep(20ms); -- WAL > > Suddenly, we only reach 4MB/s of writes from vacuum. But we also reach > only 1/2 the WAL throughput, because it's affected exactly the same way > by the sleeps from vacuum throttling. > > We've not reached either of the limits. How exactly is this "lower limit > takes effect"? Because I upthread said that that's not how I think a sane implementation of WAL throttling would work. I think the whole cost budgeting approach is BAD, and it'd be serious mistake to copy it for a WAL rate limit (it disregards the time taken to execute IO and CPU costs etc, and in this case the cost of other bandwidth limitations). What I'm saying is that we ought to instead specify an WAL rate in bytes/sec and *only* sleep once we've exceeded it for a time period (with some optimizations, so we don't gettimeofday after every XLogInsert(), but instead compute how many bytes later need to re-determine the time to see if we're still in the same 'granule'). Now, a non-toy implementation would probably would want to have a sliding window to avoid being overly bursty, and reduce the number of gettimeofday as mentioned above, but for explanation's sake basically imagine that at the "main loop" of an bulk xlog emitting command would invoke a helper with a a computation in pseudocode like: current_time = gettimeofday(); if (same_second(current_time, last_time)) { wal_written_in_second += new_wal_written; if (wal_written_in_second >= wal_write_limit_per_second) { double too_much = (wal_written_in_second - wal_write_limit_per_second); sleep_fractional_seconds(too_much / wal_written_in_second); last_time = current_time; } } else { last_time = current_time; } which'd mean that in contrast to your example we'd not continually sleep for WAL, we'd only do so if we actually exceeded (or are projected to exceed in a smarter implementation), the specified WAL write rate. As the 20ms sleeps from vacuum effectively reduce the WAL write rate, we'd correspondingly sleep less. And my main point is that even if you implement a proper bytes/sec limit ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get meaningfully more confusing than right now. Greetings, Andres Freund