On Friday 13 November 2020 at 21:58:25 +0000, Mike Crowe via Libstdc++ wrote: > On Thursday 12 November 2020 at 23:07:47 +0000, Jonathan Wakely wrote: > > On 29/05/20 07:17 +0100, Mike Crowe via Libstdc++ wrote: > > > The futex system call supports waiting for an absolute time if > > > FUTEX_WAIT_BITSET is used rather than FUTEX_WAIT. Doing so provides two > > > benefits: > > > > > > 1. The call to gettimeofday is not required in order to calculate a > > > relative timeout. > > > > > > 2. If someone changes the system clock during the wait then the futex > > > timeout will correctly expire earlier or later. Currently that only > > > happens if the clock is changed prior to the call to gettimeofday. > > > > > > According to futex(2), support for FUTEX_CLOCK_REALTIME was added in the > > > v2.6.28 Linux kernel and FUTEX_WAIT_BITSET was added in v2.6.25. To > > > ensure > > > that the code still works correctly with earlier kernel versions, an > > > ENOSYS > > > error from futex[1] results in the futex_clock_realtime_unavailable flag > > > being set. This flag is used to avoid the unnecessary unsupported futex > > > call in the future and to fall back to the previous gettimeofday and > > > relative time implementation. > > > > > > glibc applied an equivalent switch in pthread_cond_timedwait to use > > > FUTEX_CLOCK_REALTIME and FUTEX_WAIT_BITSET rather than FUTEX_WAIT for > > > glibc-2.10 back in 2009. See > > > glibc:cbd8aeb836c8061c23a5e00419e0fb25a34abee7 > > > > > > The futex_clock_realtime_unavailable flag is accessed using > > > std::memory_order_relaxed to stop it becoming a bottleneck. If the first > > > two calls to _M_futex_wait_until happen to happen simultaneously then the > > > only consequence is that both will try to use FUTEX_CLOCK_REALTIME, both > > > risk discovering that it doesn't work and, if so, both set the flag. > > > > > > [1] This is how glibc's nptl-init.c determines whether these flags are > > > supported. > > > > > > * libstdc++-v3/src/c++11/futex.cc: Add new constants for required > > > futex flags. Add futex_clock_realtime_unavailable flag to store > > > result of trying to use > > > FUTEX_CLOCK_REALTIME. > > > (__atomic_futex_unsigned_base::_M_futex_wait_until): > > > Try to use FUTEX_WAIT_BITSET with FUTEX_CLOCK_REALTIME and only > > > fall back to using gettimeofday and FUTEX_WAIT if that's not > > > supported. > > > > Mike, > > > > I've been doing some performance comparisons and this patch seems to > > make quite a big difference to code that polls a future by calling > > fut.wait_until(t) using any t < now() as the timeout. For example, > > fut.wait_until(chrono::system_clock::time_point{}) to wait until the > > UNIX epoch. > > > > With GCC 10 (or with the if (!futex_clock_realtime_unavailable.load(...) > > commented out) I see that polling take < 100ns. With the change, it > > takes 3000ns or more. > > > > Now this is still far better than polling using fut.wait_for(0s) which > > takes around 50000ns due to the clock_gettime call, but I'm about to > > fix that. > > > > I'm not sure how important it is for wait_until(past) to be fast, but > > the difference from 100ns to 3000ns seems significant. Do you see the > > same kind of numbers? Is this just a property of the futex wait with > > an absolute time? > > > > N.B. using wait_until(system_clock::time_point::min()) or any other > > time before the epoch doesn't work. The futex syscall returns EINVAL > > which we don't check for. I'm about to fix that too. > > I see similar behaviour. I suppose this is because the > gettimeofday/clock_gettime system calls are in the VDSO and therefore > usually much cheaper to call than the real system call SYS_futex. > > If rather than bailing out early when the relative timeout is negative, I > call the relative SYS_futex with rt.tv_sec = rt.tv_nsec = 0 then the > wait_until call takes about ten times longer than when using the absolute > SYS_futex. I can't really explain that. > > Calling these functions with a time in the past is probably quite common if > you calculate a single timeout for several operations in sequence. What's > less clear is whether the performance matters that much when the return > value indicates a timeout anyway. > > If gettimeofday/clock_gettime are cheap enough then I suppose we can call > them even in the absolute timeout case (losing benefit 1 above, which > appears to not really exist) to get the improved performance for timeouts > in the past whilst retaining the correct behaviour if the clock is warped > that this patch addressed (benefit 2 above.)
I wrote the attached standalone program to measure the relative performance of wait operations in the past (or with zero timeout in the relative case) and ran it on a variety of machines. The results below are in nanoseconds: |--------------------+---------+----------+-----------+----------+---------| | | Kernel | futex | futex | futex | clock | | CPU | version | realtime | monotonic | relative | gettime | |--------------------+---------+----------+-----------+----------+---------| | x86_64 E5-2690 v2 | 4.19 | 6942 | 6675 | 61175 | 85 | | x86_64 i7-10510U | 5.4 | 27950 | 36650 | 69969 | 433 | | x86_64 i7-3770K | 5.9 | 17152 | 17232 | 59827 | 322 | | x86_64 i7-4790K | 4.19 | 13280 | 12219 | 58225 | 413 | | x86 Celeron G1610T | 4.9 | 18245 | 18626 | 58445 | 407 | | Raspberry Pi 3 | 5.9 | 30765 | 30851 | 72776 | 300 | | Raspberry Pi 2 | 5.4 | 23830 | 24104 | 91539 | 1062 | | mips64 gcc24 | 4.19 | 23102 | 23503 | 69343 | 1236 | | sparc32 gcc102 | 5.9 | 42657 | 39306 | 87568 | 688 | |--------------------+---------+----------+-----------+----------+---------| The first machine is virtual on ESXi (it appears that Xeons really are much faster at this stuff!) The last two machines are .fsffrance.org GCC farm machines. The pthread_cond_timedwait durations for CLOCK_REALTIME were generally a little bit better than the futex realtime durations. The pthread_cond_timedwait durations for CLOCK_MONOTONIC differed greatly by glibc version. With glibc v2.28 (from Debian 10) it completed very quickly, but with glibc v2.31 (from Ubuntu 20.04) it took slightly longer than the realtime version. This is presumably because glibc v2.28 would use a relative timeout in this case and realise that there was no point it calling futex - I changed that in glibc:99d01ffcc386d1bfb681fb0684fcf6a6a996beb3. So, it's clear to me that my changes have caused an absolute wait on a time in the past to take longer in both libstdc++ and glibc. The question now is whether that matters to anyone. I'll take my findings to the glibc list and see what they think. Thanks. Mike.
#include <linux/futex.h> #include <sys/syscall.h> #include <sys/time.h> #include <stdlib.h> #include <unistd.h> #include <cerrno> #include <chrono> #include <cstdio> #include <cinttypes> #include <thread> const unsigned futex_wait_op = 0; const unsigned futex_wait_bitset_op = 9; const unsigned futex_clock_monotonic_flag = 0; const unsigned futex_clock_realtime_flag = 256; const unsigned futex_bitset_match_any = ~0; const unsigned futex_wake_op = 1; static int futex(int *uaddr, int futex_op, int val, const struct timespec *timeout, int *uaddr2, int val3) { return syscall(SYS_futex, uaddr, futex_op, val, timeout, uaddr, val3); } std::chrono::nanoseconds test_futex_realtime() { int word = 1; struct timespec timeout{0,0}; const auto start = std::chrono::steady_clock::now(); int rc = futex(&word, futex_wait_bitset_op | futex_clock_realtime_flag, 1, &timeout, nullptr, futex_bitset_match_any); if (rc != -1) { fprintf(stderr, "Unexpected return value from futex: %d\n", rc); exit(1); } if (errno != ETIMEDOUT) { fprintf(stderr, "Unexpected error from futex: %m\n"); exit(1); } const auto duration = std::chrono::steady_clock::now() - start; return duration; } std::chrono::nanoseconds test_futex_monotonic() { int word = 1; struct timespec timeout{0,0}; const auto start = std::chrono::steady_clock::now(); int rc = futex(&word, futex_wait_bitset_op | futex_clock_monotonic_flag, 1, &timeout, nullptr, futex_bitset_match_any); if (rc != -1) { fprintf(stderr, "Unexpected return value from futex: %d\n", rc); exit(1); } if (errno != ETIMEDOUT) { fprintf(stderr, "Unexpected error from futex: %m\n"); exit(1); } const auto duration = std::chrono::steady_clock::now() - start; return duration; } std::chrono::nanoseconds test_futex_relative() { int word = 1; struct timespec timeout{0,0}; const auto start = std::chrono::steady_clock::now(); int rc = futex(&word, futex_wait_op, 1, &timeout, nullptr, 0); if (rc != -1) { fprintf(stderr, "Unexpected return value from futex: %d\n", rc); exit(1); } if (errno != ETIMEDOUT) { fprintf(stderr, "Unexpected error from futex: %m\n"); exit(1); } const auto duration = std::chrono::steady_clock::now() - start; return duration; } std::chrono::nanoseconds test_clock_gettime() { const auto start = std::chrono::steady_clock::now(); struct timespec timeout; clock_gettime(CLOCK_REALTIME, &timeout); const auto duration = std::chrono::steady_clock::now() - start; return duration; } pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t cond_realtime = PTHREAD_COND_INITIALIZER; std::chrono::nanoseconds test_cond_realtime() { int rc = pthread_mutex_lock(&mut); if (rc != 0) { fprintf(stderr, "pthread_mutex_lock: %m\n"); exit(1); } const auto start = std::chrono::steady_clock::now(); struct timespec timeout{0,0}; rc = pthread_cond_timedwait(&cond_realtime, &mut, &timeout); if (rc != ETIMEDOUT) { fprintf(stderr, "Unexpected return value from cond wait: %d\n", rc); exit(1); } const auto duration = std::chrono::steady_clock::now() - start; pthread_mutex_unlock(&mut); return duration; } pthread_cond_t cond_monotonic; std::chrono::nanoseconds test_cond_monotonic() { int rc = pthread_mutex_lock(&mut); if (rc != 0) { fprintf(stderr, "pthread_mutex_lock: %m\n"); exit(1); } const auto start = std::chrono::steady_clock::now(); struct timespec timeout{0,0}; rc = pthread_cond_timedwait(&cond_monotonic, &mut, &timeout); if (rc != ETIMEDOUT) { fprintf(stderr, "Unexpected return value from cond wait: %d\n", rc); exit(1); } const auto duration = std::chrono::steady_clock::now() - start; pthread_mutex_unlock(&mut); return duration; } void show_mean(const char *name, std::chrono::nanoseconds (*f)()) { // warm up for(int i = 0; i < 10; ++i) f(); // calculate mean const int count = 5000; std::chrono::nanoseconds total{0}; for(int i = 0; i < count; ++i) { total += f(); std::this_thread::sleep_for(std::chrono::milliseconds(1)); } printf("%s mean duration %" PRId64 "ns\n", name, (total/count).count()); } int main() { pthread_condattr_t condattr_monotonic; pthread_condattr_init(&condattr_monotonic); pthread_condattr_setclock(&condattr_monotonic, CLOCK_MONOTONIC); pthread_cond_init(&cond_monotonic, &condattr_monotonic); show_mean("futex_realtime", test_futex_realtime); show_mean("futex_monotonic", test_futex_monotonic); show_mean("futex_relative", test_futex_relative); show_mean("clock_gettime", test_clock_gettime); show_mean("cond_realtime", test_cond_realtime); show_mean("cond_monotonic", test_cond_monotonic); }