It's been some years now, but I had worked on developing code for a high throughput network server (not BIND). We found that on multi-socketed NUMA machines we could have similar contention problems, and it was quite important to make sure that threads which needed access to the same memory areas weren't split across sockets. Luckily, the various services being run were sufficiently separate that we could assign the service processes to different sockets and avoid a lot of contention.
With BIND, it's basically all one service, so this is not directly possible. It might be possible, however, to run two (or more) *separate* instances of BIND and do some strictly internal routing of the IP traffic to those separate instances, or even to have separate NICs feeding the separate processes. In other words, have several BIND servers in one chassis, each with its own NUMA memory area. On Fri, 2 Jun 2017 07:12:09 +0000 "Browne, Stuart" <stuart.bro...@neustar.biz> wrote: > Just some interesting investigation results. One of the URL's Matthew > Ian Eis linked to talked about using a tool called 'perf'. For the > hell of it, I gave it a shot. > > Sure enough it tells some very interesting things. > > When BIND was restricted to using a single NUMA node, the biggest > call (to _raw_spin_lock) showed 7.05% overhead. > > When BIND was allowed to use both NUMA nodes, the same call showed > 49.74% overhead; an astonishing difference. > > As it was running unrestricted, memory from both nodes was more used: > > [root@kr20s2601 ~]# numastat -p 22441 > > Per-node process memory usage (in MBs) for PID 22441 (named) > Node 0 Node 1 Total > --------------- --------------- --------------- > Huge 0.00 0.00 0.00 > Heap 0.45 0.12 0.57 > Stack 0.71 0.64 1.35 > Private 5.28 9415.30 9420.57 > ---------------- --------------- --------------- --------------- > Total 6.43 9416.07 9422.50 > > Given the numbers here, you wouldn't think it should make much of a > difference. > > Sadly, I didn't get which CPU the UDP listener was attached to. > > Anyway, what I've changed so far: > > vm.swappines = 0 > vm.dirty_ratio = 1 > vm.dirty_background_ratio = 1 > kernel.sched_min_granularity_ns = 10000000 > kernel.sched_migration_cost_ns = 5000000 > > Query rate thus far reached (on 24 cores, numa node restricted): 426k > qps Query rate thus far reached (on 48 cores, numa nodes > unrestricted): 321k qps > > Stuart > > 'perf' data collected during a 3 minute test run: > > [root@kr20s2601 ~]# ls -al perf.data* > -rw-------. 1 root root 717350012 Jun 2 08:36 perf.data.24 > -rw-------. 1 root root 1366620296 Jun 2 08:53 perf.data.48 > > 'perf' top 5 (24 cores, numa restricted): > > Overhead Command Shared Object Symbol > 7.05% named [kernel.kallsyms] [k] _raw_spin_lock > 6.96% named libpthread-2.17.so [.] pthread_mutex_lock > 3.84% named libc-2.17.so [.] vfprintf > 2.36% named libdns.so.165.0.7 [.] dns_name_fullcompare > 2.02% named libisc.so.160.1.2 [.] isc_log_wouldlog > > 'perf' top 5 (48 cores): > > Overhead Command Shared Object Symbol > 49.74% named [kernel.kallsyms] [k] _raw_spin_lock > 4.52% named libpthread-2.17.so [.] pthread_mutex_lock > 3.09% named libisc.so.160.1.2 [.] isc_log_wouldlog > 1.84% named [kernel.kallsyms] [k] _raw_spin_lock_bh > 1.56% named libc-2.17.so [.] vfprintf > _______________________________________________ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to > unsubscribe from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users > > _______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users