From: Parth Shah > Sent: 30 September 2019 11:44 ... > 5> Separating AVX512 tasks and latency sensitive tasks on separate cores > ( -Tim Chen ) > =========================================================================== > Another usecase we are considering is to segregate those workload that will > pull down core cpu frequency (e.g. AVX512) from workload that are latency > sensitive. There are certain tasks that need to provide a fast response > time (latency sensitive) and they are best scheduled on cpu that has a > lighter load and not have other tasks running on the sibling cpu that could > pull down the cpu core frequency. > > Some users are running machine learning batch tasks with AVX512, and have > observed that these tasks affect the tasks needing a fast response. They > have to rely on manual CPU affinity to separate these tasks. With > appropriate latency hint on task, the scheduler can be taught to separate > them.
Has this been diagnosed properly? I can't really see how the frequency drop from AVX512 significantly affects latency. Most tasks that require low latency probably don't do a lot of work. It is much more likely that the latency issues happen because the AVX512 tasks are doing very few system calls so can't be pre-empted even by a high priority task. This 'feature' is hinted by this: > 2> TurboSched > ( -Parth Shah ) > ==================== > TurboSched [2] tries to minimize the number of active cores in a socket by > packing an un-important and low-utilization (named jitter) task on an > already active core and thus refrains from waking up of a new core if > possible. Consider this example of a process that requires low latency (sub 1ms would be good): - A hardware interrupt (or timer interrupt) wakes up on thread. - When that thread wakes it wakes up other threads that are sleeping. - All the threads 'beaver away' for a few ms (processing RTP and other audio). - They all sleep for the rest of a 10ms period. The affinities are set so each thread runs on a separate cpu, and all are SCHED_RR. Now loop all the cpus in userspace (run: while :; do :; done) and see what happens to the latencies. You really want the SCHED_RR threads to immediately pre-empt the running processes. But I suspect nothing happens until a timer interrupt to the target cpu. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)