On Thu, 2014-07-03 at 12:32 +1000, Dave Chinner wrote: > Hi folks, > > I've got a workload that hammers the mmap_sem via multi-threads > memory allocation and page faults: it's called xfs_repair.
Another reason for concurrent address space operations :/ > I've been > debugging problems with the latest release, and in the process of > tracking down some recent regressions I noticed that turning off all > the cross-ag IO concurrency resulted in the repair running about > 4-5x faster. > > i.e.: > > # time xfs_repair -f -vv fs.img > ..... > > XFS_REPAIR Summary Thu Jul 3 11:51:41 2014 > > Phase Start End Duration > Phase 1: 07/03 11:47:15 07/03 11:47:15 > Phase 2: 07/03 11:47:15 07/03 11:47:35 20 seconds > Phase 3: 07/03 11:47:35 07/03 11:51:26 3 minutes, 51 seconds > Phase 4: 07/03 11:51:26 07/03 11:51:31 5 seconds > Phase 5: 07/03 11:51:31 07/03 11:51:31 > Phase 6: 07/03 11:51:31 07/03 11:51:39 8 seconds > Phase 7: 07/03 11:51:39 07/03 11:51:39 > > Total run time: 4 minutes, 24 seconds > done > > real 4m26.399s > user 1m6.023s > sys 27m26.707s > $ > > And turning off AG striding: > > # time xfs_repair -f -vv -o ag_stride=-1 fs.img > ..... > XFS_REPAIR Summary Thu Jul 3 11:41:28 2014 > > Phase Start End Duration > Phase 1: 07/03 11:40:27 07/03 11:40:27 > Phase 2: 07/03 11:40:27 07/03 11:40:36 9 seconds > Phase 3: 07/03 11:40:36 07/03 11:41:12 36 seconds The *real* degradation is here then. > Phase 4: 07/03 11:41:12 07/03 11:41:17 5 seconds > Phase 5: 07/03 11:41:17 07/03 11:41:18 1 second > Phase 6: 07/03 11:41:18 07/03 11:41:25 7 seconds > Phase 7: 07/03 11:41:25 07/03 11:41:25 > > Total run time: 58 seconds > done > > real 0m59.893s > user 0m41.935s > sys 4m55.867s > $ > > The difference is in phase 2 and 3, which is where all the memory > allocation and IO that populates the userspace buffer cache takes > place. The remainder of the phases run from the cache. All IO is > direct IO, so there is no kernel caching at all. The filesystem > image has a lot of metadata in it - it has 336 AGs and the buffer > cache grows to about 6GB in size during phase 3. > > The difference in performance is in the system CPU time, and it > results directly in lower IO dispatch rates - about 2,000 IOPS > instead of ~12,000. > > This is what the kernel profile looks like on the strided run: > > - 83.06% [kernel] [k] osq_lock > - osq_lock > - 100.00% rwsem_down_write_failed > - call_rwsem_down_write_failed > - 99.55% sys_mprotect > tracesys > __GI___mprotect > - 12.02% [kernel] [k] rwsem_down_write_failed > rwsem_down_write_failed > call_rwsem_down_write_failed > + 1.09% [kernel] [k] _raw_spin_unlock_irqrestore > + 0.92% [kernel] [k] _raw_spin_unlock_irq > + 0.68% [kernel] [k] __do_softirq > + 0.33% [kernel] [k] default_send_IPI_mask_sequence_phys > + 0.10% [kernel] [k] __do_page_fault > > Basically, all the kernel time is spent processing lock contention > rather than doing real work. While before it just blocked. > I haven't tested back on 3.15 yet, but historically the strided AG > repair for such filesystems (which I test all the time on 100+500TB > filesystem images) is about 20-25% faster on this storage subsystem. > Yes, the old code also burnt a lot of CPU due to lock contention, > but it didn't go 5x slower than having no threading at all. > > So this looks like a significant performance regression due to: > > 4fc828e locking/rwsem: Support optimistic spinning > > which changed the rwsem behaviour in 3.16-rc1. So the mmap_sem is held long enough in this workload that the cost of blocking is actually significantly smaller than just spinning -- particularly MCS. How many threads are you running when you see the issue? Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/