On Fri, Aug 4, 2017 at 11:21 AM, Ross Zwisler <ross.zwis...@linux.intel.com> wrote: > On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote: >> [ adding Dave who is working on a blk-mq + dma offload version of the >> pmem driver ] >> >> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minc...@kernel.org> wrote: >> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote: >> [..] >> >> Thanks for the testing. Your testing number is within noise level? >> >> >> >> I cannot understand why PMEM doesn't have enough gain while BTT is >> >> significant >> >> win(8%). I guess no rw_page with BTT testing had more chances to wait bio >> >> dynamic >> >> allocation and mine and rw_page testing reduced it significantly. However, >> >> in no rw_page with pmem, there wasn't many cases to wait bio allocations >> >> due >> >> to the device is so fast so the number comes from purely the number of >> >> instructions has done. At a quick glance of bio init/submit, it's not >> >> trivial >> >> so indeed, i understand where the 12% enhancement comes from but I'm not >> >> sure >> >> it's really big difference in real practice at the cost of maintaince >> >> burden. >> > >> > I tested pmbench 10 times in my local machine(4 core) with zram-swap. >> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable. >> > >> > I guess it's really hard to get stable result in severe memory pressure. >> > It would be a result within noise level(see below stddev). >> > So, I think it's hard to conclude rw_page is far faster than onstack-bio. >> > >> > rw_page >> > avg 5.54us >> > stddev 8.89% >> > max 6.02us >> > min 4.20us >> > >> > onstack bio >> > avg 5.27us >> > stddev 13.03% >> > max 5.96us >> > min 3.55us >> >> The maintenance burden of having alternative submission paths is >> significant especially as we consider the pmem driver ising more >> services of the core block layer. Ideally, I'd want to complete the >> rw_page removal work before we look at the blk-mq + dma offload >> reworks. >> >> The change to introduce BDI_CAP_SYNC is interesting because we might >> have use for switching between dma offload and cpu copy based on >> whether the I/O is synchronous or otherwise hinted to be a low latency >> request. Right now the dma offload patches are using "bio_segments() > >> 1" as the gate for selecting offload vs cpu copy which seem >> inadequate. > > Okay, so based on the feedback above and from Jens[1], it sounds like we want > to go forward with removing the rw_page() interface, and instead optimize the > regular I/O path via on-stack BIOS and dma offload, correct? > > If so, I'll prepare patches that fully remove the rw_page() code, and let > Minchan and Dave work on their optimizations.
I think the conversion to on-stack-bio should be done in the same patchset that removes rw_page. We don't want to leave a known performance regression while the on-stack-bio work is in-flight.