Hi, On 2020-05-03 09:12:59 -0400, Robert Haas wrote: > On Sat, May 2, 2020 at 10:36 PM Andres Freund <and...@anarazel.de> wrote: > > I changed Robert's test program to optionall fallocate, > > sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large > > footprint in the page cache. The performance > > differences are quite substantial: > > > > gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \ > > rm -ff /srv/dev/bench/test* && echo 3 |sudo tee > > /proc/sys/vm/drop_caches && \ > > /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 > > --filesize=$((400*1024*1024*1024)) /srv/dev/bench/test1 > > > > running test with: numprocs=1 filesize=429496729600 blocksize=8192 > > fallocate=0 sfr=0 fadvise=0 > > [/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, > > close: 0, total: 220 > > > > comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1 > > running test with: numprocs=1 filesize=429496729600 blocksize=8192 > > fallocate=1 sfr=1 fadvise=1 > > [/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, > > close: 0, total: 161 > > Ah, nice.
Btw, I forgot to include the result for 0 / 0 / 0 in the results (off-by-one error in a script :)) numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68 > > The run-to-run variations between the runs without cache control are > > pretty large. So this is probably not the end-all-be-all numbers. But I > > think the trends are pretty clear. > > Could you be explicit about what you think those clear trends are? Largely that concurrency can help a bit, but also hurt tremendously. Below is some more detailed analysis, it'll be a bit long... Taking the no concurrency / cache management as a baseline: > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 > 0.155GHz 833,345,334,408 0.68 and comparing cache management with using some concurrency: > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 > 0.116GHz 678,451,556,273 0.84 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 > 0.214GHz 715,274,337,015 0.51 we can see very similar timing. Which makes sense, because that's roughly the device's max speed. But then going to higher concurrency, there's clearly regressions: > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 > 0.075GHz 888,512,073,477 0.66 And I think it is instructive to look at the ref_cycles_tot/cycles_tot/instructions_tot vs ref_cycles_sec/cycles_sec/ipc. The units are confusing because they are across all cores and most are idle. But it's pretty obvious that numprocs=1 sfr=1 fadvise=1 has cores running for a lot shorter time (reference cycles basically count the time cores were running on a absolute time scale). Compared to numprocs=2 sfr=0 fadvise=0, which has the same resulting performance, it's clear that cores were busier, but less efficient (lower ipc). With cache mangement there's very little benefit, and some risk (1->2 regression) in this workload with increasing concurrency: > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 > 0.116GHz 678,451,556,273 0.84 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 > 0.122GHz 799,212,291,243 0.92 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 > 0.159GHz 826,905,592,751 0.84 And there's good benefit, but tremendous risk, of concurrency in the no cache control case: > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 > 0.155GHz 833,345,334,408 0.68 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 > 0.214GHz 715,274,337,015 0.51 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 > 0.075GHz 888,512,073,477 0.66 sync file range without fadvise isn't a benefit at low concurrency, but prevents bad regressions at high concurency: > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 > 0.155GHz 833,345,334,408 0.68 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 > 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 > 0.123GHz 705,950,461,166 0.54 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 > 0.214GHz 715,274,337,015 0.51 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 > 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 > 0.135GHz 802,754,964,096 0.76 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 > 0.075GHz 888,512,073,477 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 > 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 > 0.111GHz 748,666,206,805 0.65 fadvise alone is similar: > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 > 0.155GHz 833,345,334,408 0.68 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 > 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 > 0.121GHz 944,243,167,053 0.59 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 > 0.214GHz 715,274,337,015 0.51 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 > 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 > 0.151GHz 1,000,100,694,932 0.69 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 > 0.075GHz 888,512,073,477 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 > 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 > 0.170GHz 1,005,270,838,642 0.68 There does not appear to be a huge of benefit in fallocate in this workload, the OS's delayed allocation works well. Compare: numprocs=1 > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 > 0.155GHz 833,345,334,408 0.68 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0 > 243.609959554 1,802,385,405,203 184.970M/sec 1,449,560,513,247 > 0.149GHz 855,426,288,031 0.56 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 > 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 > 0.123GHz 705,950,461,166 0.54 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0 > 230.880100449 1,328,417,418,799 143.846M/sec 1,148,924,667,393 > 0.124GHz 723,158,246,628 0.63 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 > 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 > 0.121GHz 944,243,167,053 0.59 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1 > 253.591234992 1,548,485,571,798 152.658M/sec 1,229,926,994,613 > 0.121GHz 1,117,352,436,324 0.95 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 > 0.116GHz 678,451,556,273 0.84 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 > 0.116GHz 672,105,046,261 0.84 numprocs=2 > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 > 0.214GHz 715,274,337,015 0.51 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0 > 421.580487642 2,756,486,952,728 163.449M/sec 1,387,708,033,752 > 0.082GHz 990,478,650,874 0.72 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 > 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 > 0.135GHz 802,754,964,096 0.76 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0 > 169.854206542 1,333,619,626,854 196.282M/sec 1,036,261,531,134 > 0.153GHz 666,052,333,591 0.64 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 > 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 > 0.151GHz 1,000,100,694,932 0.69 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1 > 305.078100578 1,970,042,289,192 161.445M/sec 1,505,706,462,812 > 0.123GHz 954,963,240,648 0.62 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 > 0.122GHz 799,212,291,243 0.92 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 > 0.129GHz 761,338,026,415 0.89 numprocs=4 > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 > 0.075GHz 888,512,073,477 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0 > 334.932246893 2,366,388,662,460 176.628M/sec 1,216,049,589,993 > 0.091GHz 796,698,831,717 0.68 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 > 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 > 0.111GHz 748,666,206,805 0.65 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0 > 161.697270285 1,866,036,713,483 288.576M/sec 1,068,181,502,433 > 0.165GHz 739,559,279,008 0.70 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 > 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 > 0.170GHz 1,005,270,838,642 0.68 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1 > 231.440889430 1,965,389,749,057 212.391M/sec 1,407,927,406,358 > 0.152GHz 997,199,361,968 0.72 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 > 0.159GHz 826,905,592,751 0.84 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 > 0.125GHz 861,540,079,120 0.80 I would say that it seems to help concurrent cases without cache control, but not particularly reliably so. At higher concurrency it seems to hurt with cache control, not sure I undstand why. I was at first confused why 128kb write sizes hurt (128kb is probably on the higher end of useful, but I wanted to have see a more extreme difference): > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 > 0.155GHz 833,345,334,408 0.68 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0 > 644.521613661 3,688,449,404,537 143.079M/sec 2,020,128,131,309 > 0.078GHz 961,486,630,359 0.48 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 > 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 > 0.123GHz 705,950,461,166 0.54 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0 > 243.830464632 1,499,608,983,445 153.756M/sec 1,227,468,439,403 > 0.126GHz 691,534,661,654 0.59 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 > 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 > 0.121GHz 944,243,167,053 0.59 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1 > 292.866419420 1,753,376,415,877 149.677M/sec 1,483,169,463,392 > 0.127GHz 860,035,914,148 0.56 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 > 0.116GHz 678,451,556,273 0.84 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 > 162.152397194 925,643,754,128 142.719M/sec 743,208,501,601 > 0.115GHz 554,462,585,110 0.70 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0 > 243.609959554 1,802,385,405,203 184.970M/sec 1,449,560,513,247 > 0.149GHz 855,426,288,031 0.56 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0 > 211.369510165 1,558,996,898,599 184.401M/sec 1,359,343,408,200 > 0.161GHz 766,769,036,524 0.57 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0 > 230.880100449 1,328,417,418,799 143.846M/sec 1,148,924,667,393 > 0.124GHz 723,158,246,628 0.63 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0 > 233.315094908 1,427,133,080,540 152.927M/sec 1,166,000,868,597 > 0.125GHz 743,027,329,074 0.64 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1 > 253.591234992 1,548,485,571,798 152.658M/sec 1,229,926,994,613 > 0.121GHz 1,117,352,436,324 0.95 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1 > 290.698155820 1,732,849,079,701 149.032M/sec 1,441,508,612,326 > 0.124GHz 835,039,426,282 0.57 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 > 0.116GHz 672,105,046,261 0.84 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 > 159.945462440 850,162,390,626 132.892M/sec 724,286,281,548 > 0.113GHz 670,069,573,150 0.90 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 > 0.214GHz 715,274,337,015 0.51 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0 > 163.244592275 1,524,807,507,173 233.531M/sec 1,398,319,581,978 > 0.214GHz 689,514,058,243 0.46 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 > 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 > 0.135GHz 802,754,964,096 0.76 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0 > 231.795934322 1,731,030,267,153 186.686M/sec 1,124,935,745,020 > 0.121GHz 736,084,922,669 0.70 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 > 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 > 0.151GHz 1,000,100,694,932 0.69 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1 > 315.564163702 1,958,199,733,216 155.128M/sec 1,405,115,546,716 > 0.111GHz 1,000,595,890,394 0.73 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 > 0.122GHz 799,212,291,243 0.92 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 > 210.945487961 1,527,169,148,899 180.990M/sec 906,023,518,692 > 0.107GHz 700,166,552,207 0.80 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0 > 421.580487642 2,756,486,952,728 163.449M/sec 1,387,708,033,752 > 0.082GHz 990,478,650,874 0.72 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0 > 161.759094088 1,468,321,054,671 226.934M/sec 1,221,167,105,510 > 0.189GHz 735,855,415,612 0.59 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0 > 169.854206542 1,333,619,626,854 196.282M/sec 1,036,261,531,134 > 0.153GHz 666,052,333,591 0.64 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0 > 158.578248952 1,354,770,825,277 213.586M/sec 936,436,363,752 > 0.148GHz 654,823,079,884 0.68 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1 > 305.078100578 1,970,042,289,192 161.445M/sec 1,505,706,462,812 > 0.123GHz 954,963,240,648 0.62 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1 > 274.628500801 1,792,841,068,080 163.209M/sec 1,343,398,055,199 > 0.122GHz 996,073,874,051 0.73 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 > 0.129GHz 761,338,026,415 0.89 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 > 179.140070123 1,383,595,004,328 193.095M/sec 850,299,722,091 > 0.119GHz 706,959,617,654 0.83 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 > 0.075GHz 888,512,073,477 0.66 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0 > 445.496787199 2,663,914,572,687 149.495M/sec 1,267,340,496,930 > 0.071GHz 787,469,552,454 0.62 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 > 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 > 0.111GHz 748,666,206,805 0.65 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0 > 261.866083604 2,325,884,820,091 222.043M/sec 1,094,814,208,219 > 0.105GHz 649,479,233,453 0.57 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 > 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 > 0.170GHz 1,005,270,838,642 0.68 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1 > 172.963505544 1,717,387,683,260 248.228M/sec 1,356,381,335,831 > 0.196GHz 822,256,638,370 0.58 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 > 0.159GHz 826,905,592,751 0.84 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 > 157.934678897 1,650,503,807,778 261.266M/sec 970,705,561,971 > 0.154GHz 637,953,927,131 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0 > 334.932246893 2,366,388,662,460 176.628M/sec 1,216,049,589,993 > 0.091GHz 796,698,831,717 0.68 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0 > 225.623143601 1,804,402,820,599 199.938M/sec 1,086,394,788,362 > 0.120GHz 656,392,112,807 0.62 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0 > 161.697270285 1,866,036,713,483 288.576M/sec 1,068,181,502,433 > 0.165GHz 739,559,279,008 0.70 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0 > 157.930900998 1,797,506,082,342 284.548M/sec 1,001,509,813,741 > 0.159GHz 644,107,150,289 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1 > 231.440889430 1,965,389,749,057 212.391M/sec 1,407,927,406,358 > 0.152GHz 997,199,361,968 0.72 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1 > 165.772265335 1,805,895,001,689 272.353M/sec 1,514,173,918,970 > 0.228GHz 823,435,044,810 0.54 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 > 0.125GHz 861,540,079,120 0.80 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 > 187.664764448 1,964,118,348,429 261.660M/sec 978,060,510,880 > 0.130GHz 668,316,194,988 0.67 It's pretty clear that the larger write block size can hurt quite badly. I was somewhat confused by this at first, but after thinking about it for a while longer it actually makes sense: For the OS to finish an 8k write it needs to find two free pagecache pages. For an 128k write it needs to find 32. Which means that it's much more likely that kernel threads and the writes are going to fight over locks / cachelines: In the 8k page it's quite likely that ofen the kernel threads will do so while the memcpy() from userland is happening, but that's less the case with 32 pages that need to be acquired before the memcpy() can happen. With cache control that problem doesn't exist, which is why the larger block size is beneficial: > test > time ref_cycles_tot ref_cycles_sec cycles_tot > cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 > 0.116GHz 678,451,556,273 0.84 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 > 162.152397194 925,643,754,128 142.719M/sec 743,208,501,601 > 0.115GHz 554,462,585,110 0.70 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 > 0.116GHz 672,105,046,261 0.84 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 > 159.945462440 850,162,390,626 132.892M/sec 724,286,281,548 > 0.113GHz 670,069,573,150 0.90 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 > 0.122GHz 799,212,291,243 0.92 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 > 210.945487961 1,527,169,148,899 180.990M/sec 906,023,518,692 > 0.107GHz 700,166,552,207 0.80 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 > 0.129GHz 761,338,026,415 0.89 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 > 179.140070123 1,383,595,004,328 193.095M/sec 850,299,722,091 > 0.119GHz 706,959,617,654 0.83 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 > 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 > 0.159GHz 826,905,592,751 0.84 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 > 157.934678897 1,650,503,807,778 261.266M/sec 970,705,561,971 > 0.154GHz 637,953,927,131 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 > 0.125GHz 861,540,079,120 0.80 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 > 187.664764448 1,964,118,348,429 261.660M/sec 978,060,510,880 > 0.130GHz 668,316,194,988 0.67 Note how especially in the first few cases the total number of instructions required is improved (although due to the way I did the perf stat the sampling error is pretty large). I haven't run that test yest, but after looking at all this I would bet that reducing the block size to 4kb (i.e. a single os/hw page) would help the no cache control case significantly, in particular in the concurrent case. And conversely, I'd expect that the CPU efficiency will be improved by larger block size for the cache control case for just about any realistic block size. I'd love to have a faster storage available (faster NVMes, or multiple ones I can use for benchmarking) to see what the cutoff point for actually benefiting from concurrency is. Also worthwhile to note that even the "best case" from a CPU usage point here absolutely *pales* against using direct-IO. It's not an apples/apples comparison, but comparing buffered io using write_and_fsync, and unbuffered io using fio: 128KiB blocksize: write_and_fsync: echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e cpu-clock,ref-cycles,cycles,instructions /tmp/write_and_fsync --blocksize $((128*1024)) --sync_file_range=1 --fallocate=1 --fadvise=1 --sequential=0 --filesize=$((400*1024*1024*1024)) /srv/dev/bench/test1 Performance counter stats for 'system wide': 6,377,903.65 msec cpu-clock # 39.999 CPUs utilized 628,014,590,200 ref-cycles # 98.467 M/sec 634,468,623,514 cycles # 0.099 GHz 795,771,756,320 instructions # 1.25 insn per cycle 159.451492209 seconds time elapsed fio: rm -f /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e cpu-clock,ref-cycles,cycles,instructions fio --name=test --iodepth=512 --iodepth_low=8 --iodepth_batch_submit=8 --iodepth_batch_complete_min=8 --iodepth_batch_complete_max=128 --ioengine=libaio --rw=write --bs=128k --filesize=$((400*1024*1024*1024)) --direct=1 --numjobs=1 Performance counter stats for 'system wide': 6,313,522.71 msec cpu-clock # 39.999 CPUs utilized 458,476,185,800 ref-cycles # 72.618 M/sec 196,148,015,054 cycles # 0.031 GHz 158,921,457,853 instructions # 0.81 insn per cycle 157.842080440 seconds time elapsed CPU usage for fio most of the time was around 98% for write_and_fsync and 40% for fio. I.e. system-wide CPUs were active 0.73x the time, and 0.2x as many instructions had to be executed in the DIO case. Greetings, Andres Freund