Hi Arnold, If AWKBUFSIZE translates to disk IO request size then it is already what its needed. However it's a little annoying.
Regarding optimal settings, the benchmark actually tells you what is optimal. Let's assume grep or any other tool can process in memory 3GB/s. If your device can server 5GB/s then you can saturate the CPU. If however the device needs at least X as block size to reach the maximum throughput, then that's what you have to use. Plain and simple. And as I said, when going into GB territory, at application level reads have to be asynchronous. if you look at benchmarking tools like Atto you see the graphs clearly and see the scaling for SSDs. And just happens that the value good for SSDs (minimum 512KB) also benefits HDD RAID arrays with strip sizes smaller than 512KB. With HDD RAID arrays unfortunately it does get complicated because you have to know the number of disks and strip size. I for example always use tune2fs and set those parameters when format the partition. This could just as well be a configurable OS parameter per drive and based on the location of the file, the right value could be used. But I have to admit that this would add exponential complexity with diminishing returns versus just setting a buffer size of 1MB (which will cover both current and future SSDs). Also I'm not too fond of heuristics or any other smartness at IO level in Linux IO stack. I'm working with large databases (as user) and discussed about Linux IO stack with database developers. The common opinion is that Linux IO stack got out of control and nobody actually has a good overview anymore. And I tend to agree. Linux needs an IO stack as lean as possible and let the applications decide what to do, as at the application level you know your usage pattern. I already had to finetune the database due to it. On Wed, 1 Jan 2020 at 21:24, <arn...@skeeve.com> wrote: > Hi. > > Sergiu Hlihor <s...@discovergy.com> wrote: > > > Arnold, there is no need to write user code, it is already done in > > benchmarks. One of the standard benchmarks when testing HDDs and SSDs is > > read throughput vs block size and at different queue depths. > > I think you're misunderstanding me, or I am misunderstanding you. > > As the gawk maintainer, I can choose the buffer size to use every time > I issue a read(2) system call for any given input file. Gawk currently > uses the smaller of (a) the file's size or (b) the st_blksize member of > the struct stat array. > > If I understand you correctly, this is "not enough"; gawk (grep, > cp, etc.) should all use an optimal buffer size that depends upon the > underlying storage hardware where the file is located. > > So far, so good, except for: How do I determine what that number is? > I cannot run a benchmark before opening each and every file. I don't > know of a system call that will give me that number. (If there is, > please point me to it.) > > Do you just want a command line option or environment variable > that you, as the application user, can set? > > If the latter, it happens that gawk will let you set AWKBUFSIZE and > it will use whatever number you supply for doing reads. (This is > even documented.) > > HTH, > > Arnold >