bug#32073: Improvements in Grep (Bug#32073)

Sergiu Hlihor Wed, 01 Jan 2020 20:22:32 -0800

Hi Arnold,
If AWKBUFSIZE translates to disk IO request size then it is already what
its needed. However it's a little annoying.

Regarding optimal settings, the benchmark actually tells you what is
optimal. Let's assume grep or any other tool can process in memory 3GB/s.
If your device can server 5GB/s then you can saturate the CPU.  If however
the device needs at least X as block size to reach the maximum throughput,
then that's what you have to use. Plain and simple. And as I said, when
going into GB territory, at application level reads have to be asynchronous.
if you look at benchmarking tools like Atto you see the graphs clearly and
see the scaling for SSDs. And just happens that the value good for SSDs
(minimum 512KB) also benefits HDD RAID arrays with strip sizes smaller than
512KB. With HDD RAID arrays unfortunately it does get complicated because
you have to know the number of disks and strip size. I for example always
use tune2fs and set those parameters when format the partition. This could
just as well be a configurable OS parameter per drive and based on the
location of the file, the right value could be used. But I have to admit
that this would add exponential complexity with diminishing returns versus
just setting a buffer size of 1MB (which will cover both current and future
SSDs).

Also I'm not too fond of heuristics or any other smartness at IO level in
Linux IO stack. I'm working with large databases (as user) and discussed
about Linux IO stack with database developers. The common opinion is that
Linux IO stack got out of control and nobody actually has a good overview
anymore. And I tend to agree. Linux needs an IO stack as lean as possible
and let the applications decide what to do, as at the application level you
know your usage pattern. I already had to finetune the database due to it.

On Wed, 1 Jan 2020 at 21:24, <arn...@skeeve.com> wrote:

> Hi.
>
> Sergiu Hlihor <s...@discovergy.com> wrote:
>
> > Arnold, there is no need to write user code, it is already done in
> > benchmarks. One of the standard benchmarks when testing HDDs and SSDs is
> > read throughput vs block size and at different queue depths.
>
> I think you're misunderstanding me, or I am misunderstanding you.
>
> As the gawk maintainer, I can choose the buffer size to use every time
> I issue a read(2) system call for any given input file.  Gawk currently
> uses the smaller of (a) the file's size or (b) the st_blksize member of
> the struct stat array.
>
> If I understand you correctly, this is "not enough"; gawk (grep,
> cp, etc.) should all use an optimal buffer size that depends upon the
> underlying storage hardware where the file is located.
>
> So far, so good, except for: How do I determine what that number is?
> I cannot run a benchmark before opening each and every file. I don't
> know of a system call that will give me that number. (If there is,
> please point me to it.)
>
> Do you just want a command line option or environment variable
> that you, as the application user, can set?
>
> If the latter, it happens that gawk will let you set AWKBUFSIZE and
> it will use whatever number you supply for doing reads. (This is
> even documented.)
>
> HTH,
>
> Arnold
>

bug#32073: Improvements in Grep (Bug#32073)

Reply via email to