Hello,

On 11/05/2025 23:49, Tim Kientzle wrote:
> On May 11, 2025, at 6:50 AM, Klaus Kusche <klaus.kus...@computerix.info> 
> wrote:
>>
>> I regularly backup hundreds of thousands of very small files with tar.
>> Currently, this results in many very small sequential read requests.
> 
> Are these small read requests occurring because the files are small?
> Or is tar deliberately making small read requests?

The files are very small. 
As written below, tar's speed increases significantly for large files.
In fact, it already increases on medium-sized files.

> A few possible experiments:
>  * Use a larger request size when reading file data.  64k or 128k, perhaps?
> 
>  * Try mmap-ing the input files, either relying on the kernel’s read-ahead
>     logic or having a background thread that reads a single byte every 4k or
>     so to prompt page-ins ahead of the main thread.

As I said, the files are very small, 
almost all of them less than 4K.
So they are only a single page / single I/O-Request per file, anyway,
independent of the I/O size.
>  * Compare how `star` performs; it uses a very different buffering
>    architecture which may uncover other possibilities.
My distribution no longer packages star.
And I think star uses a huge buffer between reading and writing
to read ahead into the buffer while reading is faster than writing
and to write from the buffer when reading is slow for short periods.

But as far as I know, star does not read in parallel?

>> (my tar reads and archives up to 2 GB/s when the input files 
>> are GB-sized, including on-the-fly compression).
> 
> This suggests the real issue may be opening the files rather than
> reading them.  That is, you may be seeing small read requests from
> the filesystem code (reading directory pages and stat-ing files)
> rather than from reading the file contents.  That’s a very different
> problem.

File opening might be part of the problem,
but is most likely not the main problem.
The small files are structured into small directories,
3 to 8 files per directory.
So theoretically I have one 4K read for each file for data,
one 4K read per about four files for the directory,
and one 4K read per 16 files for getting the inodes.

So, for a perfect solution, directory reading and file opening
should also be done in parallel.

However, in practice, immediately before I backup my system,
I rebuild the system-wide file index.
So all the directory and inode info should already be cached in RAM
when tar starts, and all SSD activity should be for data reads only
(at least that's what I'd expect / hope).


Reply via email to