On Sep 30, 2024, at 13:26, Apostolis Stamatis 
<el18...@mail.ntua.gr<mailto:el18...@mail.ntua.gr>> wrote:


Thank you very much Andreas.

Your explanation was very insightful.

I do have the following questions/thoughts:

Let's say I have 2 available OSTs, and 4MB of data. The stripe-size is 1MB. 
(Sizes are small for discussion purposes, I am trying to understand what 
solution -if any- would perform better in general)

I would like to compare the following two strategies of writing/reading the 
data:

A) I can store all the data in 1 single big lustre file, striped across the 2 
OSTs.

B) I can create (e.g.) 4  smaller lustre files, each consisting of 1MB of data. 
Suppose I place them manually in the same way that they would be striped on 
strategy A.

So the only difference between the 2 strategies is whether data is in a single 
lustre file or not (meaning I make sure each OST has a similar load in both 
cases).

Then:

Q1. Suppose I have 4 simultaneous processes, each wanting to read 1MB of data. 
On strategy A, each process opens the file (via llapi_file_open) and then reads 
the corresponding data by calculating the offset from the start. On strategy B 
each process simply opens the corresponding file and reads its data. Would 
there be any difference in performance between the two strategies ?

For reading it is unlikely that there would be a significant difference in 
performance.  For writing, option A would be somewhat slower than B for large 
amounts of data, because there would be some lock contention between parallel 
writers to the same file.

However, if this behavior is expanded to a large scale, then having millions or 
billions of 1MB files would have a different kind of overhead to open/close 
each file separately and having to manage so many those files vs. having 
fewer/larger files.  Given that a single client can read/write GB/s, it makes 
sense to aggregate enough data per file to amortize the overhead of the 
lookup/open/stat/close.

Large-scale HPC applications try to pick a middle ground, for example having 1 
file per checkpoint timestep written in parallel (instead of 1M separate 
per-CPU files), but each timestep (hourly) has a different file.  Alternately, 
each timestep could write individual files into a separate directory, if they 
are reasonably large (e.g. GB).


Q2. Suppose I have 1 process, wanting to read the (e.g.)  3rd MB of data. Would 
strategy B be better, since it avoids the overhead of "skipping" to the offset 
that is required in strategy A ?

Seeking the offset pointer within a file has no cost.  That is just changing a 
number in the open file descriptor on the client, so it doesn't involve the 
servers or any kind of locking.


Q3. For question 2, would the answer be different if the read is not aligned to 
the stripe-size? Meaning that in both strategies I would have to skip to an 
offset (compared to Q2 where I could just read the whole file in strategy B 
from the start), but in strategy A the skip is bigger.

Same answer as 2 - the seeking itself has no cost.  The *read* of unaligned 
data in this case is likely to be somewhat slower than reading aligned data (it 
may send RPCs to two OSTs, needing two separate locks, etc).  However, with any 
large-sized read (e.g. 8 MB+) it is unlikely to make a significant difference.


Q4. One concern I have regarding strategy A is that all the stripes of the file 
that are in the same OST are seen -internally- as one object (as per 
"Understanding Lustre Internals"). Does this affect performance when different, 
but not overlapping, parts of the file (that are on the same OST) are being 
accessed (for example due to locking)? Does it matter if the parts being 
accessed are on different "chunk", e.g 1st and 3rd MB on the above example?

No, Lustre can allow concurrent read access to a single object from multiple 
threads/clients.  When writing the file, there can also be concurrent write 
access to a single object, but only with non-overlapping regions.  That would 
also be true if writing to separate files in option B (contention if two 
processes tried to write the same small file).


Also if there are any additional docs I can read on those topics (apart from 
"Understanding Lustre internals") to get a better understanding, please do 
point them out.

Patrick Farrell has presented at LAD and LUG a few times about optimizations to 
the IO pipeline, which may be interesting:
https://wiki.lustre.org/Lustre_User_Group_2022
- https://wiki.lustre.org/images/a/a3/LUG2022-Future_IO_Path-Farrell.pdf
https://www.eofs.eu/index.php/events/lad-23/
- https://www.eofs.eu/wp-content/uploads/2024/02/04-LAD-2023-Unaligned-DIO.pdf
https://wiki.lustre.org/Lustre_User_Group_2024
- https://wiki.lustre.org/images/a/a0/LUG2024-Hybrid_IO_Path_Update-Farrell.pdf


Thanks again for your help,

Apostolis


On 9/23/24 00:42, Andreas Dilger wrote:
On Sep 18, 2024, at 10:47, Apostolis Stamatis 
<el18...@mail.ntua.gr<mailto:el18...@mail.ntua.gr>> wrote:
I am trying to read/write a specific stripe for files striped across multiple 
OSTs. I've been looking around the C api but with no success so far.


Let's say I have a big file which is striped across multiple OSTs. I have a 
cluster of compute nodes which perform some computation on the data of the 
file. Each node needs only a subset of that data.

I want each node to be able to read/write only the needed information, so that 
all reads/writes can happen in parallel. The desired data may or may not be 
aligned with the stripes (this is secondary).

It is my understanding that stripes are just parts of the file. Meaning that if 
I have an array of 100 rows and stripe A contains the first half, then it would 
contain the first 50 rows, is this correct?

This is not totally correct.  The location of the data depends on the size of 
the data and the stripe size.

For a 1-stripe file (the default unless otherwise specified) then all of the 
data would be in a single object, regardless of the size of the data.

For a 2-stripe file with stripe_size=1MiB, then the first MB of data [0-1MB) is 
on object 0, the second MB of data [1-2MB) is on object 1, and the third MB of 
data [2-3MB) is back on object 0, etc.

See https://wiki.lustre.org/Understanding_Lustre_Internals#Lustre_File_Layouts 
for example.

To sum up my questions are:

1) Can I read/write a specific stripe of a file via the C api to achieve better 
performance/locality?

There is no Lustre llapi_* interface that provides this functionality, but you 
can of course read the file with regular read() or preferably pread() or 
readv() calls with the right file offsets.

2) Is it correct that stripes include parts of the file, meaning the raw data? 
If not, can the raw data be extracted from any additional information stored in 
the stripe?

For example, if you have a 4-stripe file, then the application should read 
every 4th MB of the file to stay on the same OST object. Note that the *OST* 
index is not necessarily the same as the *stripe* number of the file.  To read 
the file from the local OST then it should check the local OST index and select 
that OST index from the file to determine the offset from the start of the file 
= stripe_size * stripe_number.

However, you could also do this more easily by having a bunch of 1-stripe files 
and doing the reads directly on the local OSTs.  You would run "lfs find DIR -i 
LOCAL_OST_IDX" to get a list of the files on each OST, and then process them 
directly.

3) If each compute node is run on top of a different OST where stripes of the 
file are stored, would it be better in terms of performance to have the node 
read the stripe of its OST? (because e.g. it avoids data transfer over the 
network)

This is not necessarily needed, if you have a good network, but it depends on 
the workload.  Local PCI storage access is about the same speed as remote PCI 
network access because they are limited by the PCI bus bandwidth.  You would 
notice a difference is if you have a large number of clients and they are 
completely IO-bound that overwhelm the storage.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud








Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to