Re: [lustre-discuss] Read/Write on specific stripe of file via C api

Apostolis Stamatis Mon, 30 Sep 2024 13:17:51 -0700

Thank you very much Andreas.

Your explanation was very insightful.


I do have the following questions/thoughts:

Let's say I have 2 available OSTs, and 4MB of data. The stripe-size is1MB. (Sizes are small for discussion purposes, I am trying to understandwhat solution -if any- would perform better in general)

I would like to compare the following two strategies of writing/readingthe data:

A) I can store all the data in 1 single big lustre file, striped acrossthe 2 OSTs.

B) I can create (e.g.) 4 smaller lustre files, each consisting of 1MBof data. Suppose I place them manually in the same way that they wouldbe striped on strategy A.

So the only difference between the 2 strategies is whether data is in asingle lustre file or not (meaning I make sure each OST has a similarload in both cases).


Then:

Q1. Suppose I have 4 simultaneous processes, each wanting to read 1MB ofdata. On strategy A, each process opens the file (via llapi_file_open)and then reads the corresponding data by calculating the offset from thestart. On strategy B each process simply opens the corresponding fileand reads its data. Would there be any difference in performance betweenthe two strategies ?

Q2. Suppose I have 1 process, wanting to read the (e.g.) 3rd MB ofdata. Would strategy B be better, since it avoids the overhead of"skipping" to the offset that is required in strategy A ?

Q3. For question 2, would the answer be different if the read is notaligned to the stripe-size? Meaning that in both strategies I would haveto skip to an offset (compared to Q2 where I could just read the wholefile in strategy B from the start), but in strategy A the skip is bigger.

Q4. One concern I have regarding strategy A is that all the stripes ofthe file that are in the same OST are seen -internally- as one object(as per "Understanding Lustre Internals"). Does this affect performancewhen different, but not overlapping, parts of the file (that are on thesame OST) are being accessed (for example due to locking)? Does itmatter if the parts being accessed are on different "chunk", e.g 1st and3rd MB on the above example?

Also if there are any additional docs I can read on those topics (apartfrom "Understanding Lustre internals") to get a better understanding,please do point them out.


Thanks again for your help,

Apostolis


On 9/23/24 00:42, Andreas Dilger wrote:

On Sep 18, 2024, at 10:47, Apostolis Stamatis <el18...@mail.ntua.gr>wrote:
I am trying to read/write a specific stripe for files striped acrossmultiple OSTs. I've been looking around the C api but with no successso far.
Let's say I have a big file which is striped across multiple OSTs. Ihave a cluster of compute nodes which perform some computation on thedata of the file. Each node needs only a subset of that data.
I want each node to be able to read/write only the neededinformation, so that all reads/writes can happen in parallel. Thedesired data may or may not be aligned with the stripes (this issecondary).
It is my understanding that stripes are just parts of the file.Meaning that if I have an array of 100 rows and stripe A contains thefirst half, then it would contain the first 50 rows, is this correct?
This is not totally correct. The location of the data depends on thesize of the data and the stripe size.
For a 1-stripe file (the default unless otherwise specified) then allof the data would be in a single object, regardless of the size of thedata.
For a 2-stripe file with stripe_size=1MiB, then the first MB of data[0-1MB) is on object 0, the second MB of data [1-2MB) is on object 1,and the third MB of data [2-3MB) is back on object 0, etc.
Seehttps://wiki.lustre.org/Understanding_Lustre_Internals#Lustre_File_Layouts forexample.
To sum up my questions are:
1) Can I read/write a specific stripe of a file via the C api toachieve better performance/locality?
There is no Lustre llapi_* interface that provides this functionality,but you can of course read the file with regular read() or preferablypread() or readv() calls with the right file offsets.
2) Is it correct that stripes include parts of the file, meaning theraw data? If not, can the raw data be extracted from any additionalinformation stored in the stripe?
For example, if you have a 4-stripe file, then the application shouldread every 4th MB of the file to stay on the same OST object. Notethat the *OST* index is not necessarily the same as the *stripe*number of the file. To read the file from the local OST then itshould check the local OST index and select that OST index from thefile to determine the offset from the start of the file = stripe_size* stripe_number.
However, you could also do this more easily by having a bunch of1-stripe files and doing the reads directly on the local OSTs. Youwould run "lfs find DIR -i LOCAL_OST_IDX" to get a list of the fileson each OST, and then process them directly.
3) If each compute node is run on top of a different OST wherestripes of the file are stored, would it be better in terms ofperformance to have the node read the stripe of its OST? (becausee.g. it avoids data transfer over the network)
This is not necessarily needed, if you have a good network, but itdepends on the workload. Local PCI storage access is about the samespeed as remote PCI network access because they are limited by the PCIbus bandwidth. You would notice a difference is if you have a largenumber of clients and they are completely IO-bound that overwhelm thestorage.
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Read/Write on specific stripe of file via C api

Reply via email to