On Thu, Jan 30, 2025 at 1:21 PM Indu Bhagat <indu.bha...@oracle.com> wrote: > > On 1/27/25 5:10 PM, Andrii Nakryiko wrote: > >>>>> +struct sframe_preamble { > >>>>> + u16 magic; > >>>>> + u8 version; > >>>>> + u8 flags; > >>>>> +} __packed; > >>>>> + > >>>>> +struct sframe_header { > >>>>> + struct sframe_preamble preamble; > >>>>> + u8 abi_arch; > >>>>> + s8 cfa_fixed_fp_offset; > >>>>> + s8 cfa_fixed_ra_offset; > >>>>> + u8 auxhdr_len; > >>>>> + u32 num_fdes; > >>>>> + u32 num_fres; > >>>>> + u32 fre_len; > >>>>> + u32 fdes_off; > >>>>> + u32 fres_off; > >>>>> +} __packed; > >>>>> + > >>>>> +struct sframe_fde { > >>>>> + s32 start_addr; > >>>>> + u32 func_size; > >>>>> + u32 fres_off; > >>>>> + u32 fres_num; > >>>>> + u8 info; > >>>>> + u8 rep_size; > >>>>> + u16 padding; > >>>>> +} __packed; > >>>> I couldn't understand from SFrame itself, but why do sframe_header, > >>>> sframe_preamble, and sframe_fde have to be marked __packed, if it's > >>>> all naturally aligned (intentionally and by design)?.. > >>> Right, but the spec says they're all packed. Maybe the point is that > >>> some future sframe version is free to introduce unaligned fields. > >>> > >> SFrame specification aims to keep SFrame header and SFrame FDE members > >> at aligned boundaries in future versions. > >> > >> Only SFrame FRE related accesses may have unaligned accesses. > > Yeah, and it's actually bothering me quite a lot 🙂 I have a tentative > > proposal, maybe we can discuss this for SFrame v3? Let me briefly > > outline the idea. > > > > I looked at the idea below. It could work wrt unaligned accesses. > > Speaking of unaligned accesses, I will ask away: Is the reason to avoid > unaligned accesses performance hit or are there other practical reasons > to it ?
Performance hit on architectures like x86-64 that do support unaligned, but it's actually a CPU error for some other architectures, so you'd need to code with that in mind, making local aligned copies, etc. In general, I'd say it's a bit of a red flag that a format that is meant to be memory-mapped (effectively) and used without pre-processing requires dealing with unaligned accesses. So if we can fix that, that would be a win. > > > So, currently in v2, FREs within FDEs use an array-of-structs layout. > > If we use preudo-C type definitions, it would be something like this > > for FDE + its FREs: > > > > struct FDE_and_FREs { > > struct sframe_func_desc_entry fde_metadata; > > > > union FRE { > > struct FRE8 { > > u8 sfre_start_address; > > u8 sfre_info; > > u8|u16|u32 offsets[M]; > > } > > struct FRE16 { > > u16 sfre_start_address; > > u16 sfre_info; > > u8|u16|u32 offsets[M]; > > } > > struct FRE32 { > > u32 sfre_start_address; > > u32 sfre_info; > > u8|u16|u32 offsets[M]; > > } > > } fres[N] __packed; > > }; > > > > where all fres[i]s are one of those FRE8/FRE16/FRE32, so start > > addresses have the same size, but each FRE has potentially different > > offsets sizing, so there is no common alignment, and so everything has > > to be packed and unaligned. > > > > But what if we take a struct-of-arrays approach and represent it more like: > > > > struct FDE_and_FREs { > > struct sframe_func_desc_entry fde_metadata; > > u8|u16|u32 start_addrs[N]; /* can extend to u64 as well */ > > u8 sfre_infos[N]; > > u8 offsets8[M8]; > > u16 offsets16[M16] __aligned(2); > > u32 offsets32[M32] __aligned(4); > > /* we can naturally extend to support also u64 offsets */ > > }; > > > > i.e., we split all FRE records into their three constituents: start > > addresses, info bytes, and then each FRE can fall into either 8-, 16-, > > or 32-bit offsets "bucket". We collect all the offsets, depending on > > their size, into these aligned offsets{8,16,32} arrays (with natural > > extension to 64 bits, if necessary), with at most wasting 1-3 bytes to > > ensure proper alignment everywhere. > > > > Note, at this point we need to decide if we want to make FREs binary > > searchable or not. > > > > If not, we don't really need anything extra. As we process each > > start_addrs[i] and sfre_infos[i] to find matching FRE, we keep track > > of how many 8-, 16-, and 32-bit offsets already processed FREs > > consumed, and when we find the right one, we know exactly the starting > > index within offset{8,16,32}. Done. > > > > But if we were to make FREs binary searchable, we need to basically > > have an index of offset pointers to quickly find offsetsX[j] position > > corresponding to FRE #i. For that, we can have an extra array right > > next to start_addrs, "semantically parallel" to it: > > > > u8|u16|u32 start_addrs[N]; > > u8|u16|u32 offset_idxs[N]; > > > > where start_addrs[i] corresponds to offset_idxs[i], and offset_idxs[i] > > points to the first offset corresponding to FRE #i in offsetX[] array > > (depending on FRE's "bitness"). This is a bit more storage for this > > offset index, but for FDEs with lots of FREs this might be a > > worthwhile tradeoff. > > > > Few points: > > a) we can decide this "binary searchability" per-FDE, and for FDEs > > with 1-2-3 FREs not bother, while those with more FREs would be > > searchable ones with index. So we can combine both fast lookups, > > natural alignment of on-disk format, and compactness. The presence of > > index is just another bit in FDE metadata. > > I have been going back and forth on this one. So there seem to be the > following options here: > #1. Make "binary searchability" a per-FDE decision. > #2. Make "binary searchability" a per-section decision (I expect > aarch64 to have very low number of FREs per FDE). > #3. Bake "binary searchability" into the SFrame FRE specification. > So its always ON for all FDEs. The advantage is that it makes stack > tracers simpler to implement with less code. > > I do think #2, #3 appear simpler in concept. Whichever makes it easier across the entire stack (compiler, linker, kernel/unwinder). As long as binary searchability is possible, especially for FDEs with lots of FREs. Making it per-FDE just allows to pick most compact (but still with good performance) representation. > > > b) bitness of offset_idxs[] can be coupled with bitness of > > start_addrs (for simplicity), or could be completely independent and > > identified by FDE's metadata (2 more bits to define this just like > > start_addr bitness is defined). Independent probably would be my > > preference, with linker (or whoever will be producing .sframe data) > > can pick the smallest bitness that is sufficient to represent > > everything. > > > > ATM, GAS does apply special logic to decide the bitness of start_addrs > per function, and ld just uses that info. Coupling the bitness of > offset_idx with bitness of start_addrs will be easy (or _easier_ I > think), but for now, I leave it as "should be doable" :) Those offsets are relative to fde's start_addr, right? So, generally speaking, should be usually small? I my understanding is correct, then yeah, coupling is probably ok. > > > Yes, it's a bit more complicated to draw and explain, but everything > > will be nicely aligned, extensible to 64 bits, and (optionally at > > least) binary searchable. Implementation-wise on the kernel side it > > shouldn't be significantly more involved. Maybe the compiler would > > need to be a bit smarter when producing FDE data, but it's no rocket > > science. > > > > Thoughts? > > Combining the requirements from your email and Josh's follow up: > - No unaligned accesses > - Sorted FREs > > I would put compaction as a "good to have" requirement. It appears to > me that any compaction will mean a sort of post-processing which will > interfere with JIT usecase. > sgtm