It was suggested that I post an RFC here. With advance apologies for its length, I am looking for comments, remarks, and/or advice.

I'm currently trying to code a C language structure for a complex block of 0x200 bytes in size from a paper document. This particular declaration is rife with bitfields, mis-aligned base datatypes, and reserved fields. What I need is to verify my coding against the paper document before I start testing code which uses it. I've looked through gcc documents and find nothing which fits this need: no obvious -d pass which would give me such a map, nor any other option or standalone program. I'm proposing to create a standalone program, as I don't know enough about gcc internals to force such maps to be generated during compilation ... I'm not certain this would be appropriate, anyway. OTOH, I can't be the first to propose solving this problem.

AFAICT, the only accurate source data for such an aggregate map would be DWARF DIE entries. This would mean that I'd be post-processing ELF objects generated with the -g switch. What I'd like to end up with is a program which is portable among ELF types and outputs accurate data when run on foreign CPU architectures; much like readelf attempts to. This means I'd need to generically handle Elf64 and Elf32 formats, as well as respect endianism. I'd like this program to be able to properly extract aggregate maps from the three basic ELF types: relocatable, absolute, and DSO.

I've done some reading of the DWARF-2 and DWARF-3 standards. What I see emitted as DWARF in gcc-4.1.x is marked as DWARF-2; but I'm told that gcc properly emits DWARF-3. Should I even check the version? Will I need to vary processing features by gcc level?

Since I'm targeting all three ELF object types, I'm thinking I have to start reading the sections .debug_info and .debug_abbrev at the DWARF CU level, and build a tree of CU-related data. I've actually started this bit of code and am somewhat happy with test results so far.

DWARF isn't exactly user-friendly for my purposes. I've built, read about, and attempted to grok libdwarf ... I have to confess that I don't quite get it; but I think I have a functional understanding of how DWARF data is laid out and related within gcc-emitted objects. Therefore, I'm planning to not use libdwarf except for its DW_* pp terms in libdwarf.h. I expect to hear that I should be using gcc/dwarf.h; but I'd need to be convinced that its enum type forms are versatile enough to use in decoding.

Nor am I planning to use libelf: it's too straightforward to find the ELF sections I need and to read the ELF header. Use of libelf seems too top-heavy, although elf.h is very valuable to me.

I don't know enough about Fortran, Ada, or Java to even be conversant in how these languages aggregate; if this program supports these languages, it will be accidentally. I'm starting with C language dialects (DW_LANG_C, DW_LANG_C89, DW_LANG_C99) and intend to add support for C++ once C function testing is complete -- C++ classes add complexity that I really don't want to deal with during initial development.

Here's the high-level functional design I've got so far:
1. Read the ELF object file, determine its type, endianism, and bitness as written. Find the .shstrtab section, and point at the section headers. Through the section headers, find the .debug_info and .debug_abbrev sections. Read them into memory and make these copies persistent through the rest of this program's lifecycle. 2. Index through the DWARF CU headers (in .debug_info) and determine a count of CUs.
 3. For each CU,
(3a) Create an array of pointers into .debug_abbrev, reading the DW_TAGs as we go. For each DW_TAG_{structure_type, union_type}, mark the abbrev index item as "of interest". (3b) Iterate through .debug_info DIEs as delimited for the current CU, looking for abbreviation numbers we've marked as "of interest". For each interesting DIE, (3c) Preserve source filename/path of declaration, starting line number, aggregate type (struct/union), tag (if any), and total length. Now, work its child DIEs (members). For each: (3d) Report byte offset, byte length, bit offset, and bit length as available. [I don't particularly like the way bit offsets are reported within multi-byte bitfields, so I expect to do some division by 8 and adjust the byte offset as appropriate. If I can't fix this to the point where it makes sense for both endianisms, then I'll abandon this tweak.] Report the base type and identifier of the member. (3e) If there's a typedef associated with this aggregate type, identify its name and save it. 4. When all CUs have been extracted, display all extracted information about the aggregates and their members, appropriately grouped.

Where I've run into trouble so far is attempting to predict what kind of storage I'll need to save all this intermediate summary data. DWARF sections can get rather large, and there's no way to predict the contents until they're read. I expect this program -- if I follow the design -- to use large amounts of memory until I get to the point where I can start releasing blocks by displaying what they represent. Would I be making things easier on myself (or orders of magnitude less versatile for the user) by processing ELF relocatables only? [This seems a dead end to me, but I thought I'd ask anyway.]

I don't intend to use language syntax features in the display -- I don't plan on "recoding" structure declarations in C ... I just plan on displaying a few lines about each overall aggregate, and then listing the important characteristics of each of its members in row/column format ... then inject a few linefeeds and proceed to the next aggregate. I'm hoping this will increase the chances of "accidental" language support.

I'd appreciate comments. If anybody has a handle on the future of gcc & DWARF (am I wasting my time?), additional features (have I got the right scope?), or just general "been there, done that ... watch out for ..." kind of remarks, then I'm listening. Am I trying to hit a moving target wrt DWARF content? I'd also like to hear if anybody has a better idea other than DWARF data as the source for this kind of process ... is there a gcc phase I've missed? What features would *you* like to see in an aggregate mapper?

/Of course/ I'll contribute this, assuming my employer's Legal staff will permit me to do so, you guys want it, and the code doesn't end up a maintenance or licensing nightmare.

Please post replies to the list.

TIA,
---Jim--

Reply via email to