[RFC] Aggregate mapping

Jim Tison Sun, 10 Jun 2007 09:56:35 -0700

It was suggested that I post an RFC here. With advance apologies for itslength, I am looking for comments, remarks, and/or advice.

I'm currently trying to code a C language structure for a complex blockof 0x200 bytes in size from a paper document. This particulardeclaration is rife with bitfields, mis-aligned base datatypes, andreserved fields. What I need is to verify my coding against the paperdocument before I start testing code which uses it. I've looked throughgcc documents and find nothing which fits this need: no obvious -d passwhich would give me such a map, nor any other option or standaloneprogram. I'm proposing to create a standalone program, as I don't knowenough about gcc internals to force such maps to be generated duringcompilation ... I'm not certain this would be appropriate, anyway. OTOH,I can't be the first to propose solving this problem.

AFAICT, the only accurate source data for such an aggregate map would beDWARF DIE entries. This would mean that I'd be post-processing ELFobjects generated with the -g switch. What I'd like to end up with is aprogram which is portable among ELF types and outputs accurate data whenrun on foreign CPU architectures; much like readelf attempts to. Thismeans I'd need to generically handle Elf64 and Elf32 formats, as well asrespect endianism. I'd like this program to be able to properly extractaggregate maps from the three basic ELF types: relocatable, absolute,and DSO.

I've done some reading of the DWARF-2 and DWARF-3 standards. What I seeemitted as DWARF in gcc-4.1.x is marked as DWARF-2; but I'm told thatgcc properly emits DWARF-3. Should I even check the version? Will I needto vary processing features by gcc level?

Since I'm targeting all three ELF object types, I'm thinking I have tostart reading the sections .debug_info and .debug_abbrev at the DWARF CUlevel, and build a tree of CU-related data. I've actually started thisbit of code and am somewhat happy with test results so far.

DWARF isn't exactly user-friendly for my purposes. I've built, readabout, and attempted to grok libdwarf ... I have to confess that I don'tquite get it; but I think I have a functional understanding of how DWARFdata is laid out and related within gcc-emitted objects. Therefore, I'mplanning to not use libdwarf except for its DW_* pp terms in libdwarf.h.I expect to hear that I should be using gcc/dwarf.h; but I'd need to beconvinced that its enum type forms are versatile enough to use in decoding.

Nor am I planning to use libelf: it's too straightforward to find theELF sections I need and to read the ELF header. Use of libelf seems tootop-heavy, although elf.h is very valuable to me.

I don't know enough about Fortran, Ada, or Java to even be conversant inhow these languages aggregate; if this program supports these languages,it will be accidentally. I'm starting with C language dialects(DW_LANG_C, DW_LANG_C89, DW_LANG_C99) and intend to add support for C++once C function testing is complete -- C++ classes add complexity that Ireally don't want to deal with during initial development.


Here's the high-level functional design I've got so far:

1. Read the ELF object file, determine its type, endianism, andbitness as written. Find the .shstrtab section, and point at the sectionheaders. Through the section headers, find the .debug_info and.debug_abbrev sections. Read them into memory and make these copiespersistent through the rest of this program's lifecycle.2. Index through the DWARF CU headers (in .debug_info) and determine acount of CUs.

 3. For each CU,

(3a) Create an array of pointers into .debug_abbrev, reading theDW_TAGs as we go. For each DW_TAG_{structure_type, union_type}, mark theabbrev index item as "of interest".(3b) Iterate through .debug_info DIEs as delimited for the currentCU, looking for abbreviation numbers we've marked as "of interest". Foreach interesting DIE,(3c) Preserve source filename/path of declaration, starting linenumber, aggregate type (struct/union), tag (if any), and total length.Now, work its child DIEs (members). For each:(3d) Report byte offset, byte length, bit offset, and bit length asavailable. [I don't particularly like the way bit offsets are reportedwithin multi-byte bitfields, so I expect to do some division by 8 andadjust the byte offset as appropriate. If I can't fix this to the pointwhere it makes sense for both endianisms, then I'll abandon this tweak.]Report the base type and identifier of the member.(3e) If there's a typedef associated with this aggregate type,identify its name and save it.4. When all CUs have been extracted, display all extracted informationabout the aggregates and their members, appropriately grouped.

Where I've run into trouble so far is attempting to predict what kind ofstorage I'll need to save all this intermediate summary data. DWARFsections can get rather large, and there's no way to predict thecontents until they're read. I expect this program -- if I follow thedesign -- to use large amounts of memory until I get to the point whereI can start releasing blocks by displaying what they represent. Would Ibe making things easier on myself (or orders of magnitude less versatilefor the user) by processing ELF relocatables only? [This seems a deadend to me, but I thought I'd ask anyway.]

I don't intend to use language syntax features in the display -- I don'tplan on "recoding" structure declarations in C ... I just plan ondisplaying a few lines about each overall aggregate, and then listingthe important characteristics of each of its members in row/columnformat ... then inject a few linefeeds and proceed to the nextaggregate. I'm hoping this will increase the chances of "accidental"language support.

I'd appreciate comments. If anybody has a handle on the future of gcc &DWARF (am I wasting my time?), additional features (have I got the rightscope?), or just general "been there, done that ... watch out for ..."kind of remarks, then I'm listening. Am I trying to hit a moving targetwrt DWARF content? I'd also like to hear if anybody has a better ideaother than DWARF data as the source for this kind of process ... isthere a gcc phase I've missed? What features would *you* like to see inan aggregate mapper?

/Of course/ I'll contribute this, assuming my employer's Legal staffwill permit me to do so, you guys want it, and the code doesn't end up amaintenance or licensing nightmare.


Please post replies to the list.

TIA,
---Jim--

[RFC] Aggregate mapping

Reply via email to