Re: Question about past research in detecting compiler used to create executable binary
On Wed, 2008-01-23 at 16:48 -0600, Stephen Torri wrote: > GCC Community, > > I am a PhD candidate at Auburn University in Alabama investigating > automated compiler detection for reverse engineering. The reason I am > contacting this mailing list is to see if anyone knows of research done > to discover the compiler used to create a binary executable. > > Sincerely, > > Stephen Torri > PhD Candidate > Auburn University > Department of Computer Science and Software Engineering > [EMAIL PROTECTED] > > If GCC is any guide, this will often be trivial. GCC embeds lots of data about the source system and compiler in the executable. > file temp.x temp.x: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), for GNU/Linux 2.6.0, dynamically linked (uses shared libs), not stripped Also in the same file GCC: (GNU) 4.1.2 (Ubuntu 4.1.2-0ubuntu4) If this is a reverse engineering project, your adversary will probably have stripped as much of this kind of thing as possible though. Tim Josling
Re: Rant about ChangeLog entries and commit messages - better to do something than just complain
On the principle that it's better to do something than just complain... I monitored the time I spent looking for the emails associated with a given patch and I found it takes high single digit minutes to find them. Sometimes you can't find them (which takes a lot longer). I do this a lot. I wrote a little proof-of-concept script to take the mailing list archives and the ChangeLog files and annotate the ChangeLog files with the URLs of the probable email containing the patch. Sample output is here (annotation of the current ChangeLog file). http://cobolforgcc.cvs.sourceforge.net/cobolforgcc/gcc/gcc/gcb/gcc_ChangeLog.txt?revision=1.1&view=markup Or http://tinyurl.com/2v824o Or http://preview.tinyurl.com/2v824o The program is here (not much internal documentation at all). Testing has been limited - in any case, with processing of text written by people, perfection is not possible. http://cobolforgcc.cvs.sourceforge.net/cobolforgcc/gcc/gcc/gcb/gcc_mailscan.rb?revision=1.1&view=markup Or http://tinyurl.com/2yem2u Or http://preview.tinyurl.com/2yem2u It runs in about 25 minutes on my system and uses a few hundred MB of storage. Things I learned: 1. There is a lot of data. It's a good thing Ruby 1.9 is a lot faster than Ruby 1.8. There are over 100 ChangeLog files in the GCC source, with over 600,000 lines in total. The gcc patches mailing list archives are over 2 GB in size, and take a considerable time to download. 2. Most patches to ChangeLog have an identifiable email in the archive. Things get spotty with branches in some cases, also as you go back in time, and also there is a large gap in the email archives from a while back. 3. I think this may be a useful thing. If a place could be found to put the 30MB of files I would be happy to maintain them on a weekly basis or so. Alternatively I could update the ChangeLog files themselves but I have reason to suspect that may not be popular. If nothing else happens I will keep it up-to-date for my own use. Tim Josling On Tue, 2007-12-04 at 08:05 -0500, Richard Kenner wrote: > > I didn't say you cannot or should not use these tools. But a good comment > > on a piece of code sure beats a good commit message, which must be looked > > at > > separately, and can be fragmented over multiple commits, etc. > > I don't see one as "beating" the other because they have very different > purposes. Sometimes you need one and sometimes you need the other. > > The purpose of COMMENTS is to help somebody understand the code as it > stands at some point in time. In most cases, that means saying WHAT the > code does and WHY (at some level) it does what it does. Once in a while, > it also means saying why it DOESN'T do something, for example, if it might > appear that there's a simpler way of doing what the code is doing now but > it doesn't work for some subtle reason. But it's NOT appropriate to put > into comments the historical remark that this code used to have a typo > which caused a miscompilation at some specific place. However, the commit > log IS the place for that sort of note. > > My view is that, in general, the comments are usually the most appropriate > place to put information about how the code currently works and the commit > log is generally the best place for information that contrasts how the code > currently works with how it used to work and provides the motivation for > making the change. But there are exceptions to both of those generalizations.
Getting host and target size and alignment information at build time?
I need to find out the alignment and size information for the standard integral types and pointer types at GCC build time. The information is needed to work out the sizes of data structures so that warnings about size mismatches can be produced. The information is needed at build time because the parser and validator do not have access to the gcc back end code when the compiler runs. So this information needs to be worked out earlier and generated as Lisp code ie in the build phase. I have found tm.h, and also bconfig.h, config.h and tconfig.h. The sizes are more or less OK as there are macros for sizes, apart from pointer sizes in some cases. The alignment is the main problem; the alignments for i386 are not constants but function calls and vary in certain scenarios. My current attempt at doing this is below. I fully acknowledge that it is not correct. That is the reason for this posting. Does anyone have any suggestions about how to get this information at build time? Apart from some simple solution which I hope someone will come up with I have two other possible avenues to solve the problem 1. Have some sort of install process which compiles and that extracts the information out of the target compiler. Eg I could write a small program which prints out the __alignof__ values and sizeof values for various data items. This information then gets stored somewhere that it can be used by the Lisp code. 2. Turn GCC into a libgccbackend and call it from my lisp code at run time using a foreign function interface. This would make it unnecessary to get the information at build time because the Lisp code could get it from the compiler back end when compiling the program. This would be a last resort at this stage due to possibilities for misuse of a libgccbackend and also the foreign function interface overheads. Tim Josling /* -*- C -*- */ /* Copyright ... FILE Generate target-info file, - data item attributes for target time. Output goes to standard output. */ #include #include #include #define IN_GCC #include "tconfig.h" #include "system.h" #include "coretypes.h" #include "tm.h" /* We don't want fancy_abort */ #undef abort #ifndef BIGGEST_FIELD_ALIGNMENT #define BIGGEST_FIELD_ALIGNMENT 32 #endif #ifndef BITS_PER_UNIT #define BITS_PER_UNIT 8 #endif #ifndef POINTER_SIZE #ifdef TARGET_64BIT #define POINTER_SIZE 64 #else #define POINTER_SIZE 32 #endif #endif /* Fake because some macro needs it. */ int ix86_isa_flags = 0; static int maxint (int a, int b) { return (a>b?a:b); } static void print_one_item (char *name, char *actual_usage, char *basic_type, int size_bits) { printf ("(defconstant %s-attributes\n" " (gcb:make-usage-attributes\n" " :usage cbt:%s\n" " :basic-type cbt:%s\n" " :size %d\n" " :default-alignment %d\n" " :sync-alignment %d))\n", name, actual_usage, basic_type, size_bits/BITS_PER_UNIT, 1, maxint (size_bits/BITS_PER_UNIT, /* This alignment is all wrong but there doesn't seem to be any way to get the true figure out of GCC short of doing a cross-build and then running a program on the target machine. */ BIGGEST_FIELD_ALIGNMENT/BITS_PER_UNIT)); } int main (int argc, char **argv) { fprintf (stderr, "TARGET_64BIT %d\n", TARGET_64BIT); fprintf (stderr, "POINTER_SIZE %d\n", POINTER_SIZE); if (argc != 1) { fprintf (stderr, "Unexpected number of parameters - should be none \n"); abort (); } printf ("...file header stuff"); print_one_item ("char", "binary-char", "binary", BITS_PER_UNIT); print_one_item ("short", "binary-short", "binary", SHORT_TYPE_SIZE); print_one_item ("int", "binary-int", "binary", INT_TYPE_SIZE); print_one_item ("long", "binary-long", "binary", LONG_TYPE_SIZE); print_one_item ("long-long", "binary-long-long", "binary", LONG_LONG_TYPE_SIZE); print_one_item ("sizet", "binary-size", "binary", POINTER_SIZE); print_one_item ("ptr", "binary-ptr", "binary", POINTER_SIZE); print_one_item ("ptr-diff", "binary-ptr-diff", "binary", POINTER_SIZE); print_one_item ("display", "display", "display", BITS_PER_UNIT); print_one_item ("binary", "binary", "binary", INT_TYPE_SIZE); print_one_item ("binary1", "binary1", "binary", 1 * BITS_PER_UNIT); print_one_item ("binary2", "binary2", "binary", 2 * BITS_PER_UNIT); print_one_item ("binary4", "binary4",
Re: Getting host and target size and alignment information at build time?
On Fri, 2008-04-11 at 09:07 -0400, Daniel Jacobowitz wrote: > Please don't reply to an existing thread to post a new question. Sorry, I didn't realize that would cause a problem. > Simply put, you can't do this. All of these things can depend on > command line options. It does seem you can only get this information in the context of an actual compile on the target machine. > > Why not get it out of GCC later? You don't need to hack up GCC to do > that. Later is too late. I need to make decisions before the GCC back end gets involved (the back end is in a separate OS process). For example "Is this literal too long for this group data item?". Or "Is a redefine larger than the original (which is not allowed)?". If the literal is too long I need to truncate it and give an error message; if a redefine is too large I need to extend the original and give an error message. While this can all be done, it means I am duplicating more logic into the C code, and this has a 4X negative productivity impact versus Lisp. It also makes it extremely difficult to output error messages in line# sorted order, because they are issued by different processes. Still if that's how GCC operates I will need to find some way to deal with it. Maybe a cut down libgccbackend that doesn't generate code, it just gives me the information I want. Tim Josling
Re: Getting host and target size and alignment information at build time?
On Fri, 2008-04-11 at 17:05 -0400, Daniel Jacobowitz wrote: > On Sat, Apr 12, 2008 at 06:59:28AM +1000, Tim Josling wrote: > > > Why not get it out of GCC later? You don't need to hack up GCC to do > > > that. > That's not what I meant. You don't need it _during the GCC build > process_. You can fork GCC and run it and have it tell you the answer > based on the current command line arguments, read its output, and > go on with what you were doing. Which presumably involves further > compilation. > You're right... That's more or less what I think I will do. I'm working on a proof of concept at the moment. > (You didn't say what you are trying to do, so I'm guessing at the > context a bit.) > Here is some more explanation of what I am trying to do. My COBOL compiler was going to look like this: (1). The gcc driver calls a lisp program which does the preprocessing (lex/parse/copy-includes/replaces), lex, parse, cross checking, simplification, and creates an output file in a simple binary format. This Lisp program does not have direct access to any GCC code including headers. (2). The gcc driver passes the output file to another program (cob1) which would be similar to cc1 except that its input file is a simple binary format that does not need to be lexed and parsed. This program will be drived by toplev.c and will generate, via the gcc back end, the assembler output to be assembled and linked by subsequent programs called from the gcc driver. It will have access to all the gcc middle and back end code. My initial intention was that the program (1) should know as little about gcc as possible. I then realised that it would need some target information such as type sizes and alignment information from gcc. I thought I could get this information by writing a small program that would pull in some headers and could then output a Lisp program that could be compiled into program (1). This didn't work out very well because the information is only available within the compiler at run time on the target system, and it is dynamic and option-dependent. So I will add an option to the compiler "-fget-types". This will trigger the output on standard output of all the information I need. So the flow will be: (0) cob1 -fGet-types ->stdout passed as a parameter to (1) (1) Lisp pgm ->binary file (2) cob1 Main toplev.c compilation taking binary file as input. For various reasons I have to run the Lisp program via a shell script. I cab readily include the -fget-types run in that, something like this lisp --user-parms -o -ftypes=`cob1 -fget-types` The stdout from cob1 -fget-types will get passed to the Lisp program via the shell back-quotes facility, which incorporates stdout from a file into the command line where the back quotes appear. This is used elsewhere in gcc. Regards, Tim Josling
Re: Getting host and target size and alignment information at build time?
On Sat, 2008-04-12 at 18:16 +1000, Tim Josling wrote: > On Fri, 2008-04-11 at 17:05 -0400, Daniel Jacobowitz wrote: > > On Sat, Apr 12, 2008 at 06:59:28AM +1000, Tim Josling wrote: > > > > Why not get it out of GCC later? You don't need to hack up GCC to do > > > > that. > You're right... That's more or less what I think I will do. I'm working > on a proof of concept at the moment. Here is the proof of concept for getting the type information out of the gcc back end. It was not as hard as I expected in the end. cob2.c: http://cobolforgcc.cvs.sourceforge.net/cobolforgcc/gcc/gcb/cob2.c?revision=1.1&view=markup See get_build_types () and get_target_types () Called from script cob1.sh: http://cobolforgcc.cvs.sourceforge.net/cobolforgcc/gcc/gcb/cob1.sh?revision=1.1 Used by type-info.lisp: http://cobolforgcc.cvs.sourceforge.net/cobolforgcc/gcc/gcb/type-info.lisp?revision=1.1&view=markup See defun init-type-info Any comments or suggestions welcome. Thanks for your ideas Daniel. Tim Josling
Some questions about writing a front end
BACKGROUND (optional) I've now reached the point of writing the GCC middle/back end interface for my Cobol compiler. See http://cobolforgcc.sourceforge.net/ Previously I wrote two front ends but that was a while ago. These were the original iteration of cobolforgcc 1998-2003, and the now defunct treelang of similar vintage. I also translated and updated the "how to write a front end document", now sadly out of date http://cobolforgcc.sourceforge.net/cobol_14.html But that was all a while ago and a lot has happened. I read the GCC Summit papers and the GCC Wiki but a few questions remain and there are some things I'm not quite sure about. QUESTIONS 1. Sample front-end: Given treelang no longer exists and "is not a good example anyway" what would be the best front end to use as a model and to plagiarize code? I have found that the Ada front end, while large, is quite easy to follow and I've been using that. C/C++ seem to have the back end interface very enmeshed in the hand coded parsers. The Java front end is reasonably small (only handles class files?) but the back end (BE) interface is spread among 30 files. The fortran Front End (FE) has 58 files with BE interfaces. Objective C/++ are mostly just add-ons to C. What I don't know is how up-to-date the various front ends are and how good an example they are. 2. Most-Gimplified front-end: Allied to Q1, which front ends have been most thoroughly converted to GIMPLE? 3. LANG_HOOKS: There has been some discussion about LANG_HOOKS being removed in the future. From memory this was in the context of the "optimization in the linker (LTI)" projects. Is there a replacement I should use now, or is there anything I should do to prepare for the replacement? 4. What does Gimple cover: What is the scope of GIMPLE? Most of the discussion is about procedural code. Does it also cover variable definition, function prototype definition etc. 5. What is deprecated: Is there any time-effective way to identify constructs, header files, macros, variable and functions that are "deprecated". 6. Tuples: I am a bit confused about tuples. Tuples seem to be really just structs by another name, unless I have missed the point. The idea is not a bad one - I went through the same process in the Lisp code in the front end where initially I stored everything in arrays and later switched to structs/tuples. In lisp this provided the advantages of run-time type-checking and the ability to use mnemonic names. The first email about tuples that I can find seems to assume a reasonable amount of background on the part of the reader: http://www.mailinglistarchive.com/gcc@gcc.gnu.org/msg01669.html Some clarification about what the tuples project is trying to do, and in particular how I should position for the advent of tuples would be very useful. I have read the material in the Wiki and from the GCC summit. 7. Should I target GENERIC, High Gimple or Low Gimple? Would I miss optimizations if I went straight to a Gimple representation? Is one interface more likely to change radically in the future? The assumption here is that the front end will be using an entirely different representation so there is no question of using one of these in the Front End. It is just a question of which format to convert into. Thank you all for any help you can provide, Tim Josling
Re: [tuples] New requirement for new patches
> - The C front end is bootstrapping. The failure rate in the testsuites is in the 2-4% range. I've been trying to do a C-only bootstrap of the tuples branch for a couple of days on "Linux tim-gcc 2.6.20-15-generic #2 SMP Sun Apr 15 06:17:24 UTC 2007 x86_64 GNU/Linux" and I get /../libdecnumber -I/home2/gcc-gimple-tuples-branch/gcc/gcc/../libdecnumber/bid -I../libdecnumber /home2/gcc-gimple-tuples-branch/gcc/gcc/tree-optimize.c -o tree-optimize.o /home2/gcc-gimple-tuples-branch/gcc/gcc/tree-data-ref.c: In function 'compute_all_dependences': /home2/gcc-gimple-tuples-branch/gcc/gcc/tree-data-ref.c:3930: internal compiler error: in avail_expr_eq, at tree-ssa-dom.c:2482 Please submit a full bug report, with preprocessed source if appropriate. See <http://gcc.gnu.org/bugs.html> for instructions. make[3]: *** [tree-data-ref.o] Error 1 make[3]: *** Waiting for unfinished jobs Also this one is looping at the time the other one crashes. From ps aux| grep cc1: tim 4500 71.9 1.6 82840 67128 pts/0RN+ 11:08 3:02 /home2/gcc-gimple-tuples-branch/objdir/./prev-gcc/cc1 -quiet -I. -I. -I/home2/gcc-gimple-tuples-branch/gcc/gcc -I/home2/gcc-gimple-tuples-branch/gcc/gcc/. -I/home2/gcc-gimple-tuples-branch/gcc/gcc/../include -I/home2/gcc-gimple-tuples-branch/gcc/gcc/../libcpp/include -I/usr/local/include -I/home2/gcc-gimple-tuples-branch/gcc/gcc/../libdecnumber -I/home2/gcc-gimple-tuples-branch/gcc/gcc/../libdecnumber/bid -I../libdecnumber -iprefix /home2/gcc-gimple-tuples-branch/objdir/prev-gcc/../lib/gcc/x86_64-unknown-linux-gnu/4.4.0/ -isystem /home2/gcc-gimple-tuples-branch/objdir/./prev-gcc/include -isystem /home2/gcc-gimple-tuples-branch/objdir/./prev-gcc/include-fixed -DIN_GCC -DHAVE_CONFIG_H insn-attrtab.c -quiet -dumpbase insn-attrtab.c -mtune=generic -auxbase-strip insn-attrtab.o -g -O2 -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -Wno-return-type -Wno-format -Wno-missing-format-attribute -o /tmp/ccHig7nY.s Tim Josling On Thu, 2008-04-17 at 17:19 -0700, Diego Novillo wrote: > Please notice that the wiki page for tuples has new rules for patches. > From now on, every patch needs to have been tested with a C-only bootstrap. > > > Thanks. Diego.
Re: Some questions about writing a front end
On Thu, 2008-04-17 at 10:24 -0700, Ian Lance Taylor wrote: > Tim Josling <[EMAIL PROTECTED]> writes: > > 5. What is deprecated: Is there any time-effective way to identify > > constructs, header files, macros, variable and functions that are > > "deprecated". > > Not really. We try not to leave deprecated stuff around for too long. > Good, but I was wondering then what I should avoid so that my front-end avoids being "not a very good example" (like treelang). Tim Josling
Re: Rant about ChangeLog entries and commit messages
On Mon, 2007-12-03 at 13:58 -0500, Diego Novillo wrote: > On 12/03/07 13:50, Richard Kenner wrote: > >> I guess that could work, but that wouldn't give a way into the history > >> for the change. Several times there is a post-mortem discussion on the > >> patch, leading to more patches. > > > > How about both? > > Sure. > > > Diego. Quite a few people are worried about verbose descriptions of changes cluttering up the ChangeLog. Others (like me) would like a way easily to find the discussions about the change, and would like a brief indication in the ChangeLog of the context of the change. The FSF also has good reasons for keeping solid records of who made what change. So, how about this: 1. For a PR fix, continue to record the PR number and category. Like this: PR tree-optimization/32694 2. For all changes, a one-line record giving the context, plus the URL of a key message in the email message trail, unless the intent is plainly obvious such as bumping the version number. Like this: Gimplification of Fortran front end. http://gcc.gnu.org/ml/gcc-patches/2007-12/msg00072.html 3. Continue to record "who made what change". Like this: * config/xtensa/xtensa.c (xtensa_expand_prologue): Put a REG_FRAME_RELATED_EXPR note on the last insn that sets up the stack pointer or frame pointer. This should satisfy everyone's needs. This would by no means be the largest divergence from the FSF standards by the GCC project. The use of languages other than C in the Ada front end is non-compliant by my reading. The compliance of the rest of the code to the FSF standards is spotty at times eg the garbage collection code. While this is a divergence from the FSF standards, it is a positive change and no information is being lost. It would be interesting to ask someone who was around at the time why the guidelines were written as they were. They rationale may no longer be relevant. Tim Josling
Re: [RFC] WHOPR - A whole program optimizer framework for GCC
On Wed, 2007-12-12 at 15:06 -0500, Diego Novillo wrote: > Over the last few weeks we (Google) have been discussing ideas on how to > leverage the LTO work to implement a whole program optimizer that is > both fast and scalable. > > While we do not have everything thought out in detail, we think we have > enough to start doing some implementation work. I tried attaching the > document, but the mailing list rejected it. I've uploaded it to > http://airs.com/dnovillo/pub/whopr.pdf A few questions: Do you have any thoughts on how this approach would be able to use profiling information, which is very a very powerful source of information for producing good optimisations? Would there be much duplication of code between this and normal GCC processing or would it be possible to share a common code base? A few years back there were various suggestions about having files containing intermediate representations and this was criticised because it could make it possible for people for subvert the GPL by connecting to the optimisation phases via such an intermediate file. Arguably the language front end is then a different program and not covered by the GPL. It might be worth thinking about this aspect. This also triggers the thought that if you have this intermediate representation, and it is somewhat robust to GCC patchlevels, you do not actually need source code of proprietary libraries to optimize into them. You only need the intermediate files, which may be easier to get than source code. Tim Josling
Re: [RFC] WHOPR - A whole program optimizer framework for GCC
On Thu, 2007-12-13 at 08:27 -0500, Diego Novillo wrote: > On 12/13/07 2:39 AM, Ollie Wild wrote: > > > The lto branch is already doing this, so presumably that discussion > > was resolved (Maybe someone in the know should pipe up.). > > Yes, streaming the IL to/from disk is a resolved issue. > ... > > Diego. I found this thread http://gcc.gnu.org/ml/gcc/2005-11/msg00735.html >> From: Mark Mitchell >> To: gcc mailing list >> Date: Wed, 16 Nov 2005 14:26:28 -0800 >> Subject: Link-time optimzation >> The GCC community has talked about link-time optimization for some time. >> ... >> We would prefer not to have this thread devolve into a discussion about >> legal and "political" issues relating to reading and writing GCC's >> internal representation. I've said publicly for a couple of years that >> GCC would need to have this ability, and, more constructively, David >> Edelsohn has talked with the FSF (both RMS and Eben Moglen) about it. >> The FSF has indicated that GCC now can explore adding this feature, >> although there are still some legal details to resolve. >> ... >> http://gcc.gnu.org/projects/lto/lto.pdf >> ... Was there any more about this? I have restarted work on my COBOL front end. Based on my previous experiences writing a GCC front end I want to have as little code as possible in the same process as the GCC back end. This means passing over a file. So I would like to understand how to avoid getting into political/legal trouble when doing this. Thanks, Tim Josling
Re: Rant about ChangeLog entries and commit messages
On Sat, 2007-12-15 at 20:54 -0200, Alexandre Oliva wrote: > On Dec 3, 2007, [EMAIL PROTECTED] (Richard Kenner) wrote: > > > In my view, ChangeLog is mostly "write-only" from a developer's > > perspective. It's a document that the GNU project requires us to > produce > > for > > ... a good example of compliance with the GPL: > > 5. Conveying Modified Source Versions. > > a) The work must carry prominent notices stating that you modified > it, and giving a relevant date. > (Minor quibble) As copyright owner of GCC, the FSF is not bound by the conditions of the licence it grants in the same way as licencees are bound. So I don't think this provision in itself would mandate that those who have copyright assignments to the FSF record their changes. I don't hear anyone arguing that people should not record what they changes and when. The question is whether it is sufficient. I just started using git locally, and I keep thinking it would be really great to have something like "git blame" for gcc. The command "git blame" gives you a listing of who changed each line of the file and when, and also gives the commit id. From that all can be revealed. > > FWIW, I've used ChangeLogs to find problems a number of times in my 14 > years of work in GCC, and I find them very useful. When I need more > details, web-searching for the author of the patch and some relevant > keywords in the ChangeLog will often point at the relevant e-mail, so > burdening people with adding a direct URL seems pointless to me. It's > pessimizing the common case for a small optimization in far less > common cases. > This may possibly work when the mailing list entries exist and are accessible. However they are only available AFAIK from 1998. GCC has been going for 2-3 times as long as that. And there is at least one significant gap: February 2004 up to and including this message http://gcc.gnu.org/ml/gcc-patches/2004-02/msg02288.html. In my experience, when documentation is not stored with the source code, it often gets lost. When a person is offline the mailing list htmls are not available. I have an idea to resolve this that I am working on... more in due course if it comes to anything. Tim Josling