Hi, I'd like to write a small language. This would be my first language and I'm faced with several implementation options. I have a hard time raking them so I'd like to ask advice from the GCC users and developers.
The language will feature dynamic typing, lexical closures and garbage collections. Most languages falling into this category are implemented with a bytecode interpreter but there are also a few notable GCC front ends. GCC front ends are usually patches to the GCC tree. The front end tend to be self contained in a single language directory with only a few modifications to the build system. This architecture is nice and clean but it presents problems for the distribution of the front end. Either a fully patched GCC source tree is distributed or instructions on how to patch GCC must be supplied. Alternatively a front end can convert its language to C. GCC support many annotations and extensions to C that will enable efficient transfer to machine language of programming constructs common in languages quite different from C. My options code generations are more or less: 1) Code an interpreter 2) Build the parse tree in GCC's native format and let GCC generate the code 3) Generate annotated C and call GCC on that. I think that option 1 would represent the less work but I doubt that it can be made efficient without major contortions. I will probably go that way for the first prototype but I'm afraid that I will need something else for a production release. Option 2 sounds like a good deal. The parse tree need to be build anyway, building it in GCC's native format in the first place makes a lot of sense. The only problem seems to be with the distribution of the resulting front end. Is it possible to build such a front end by only linking to libgcc or something like that? Finally, option 3 solves the distribution problems of option 2 but generating good C code doesn't sound like a trivial problem. Compilation under that option is probably slow since each files must be parsed twice... Is it possible to produde machine code as efficient as with option 2 when generating C code? Did I miss anything? What are the relative advantages of each solutions? Do you think that I overlooked other options? Would using an exiting virtual machine be a good option? Except for Nice, this option doesn't seem to be popular; there must be a catch. Regarding parsing, I would like to support Unicode literals and identifier. I would not mind if the input encoding was restricted to UTF-8. I think that Flex has very limited Unicode support. Is there a good lexer out there with good Unicode support? Would Unicode break anything if I use GCC as my backed? Thanks for your time, -- Yannick Gingras