"Kaveh R. Ghazi" <[EMAIL PROTECTED]> writes: [ Moved from gcc-patches to gcc ]
> At this point, I don't do any parsing of the "format-checking-data", > this is where I would expect Ian's state machine language to appear. To make this kind of thing useful, I see two paths that we can follow. The first is to simply not try to implement all of printf in a special language. Most printf extensions are not nearly as complex as printf itself. In fact, most simply add a few additional % conversions, with no modifiers. So we can get pretty good mileage out of a mechanism which simply says "like printf, plus these conversions". For example, #pragma GCC format "bfd" "inherit printf; %A: asection; %B: bfd" Here the "inherit" could be simply "printf" for whatever is appropriate for the current compilation, or it could be a specific standard name. Unfortunately it turns out that this doesn't currently describe the BFD formatting. The BFD additional conversion specifiers %A and %B can only appear at the beginning of the string, before any other conversion specifiers. But still this would be good enough for many uses. The second approach is of course to write a little language which is powerful enough to describe printf. The state machine language I described earlier is too simple and perhaps overly cryptic. It would be easier to understand a language based on regular expressions (which are of course equivalent to state machines). The main issues I see are: * There is duplicated information in printf flags. For example, in many cases we want to accept a width flag, and warn if the it is zero. We don't want to have to duplicate that warning for each conversion specifier which uses a width flag. * The use of dollar to specify a particular argument (e.g., "%1$d") must be represented cleanly. The dollar style must be used with every conversion or with none. "%1$*2$d" is a particularly annoying construct. * It might be convenient to be able to access the current standard level. It might also be convenient to access specific warning options. So let's consider a little language in which each conversion specifier is described by a simplified regular expression. In order to handle commonalities, we permit the regular expressions to use subroutines--a subroutine is itself a named regular expression. Each regular expression may have a conditional indicating whether it is valid, which may reference the standard level and warning options. Each regular expression may have an action which is executed if it matches. The action is a sequence of zero or more predefined functions. Dollar specifiers are handled specially by the framework, not by the little language itself. In this proposal, we can't use standard regular expressions, because they have no provision for the subroutines which I think we need. So here is the regular expression syntax: c where c is not a meta-character, matches c \c matches c [abc...] matches any of the characters abc.... r1r2 matches r1, then r2 (r) grouping; matches r r* matches zero or more of r r+ matches one or more of r r? matches zero or one r {NAME} matches a regular expression matched by {NAME} {$} matches [0123456789]+$, and handles dollar specifier The meta-characters are "\*+?[](){}/:" (colon is used for labels). Typical regular expression items which are not present (but which we could add if we want them): ^ and $ anchors, . to match any character, | for alternation, [^abc...] for a negated match. The grammar of the little language is: rules: rule rules rule: optcond optlabel '/' REGEXP '/' optactions optcond: /* empty */ | '?' cond '?' cond: VAR | cond '||' cond | cond '&&' cond | '(' cond ')' optlabel: /* empty */ | ':' NAME ':' optactions: /* empty */ | '{' actions '}' actions: action actions action: FNNAME '(' args ')' ';' args: arg | arg ',' args arg: STRING We permit multiple regular expressions to have the same name. In this case we try to match each one in order, and apply the actions of the first one to match. Conditionals are expressions based on variables, like flag_isoc99. I'm not sure yet what the permitted variables should be. The string is processed by matching unnamed regular expressions in order, anchored at the start of the string. If nothing matches, the character is ignored and we start matchin at the next character. So, for printf: :FLAGS: /[#0- +]*/ :WIDTH: /[123456789][0123456789]*/ :WIDTH: /\*{$}/ { match (int); } :PREC: /.[0123456789]*/ :PREC: /\*{$}/ { match (int); } :ADJ: /{FLAGS}{WIDTH}?{PREC}?/ ?flag_isoc99? /%{ADJ}hh[di]/ { match (char); } /%{ADJ}h[di]/ { match (short); } ?flag_isoc99? /%{ADJ}ll[di]/ { match (long long); } /%{ADJ}l[di]/ { match (long); } /%{ADJ}[di]/ { match (int); } etc. Note that the use of {$} in the regexp controls which argument is tested by the next call to "match". As an example of checking for a zero field width in scanf: :WIDTH: /[0123456789]*[123456789][0123456789]*/ :WIDTH: /0*/ { warn ("zero width"); } For BFD: /%A/ { match (asection); } /%B/ { match (bfd); } /{PRINTF}/ I still don't have a way to say that %A and %B must appear first. I'm a bit leery of introducing C style variables and if statements, although that would be one approach. We could also do it like this: /%A/ { warn_if_flag (0, "%A must not follow a printf specifier"); match (asection); } /%B/ { warn_if_flag (0, "%B must not follow a printf specifier"); match (bfd); } /{PRINTF}/ { set_flag (0); } I haven't tried to flesh this out any further. I'd be curious to hear how people react to it. Ian