Re: [PATCH]: Proof-of-concept for dynamic format checking

Ian Lance Taylor Wed, 17 Aug 2005 12:15:45 -0700

"Kaveh R. Ghazi" <[EMAIL PROTECTED]> writes:

[ Moved from gcc-patches to gcc ]


> At this point, I don't do any parsing of the "format-checking-data",
> this is where I would expect Ian's state machine language to appear.

To make this kind of thing useful, I see two paths that we can follow.


The first is to simply not try to implement all of printf in a special
language.  Most printf extensions are not nearly as complex as printf
itself.  In fact, most simply add a few additional % conversions, with
no modifiers.  So we can get pretty good mileage out of a mechanism
which simply says "like printf, plus these conversions".

For example,

#pragma GCC format "bfd" "inherit printf; %A: asection; %B: bfd"

Here the "inherit" could be simply "printf" for whatever is
appropriate for the current compilation, or it could be a specific
standard name.

Unfortunately it turns out that this doesn't currently describe the
BFD formatting.  The BFD additional conversion specifiers %A and %B
can only appear at the beginning of the string, before any other
conversion specifiers.  But still this would be good enough for many
uses.


The second approach is of course to write a little language which is
powerful enough to describe printf.  The state machine language I
described earlier is too simple and perhaps overly cryptic.  It would
be easier to understand a language based on regular expressions (which
are of course equivalent to state machines).  The main issues I see
are:

* There is duplicated information in printf flags.  For example, in
  many cases we want to accept a width flag, and warn if the it is
  zero.  We don't want to have to duplicate that warning for each
  conversion specifier which uses a width flag.

* The use of dollar to specify a particular argument (e.g., "%1$d")
  must be represented cleanly.  The dollar style must be used with
  every conversion or with none.  "%1$*2$d" is a particularly annoying
  construct.

* It might be convenient to be able to access the current standard
  level.  It might also be convenient to access specific warning
  options.

So let's consider a little language in which each conversion specifier
is described by a simplified regular expression.  In order to handle
commonalities, we permit the regular expressions to use subroutines--a
subroutine is itself a named regular expression.  Each regular
expression may have a conditional indicating whether it is valid,
which may reference the standard level and warning options.  Each
regular expression may have an action which is executed if it matches.
The action is a sequence of zero or more predefined functions.  Dollar
specifiers are handled specially by the framework, not by the little
language itself.

In this proposal, we can't use standard regular expressions, because
they have no provision for the subroutines which I think we need.  So
here is the regular expression syntax:

c          where c is not a meta-character, matches c
\c         matches c
[abc...]   matches any of the characters abc....
r1r2       matches r1, then r2
(r)        grouping; matches r
r*         matches zero or more of r
r+         matches one or more of r
r?         matches zero or one r
{NAME}     matches a regular expression matched by {NAME}
{$}        matches [0123456789]+$, and handles dollar specifier

The meta-characters are "\*+?[](){}/:" (colon is used for labels).

Typical regular expression items which are not present (but which we
could add if we want them): ^ and $ anchors, . to match any character,
| for alternation, [^abc...] for a negated match.

The grammar of the little language is:

rules: rule rules
rule: optcond optlabel '/' REGEXP '/' optactions
optcond: /* empty */ | '?' cond '?'
cond: VAR | cond '||' cond | cond '&&' cond | '(' cond ')'
optlabel: /* empty */ | ':' NAME ':'
optactions: /* empty */ | '{' actions '}'
actions: action actions
action: FNNAME '(' args ')' ';'
args: arg | arg ',' args
arg: STRING

We permit multiple regular expressions to have the same name.  In this
case we try to match each one in order, and apply the actions of the
first one to match.

Conditionals are expressions based on variables, like flag_isoc99.
I'm not sure yet what the permitted variables should be.

The string is processed by matching unnamed regular expressions in
order, anchored at the start of the string.  If nothing matches, the
character is ignored and we start matchin at the next character.

So, for printf:

:FLAGS: /[#0- +]*/
:WIDTH: /[123456789][0123456789]*/
:WIDTH: /\*{$}/ { match (int); }
:PREC:  /.[0123456789]*/
:PREC:  /\*{$}/ { match (int); }
:ADJ:   /{FLAGS}{WIDTH}?{PREC}?/
?flag_isoc99? /%{ADJ}hh[di]/  { match (char); }
/%{ADJ}h[di]/   { match (short); }
?flag_isoc99? /%{ADJ}ll[di]/ { match (long long); }
/%{ADJ}l[di]/   { match (long); }
/%{ADJ}[di]/    { match (int); }

etc.  Note that the use of {$} in the regexp controls which argument
is tested by the next call to "match".

As an example of checking for a zero field width in scanf:

:WIDTH: /[0123456789]*[123456789][0123456789]*/
:WIDTH: /0*/ { warn ("zero width"); }

For BFD:

/%A/ { match (asection); }
/%B/ { match (bfd); }
/{PRINTF}/

I still don't have a way to say that %A and %B must appear first.  I'm
a bit leery of introducing C style variables and if statements,
although that would be one approach.  We could also do it like this:

/%A/ { warn_if_flag (0, "%A must not follow a printf specifier");
       match (asection); }
/%B/ { warn_if_flag (0, "%B must not follow a printf specifier");
       match (bfd); }
/{PRINTF}/ { set_flag (0); }

I haven't tried to flesh this out any further.  I'd be curious to hear
how people react to it.

Ian

Re: [PATCH]: Proof-of-concept for dynamic format checking

Reply via email to