Basic... wow. Start by fixing your regexes:


[0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
             yylval.d = strtod(yytext, NULL);
             return NUMBER;
           }

This matches a single '.', '.E0', and so on. Presumably you want something which looks more like

dec_digit [0-9]
suffix    ...whatever
numA {dec_digit}+\.{dec_digit}+{suffix}?
numB      {dec_digit}+\.{suffix}
numC      \.{dec_digit}+{suffix}?
numD      {dec_digit}+{suffix}?
number    {numA}|{numB}|{numC}|{numD}
\"[^"^\n]*[\"\n] {
             yytext[strlen(yytext) - 1] = '\0';
             yylval.s = str_new(yytext + 1);
             return STRING;
           }

What does a string actually look like? And why have you got 'yytext+1'? If you're trying to get rid of the leading quote, you also need a code block to get rid of the closing quote.

This matches, among other things, a string which starts with a double quote, terminated by a newline, with no closing quote. Not even Basic can be that bad. This bit with 'zero or more chars which aren't a newline' is also redundant.

And note that you only need one caret (^), which must be at the start of the character class ([]). Your regexp literally matches a caret.


I am looking for ways to attack this. I tried this in my scanner:

[\,\:\n].*[\,\:\n] {
            yytext[strlen(yytext) - 1] = '\0';
            yylval.s = str_new(yytext + 1);
            return STRING;
          }

This matches all sorts of stuff which isn't a string. The basic unquoted string is presumably any alphanumeric sequence, starting with a letter. The comma isn't really relevant, since it's not alphanumeric. Maybe something like:

quoted_string    \"[^"\n]*\"
unquoted_string  [a-zA-Z][a-zA-Z0-9]*
string {quoted_string}|{unquoted_string}

...but this could interfere with variable names, and so on, which will need more work. This will probably require you to take into account the current context; see Hans's reply.



Reply via email to