Basic... wow. Start by fixing your regexes:
[0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
yylval.d = strtod(yytext, NULL);
return NUMBER;
}
This matches a single '.', '.E0', and so on. Presumably you want
something which looks more like
dec_digit [0-9]
suffix ...whatever
numA {dec_digit}+\.{dec_digit}+{suffix}?
numB {dec_digit}+\.{suffix}
numC \.{dec_digit}+{suffix}?
numD {dec_digit}+{suffix}?
number {numA}|{numB}|{numC}|{numD}
\"[^"^\n]*[\"\n] {
yytext[strlen(yytext) - 1] = '\0';
yylval.s = str_new(yytext + 1);
return STRING;
}
What does a string actually look like? And why have you got 'yytext+1'?
If you're trying to get rid of the leading quote, you also need a code
block to get rid of the closing quote.
This matches, among other things, a string which starts with a double
quote, terminated by a newline, with no closing quote. Not even Basic
can be that bad. This bit with 'zero or more chars which aren't a
newline' is also redundant.
And note that you only need one caret (^), which must be at the start of
the character class ([]). Your regexp literally matches a caret.
I am looking for ways to attack this. I tried this in my scanner:
[\,\:\n].*[\,\:\n] {
yytext[strlen(yytext) - 1] = '\0';
yylval.s = str_new(yytext + 1);
return STRING;
}
This matches all sorts of stuff which isn't a string. The basic unquoted
string is presumably any alphanumeric sequence, starting with a letter.
The comma isn't really relevant, since it's not alphanumeric. Maybe
something like:
quoted_string \"[^"\n]*\"
unquoted_string [a-zA-Z][a-zA-Z0-9]*
string {quoted_string}|{unquoted_string}
...but this could interfere with variable names, and so on, which will
need more work. This will probably require you to take into account the
current context; see Hans's reply.