> On 10 Nov 2018, at 12:50, Akim Demaille <a...@lrde.epita.fr> wrote: > >> Le 10 nov. 2018 à 10:38, Hans Åberg <haber...@telia.com> a écrit : >> >>> Also, see if using %param does not already >>> give you what you need to pass information from the scanner to the >>> parser’s yyerror. >> >> How would that get into the yyerror function? > > In C, arguments of %parse-param are passed to yyerror. That’s why I mentioned > %param, not %lex-param. And in the C++ case, these are members.
Actually, I was thinking about the token error. But for the yyerror function, I use C++, and compute the string for data in the semantic value, the prototype is: void yyparser::error(const location_type& loc, const std::string& errstr) Then I use it for both errors and warnings, the latter we discussed long ago. For errors: throw syntax_error(@x, str); // Suitably computed string For warnings: parser::error(@y, "warning: " + str); // Suitably computed string Then the error function above has: std::string s = "error: "; if (errstr.substr(0, 7) == "warning") s.clear(); This way, the string beginning with "error: " is not shown in the case of a warning. >>>>> I believe that the right approach is rather the one we have in compilers >>>>> and in bison: caret errors. >>>>> >>>>> $ cat /tmp/foo.y >>>>> %token FOO 0xff 0xff >>>>> %% >>>>> exp:; >>>>> $ LC_ALL=C bison /tmp/foo.y >>>>> /tmp/foo.y:1.17-20: error: syntax error, unexpected integer >>>>> %token FOO 0xff 0xff >>>>> ^^^^ >>>>> I would have been bothered by « unexpected 255 ». >>>> >>>> Currently, that’s for those still using only ASCII. >>> >>> No, it’s not, it works with UTF-8. Bison’s count of characters is mostly >>> correct. I’m talking about Bison’s own location, used to parse grammars, >>> which is improved compared to what we ship in generated parsers. >> >> Ah. I thought of errors for the generated parser only. Then I only report >> byte count, but using character count will probably not help much for caret >> errors, as they vary in width. Then problem is that caret errors use two >> lines which are hard to synchronize in Unicode. So perhaps some kind of one >> line markup instead might do the trick. > > Two things: > > One is that the semantics of Bison’s location’s column is not specified: > it is up the user to track characters or bytes. As a matter of fact, Bison > is hardly concerned by this choice; rather it’s the scanner that has to > deal with that. > > The other one is: once you have the location, you can decide how to display > it. In the case of Bison, I think the caret errors are fine, but you > could decide to do something different, say use colors or delimiters, to > be robust to varying width. Yes, actually I though about the token errors. But it is interesting to see what you say about it. >>>> I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display >>>> properly. In fact, I am using special code to even write out Unicode >>>> characters in the error strings, since Bison assumes all strings are >>>> ASCII, the bytes with the high bit set being translated into escape >>>> sequences. >>> >>> Yes, I’m aware of this issue, and we have to address it. >> >> For what I could see, the function that converts it to escapes is sometimes >> applied once and sometimes twice, relying on that it is an idempotent. > > It’s a bit more tricky than this. I’m looking into it, and I’d like > to address this in 3.3. I realized one needs to know a lot about Bison's innards to fix this. A thing that made me curios is why the function it uses zeroes out the high bit: It looks like having something with the POSIX C locale, but I could not find anything require it to be set to zero in that locale. Right now, I use a function that translates the escape sequences back to bytes. >>> We also have to provide support for internationalization of >>> the token names. >> >> Personally, I don't have any need for that. I use strings, like >> %token logical_not_key "¬" >> %token logical_and_key "∧" >> %token logical_or_key "∨" >> and in the case there are names, they typically match what the lexer >> identifies. > > Yes, not all the strings should be translated. I was thinking of > something like > > %token NUM _("number") > %token ID _("identifier") > %token PLUS "+" > > This way, we can even point xgettext to looking at the grammar file > rather than the generated parser. It might be good if one wants error messages in another language. _______________________________________________ help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison