improving error message (was: bison for nlp)
Hi Hans, > Le 9 nov. 2018 à 14:45, Hans Åberg a écrit : > >> On 9 Nov 2018, at 12:11, Akim Demaille wrote: >> >>> Le 9 nov. 2018 à 09:58, Hans Åberg a écrit : >>> >>> On 9 Nov 2018, at 05:59, Akim Demaille wrote: > By the way, I’ll still get the error message as a string I guess, right? Yes. Some day we will work on improving error message generation, there is much demand. >>> >>> One thing I’d like to have is if there is an error with say a identifier, >>> also writing the out the name of it. >> >> Yes, that’s a common desire. However, I don’t think it’s really >> what people need, because the way you print the semantic value >> might differ from what you actually wrote. For instance, if I have >> a syntax error involving an integer literal written in binary, >> say 0b101010, then I will be surprised to read that I have an error >> involving 42. >> >> So you would need to cary the exact string from the scanner to the >> parser, and I think that’s too much to ask for. > > That is what I do. So I merely want an extra argument in the error reporting > function where it can be put. Please, be clearer: what extra argument, and show how the parser can provide it. Also, see if using %param does not already give you what you need to pass information from the scanner to the parser’s yyerror. >> I believe that the right approach is rather the one we have in compilers >> and in bison: caret errors. >> >> $ cat /tmp/foo.y >> %token FOO 0xff 0xff >> %% >> exp:; >> $ LC_ALL=C bison /tmp/foo.y >> /tmp/foo.y:1.17-20: error: syntax error, unexpected integer >> %token FOO 0xff 0xff >> >> I would have been bothered by « unexpected 255 ». > > Currently, that’s for those still using only ASCII. No, it’s not, it works with UTF-8. Bison’s count of characters is mostly correct. I’m talking about Bison’s own location, used to parse grammars, which is improved compared to what we ship in generated parsers. $ bison /tmp/foo.y /tmp/foo.y:2.6: erreur: caractères invalides: « 💩 » exp: 💩 💩 💩 💩; ^ /tmp/foo.y:2.8: erreur: caractères invalides: « 💩 » exp: 💩 💩 💩 💩; ^ /tmp/foo.y:2.10: erreur: caractères invalides: « 💩 » exp: 💩 💩 💩 💩; ^ /tmp/foo.y:2.12: erreur: caractères invalides: « 💩 » exp: 💩 💩 💩 💩; ^ It will fail when there are composed characters, granted. Don’t try with the attached grammar. foo.y Description: Binary data > I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display > properly. In fact, I am using special code to even write out Unicode > characters in the error strings, since Bison assumes all strings are ASCII, > the bytes with the high bit set being translated into escape sequences. Yes, I’m aware of this issue, and we have to address it. We also have to provide support for internationalization of the token names. ___ help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison
Re: improving error message (was: bison for nlp)
> On 10 Nov 2018, at 09:02, Akim Demaille wrote: > > Hi Hans, Hello Akim, > Yes. Some day we will work on improving error message generation, > there is much demand. One thing I’d like to have is if there is an error with say a identifier, also writing the out the name of it. >>> >>> Yes, that’s a common desire. However, I don’t think it’s really >>> what people need, because the way you print the semantic value >>> might differ from what you actually wrote. For instance, if I have >>> a syntax error involving an integer literal written in binary, >>> say 0b101010, then I will be surprised to read that I have an error >>> involving 42. >>> >>> So you would need to cary the exact string from the scanner to the >>> parser, and I think that’s too much to ask for. >> >> That is what I do. So I merely want an extra argument in the error reporting >> function where it can be put. > > Please, be clearer: what extra argument, and show how the parser > can provide it. Yes, I need to analyze it and get back. > Also, see if using %param does not already > give you what you need to pass information from the scanner to the > parser’s yyerror. How would that get into the yyerror function? >>> I believe that the right approach is rather the one we have in compilers >>> and in bison: caret errors. >>> >>> $ cat /tmp/foo.y >>> %token FOO 0xff 0xff >>> %% >>> exp:; >>> $ LC_ALL=C bison /tmp/foo.y >>> /tmp/foo.y:1.17-20: error: syntax error, unexpected integer >>> %token FOO 0xff 0xff >>> >>> I would have been bothered by « unexpected 255 ». >> >> Currently, that’s for those still using only ASCII. > > No, it’s not, it works with UTF-8. Bison’s count of characters is mostly > correct. I’m talking about Bison’s own location, used to parse grammars, > which is improved compared to what we ship in generated parsers. Ah. I thought of errors for the generated parser only. Then I only report byte count, but using character count will probably not help much for caret errors, as they vary in width. Then problem is that caret errors use two lines which are hard to synchronize in Unicode. So perhaps some kind of one line markup instead might do the trick. >> I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display >> properly. In fact, I am using special code to even write out Unicode >> characters in the error strings, since Bison assumes all strings are ASCII, >> the bytes with the high bit set being translated into escape sequences. > > Yes, I’m aware of this issue, and we have to address it. For what I could see, the function that converts it to escapes is sometimes applied once and sometimes twice, relying on that it is an idempotent. > We also have to provide support for internationalization of > the token names. Personally, I don't have any need for that. I use strings, like %token logical_not_key "¬" %token logical_and_key "∧" %token logical_or_key "∨" and in the case there are names, they typically match what the lexer identifies. ___ help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison
Re: improving error message (was: bison for nlp)
> Le 10 nov. 2018 à 10:38, Hans Åberg a écrit : > >> Also, see if using %param does not already >> give you what you need to pass information from the scanner to the >> parser’s yyerror. > > How would that get into the yyerror function? In C, arguments of %parse-param are passed to yyerror. That’s why I mentioned %param, not %lex-param. And in the C++ case, these are members. I believe that the right approach is rather the one we have in compilers and in bison: caret errors. $ cat /tmp/foo.y %token FOO 0xff 0xff %% exp:; $ LC_ALL=C bison /tmp/foo.y /tmp/foo.y:1.17-20: error: syntax error, unexpected integer %token FOO 0xff 0xff I would have been bothered by « unexpected 255 ». >>> >>> Currently, that’s for those still using only ASCII. >> >> No, it’s not, it works with UTF-8. Bison’s count of characters is mostly >> correct. I’m talking about Bison’s own location, used to parse grammars, >> which is improved compared to what we ship in generated parsers. > > Ah. I thought of errors for the generated parser only. Then I only report > byte count, but using character count will probably not help much for caret > errors, as they vary in width. Then problem is that caret errors use two > lines which are hard to synchronize in Unicode. So perhaps some kind of one > line markup instead might do the trick. Two things: One is that the semantics of Bison’s location’s column is not specified: it is up the user to track characters or bytes. As a matter of fact, Bison is hardly concerned by this choice; rather it’s the scanner that has to deal with that. The other one is: once you have the location, you can decide how to display it. In the case of Bison, I think the caret errors are fine, but you could decide to do something different, say use colors or delimiters, to be robust to varying width. >>> I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display >>> properly. In fact, I am using special code to even write out Unicode >>> characters in the error strings, since Bison assumes all strings are ASCII, >>> the bytes with the high bit set being translated into escape sequences. >> >> Yes, I’m aware of this issue, and we have to address it. > > For what I could see, the function that converts it to escapes is sometimes > applied once and sometimes twice, relying on that it is an idempotent. It’s a bit more tricky than this. I’m looking into it, and I’d like to address this in 3.3. >> We also have to provide support for internationalization of >> the token names. > > Personally, I don't have any need for that. I use strings, like > %token logical_not_key "¬" > %token logical_and_key "∧" > %token logical_or_key "∨" > and in the case there are names, they typically match what the lexer > identifies. Yes, not all the strings should be translated. I was thinking of something like %token NUM _("number") %token ID _("identifier") %token PLUS "+" This way, we can even point xgettext to looking at the grammar file rather than the generated parser. ___ help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison
Re: are there user defined infix operators?
Alright, but > On 8 Nov 2018, at 23:37, Hans Åberg wrote: > >> On 8 Nov 2018, at 22:34, Uxio Prego wrote: >> >>> [...] >> >> The example and explanation are worth a thousand words, >> thank you very much. So I use a simple grammar like that, and >> the stack data structures, and if necessary feed the lexer back >> with data from the parser once the user requests some infix >> operators. > > It is only if you want to have a prefix and an infix or postfix operator with > the same name, like operator- or operator++ in C++, that there is a need for > handshake between the lexer and the parser, and it suffices with a boolean > value that tells whether the token last seen is a prefix operator. Initially > set to false, the prefix operators set it to true in the parser, and all > other expression tokens set it to false. Then, when the lexer sees an > operator that can be both a prefix and an infix or postfix, it uses this > value to disambiguate. I leave it to you to figure out the cases, it is not > that hard, just a bit fiddly. :-) > Yeah, but e.g. I don't plan to define ++ as operator at all, even though I would want any users wanting it to be able to configure so. I guess this would require, either predefining it even with no actual core semantic; or providing the parser-to-lexer feedback, and eventually to replace a current vanilla and clean flex lexer for something else, and/or writing a lot of ugly hack in it. Now think that the ++ operator has completely different meaning from a C++ perspective than from a Haskell perspective. Repeat for the ** operator, which exists in Python or Haskell but not (or if it does exist, for sure they are not very popular) in languages like C++ or Java. Some languages provide a // operator, etc. So predefining is not a good solution I would say. Anyway this is just thinking about the ultimate possibilities that in my opinion some abstract extensible spec should try to provide, or at least foresee, but I don't prioritize to fully implement. ___ help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison
Re: improving error message
> On 10 Nov 2018, at 12:50, Akim Demaille wrote: > >> Le 10 nov. 2018 à 10:38, Hans Åberg a écrit : >> >>> Also, see if using %param does not already >>> give you what you need to pass information from the scanner to the >>> parser’s yyerror. >> >> How would that get into the yyerror function? > > In C, arguments of %parse-param are passed to yyerror. That’s why I mentioned > %param, not %lex-param. And in the C++ case, these are members. Actually, I was thinking about the token error. But for the yyerror function, I use C++, and compute the string for data in the semantic value, the prototype is: void yyparser::error(const location_type& loc, const std::string& errstr) Then I use it for both errors and warnings, the latter we discussed long ago. For errors: throw syntax_error(@x, str); // Suitably computed string For warnings: parser::error(@y, "warning: " + str); // Suitably computed string Then the error function above has: std::string s = "error: "; if (errstr.substr(0, 7) == "warning") s.clear(); This way, the string beginning with "error: " is not shown in the case of a warning. > I believe that the right approach is rather the one we have in compilers > and in bison: caret errors. > > $ cat /tmp/foo.y > %token FOO 0xff 0xff > %% > exp:; > $ LC_ALL=C bison /tmp/foo.y > /tmp/foo.y:1.17-20: error: syntax error, unexpected integer > %token FOO 0xff 0xff > > I would have been bothered by « unexpected 255 ». Currently, that’s for those still using only ASCII. >>> >>> No, it’s not, it works with UTF-8. Bison’s count of characters is mostly >>> correct. I’m talking about Bison’s own location, used to parse grammars, >>> which is improved compared to what we ship in generated parsers. >> >> Ah. I thought of errors for the generated parser only. Then I only report >> byte count, but using character count will probably not help much for caret >> errors, as they vary in width. Then problem is that caret errors use two >> lines which are hard to synchronize in Unicode. So perhaps some kind of one >> line markup instead might do the trick. > > Two things: > > One is that the semantics of Bison’s location’s column is not specified: > it is up the user to track characters or bytes. As a matter of fact, Bison > is hardly concerned by this choice; rather it’s the scanner that has to > deal with that. > > The other one is: once you have the location, you can decide how to display > it. In the case of Bison, I think the caret errors are fine, but you > could decide to do something different, say use colors or delimiters, to > be robust to varying width. Yes, actually I though about the token errors. But it is interesting to see what you say about it. I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display properly. In fact, I am using special code to even write out Unicode characters in the error strings, since Bison assumes all strings are ASCII, the bytes with the high bit set being translated into escape sequences. >>> >>> Yes, I’m aware of this issue, and we have to address it. >> >> For what I could see, the function that converts it to escapes is sometimes >> applied once and sometimes twice, relying on that it is an idempotent. > > It’s a bit more tricky than this. I’m looking into it, and I’d like > to address this in 3.3. I realized one needs to know a lot about Bison's innards to fix this. A thing that made me curios is why the function it uses zeroes out the high bit: It looks like having something with the POSIX C locale, but I could not find anything require it to be set to zero in that locale. Right now, I use a function that translates the escape sequences back to bytes. >>> We also have to provide support for internationalization of >>> the token names. >> >> Personally, I don't have any need for that. I use strings, like >> %token logical_not_key "¬" >> %token logical_and_key "∧" >> %token logical_or_key "∨" >> and in the case there are names, they typically match what the lexer >> identifies. > > Yes, not all the strings should be translated. I was thinking of > something like > > %token NUM _("number") > %token ID _("identifier") > %token PLUS "+" > > This way, we can even point xgettext to looking at the grammar file > rather than the generated parser. It might be good if one wants error messages in another language. ___ help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison
Re: are there user defined infix operators?
> On 10 Nov 2018, at 13:51, Uxio Prego wrote: > > Alright, but OK, let's hear! >> On 8 Nov 2018, at 23:37, Hans Åberg wrote: >> >>> On 8 Nov 2018, at 22:34, Uxio Prego wrote: >>> [...] >>> >>> The example and explanation are worth a thousand words, >>> thank you very much. So I use a simple grammar like that, and >>> the stack data structures, and if necessary feed the lexer back >>> with data from the parser once the user requests some infix >>> operators. >> >> It is only if you want to have a prefix and an infix or postfix operator >> with the same name, like operator- or operator++ in C++, that there is a >> need for handshake between the lexer and the parser, and it suffices with a >> boolean value that tells whether the token last seen is a prefix operator. >> Initially set to false, the prefix operators set it to true in the parser, >> and all other expression tokens set it to false. Then, when the lexer sees >> an operator that can be both a prefix and an infix or postfix, it uses this >> value to disambiguate. I leave it to you to figure out the cases, it is not >> that hard, just a bit fiddly. :-) >> > > Yeah, but e.g. I don't plan to define ++ as operator at all, even > though I would want any users wanting it to be able to configure > so. An implementation detail to be aware of is that if negative numbers are allowed as tokens, then 3-2 will parse as 3 followed by -2, not as a subtraction. So therefore, it may be better to having only positive numbers, not negative, and implement unary operator- and operator+, which is why C++ has them. So you may not be able to escape having some name overloading. > I guess this would require, either predefining it even with no > actual core semantic; or providing the parser-to-lexer feedback, > and eventually to replace a current vanilla and clean flex lexer > for something else, and/or writing a lot of ugly hack in it. Have a look at the C++ operator precedence table [1]. You might try to squeeze in the user defined operators at some point in the middle. 1. https://en.cppreference.com/w/cpp/language/operator_precedence > Now think that the ++ operator has completely different meaning > from a C++ perspective than from a Haskell perspective. Repeat > for the ** operator, which exists in Python or Haskell but not (or > if it does exist, for sure they are not very popular) in languages > like C++ or Java. Some languages provide a // operator, etc. So > predefining is not a good solution I would say. In Haskell, it is a Monad operator, C++ does not have that. :-) The Haskell interpreter Hugs has a file Prelude.hs which defines a lot of prelude functions in Haskell code. But Haskell has only 10 precedence levels, which is a bit too little. > Anyway this is just thinking about the ultimate possibilities that in > my opinion some abstract extensible spec should try to provide, > or at least foresee, but I don't prioritize to fully implement. It is good to think it through before implementing it. Bison makes it easy to define a compile time grammar, making it easy to test it out. ___ help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison