improving error message (was: bison for nlp)

2018-11-10 Thread Akim Demaille
Hi Hans,

> Le 9 nov. 2018 à 14:45, Hans Åberg  a écrit :
> 
>> On 9 Nov 2018, at 12:11, Akim Demaille  wrote:
>> 
>>> Le 9 nov. 2018 à 09:58, Hans Åberg  a écrit :
>>> 
>>> 
 On 9 Nov 2018, at 05:59, Akim Demaille  wrote:
 
> By the way, I’ll still get the error message as a string I guess, right?
 
 Yes.  Some day we will work on improving error message generation,
 there is much demand.
>>> 
>>> One thing I’d like to have is if there is an error with say a identifier, 
>>> also writing the out the name of it.
>> 
>> Yes, that’s a common desire.  However, I don’t think it’s really
>> what people need, because the way you print the semantic value
>> might differ from what you actually wrote.  For instance, if I have
>> a syntax error involving an integer literal written in binary,
>> say 0b101010, then I will be surprised to read that I have an error
>> involving 42.
>> 
>> So you would need to cary the exact string from the scanner to the
>> parser, and I think that’s too much to ask for.
> 
> That is what I do. So I merely want an extra argument in the error reporting 
> function where it can be put.

Please, be clearer: what extra argument, and show how the parser
can provide it.  Also, see if using %param does not already
give you what you need to pass information from the scanner to the
parser’s yyerror.

>> I believe that the right approach is rather the one we have in compilers
>> and in bison: caret errors.
>> 
>> $ cat /tmp/foo.y
>> %token FOO 0xff 0xff
>> %%
>> exp:;
>> $ LC_ALL=C bison /tmp/foo.y
>> /tmp/foo.y:1.17-20: error: syntax error, unexpected integer
>> %token FOO 0xff 0xff
>> 
>> I would have been bothered by « unexpected 255 ».
> 
> Currently, that’s for those still using only ASCII.

No, it’s not, it works with UTF-8.  Bison’s count of characters is mostly
correct.  I’m talking about Bison’s own location, used to parse grammars,
which is improved compared to what we ship in generated parsers.

$ bison /tmp/foo.y
/tmp/foo.y:2.6: erreur: caractères invalides: « 💩 »
 exp: 💩 💩 💩 💩;
  ^
/tmp/foo.y:2.8: erreur: caractères invalides: « 💩 »
 exp: 💩 💩 💩 💩;
^
/tmp/foo.y:2.10: erreur: caractères invalides: « 💩 »
 exp: 💩 💩 💩 💩;
  ^
/tmp/foo.y:2.12: erreur: caractères invalides: « 💩 »
 exp: 💩 💩 💩 💩;
^

It will fail when there are composed characters, granted.  Don’t try
with the attached grammar.



foo.y
Description: Binary data



> I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display 
> properly. In fact, I am using special code to even write out Unicode 
> characters in the error strings, since Bison assumes all strings are ASCII, 
> the bytes with the high bit set being translated into escape sequences.

Yes, I’m aware of this issue, and we have to address it.
We also have to provide support for internationalization of
the token names.

___
help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison

Re: improving error message (was: bison for nlp)

2018-11-10 Thread Hans Åberg

> On 10 Nov 2018, at 09:02, Akim Demaille  wrote:
> 
> Hi Hans,

Hello Akim,

> Yes.  Some day we will work on improving error message generation,
> there is much demand.
 
 One thing I’d like to have is if there is an error with say a identifier, 
 also writing the out the name of it.
>>> 
>>> Yes, that’s a common desire.  However, I don’t think it’s really
>>> what people need, because the way you print the semantic value
>>> might differ from what you actually wrote.  For instance, if I have
>>> a syntax error involving an integer literal written in binary,
>>> say 0b101010, then I will be surprised to read that I have an error
>>> involving 42.
>>> 
>>> So you would need to cary the exact string from the scanner to the
>>> parser, and I think that’s too much to ask for.
>> 
>> That is what I do. So I merely want an extra argument in the error reporting 
>> function where it can be put.
> 
> Please, be clearer: what extra argument, and show how the parser
> can provide it.  

Yes, I need to analyze it and get back.

> Also, see if using %param does not already
> give you what you need to pass information from the scanner to the
> parser’s yyerror.

How would that get into the yyerror function?

>>> I believe that the right approach is rather the one we have in compilers
>>> and in bison: caret errors.
>>> 
>>> $ cat /tmp/foo.y
>>> %token FOO 0xff 0xff
>>> %%
>>> exp:;
>>> $ LC_ALL=C bison /tmp/foo.y
>>> /tmp/foo.y:1.17-20: error: syntax error, unexpected integer
>>> %token FOO 0xff 0xff
>>>
>>> I would have been bothered by « unexpected 255 ».
>> 
>> Currently, that’s for those still using only ASCII.
> 
> No, it’s not, it works with UTF-8.  Bison’s count of characters is mostly
> correct.  I’m talking about Bison’s own location, used to parse grammars,
> which is improved compared to what we ship in generated parsers.

Ah. I thought of errors for the generated parser only. Then I only report byte 
count, but using character count will probably not help much for caret errors, 
as they vary in width. Then problem is that caret errors use two lines which 
are hard to synchronize in Unicode. So perhaps some kind of one line markup 
instead might do the trick.

>> I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display 
>> properly. In fact, I am using special code to even write out Unicode 
>> characters in the error strings, since Bison assumes all strings are ASCII, 
>> the bytes with the high bit set being translated into escape sequences.
> 
> Yes, I’m aware of this issue, and we have to address it.

For what I could see, the function that converts it to escapes is sometimes 
applied once and sometimes twice, relying on that it is an idempotent.

> We also have to provide support for internationalization of
> the token names.

Personally, I don't have any need for that. I use strings, like
  %token logical_not_key "¬"
  %token logical_and_key "∧"
  %token logical_or_key "∨"
and in the case there are names, they typically match what the lexer identifies.



___
help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison

Re: improving error message (was: bison for nlp)

2018-11-10 Thread Akim Demaille


> Le 10 nov. 2018 à 10:38, Hans Åberg  a écrit :
> 
>> Also, see if using %param does not already
>> give you what you need to pass information from the scanner to the
>> parser’s yyerror.
> 
> How would that get into the yyerror function?

In C, arguments of %parse-param are passed to yyerror.  That’s why I mentioned
%param, not %lex-param.  And in the C++ case, these are members.


 I believe that the right approach is rather the one we have in compilers
 and in bison: caret errors.
 
 $ cat /tmp/foo.y
 %token FOO 0xff 0xff
 %%
 exp:;
 $ LC_ALL=C bison /tmp/foo.y
 /tmp/foo.y:1.17-20: error: syntax error, unexpected integer
 %token FOO 0xff 0xff
   
 I would have been bothered by « unexpected 255 ».
>>> 
>>> Currently, that’s for those still using only ASCII.
>> 
>> No, it’s not, it works with UTF-8.  Bison’s count of characters is mostly
>> correct.  I’m talking about Bison’s own location, used to parse grammars,
>> which is improved compared to what we ship in generated parsers.
> 
> Ah. I thought of errors for the generated parser only. Then I only report 
> byte count, but using character count will probably not help much for caret 
> errors, as they vary in width. Then problem is that caret errors use two 
> lines which are hard to synchronize in Unicode. So perhaps some kind of one 
> line markup instead might do the trick.

Two things:

One is that the semantics of Bison’s location’s column is not specified:
it is up the user to track characters or bytes.  As a matter of fact, Bison
is hardly concerned by this choice; rather it’s the scanner that has to
deal with that.

The other one is: once you have the location, you can decide how to display
it.  In the case of Bison, I think the caret errors are fine, but you
could decide to do something different, say use colors or delimiters, to
be robust to varying width.



>>> I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display 
>>> properly. In fact, I am using special code to even write out Unicode 
>>> characters in the error strings, since Bison assumes all strings are ASCII, 
>>> the bytes with the high bit set being translated into escape sequences.
>> 
>> Yes, I’m aware of this issue, and we have to address it.
> 
> For what I could see, the function that converts it to escapes is sometimes 
> applied once and sometimes twice, relying on that it is an idempotent.

It’s a bit more tricky than this.  I’m looking into it, and I’d like
to address this in 3.3.


>> We also have to provide support for internationalization of
>> the token names.
> 
> Personally, I don't have any need for that. I use strings, like
>  %token logical_not_key "¬"
>  %token logical_and_key "∧"
>  %token logical_or_key "∨"
> and in the case there are names, they typically match what the lexer 
> identifies.

Yes, not all the strings should be translated.  I was thinking of
something like

%token NUM _("number")
%token ID _("identifier")
%token PLUS "+"

This way, we can even point xgettext to looking at the grammar file
rather than the generated parser.
___
help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison

Re: are there user defined infix operators?

2018-11-10 Thread Uxio Prego
Alright, but

> On 8 Nov 2018, at 23:37, Hans Åberg  wrote:
> 
>> On 8 Nov 2018, at 22:34, Uxio Prego  wrote:
>> 
>>> [...]
>> 
>> The example and explanation are worth a thousand words,
>> thank you very much. So I use a simple grammar like that, and
>> the stack data structures, and if necessary feed the lexer back
>> with data from the parser once the user requests some infix
>> operators.
> 
> It is only if you want to have a prefix and an infix or postfix operator with 
> the same name, like operator- or operator++ in C++, that there is a need for 
> handshake between the lexer and the parser, and it suffices with a boolean 
> value that tells whether the token last seen is a prefix operator. Initially 
> set to false, the prefix operators set it to true in the parser, and all 
> other expression tokens set it to false. Then, when the lexer sees an 
> operator that can be both a prefix and an infix or postfix, it uses this 
> value to disambiguate. I leave it to you to figure out the cases, it is not 
> that hard, just a bit fiddly. :-)
> 

Yeah, but e.g. I don't plan to define ++ as operator at all, even
though I would want any users wanting it to be able to configure
so.

I guess this would require, either predefining it even with no
actual core semantic; or providing the parser-to-lexer feedback,
and eventually to replace a current vanilla and clean flex lexer
for something else, and/or writing a lot of ugly hack in it.

Now think that the ++ operator has completely different meaning
from a C++ perspective than from a Haskell perspective. Repeat
for the ** operator, which exists in Python or Haskell but not (or
if it does exist, for sure they are not very popular) in languages
like C++ or Java. Some languages provide a // operator, etc. So
predefining is not a good solution I would say.

Anyway this is just thinking about the ultimate possibilities that in
my opinion some abstract extensible spec should try to provide,
or at least foresee, but I don't prioritize to fully implement.


___
help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison

Re: improving error message

2018-11-10 Thread Hans Åberg

> On 10 Nov 2018, at 12:50, Akim Demaille  wrote:
> 
>> Le 10 nov. 2018 à 10:38, Hans Åberg  a écrit :
>> 
>>> Also, see if using %param does not already
>>> give you what you need to pass information from the scanner to the
>>> parser’s yyerror.
>> 
>> How would that get into the yyerror function?
> 
> In C, arguments of %parse-param are passed to yyerror.  That’s why I mentioned
> %param, not %lex-param.  And in the C++ case, these are members.

Actually, I was thinking about the token error. But for the yyerror function, I 
use C++, and compute the string for data in the semantic value, the prototype 
is:
  void yyparser::error(const location_type& loc, const std::string& errstr)

Then I use it for both errors and warnings, the latter we discussed long ago. 
For errors:
  throw syntax_error(@x, str); // Suitably computed string

For warnings:
  parser::error(@y, "warning: " + str);  // Suitably computed string

Then the error function above has:
  std::string s = "error: ";
  if (errstr.substr(0, 7) == "warning")
s.clear();

This way, the string beginning with "error: " is not shown in the case of a 
warning.

> I believe that the right approach is rather the one we have in compilers
> and in bison: caret errors.
> 
> $ cat /tmp/foo.y
> %token FOO 0xff 0xff
> %%
> exp:;
> $ LC_ALL=C bison /tmp/foo.y
> /tmp/foo.y:1.17-20: error: syntax error, unexpected integer
> %token FOO 0xff 0xff
>  
> I would have been bothered by « unexpected 255 ».
 
 Currently, that’s for those still using only ASCII.
>>> 
>>> No, it’s not, it works with UTF-8.  Bison’s count of characters is mostly
>>> correct.  I’m talking about Bison’s own location, used to parse grammars,
>>> which is improved compared to what we ship in generated parsers.
>> 
>> Ah. I thought of errors for the generated parser only. Then I only report 
>> byte count, but using character count will probably not help much for caret 
>> errors, as they vary in width. Then problem is that caret errors use two 
>> lines which are hard to synchronize in Unicode. So perhaps some kind of one 
>> line markup instead might do the trick.
> 
> Two things:
> 
> One is that the semantics of Bison’s location’s column is not specified:
> it is up the user to track characters or bytes.  As a matter of fact, Bison
> is hardly concerned by this choice; rather it’s the scanner that has to
> deal with that.
> 
> The other one is: once you have the location, you can decide how to display
> it.  In the case of Bison, I think the caret errors are fine, but you
> could decide to do something different, say use colors or delimiters, to
> be robust to varying width.

Yes, actually I though about the token errors. But it is interesting to see 
what you say about it.

 I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display 
 properly. In fact, I am using special code to even write out Unicode 
 characters in the error strings, since Bison assumes all strings are 
 ASCII, the bytes with the high bit set being translated into escape 
 sequences.
>>> 
>>> Yes, I’m aware of this issue, and we have to address it.
>> 
>> For what I could see, the function that converts it to escapes is sometimes 
>> applied once and sometimes twice, relying on that it is an idempotent.
> 
> It’s a bit more tricky than this.  I’m looking into it, and I’d like
> to address this in 3.3.

I realized one needs to know a lot about Bison's innards to fix this. A thing 
that made me curios is why the function it uses zeroes out the high bit: It 
looks like having something with the POSIX C locale, but I could not find 
anything require it to be set to zero in that locale.

Right now, I use a function that translates the escape sequences back to bytes.

>>> We also have to provide support for internationalization of
>>> the token names.
>> 
>> Personally, I don't have any need for that. I use strings, like
>> %token logical_not_key "¬"
>> %token logical_and_key "∧"
>> %token logical_or_key "∨"
>> and in the case there are names, they typically match what the lexer 
>> identifies.
> 
> Yes, not all the strings should be translated.  I was thinking of
> something like
> 
> %token NUM _("number")
> %token ID _("identifier")
> %token PLUS "+"
> 
> This way, we can even point xgettext to looking at the grammar file
> rather than the generated parser.

It might be good if one wants error messages in another language.



___
help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison

Re: are there user defined infix operators?

2018-11-10 Thread Hans Åberg

> On 10 Nov 2018, at 13:51, Uxio Prego  wrote:
> 
> Alright, but

OK, let's hear!

>> On 8 Nov 2018, at 23:37, Hans Åberg  wrote:
>> 
>>> On 8 Nov 2018, at 22:34, Uxio Prego  wrote:
>>> 
 [...]
>>> 
>>> The example and explanation are worth a thousand words,
>>> thank you very much. So I use a simple grammar like that, and
>>> the stack data structures, and if necessary feed the lexer back
>>> with data from the parser once the user requests some infix
>>> operators.
>> 
>> It is only if you want to have a prefix and an infix or postfix operator 
>> with the same name, like operator- or operator++ in C++, that there is a 
>> need for handshake between the lexer and the parser, and it suffices with a 
>> boolean value that tells whether the token last seen is a prefix operator. 
>> Initially set to false, the prefix operators set it to true in the parser, 
>> and all other expression tokens set it to false. Then, when the lexer sees 
>> an operator that can be both a prefix and an infix or postfix, it uses this 
>> value to disambiguate. I leave it to you to figure out the cases, it is not 
>> that hard, just a bit fiddly. :-)
>> 
> 
> Yeah, but e.g. I don't plan to define ++ as operator at all, even
> though I would want any users wanting it to be able to configure
> so.

An implementation detail to be aware of is that if negative numbers are allowed 
as tokens, then 3-2 will parse as 3 followed by -2, not as a subtraction. So 
therefore, it may be better to having only positive numbers, not negative, and 
implement unary operator- and operator+, which is why C++ has them.

So you may not be able to escape having some name overloading.

> I guess this would require, either predefining it even with no
> actual core semantic; or providing the parser-to-lexer feedback,
> and eventually to replace a current vanilla and clean flex lexer
> for something else, and/or writing a lot of ugly hack in it.

Have a look at the C++ operator precedence table [1]. You might try to squeeze 
in the user defined operators at some point in the middle.

1. https://en.cppreference.com/w/cpp/language/operator_precedence

> Now think that the ++ operator has completely different meaning
> from a C++ perspective than from a Haskell perspective. Repeat
> for the ** operator, which exists in Python or Haskell but not (or
> if it does exist, for sure they are not very popular) in languages
> like C++ or Java. Some languages provide a // operator, etc. So
> predefining is not a good solution I would say.

In Haskell, it is a Monad operator, C++ does not have that. :-) The Haskell 
interpreter Hugs has a file Prelude.hs which defines a lot of prelude functions 
in Haskell code.

But Haskell has only 10 precedence levels, which is a bit too little.

> Anyway this is just thinking about the ultimate possibilities that in
> my opinion some abstract extensible spec should try to provide,
> or at least foresee, but I don't prioritize to fully implement.

It is good to think it through before implementing it. Bison makes it easy to 
define a compile time grammar, making it easy to test it out.



___
help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison