Re: Which lexer do people use?

2020-07-04 Thread Hans Åberg


> On 3 Jul 2020, at 23:15, Daniele Nicolodi  wrote:
> 
> Which other scanners do people use?

You might ask this question in the Usenet newsgroup comp.compilers.





Re: Which lexer do people use?

2020-07-04 Thread Christian Schoenebeck
On Samstag, 4. Juli 2020 08:14:46 CEST Akim Demaille wrote:
> Hi Daniele,
> 
> > Le 3 juil. 2020 à 23:15, Daniele Nicolodi  a écrit :
> > 
> > Hello,
> > 
> > the historical pairing is using Flex with Bison. However, while Bison is
> > under active development and seems to be a very solid code base, there
> > isn't much activity on the Flex side https://github.com/westes/flex and
> > Flex codebase and capabilities show their age.
> 
> Yes.  I have a couple of issues opened over there, and it takes for ages
> to get them processed.  When they are.
> 
> When I tried to modernize the Flex doc about Bison, they even managed to
> turn this into a lecture about software maintenance.  And not install
> my changes.
> 
> https://github.com/westes/flex/pull/420

Well, just a difference in philosophies. Looks indeed somewhat awkward though 
that they kept criticising Bison documentation while not responding on the 
actual Flex issue at all.

In a perfect world, yes, it might have been desirable to have old Bison 
constructs still in the Bison docs today, clearly marked in red color as 
'removed in version x, replaced by y in version z', but that's IMO a purely 
thoretical issue, as everybody can clearly see that Akim is always patiently 
answering anybodys questions over here for instance.

For me, the exaggerated 'divide and conquer' philosophy applied decades ago by 
splitting scanner and parser was a much more painful decision with clearly 
perceivable, negative consequences in real world for all users.

> > I recently became aware of RE/flex https://www.genivia.com/reflex.html
> > which seems very promising. However, it only generates a C++ scanner
> > which may be (I haven't tried) to retro-fit into existing C projects to,
> > for example, gain full unicode (in its utf8 encoded form) support.
> 
> It seems amazing.  Featurewise and performancewise.  I did not know it
> (nor did I know ugrep).
> 
> I've seen projects use ragel (http://www.colm.net/open-source/ragel/)
> and re2c (https://re2c.org).  But, sadly, I have first-hand experience
> with Flex only, I can't comment about the others.
> 
> > Has anyone tried to hammer a C++ scanner peg generated by RE/flex into a
> > C grammar hole generated by Bison?
> > 
> > Which other scanners do people use?
> 
> Fine question.  I'm eager to read the answers!

AFAICS almost nobody is using anything else than Flex. Probably because its 
designated task of handling type-3 grammars is already fully covered by just 
having a correct RegEx implementation, and most of the examples, howtos, books 
and docs out there are based on Flex.

The only thing that people are missing once in a while on scanner side is 
unicode support, but there are ways to circumvent that, as you barely need 
unicode in the actual RegEx patterns. So unicode characters are usually 
somewhere between a (non unicode) start and end pattern.

The obvious real improvement in future will be finally getting rid of a 
separate scanner for good in the first place, combining the two things which 
actually belonged together from day one: having the scanner functionality 
directly in Bison instead, and saying goodbye to all those scanner state stack 
hacks which often end up in a huge mess that people can hardly read, and often 
lead to severe misbehaviours on edge cases of certain inputs.

Akim, was there any progress in the IP discussion for that to become possible 
one day or is that previously discussed merge off the table?

Best regards,
Christian Schoenebeck






Re: Which lexer do people use?

2020-07-04 Thread Derek Clegg
On Jul 3, 2020, at 11:14 PM, Akim Demaille  wrote:
> 
> Hi Daniele,
> 
>> Le 3 juil. 2020 à 23:15, Daniele Nicolodi  a écrit :
>> 
>> Hello,
>> 
>> the historical pairing is using Flex with Bison. However, while Bison is
>> under active development and seems to be a very solid code base, there
>> isn't much activity on the Flex side https://github.com/westes/flex and
>> Flex codebase and capabilities show their age.
> 
> Yes.  I have a couple of issues opened over there, and it takes for ages
> to get them processed.  When they are.
> 
> When I tried to modernize the Flex doc about Bison, they even managed to
> turn this into a lecture about software maintenance.  And not install
> my changes.
> 
> https://github.com/westes/flex/pull/420

My experience as well. I have found them to be unhelpful, slow, and 
argumentative. At times, I’ve thought about branching flex and cleaning up the 
obvious problems, but — since it’s really not that hard to write a lexer — I 
just roll my own. This is much better and it integrates well with bison, unlike 
the hoops needed when I use flex.

> 
>> I recently became aware of RE/flex https://www.genivia.com/reflex.html
>> which seems very promising. However, it only generates a C++ scanner
>> which may be (I haven't tried) to retro-fit into existing C projects to,
>> for example, gain full unicode (in its utf8 encoded form) support.
> 
> It seems amazing.  Featurewise and performancewise.  I did not know it
> (nor did I know ugrep).
> 
> I've seen projects use ragel (http://www.colm.net/open-source/ragel/)
> and re2c (https://re2c.org).  But, sadly, I have first-hand experience
> with Flex only, I can't comment about the others.

This is very interesting. It’s good for me since I pretty much only use C++.

>> Has anyone tried to hammer a C++ scanner peg generated by RE/flex into a
>> C grammar hole generated by Bison?

Let us know how it works out if you try!

Derek


Re: Which lexer do people use?

2020-07-04 Thread John P. Hartmann
For the scanner and parser I maintain on UNIX and then transport to the 
EBCDIC world of the mainframe, I had to write my own scanner, but I can 
get by with Bison as long as I don't use character constants in rules 
(IBM 360 assembler in rules does work).  There were a few other hoops, 
such as no more a tables-only bison option.


For other work, which is all in C, flex seems satisfactory.  Call me 
oldfashioned.




Re: Which lexer do people use?

2020-07-04 Thread Adrian Vogelsgesang
Hi Daniele,

> Which other scanners do people use?
For what it’s worth, we are using a hand-rolled scanner. Seemed just the 
fastest way to get rolling and the easiest to maintain.

Also, it allowed us to embed a few hacks directly inside the scanner: E.g. in a 
few places our grammar is not actually LR1. Only in very few edge cases, 
though, so that we don’t want to use GLR. Hence, our scanner does a lookahead 
and, e.g., upon encountering the token “WITH” looks at the following token. If 
the next token is “TIMESTAMP”, it produces “WITH_LA” instead of just “WITH”. 
Thereby, we get 1 look-ahead from the scanner. Combined with the 1 lookahead 
provided by bison, we can now parse our LR2 grammar.

Not sure if this would have been possible also with flex – but given we have a 
hand-rolled parser it was straightforward.

You can find a similar hack also in 
https://github.com/postgres/postgres/blob/master/src/backend/parser/gram.y#L721,
 if you look for the WITH_LA keywords. Postgres is using a flex scanner and 
then stacks a custom layer between flex and bison which introduces the 
additional maintenance overhead.

Cheers,
Adrian


From: help-bison  on 
behalf of Daniele Nicolodi 
Date: Friday, 3 July 2020 at 23:15
To: Bison Help 
Subject: Which lexer do people use?

Hello,

the historical pairing is using Flex with Bison. However, while Bison is
under active development and seems to be a very solid code base, there
isn't much activity on the Flex side 
https://github.com/westes/flex and
Flex codebase and capabilities show their age.

I recently became aware of RE/flex 
https://www.genivia.com/reflex.html
which seems very promising. However, it only generates a C++ scanner
which may be (I haven't tried) to retro-fit into existing C projects to,
for example, gain full unicode (in its utf8 encoded form) support.

Has anyone tried to hammer a C++ scanner peg generated by RE/flex into a
C grammar hole generated by Bison?

Which other scanners do people use?

Thank you.

Cheers,
Dan