Matt Wilmas wrote:

I don't have much time right now, but looked at it quick, and see that
you're actually trying to work around the re2c issues in general. :-) I
was only thinking of putting a "band-aid" on the comment symptom(s),
since those are about the only ones that occur with valid code (is the
tokenizer ext. *supposed to* handle all tokens in code that wouldn't
really compile?).

Yeah I figured I should try to fix as much as a could, specifically the YYLMIIT 
not enforcing availability of 'n' chars makes me nervous. ;-)

I would expect that tokenizer should handle all tokens in code as long as they 
pass the scanner phase (not the parser phase) but I'm not sure on what the 
intention here is.


And yeah, about excluding \x00 from ANY_CHAR, it could
change things, since it's always been allowed, although it seems strange
that code would have literal NULLs in it (generated eval()'d code?).
That was part of the reason I couldn't come up with a generic fix while
keeping all behavior. If re2c would just remember the last matching
state it was in at EOF like Flex!


It seems to me like the crux of the problem here is that we can't integrate an EOF check 
(such as checking the length of data) within the regular expression.  While flex allows the 
<<EOF>> we are expected to provide a unique identifier/token to match on.  This 
assumes that we have a unique character, or that the data is in good form so that we can 
detect a token etc.  Perhaps a good feature to add to re2c would be able to include a 
special regex/token match that would identify special conditions programatically such as 
(YYCURSOR == YYLIMIT) etc.

In defense of re2c I think it could be useful in situations to have to 
explicitly handle EOF, as it allows you more freedom for processing different 
data types.

I'll have to look closer at the multi-byte processing as well.  I don't see a 
lot of cases where we would run into \x00 values in code.  (Perhaps someone can 
provide a suggested use case that we need to watch out for?)  Perhaps if 
someone is including binary data strings within code?.


Otherwise, I don't know what to do. :-/ I'm going to do something else
before trying to implement what I was going to do, so there's no patch
yet...


Ok, I'll keep working on this I guess then as there's a couple more tests I 
want to run and fix some things before I commit (like ensuring that YYLIMIT 
actually ensures there are 'n' bytes available to read, etc).
As far as the Warning, with "<?php /* blah " do you get "Unterminated
comment ..." ? Of course your patch would restore it, because it's
missing last I checked (not able to right now).


I didn't see this in the current, un-patched, php-5.3 build but I'll double 
check to make sure I wasn't still using my new binaries.

And that applies to the case Lukas gave in the bug report: WHITESPACE
pattern is variable length.

Didn't see/find this is there a bug # or link?

I meant the "could be related if not the same problem" comment added the
other day in Bug #46817.


Ah, I see.  Yes this was actually my friend that raised my attention about 
getting this fixed ;-)


-shire

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to