Re: [PHP-DEV] Re: [PATCH] Scanner "diet" with fixes, etc.

Richard Quadling Thu, 30 Apr 2009 05:16:51 -0700

2009/4/30 Scott MacVicar <scott...@php.net>:
> [^] is a special case to write a portable match any character in re2c.
>
> Scott
>
> Dmitry Stogov wrote:
>> Hi Matt,
>>
>> Does this patch fix EOF handling issues related to mmap()? (e.g. parsing
>> of files with size 4096, 8192, ...). Now we have two dirty fixes to
>> handle them correctly.
>>
>> The patch is quite big to understand it quickly. I'll probably take a
>> look on weekend.
>>
>> -ANY_CHAR [^\x00]
>> +ANY_CHAR [^]
>>
>> Is [^] a correct regular expression?
>>
>> Thanks. Dmitry.
>>
>> Matt Wilmas wrote:
>>> Hi Dmitry, Brian, all,
>>>
>>> Here's a scanner patch that I mentioned awhile ago, with a possible
>>> way to work around the re2c EOF handling issues.
>>>
>>> The primary change is to do a "manual scan" like I talked about in
>>> areas that match large amounts and can contain NULL bytes
>>> (strings/comments, which are now scanned faster too), as is done for
>>> inline HTML.  I called it a "diet" :-) because it removes my
>>> complicated string regex patterns from a couple years ago, which
>>> doesn't make the .l file much smaller after adding the manual scan
>>> code (easier to understand...?), but it does result in a ~34k
>>> reduction of 5.3's generated .c file...
>>>
>>> This fixes Bug #46817, as well as a better, more proper fix for the
>>> older Bug #42767, both related to ending comments.
>>>
>>> Now inline HTML chunks aren't broken up when a tag starting with "s"
>>> is encountered (<script> for JS, <span>, etc.), since it's unlikely to
>>> be a long PHP <script> tag.
>>>
>>> If an opening PHP <SCRIPT> tag was used with a capital "S", it was
>>> missed if it wasn't the first thing scanned:
>>>
>>> var_dump(token_get_all("HTML... <SCRIPT language=php>"));
>>>
>>> Single-line comments with a Windows newline didn't include the full \r\n:
>>>
>>> var_dump(token_get_all("<?php // Comment\r\n?>"));
>>>
>>> Finally, part of the optimized scanning is that, for double quoted
>>> strings, when the first variable is encountered (making it
>>> non-constant), the amount that's been scanned up to that point is
>>> remembered, which can then be skipped over (up to the variable) after
>>> returning the quote token. Previously that initial part of the string
>>> was rescanned -- the cost dependent on how far "into" the string the
>>> first var is.
>>>
>>>
>>> I think that's about all --  I'll send another message if I forgot to
>>> mention anything...  Just wanted to send this along quick for to you
>>> guys to look at or whatever.  It was basically done last week, I just
>>> had to do a couple finishing touches and verify that everything was OK.
>>>
>>> http://realplain.com/php/scanner_diet.diff (Merged changes, but didn't
>>> test yet.)
>>> http://realplain.com/php/scanner_diet_5_3.diff
>>>
>>>
>>> Thanks,
>>> Matt
>>
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>


Aha - bottom of section at http://re2c.org/manual.html#lbAJ



-- 
-----
Richard Quadling
Zend Certified Engineer : http://zend.com/zce.php?c=ZEND002498&r=213474731
"Standing on the shoulders of some very clever giants!"

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Re: [PATCH] Scanner "diet" with fixes, etc.

Reply via email to