Edit report at https://bugs.php.net/bug.php?id=61018&edit=1
ID: 61018
Comment by: danielklein at airpost dot net
Reported by: dey101+php at gmail dot com
Summary: Unexplained bool(false) returned from preg_match
Status: Open
Type: Bug
Package: PCRE related
PHP Version: 5.3.10
Block user comment: N
Private report: N
New Comment:
I have simplified the error to the following:
<?php
$string = 'ABCDEFGHIJ12345678.';
var_dump(preg_match('/^(?:\w*)*$/i', $string));
$string = 'ABCDEFGHIJ1234567.';
var_dump(preg_match('/^(?:\w*)*$/i', $string));
?>
Outputs:
boolean false
int 0
Saying /(\w*)*/ is VERY inefficient as it must try every combination before
failing, i.e. matching:
'ABCDEFGHIJ12345678', ''
'ABCDEFGHIJ1234567', '8', ''
'ABCDEFGHIJ1234567', '', '8', ''
'ABCDEFGHIJ123456', '78', ''
'ABCDEFGHIJ123456', '7', '8', ''
'ABCDEFGHIJ123456', '7', '', '8', ''
'ABCDEFGHIJ123456', '', '78', ''
...
'', 'A', '', 'B', '', 'C', '', 'D', '', 'E', '', 'F', '', 'G', '', 'H', '',
'I', '', 'J', '', '1', '', '2', '', '3', '', '4', '', '5', '', '6', '', '7',
'', '8', ''
It is most likely running out of memory before it completes. I would suggest
that this is not a bug as it will use exponentially more memory the longer the
input string gets.
You should try something like '/^(?:(?>\w*))*$/i' instead to avoid undesired
backtracking.
Previous Comments:
------------------------------------------------------------------------
[2012-02-15 18:39:27] [email protected]
I have verified that the output from this repro script is the same on both
Windows and Linux (Both using 5.3.10), so this is not a Windows specific bug
report anymore.
------------------------------------------------------------------------
[2012-02-14 13:42:21] dey101+php at gmail dot com
I did not have access to a linux test platform to test. If you have verified
that the bug exists on multiple platforms, please fee free to re-classify as a
general bug.
------------------------------------------------------------------------
[2012-02-13 23:40:06] [email protected]
Thank you for your report and helping to make php better.
When I ran your script on Windows 2008 and Linux(using TS build of php5.3.10),
it looks like the output is the same on both OSes. I don't think this is a PHP
on Windows bug.
If you would like, I can reclassify this bug as a general bug, not specific to
Windows.
Or, am I missing something? Is this really a PHP on Windows problem?
win2008 sp1 x64 output(TS Build):
Regex: /^[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/
Host: ABCDEFGHIJ1234567890.
Result: (error) bool(false)
Host: ABCDEFGHI234567890.
Result: (no match) int(0)
Host: ABCDEFGHIJ1234567890
Result: (match) int(1)
Host: ABCDEFGHI1234567890
Result: (match) int(1)
Host: ABCDEFGHI123456789
Result: (match) int(1)
Host: ABCDEFGHIJ-1234567890
Result: (match) int(1)
Host: ABCDEFGHIJ-123456789
Result: (match) int(1)
Host: ABCDEFGHI-123456789
Result: (match) int(1)
Host: WWW.ABCDEFGHIJ-1234567890.COM
Result: (no match) int(0)
Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890
-abcd-efgh-hijk.COM
Result: (no match) int(0)
Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*$/
Host: ABCDEFGHIJ1234567890.
Result: (match) int(1)
Host: ABCDEFGHI234567890.
Result: (match) int(1)
Host: ABCDEFGHIJ1234567890
Result: (error) bool(false)
Host: ABCDEFGHI1234567890
Result: (error) bool(false)
Host: ABCDEFGHI123456789
Result: (no match) int(0)
Host: ABCDEFGHIJ-1234567890
Result: (error) bool(false)
Host: ABCDEFGHIJ-123456789
Result: (error) bool(false)
Host: ABCDEFGHI-123456789
Result: (no match) int(0)
Host: WWW.ABCDEFGHIJ-1234567890.COM
Result: (error) bool(false)
Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890
-abcd-efgh-hijk.COM
Result: (error) bool(false)
Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)+$/
Host: ABCDEFGHIJ1234567890.
Result: (match) int(1)
Host: ABCDEFGHI234567890.
Result: (match) int(1)
Host: ABCDEFGHIJ1234567890
Result: (no match) int(0)
Host: ABCDEFGHI1234567890
Result: (no match) int(0)
Host: ABCDEFGHI123456789
Result: (no match) int(0)
Host: ABCDEFGHIJ-1234567890
Result: (no match) int(0)
Host: ABCDEFGHIJ-123456789
Result: (no match) int(0)
Host: ABCDEFGHI-123456789
Result: (no match) int(0)
Host: WWW.ABCDEFGHIJ-1234567890.COM
Result: (error) bool(false)
Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890
-abcd-efgh-hijk.COM
Result: (error) bool(false)
Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*[[:alnum:]](?:[[:alnum:
]\-]*[[:alnum:]])*$/
Host: ABCDEFGHIJ1234567890.
Result: (error) bool(false)
Host: ABCDEFGHI234567890.
Result: (error) bool(false)
Host: ABCDEFGHIJ1234567890
Result: (error) bool(false)
Host: ABCDEFGHI1234567890
Result: (error) bool(false)
Host: ABCDEFGHI123456789
Result: (match) int(1)
Host: ABCDEFGHIJ-1234567890
Result: (error) bool(false)
Host: ABCDEFGHIJ-123456789
Result: (error) bool(false)
Host: ABCDEFGHI-123456789
Result: (match) int(1)
Host: WWW.ABCDEFGHIJ-1234567890.COM
Result: (match) int(1)
Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890
-abcd-efgh-hijk.COM
Result: (match) int(1)
Linux-x64-gentoo output:
Regex: /^[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/
Host: ABCDEFGHIJ1234567890.
Result: (error) bool(false)
Host: ABCDEFGHI234567890.
Result: (no match) int(0)
Host: ABCDEFGHIJ1234567890
Result: (match) int(1)
Host: ABCDEFGHI1234567890
Result: (match) int(1)
Host: ABCDEFGHI123456789
Result: (match) int(1)
Host: ABCDEFGHIJ-1234567890
Result: (match) int(1)
Host: ABCDEFGHIJ-123456789
Result: (match) int(1)
Host: ABCDEFGHI-123456789
Result: (match) int(1)
Host: WWW.ABCDEFGHIJ-1234567890.COM
Result: (no match) int(0)
Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-123
45-67890-abcd-efgh-hijk.COM
Result: (no match) int(0)
Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*$/
Host: ABCDEFGHIJ1234567890.
Result: (match) int(1)
Host: ABCDEFGHI234567890.
Result: (match) int(1)
Host: ABCDEFGHIJ1234567890
Result: (error) bool(false)
Host: ABCDEFGHI1234567890
Result: (error) bool(false)
Host: ABCDEFGHI123456789
Result: (no match) int(0)
Host: ABCDEFGHIJ-1234567890
Result: (error) bool(false)
Host: ABCDEFGHIJ-123456789
Result: (error) bool(false)
Host: ABCDEFGHI-123456789
Result: (no match) int(0)
Host: WWW.ABCDEFGHIJ-1234567890.COM
Result: (error) bool(false)
Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-
12345-67890-abcd-efgh-hijk.COM
Result: (error) bool(false)
Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)+$/
Host: ABCDEFGHIJ1234567890.
Result: (match) int(1)
Host: ABCDEFGHI234567890.
Result: (match) int(1)
Host: ABCDEFGHIJ1234567890
Result: (no match) int(0)
Host: ABCDEFGHI1234567890
Result: (no match) int(0)
Host: ABCDEFGHI123456789
Result: (no match) int(0)
Host: ABCDEFGHIJ-1234567890
Result: (no match) int(0)
Host: ABCDEFGHIJ-123456789
Result: (no match) int(0)
Host: ABCDEFGHI-123456789
Result: (no match) int(0)
Host: WWW.ABCDEFGHIJ-1234567890.COM
Result: (error) bool(false)
Host:
WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890-abcd-efgh-hijk.COM
Result: (error) bool(false)
Regex:
/^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/
Host: ABCDEFGHIJ1234567890.
Result: (error) bool(false)
Host: ABCDEFGHI234567890.
Result: (error) bool(false)
Host: ABCDEFGHIJ1234567890
Result: (error) bool(false)
Host: ABCDEFGHI1234567890
Result: (error) bool(false)
Host: ABCDEFGHI123456789
Result: (match) int(1)
Host: ABCDEFGHIJ-1234567890
Result: (error) bool(false)
Host: ABCDEFGHIJ-123456789
Result: (error) bool(false)
Host: ABCDEFGHI-123456789
Result: (match) int(1)
Host: WWW.ABCDEFGHIJ-1234567890.COM
Result: (match) int(1)
Host:
WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890-abcd-efgh-hijk.COM
Result: (match) int(1)
------------------------------------------------------------------------
[2012-02-08 18:43:42] dey101+php at gmail dot com
Description:
------------
PHP VC9 x86 Thread Safe (from http://windows.php.net/download/)
Using a regex to validate if a string is a valid hostname (host or FQDN).
It seems that for certain length strings trying to match a literal period at
the end will cause the preg_match to return false if the string does not have a
period in it. It also will return false if the string has a period at the end,
and the regex does not try to match them.
The regex is using subpatterns ()to apply the zero or more repetition
quantifier *. I tried with both capturing and non-capturing (?:), both yield
the same result. However, if I use the one or more quantifier + it does not
return bool(false). Using {0,} instead of * does not change the outcome.
It seems that the cutoff length for the string is about 20 characters. Less
than that, the results are int(0) or int(1) depending on if the regex matches,
longer than that, and bool(false) is returned.
If the subpattern is part of a longer string, it does work as anticipated.
Matching a literal period at the beginning of the pattern does not yield an
error.
Substituting a-zA-Z0-9 for the [:alnum:] character class does not affect the
results.
error_get_last() does not return anything, nothing is showing up in logs with
error_reporting(-1) set either.
Test script:
---------------
$regexs = array
(
'/^[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/',
'/^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*$/',
'/^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)+$/',
'/^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/'
);
$hosts = array
(
'ABCDEFGHIJ1234567890.', // long string with period at end
'ABCDEFGHI234567890.', // slightly shorter string with period at end
'ABCDEFGHIJ1234567890', // long string no period
'ABCDEFGHI1234567890', // a little shorter
'ABCDEFGHI123456789', // even shorter
'ABCDEFGHIJ-1234567890', // long with hyphen
'ABCDEFGHIJ-123456789', // sorter with hyphen
'ABCDEFGHI-123456789', // even shorter with hyphen
'WWW.ABCDEFGHIJ-1234567890.COM', // a FQDN with long sting and hyphen
'WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890-abcd-efgh-hijk.COM'
// a really long FQDN
);
foreach ($regexs as $regex)
{
echo "\nRegex: $regex\n";
foreach ($hosts as $host)
{
echo " Host: $host\n";
$result = preg_match($regex, $host);
echo ' Result: ';
if ($result === false)
{
echo '(error) ';
print_r(error_get_last()); // never prints anything?
}
else
{
echo ($result) ? '(match) ' : '(no match) ';
}
var_dump($result);
}
}
Expected result:
----------------
none of the results should yield bool(false)
Actual result:
--------------
// just the output from the last regex, but others yield bool(false)
Regex:
/^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/
Host: ABCDEFGHIJ1234567890.
Result: (error) bool(false)
Host: ABCDEFGHI234567890.
Result: (error) bool(false)
Host: .ABCDEFGHIJ1234567890
Result: (no match) int(0)
Host: ABCDEFGHIJ1234567890
Result: (error) bool(false)
Host: ABCDEFGHI1234567890
Result: (error) bool(false)
Host: ABCDEFGHI123456789
Result: (match) int(1)
Host: ABCDEFGHIJ-1234567890
Result: (error) bool(false)
Host: ABCDEFGHIJ-123456789
Result: (error) bool(false)
Host: ABCDEFGHI-123456789
Result: (match) int(1)
Host: WWW.ABCDEFGHIJ-1234567890.COM
Result: (match) int(1)
Host:
WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890-abcd-efgh-hijk.COM
Result: (match) int(1)
------------------------------------------------------------------------
--
Edit this bug report at https://bugs.php.net/bug.php?id=61018&edit=1