Branch: refs/heads/blead
Home: https://github.com/Perl/perl5
Commit: d8c30f31076f4ffd2816d68d5b36a54d80a5aa85
https://github.com/Perl/perl5/commit/d8c30f31076f4ffd2816d68d5b36a54d80a5aa85
Author: Karl Williamson <[email protected]>
Date: 2025-10-19 (Sun, 19 Oct 2025)
Changed paths:
M pod/perldelta.pod
M t/op/sub_lval.t
M toke.c
Log Message:
-----------
toke.c: Fix inconsistency under 'use utf8'
This code can't work properly:
if (UTF ? isIDFIRST_utf8((U8*)s+1) : isWORDCHAR_A(s[1]))
Suppose you have a string composed entirely of ASCII characters
beginning with a digit. If the string isn't encoded in UTF-8, the
condition is true, but it is false if the string happens to have the
UTF-8 flag set for whatever reason. One of those reasons simply is that
the Perl program is being compiled under 'use utf8'.
The UTF-8 flag should not change the behavior of ASCII strings.
The code was introduced in 9d58dbc453a86c9cbb3a131adcd1559fe0445a08 in
2015, to fix [perl #123963] "@<fullwidth digit>". The line it replaced
was
if (isWORDCHAR_lazy_if(s+1,UTF))
(The code was modified in 2016 by
fac0f7a38edc4e50a7250b738699165079b852d8 as part of a global
substitution to use isIDFIRST_utf8_safe() so as to have no possibility
of going off the end of the buffer), but that did not affect the logic.
The problem the original commit was trying to solve was that fullwidth
digits (U+FF10 etc) were accepted when they shouldn't be, whereas [0-9]
should remain as being accepted. The defect is that [0-9] stopped being
accepted when the UTF-8 flag is on. The solution is to change it to
instead be
if (isDIGIT_A(s[1]) || isIDFIRST_lazy_if_safe(s+1, send, UTF))
This causes [0-9] to remain accepted regardless of the UTF-8 flag.
So when it is on, the only difference between before this commit and
after is that [0-9] are accepted.
In the ASCII range, the only difference between \w and IDFirst is that
the former includes the digits 0-9, so when the UTF-8 flag is off this
evaluates to isWORD_CHAR_A, as before.
(Changing to isIDFIRST from isWORDCHAR in the original commit did solve
a bunch of other cases where a \w is not supposed to be the first
character in a name. There are about 4K such characters currently in
Unicode.)
To unsubscribe from these emails, change your notification settings at
https://github.com/Perl/perl5/settings/notifications