Synopsis 5 says that "C<\n> now matches a logical (platform independent)
newline not just C<\012>". But the devil is in the details, and I'm
wanting confirmation (or discussion) of the details on \n so I can
implement it in PGE...
Quick summary: I'm thinking that \n should be defined as
the equivalent of
rule nl { [ \015\012 | <[\015\012\f\x85\x{2028}\x{2029}]> ]: }
Note the colon (:) at the end of the pattern, which means that the
CRLF sequence (\x0d\x0a) will always be treated as a single newline
for purposes of matching C<\n>.
Discussion: The common newline characters in use today are
LF (\x0a), CRLF (\x0d\x0a), and CR (\x0d) depending on the
operating system involved. The CRLF is the tricky one when
it comes to quantification, in particular, consider the
following:
"\012\012\012\012" ~~ / \n**{4} / # matches (4 LFs)
"\015\015\015\015" ~~ / \n**{4} / # matches (4 CRs)
"\015\012\015\012" ~~ / \n**{4} / # ???
I'm of the opinion that the sequence "\015\012" should always
be treated as a single newline, in which case the last
expression above would not match because the target string contains
only two newlines. But I want to check if others' interpretations
square with mine on this point (and if there's no consensus on it,
we may need to pose the question to p6l for an official ruling).
The other characters in the definition of C<\n> above come from
Unicode, which gives the following as line terminators:
LF - line feed - u000a
CR - carriage return - u000d
CR+LF - CR followed by LF
FF - form feed - u000c
NL - next line - u0085
LS - line separator - u2028
PS - paragraph separator - u2029
With this, the definition of \N is simply any character that
is not in the set [\012\015\x0c\x85\x{2028}\x{2029}].
Comments and feedback welcomed.
Pm