Patrick wrote at 12:15pm CST on Wednesday, 10 November 2010:
>> Sorry if this is the wrong forum. I was wondering if there was a way to
>> specify unicode
>> categories<http://www.fileformat.info/info/unicode/category/index.htm>in
>> a regular expression (and hence a grammar), or if there would be any
>> consideration for adding support for that (requiring some kind of special
>> syntax).
> Unicode categories are done using assertion syntax with "is" followed by
> the category name. Thus <isLu> (uppercase letter), <isNd> (decimal digit),
> <isZs> (space separator), etc.
> This even works in Rakudo today:
> $ ./perl6
> > say 'abcdEFG' ~~ / <isLu> /
> E
> They can also be combined, as in +isLu+isLt (uppercase+titlecase).
> The relevant section of the spec is in Synopsis 5; search for "Unicode
> properties are always available with a prefix".
> Hope this helps!
Actually, that quote from Synopsis raises more questions than it answers.
Below I've annonated the three output groups with (letters):
% uniprops -a A
U+0041 ‹A› \N{ LATIN CAPITAL LETTER A }:
(A) \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
(B) AHex ASCII_Hex_Digit All Any Alnum Alpha Alphabetic ASCII Assigned
Cased Cased_Letter LC Changes_When_Casefolded CWCF
Changes_When_Casemapped CWCM Changes_When_Lowercased CWL
Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph
GrBase Hex XDigit Hex_Digit ID_Continue IDC ID_Start IDS Letter L_
Latin Latn Uppercase_Letter PerlWord PosixAlnum PosixAlpha
PosixGraph PosixPrint PosixUpper Print Upper Uppercase Word
XID_Continue XIDC XID_Start XIDS
(C) Age:1.1 Block=Basic_Latin Bidi_Class:L Bidi_Class=Left_To_Right
Bidi_Class:Left_To_Right Bc=L Block:ASCII Block:Basic_Latin
Blk=ASCII Canonical_Combining_Class:0
Canonical_Combining_Class=Not_Reordered
Canonical_Combining_Class:Not_Reordered Ccc=NR
Canonical_Combining_Class:NR Decomposition_Type:None Dt=None
East_Asian_Width:Na East_Asian_Width=Narrow East_Asian_Width:Narrow
Ea=Na Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX
Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
Hangul_Syllable_Type=Not_Applicable
Hangul_Syllable_Type:Not_Applicable Hst=NA
Joining_Group:No_Joining_Group Jg=NoJoiningGroup
Joining_Type:Non_Joining Jt=U Joining_Type:U
Joining_Type=Non_Joining Script=Latin Line_Break:AL
Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None
Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0
Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0
Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1
Present_In:5.2 In=5.2 Script:Latin Sc=Latn Script:Latn
Sentence_Break:UP Sentence_Break=Upper Sentence_Break:Upper SB=UP
Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter
What that means is that the "B" properties are properties from
the *General* category. They may all be referred to as \p{X}
or \p{IsX}, \p{General_Category=X} or \p{General_Category:X},
and \p{GC=X} or \p{GC:X}.
I have a feeling that your synopsis quote is referring only to
type B properties alone. It is not talking about type C properties,
which must also be accounted for.
--tom