On Fri, Jan 30, 2009 at 03:30:02AM -0800, Darren Duncan wrote: > pugs-comm...@feather.perl6.nl wrote: >> In the abstract, Perl is written in Unicode, and has consistent Unicode >> -semantics regardless of the underlying text representations. >> +semantics regardless of the underlying text representations. By default >> +Perl presents Unicode in "NFG" formation, where each grapheme counts as >> +one character. A grapheme is what the novice user would think of as a >> +character in their normal everyday life, including any diacritics. > > What's with this NFG / Normal Form G that you refer to? I don't see any > mention of that in http://unicode.org/reports/tr15/ ... did you mean NFC?
Nope, this is a Perl/Parrot idea. It started out with a notion of mine a year ago. Search for 'grapheme' in http://use.perl.org/~chromatic/journal/35461 We named it NFG about the time Simon Cozens wrote a PDD for it for parrot. At the moment it's much better specced in Parrotland than in P6land. See http://www.parrotcode.org/docs/pdd/pdd28_strings.html NFG stands for Normalization Form G, where the G is short for "grapheme". And before anyone asks, yes, we were aware of the other gloss for NFG when we picked it. :) > For that matter, is it possible for all realistic combinations of > diacritics and base letters to be represented by a single Unicode > codepoint, including all language-dependent graphemes? No, that is the vision of NFC, but there are potentially an infinite number of graphemes that can be composed in Unicode. NFG aims to represent each of those locally as a single integer, and translate back out to a more standard normalization form on output. > I thought NFC sort of did one codepoint per grapheme but there were a few > exceptions ... I could be wrong on that point. You are correct, NFC doesn't do all that we want. By the way, we could use someone to write the Perl 6 Unicode synopsis, based on PDD 28. Larry