On Thu, Jul 08, 2004 at 01:17:48PM -0400, Houghton,Andrew wrote: > Unicode specifies four normalization methods, NFC, NFD, NFKC, > and NFKD. While RDF could have just accepted characters in > unnormalized form, it decided to mandate that all data content > be provided in NFC normalization form. This has consequences > when taking MARC-XML data to RDF. MARC-XML uses NFD, loosely, > as Michael Doran pointed out there are exceptions.
Ouch, is this documented somewhere? I imagine it must be, but I never seem to have run across it before. It's probably in bold letters in the spec :] > I'm not announcing the availability of the code yet, but you can > take a peek at http://staff.oclc.org/~houghtoa/repository/perl/utf-nf.pl Thanks Andrew. I'll check it out. > I'm not sure how the internals of MARC::Charset work, but if it keeps > the data in Perl's internal Unicode representation then all that you > would need to do is call the normalize function in the Unicode::Normalize > package. If not then you would need to convert it first into Perl's > internal Unicode representation, probably with the Encode package, which > is also built into all 5.8.0 Perl distributions. All MARC::Charset does is provide hash lookup for the LC mapping tables [1], and a fairly simple alogorithm for reading the MARC-8 escapes and translating to UTF-8 appropriately. One somewhat nice bonus is that the big East Asian mapping is stored in a BerkleyDb to save on memory--but I guess memory is cheap these days. Thanks for the tip about Unicode::Normalize. MARC::Charset already requires perl 5.8, so I think adding this normalization would be a good idea at some point. //Ed [1] http://www.loc.gov/marc/specifications/specchartables.html