On Thu, Jul 08, 2004 at 01:17:48PM -0400, Houghton,Andrew wrote:
> Unicode specifies four normalization methods, NFC, NFD, NFKC,
> and NFKD.  While RDF could have just accepted characters in
> unnormalized form, it decided to mandate that all data content
> be provided in NFC normalization form.  This has consequences
> when taking MARC-XML data to RDF.  MARC-XML uses NFD, loosely,
> as Michael Doran pointed out there are exceptions.

Ouch, is this documented somewhere? I imagine it must be, but I never
seem to have run across it before. It's probably in bold letters in the
spec :]

> I'm not announcing the availability of the code yet, but you can
> take a peek at http://staff.oclc.org/~houghtoa/repository/perl/utf-nf.pl

Thanks Andrew. I'll check it out. 

> I'm not sure how the internals of MARC::Charset work, but if it keeps
> the data in Perl's internal Unicode representation then all that you
> would need to do is call the normalize function in the Unicode::Normalize
> package.  If not then you would need to convert it first into Perl's
> internal Unicode representation, probably with the Encode package, which
> is also built into all 5.8.0 Perl distributions.

All MARC::Charset does is provide hash lookup for the LC mapping tables [1], 
and a fairly simple alogorithm for reading the MARC-8 escapes and
translating to UTF-8 appropriately. One somewhat nice bonus is that the
big East Asian mapping is stored in a BerkleyDb to save on memory--but I
guess memory is cheap these days. 

Thanks for the tip about Unicode::Normalize.  MARC::Charset already requires 
perl 5.8, so I think adding this normalization would be a good idea at
some point.

//Ed

[1] http://www.loc.gov/marc/specifications/specchartables.html

Reply via email to