Hi Ashley, Thanks for the info! Trying to keep up with i18n and/or character set stuff is almost a full time job.
> > How are you testing for UTF-8? > > There's a handy perl regexp on the W3C web site at: > > http://www.w3.org/International/questions/qa-forms-utf-8 > > You'll need to change the ASCII part of the regexp to something like: > > [\x01-\x7e] > > This will more than accommodate for the various control > characters you can find in MARC records (don't forget Esc as > the lead in to Greek, Cyrillic, etc.) In a MARC UCS/Unicode UTF-8 environment, the Esc (0x1B) character doesn't serve any purpose, since it is not necessary to escape to the alternate MARC-8 character sets (the aforementioned Greek, Cyrillic, etc.). My understanding is that a proper conversion from MARC-8 to UTF-8 should remove any escape sequences. I believe that the only other 'CO' control characters allowed in MARC records are these [1]: hex MARC control name ASCII control name Unicode control name ---- ------------------- ------------------ ----------------------------- 0x1D [RECORD TERMINATOR] [GROUP SEPARATOR] [INFORMATION SEPARATOR THREE] 0x1E [FIELD TERMINATOR] [RECORD SEPARATOR] [INFORMATION SEPARATOR TWO] 0x1F [SUBFIELD DELIMITER] [UNIT SEPARATOR] [INFORMATION SEPARATOR ONE] So, I'm wondering if for MARC record testing, it would make sense to tighten up the ASCII part of the regexp a bit to this: [\x1D-\x7E] -- Michael [1] MARC21 > Code Table Basic Latin (ASCII) http://lcweb2.loc.gov/cocoon/codetables/42.html # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Ashley Sanders [mailto:[EMAIL PROTECTED] > Sent: Wednesday, March 14, 2007 10:52 AM > To: Doran, Michael D > Cc: perl4lib > Subject: Re: MARC::Charset > > Michael, > > >> So, basically, you either need prior knowledge about the actual > >> character encoding used, or you have to test. Testing for UTF-8 is > >> fairly straightforward... > > > > How are you testing for UTF-8? > > There's a handy perl regexp on the W3C web site at: > > http://www.w3.org/International/questions/qa-forms-utf-8 > > You'll need to change the ASCII part of the regexp to something like: > > [\x01-\x7e] > > This will more than accommodate for the various control > characters you can find in MARC records (don't forget Esc as > the lead in to Greek, Cyrillic, etc.) > > The W3C regexp tests the whole string -- which may be > inefficient if you are testing lots of data. Depending on > what sort of accuracy you want and whether or not overlong > UTF-8 sequences are a concern, you could just test for the following: > > [\xc2-\xf4][\x80-\xbf] > > The Wikipedia page on UTF-8 is worth a read. > > >> Distinguishing Latin-1 from MARC-8 is a bit more like guess work. > >> As a test for MARC-8 I look for the common combining diacritics > >> followed by a vowel. > > > > Do you have a programmatic way to do that test, or are you > "eye-balling" the records. > > I use a simple regexp: > > ([\xe1-\xe3][aeiouAEIOU]|\xf0[cC]) > > which may be rather too simple. For a critical application > I'd come up with something a bit better (after first > eye-balling a load of records.) > > Just as an aside, I'm not using perl -- I'm using the Boost > Regexp library for C++ (which is a good implementation of > perl regexps.) > > Regards, > > Ashley. > -- > Ashley Sanders [EMAIL PROTECTED] > Copac http://copac.ac.uk A MIMAS Service funded by JISC >