Character set tests [was MARC::Charset]

Doran, Michael D Wed, 14 Mar 2007 11:46:45 -0800

Hi Ashley,

Thanks for the info!  Trying to keep up with i18n and/or character set stuff is 
almost a full time job.


> > How are you testing for UTF-8?
> 
> There's a handy perl regexp on the W3C web site at:
> 
>     http://www.w3.org/International/questions/qa-forms-utf-8
> 
> You'll need to change the ASCII part of the regexp to something like:
> 
>     [\x01-\x7e]
> 
> This will more than accommodate for the various control 
> characters you can find in MARC records (don't forget Esc as 
> the lead in to Greek, Cyrillic, etc.)

In a MARC UCS/Unicode UTF-8 environment, the Esc (0x1B) character doesn't serve 
any purpose, since it is not necessary to escape to the alternate MARC-8 
character sets (the aforementioned Greek, Cyrillic, etc.).  My understanding is 
that a proper conversion from MARC-8 to UTF-8 should remove any escape 
sequences.  I believe that the only other 'CO' control characters allowed in 
MARC records are these [1]:

 hex    MARC control name               ASCII control name      Unicode control 
name
 ----   -------------------     ------------------      
-----------------------------   
 0x1D [RECORD TERMINATOR]       [GROUP SEPARATOR]               [INFORMATION 
SEPARATOR THREE]
 0x1E [FIELD TERMINATOR]        [RECORD SEPARATOR]      [INFORMATION SEPARATOR 
TWO]
 0x1F [SUBFIELD DELIMITER]      [UNIT SEPARATOR]                [INFORMATION 
SEPARATOR ONE]

So, I'm wondering if for MARC record testing, it would make sense to tighten up 
the ASCII part of the regexp a bit to this:

        [\x1D-\x7E]

-- Michael

[1] MARC21 > Code Table Basic Latin (ASCII)
    http://lcweb2.loc.gov/cocoon/codetables/42.html

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Ashley Sanders [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, March 14, 2007 10:52 AM
> To: Doran, Michael D
> Cc: perl4lib
> Subject: Re: MARC::Charset
> 
> Michael,
> 
> >> So, basically, you either need prior knowledge about the actual 
> >> character encoding used, or you have to test. Testing for UTF-8 is 
> >> fairly straightforward...
> > 
> > How are you testing for UTF-8?
> 
> There's a handy perl regexp on the W3C web site at:
> 
>     http://www.w3.org/International/questions/qa-forms-utf-8
> 
> You'll need to change the ASCII part of the regexp to something like:
> 
>     [\x01-\x7e]
> 
> This will more than accommodate for the various control 
> characters you can find in MARC records (don't forget Esc as 
> the lead in to Greek, Cyrillic, etc.)
> 
> The W3C regexp tests the whole string -- which may be 
> inefficient if you are testing lots of data. Depending on 
> what sort of accuracy you want and whether or not overlong 
> UTF-8 sequences are a concern, you could just test for the following:
> 
>     [\xc2-\xf4][\x80-\xbf]
> 
> The Wikipedia page on UTF-8 is worth a read.
> 
> >> Distinguishing Latin-1 from MARC-8 is a bit more like guess work.
> >> As a test for MARC-8 I look for the common combining diacritics 
> >> followed by a vowel.
> > 
> > Do you have a programmatic way to do that test, or are you 
> "eye-balling" the records.
> 
> I use a simple regexp:
> 
>    ([\xe1-\xe3][aeiouAEIOU]|\xf0[cC])
> 
> which may be rather too simple. For a critical application 
> I'd come up with something a bit better (after first 
> eye-balling a load of records.)
> 
> Just as an aside, I'm not using perl -- I'm using the Boost 
> Regexp library for C++ (which is a good implementation of 
> perl regexps.)
> 
> Regards,
> 
> Ashley.
> -- 
> Ashley Sanders               [EMAIL PROTECTED]
> Copac http://copac.ac.uk A MIMAS Service funded by JISC
>

Character set tests [was MARC::Charset]

Reply via email to