RE: Regular Expression for non-Roman characters

Doran, Michael D Thu, 25 Sep 2008 14:54:06 -0700

Hi Jane,

In a MARC-8 character set environment, I would assume that the key to detecting 
non-Latin characters would be the presence of an escape sequence to indicate a 
switch to an alternate character set (e.g. Arabic, Greek, Cyrillic, etc) [1].  
Everything from that point on would be non-Latin until there was an escape 
sequence back to Latin.


In a MARC Unicode character set environment, if you are using Perl for your 
regular expression matching, you can probably take advantage of the Unicode 
\p{} constructs [2].  Something along the lines of...

        \P{Latin}

...which means doesn't belong to the Latin script (lowercase 'p' = belongs to, 
uppercase 'P' = does not belong to).

For more info on the regular expression Unicode scripts/blocks see this 
tutorial:
http://www.regular-expressions.info/unicode.html

I'll point out that when I've used Unicode \p{} constructs in a program, it was 
necessary to explicitly label strings as being Unicode (assuming they are, 
natch) before regex matching, using...

        decode('UTF-8',$string_tobe_matched);

I know that's not exactly what you asked for, but (assuming I didn't 
misunderstand your question) it may suggest some approaches should you end up 
tackling it yourself.

-- Michael

[1] MARC 21 Specification > ACCESSING ALTERNATE GRAPHIC CHARACTER SETS
    http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative

[2] Perl > Unicode Regular Expression Support Level
    
http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
  

> -----Original Message-----
> From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, September 25, 2008 1:24 PM
> To: perl4lib@perl.org
> Subject: Regular Expression for non-Roman characters
> 
> Hi folks,
> 
> I'm wondering if anyone has codified a regular expression that would
> indicate the presence of non-Latin characters.  I want to detect the
> presence of non-Roman letters in authority records.  Currently
> Authorities with non-Roman forms of name place these in the 
> 4XX fields.
> Our system can't handle that so I want to flip them to 5XX 
> and possibly
> add a subfield to note what they but first I need something to detect
> them
> 
> I had in mind something like \xE0-\xFE which detects 
> diacritics nicely.
> I'd prefer not to figure it out for myself if someone else has already
> done it.
> Thanks in advance.
> JJ 
> 
> **Views expressed by the author do not necessarily represent those of
> the Queens Library.**
> 
> Jane Jacobs
> Asst. Coord., Catalog Division
> Queens Borough Public Library
> 89-11 Merrick Blvd.
> Jamaica, NY 11432
> tel.: (718) 990-0804
> e-mail: [EMAIL PROTECTED]
> FAX. (718) 990-8566
> 
> 
> 
> 
> 
> The information contained in this message may be privileged 
> and confidential and protected from disclosure. If the reader 
> of this message is not the intended recipient, or an employee 
> or agent responsible for delivering this message to the 
> intended recipient, you are hereby notified that any 
> dissemination, distribution or copying of this communication 
> is strictly prohibited. If you have received this 
> communication in error, please notify us immediately by 
> replying to the message and deleting it from your computer.
>

RE: Regular Expression for non-Roman characters

Reply via email to