Hi Jane, In a MARC-8 character set environment, I would assume that the key to detecting non-Latin characters would be the presence of an escape sequence to indicate a switch to an alternate character set (e.g. Arabic, Greek, Cyrillic, etc) [1]. Everything from that point on would be non-Latin until there was an escape sequence back to Latin.
In a MARC Unicode character set environment, if you are using Perl for your regular expression matching, you can probably take advantage of the Unicode \p{} constructs [2]. Something along the lines of... \P{Latin} ...which means doesn't belong to the Latin script (lowercase 'p' = belongs to, uppercase 'P' = does not belong to). For more info on the regular expression Unicode scripts/blocks see this tutorial: http://www.regular-expressions.info/unicode.html I'll point out that when I've used Unicode \p{} constructs in a program, it was necessary to explicitly label strings as being Unicode (assuming they are, natch) before regex matching, using... decode('UTF-8',$string_tobe_matched); I know that's not exactly what you asked for, but (assuming I didn't misunderstand your question) it may suggest some approaches should you end up tackling it yourself. -- Michael [1] MARC 21 Specification > ACCESSING ALTERNATE GRAPHIC CHARACTER SETS http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative [2] Perl > Unicode Regular Expression Support Level http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] > Sent: Thursday, September 25, 2008 1:24 PM > To: perl4lib@perl.org > Subject: Regular Expression for non-Roman characters > > Hi folks, > > I'm wondering if anyone has codified a regular expression that would > indicate the presence of non-Latin characters. I want to detect the > presence of non-Roman letters in authority records. Currently > Authorities with non-Roman forms of name place these in the > 4XX fields. > Our system can't handle that so I want to flip them to 5XX > and possibly > add a subfield to note what they but first I need something to detect > them > > I had in mind something like \xE0-\xFE which detects > diacritics nicely. > I'd prefer not to figure it out for myself if someone else has already > done it. > Thanks in advance. > JJ > > **Views expressed by the author do not necessarily represent those of > the Queens Library.** > > Jane Jacobs > Asst. Coord., Catalog Division > Queens Borough Public Library > 89-11 Merrick Blvd. > Jamaica, NY 11432 > tel.: (718) 990-0804 > e-mail: [EMAIL PROTECTED] > FAX. (718) 990-8566 > > > > > > The information contained in this message may be privileged > and confidential and protected from disclosure. If the reader > of this message is not the intended recipient, or an employee > or agent responsible for delivering this message to the > intended recipient, you are hereby notified that any > dissemination, distribution or copying of this communication > is strictly prohibited. If you have received this > communication in error, please notify us immediately by > replying to the message and deleting it from your computer. >