Firstly thanks to Michael for solving my problem of locating non-Roman characters in Marc Texts. It works like a charm!
Our ILS wonks out if it gets the new Authority Records with cross-references from non-Roman Scripts. For example: 010 n 99034155 ǂz no2005062324 040 DLC ǂb eng ǂc DLC ǂd OCoLC ǂd DLC ǂd OCoLC 100 1 Guo, Fucheng, ǂd 1965- 400 1 Kuo, Fu-chʻeng, ǂc singer ǂw nne 400 1 Kwok, Arron, ǂd 1965- 400 1 Kwok, Aaron, ǂd 1965- 400 1 Kwok, Fu Shing, ǂd 1965- 400 1 Gwok, Fu Sing, ǂd 1965- 400 1 郭富城, ǂd 1965- Our vendor sensibly recommends that we change 4XX to 5XX. This works fine for a manual load, but in a batch load context one would anticipate problems. Hence the program. Michael has encouraged me to share my result with my result with the group, even though I have my doubts about whether any of you couldn’t crank out something better off the top of your heads. However, I’m offering it up with the caveats that it is somewhat inelegant and contains at least a few steps I’m sure are not necessary except that I can’t figure out the right way to pull them out. It is slavishly derivative of the MARC::DOC::Tutorial “Updating subject subfield x to subfield v”. As far as I can tell it does the job but any bugs are no doubt mine. I added a $9 to my non-Latin headings to distinguish them from other “legitimate” 5XX tags and that, of course, can be knocked out. That said, here it is for anyone who wants it: use strict; use MARC::Batch; my $batch = MARC::Batch->new('USMARC','VerAuthtest.mrc'); open( OUT, '>VerAuthtest.dat' ) or die $1; while ( my $record = $batch->next() ) { my $leader = $record->leader(); my $Charset = substr($leader,9,1); my $Type = substr($leader,6,1); # go through all 4XX fields in the record. foreach my $Xref ( $record->field( '4..' ) ) { # extract subfields as an array of array refs. my @subfields = $Xref->subfields(); # setup an array to store our new field. my @newSubfields = (); my $newtag = $Xref->tag(); $newtag =~ s/^4/5/; # use pop() to read the subfields backwards. while ( my $subfield = pop( @subfields ) ) { # for convenience, pull out the subfield # code and data from the array ref. my ($code,$data) = @$subfield; unshift( @newSubfields, $code, $data ); } my $Xrefstring = $Xref ->as_string( 'abcq' ); if (!($Xrefstring =~ m/\p{Latin}/) && (($Type eq 'z') && ($Charset eq 'a'))) { my $newXref = MARC::Field->new( $newtag, $Xref->indicator(1), $Xref->indicator(2), @newSubfields, 9=>'Non-Latin' ); $Xref->replace_with( $newXref ); ##my $escape = '_('; } elsif (($Xrefstring =~ m/_\(/) && (($Type eq 'z') && ($Charset eq ' '))) { my $newXref = MARC::Field->new( $newtag, $Xref->indicator(1), $Xref->indicator(2), @newSubfields, 9=>'Non-Latin' ); $Xref->replace_with( $newXref ); } } # output the potentially changed record as MARC. print OUT $record->as_usmarc(); } **Views expressed by the author do not necessarily represent those of the Queens Library.** Jane Jacobs Asst. Coord., Catalog Division Queens Borough Public Library 89-11 Merrick Blvd. Jamaica, NY 11432 tel.: (718) 990-0804 e-mail: [EMAIL PROTECTED] FAX. (718) 990-8566 The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer.-----Original Message----- From: Doran, Michael D [mailto:[EMAIL PROTECTED] Sent: Thursday, September 25, 2008 5:54 PM To: Jacobs, Jane W Cc: perl4lib@perl.org Subject: RE: Regular Expression for non-Roman characters Hi Jane, In a MARC-8 character set environment, I would assume that the key to detecting non-Latin characters would be the presence of an escape sequence to indicate a switch to an alternate character set (e.g. Arabic, Greek, Cyrillic, etc) [1]. Everything from that point on would be non-Latin until there was an escape sequence back to Latin. In a MARC Unicode character set environment, if you are using Perl for your regular expression matching, you can probably take advantage of the Unicode \p{} constructs [2]. Something along the lines of... \P{Latin} ..which means doesn't belong to the Latin script (lowercase 'p' = belongs to, uppercase 'P' = does not belong to). For more info on the regular expression Unicode scripts/blocks see this tutorial: http://www.regular-expressions.info/unicode.html I'll point out that when I've used Unicode \p{} constructs in a program, it was necessary to explicitly label strings as being Unicode (assuming they are, natch) before regex matching, using... decode('UTF-8',$string_tobe_matched); I know that's not exactly what you asked for, but (assuming I didn't misunderstand your question) it may suggest some approaches should you end up tackling it yourself. -- Michael [1] MARC 21 Specification > ACCESSING ALTERNATE GRAPHIC CHARACTER SETS http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative [2] Perl > Unicode Regular Expression Support Level http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] > Sent: Thursday, September 25, 2008 1:24 PM > To: perl4lib@perl.org > Subject: Regular Expression for non-Roman characters > > Hi folks, > > I'm wondering if anyone has codified a regular expression that would > indicate the presence of non-Latin characters. I want to detect the > presence of non-Roman letters in authority records. Currently > Authorities with non-Roman forms of name place these in the > 4XX fields. > Our system can't handle that so I want to flip them to 5XX > and possibly > add a subfield to note what they but first I need something to detect > them > > I had in mind something like \xE0-\xFE which detects > diacritics nicely. > I'd prefer not to figure it out for myself if someone else has already > done it. > Thanks in advance. > JJ > > **Views expressed by the author do not necessarily represent those of > the Queens Library.** > > Jane Jacobs > Asst. Coord., Catalog Division > Queens Borough Public Library > 89-11 Merrick Blvd. > Jamaica, NY 11432 > tel.: (718) 990-0804 > e-mail: [EMAIL PROTECTED] > FAX. (718) 990-8566 > > > > > > The information contained in this message may be privileged > and confidential and protected from disclosure. If the reader > of this message is not the intended recipient, or an employee > or agent responsible for delivering this message to the > intended recipient, you are hereby notified that any > dissemination, distribution or copying of this communication > is strictly prohibited. If you have received this > communication in error, please notify us immediately by > replying to the message and deleting it from your computer. >