RE: Regular Expression for non-Roman characters

Jacobs, Jane W Thu, 02 Oct 2008 05:24:24 -0700

Firstly thanks to Michael for solving my problem of locating non-Roman 
characters in Marc Texts.  It works like a charm!


 

Our ILS wonks out if it gets the new Authority Records with cross-references 
from non-Roman Scripts.  For example:

010  n  99034155  ǂz no2005062324

040  DLC ǂb eng ǂc DLC ǂd OCoLC ǂd DLC ǂd OCoLC

100 1 Guo, Fucheng, ǂd 1965-

400 1 Kuo, Fu-chʻeng, ǂc singer ǂw nne

400 1 Kwok, Arron, ǂd 1965-

400 1 Kwok, Aaron, ǂd 1965-

400 1 Kwok, Fu Shing, ǂd 1965-

400 1 Gwok, Fu Sing, ǂd 1965-

400 1 郭富城, ǂd 1965-

 


Our vendor sensibly recommends that we change 4XX to 5XX.  This works fine for 
a manual load, but in a batch load context one would anticipate problems.  
Hence the program.  Michael has encouraged me to share my result with my result 
with the group, even though I have my doubts about whether any of you couldn’t 
crank out something better off the top of your heads.  However, I’m offering it 
up with the caveats that it is somewhat inelegant and contains at least a few 
steps I’m sure are not necessary except that I can’t figure out the right way 
to pull them out.  It is slavishly derivative of the MARC::DOC::Tutorial 
“Updating subject subfield x to subfield v”.  As far as I can tell it does the 
job but any bugs are no doubt mine. I added a $9 to my non-Latin headings to 
distinguish them from other “legitimate” 5XX tags and that, of course, can be 
knocked out.  That said, here it is for anyone who wants it:


 


use strict;

 

 

  use MARC::Batch;

 

  my $batch = MARC::Batch->new('USMARC','VerAuthtest.mrc');

  open( OUT, '>VerAuthtest.dat' ) or die $1;

  while ( my $record = $batch->next() ) {

 

                           my $leader = $record->leader();

                           my $Charset = substr($leader,9,1);

                           my $Type = substr($leader,6,1);

 

    # go through all 4XX fields in the record.

    foreach my $Xref ( $record->field( '4..' ) ) {

 

     

      # extract subfields as an array of array refs.

      my @subfields = $Xref->subfields();

 

      # setup an array to store our new field.

      my @newSubfields = ();

  

     

      my $newtag = $Xref->tag();

      $newtag =~ s/^4/5/;

  

     # use pop() to read the subfields backwards.

      while ( my $subfield = pop( @subfields ) ) {

  

        # for convenience, pull out the subfield

        # code and data from  the array ref.

        my ($code,$data) = @$subfield;

  

        unshift( @newSubfields, $code, $data );

 

      }

      my $Xrefstring = $Xref ->as_string( 'abcq' );

 

 

                if (!($Xrefstring =~ m/\p{Latin}/) && (($Type eq 'z') && 
($Charset eq 'a'))) {  

                  my $newXref = MARC::Field->new( 

                    $newtag,

                    $Xref->indicator(1),

                    $Xref->indicator(2),

                   @newSubfields,

                   9=>'Non-Latin'

                  );

                 $Xref->replace_with( $newXref );

                 ##my $escape = '_(';

                } elsif (($Xrefstring =~ m/_\(/) && (($Type eq 'z') && 
($Charset eq ' '))) {  

                  my $newXref = MARC::Field->new( 

                    $newtag,

                    $Xref->indicator(1),

                    $Xref->indicator(2),

                   @newSubfields,

                   9=>'Non-Latin'

                  );

                 $Xref->replace_with( $newXref );

                }

   

     }

    

     # output the potentially changed record as MARC.

     print OUT $record->as_usmarc();

    }


 


 


 


 

 

 

 

**Views expressed by the author do not necessarily represent those of the 
Queens Library.**

 

Jane Jacobs

Asst. Coord., Catalog Division

Queens Borough Public Library

89-11 Merrick Blvd.

Jamaica, NY 11432

tel.: (718) 990-0804

e-mail: [EMAIL PROTECTED]

FAX. (718) 990-8566

 



The information contained in this message may be privileged and confidential 
and protected from disclosure. If the reader of this message is not the 
intended recipient, or an employee or agent responsible for delivering this 
message to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify us 
immediately by replying to the message and deleting it from your 
computer.-----Original Message-----
From: Doran, Michael D [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 25, 2008 5:54 PM
To: Jacobs, Jane W
Cc: perl4lib@perl.org
Subject: RE: Regular Expression for non-Roman characters

 

Hi Jane,

 

In a MARC-8 character set environment, I would assume that the key to detecting 
non-Latin characters would be the presence of an escape sequence to indicate a 
switch to an alternate character set (e.g. Arabic, Greek, Cyrillic, etc) [1].  
Everything from that point on would be non-Latin until there was an escape 
sequence back to Latin.

 

In a MARC Unicode character set environment, if you are using Perl for your 
regular expression matching, you can probably take advantage of the Unicode 
\p{} constructs [2].  Something along the lines of...

 

      \P{Latin}

 

..which means doesn't belong to the Latin script (lowercase 'p' = belongs to, 
uppercase 'P' = does not belong to).

 

For more info on the regular expression Unicode scripts/blocks see this 
tutorial:

http://www.regular-expressions.info/unicode.html

 

I'll point out that when I've used Unicode \p{} constructs in a program, it was 
necessary to explicitly label strings as being Unicode (assuming they are, 
natch) before regex matching, using...

 

      decode('UTF-8',$string_tobe_matched);

 

I know that's not exactly what you asked for, but (assuming I didn't 
misunderstand your question) it may suggest some approaches should you end up 
tackling it yourself.

 

-- Michael

 

[1] MARC 21 Specification > ACCESSING ALTERNATE GRAPHIC CHARACTER SETS

    http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative

 

[2] Perl > Unicode Regular Expression Support Level

    
http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level

 

# Michael Doran, Systems Librarian

# University of Texas at Arlington

# 817-272-5326 office

# 817-688-1926 mobile

# [EMAIL PROTECTED]

# http://rocky.uta.edu/doran/

  

 

> -----Original Message-----

> From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] 

> Sent: Thursday, September 25, 2008 1:24 PM

> To: perl4lib@perl.org

> Subject: Regular Expression for non-Roman characters

> 

> Hi folks,

> 

> I'm wondering if anyone has codified a regular expression that would

> indicate the presence of non-Latin characters.  I want to detect the

> presence of non-Roman letters in authority records.  Currently

> Authorities with non-Roman forms of name place these in the 

> 4XX fields.

> Our system can't handle that so I want to flip them to 5XX 

> and possibly

> add a subfield to note what they but first I need something to detect

> them

> 

> I had in mind something like \xE0-\xFE which detects 

> diacritics nicely.

> I'd prefer not to figure it out for myself if someone else has already

> done it.

> Thanks in advance.

> JJ 

> 

> **Views expressed by the author do not necessarily represent those of

> the Queens Library.**

> 

> Jane Jacobs

> Asst. Coord., Catalog Division

> Queens Borough Public Library

> 89-11 Merrick Blvd.

> Jamaica, NY 11432

> tel.: (718) 990-0804

> e-mail: [EMAIL PROTECTED]

> FAX. (718) 990-8566

> 

> 

> 

> 

> 

> The information contained in this message may be privileged 

> and confidential and protected from disclosure. If the reader 

> of this message is not the intended recipient, or an employee 

> or agent responsible for delivering this message to the 

> intended recipient, you are hereby notified that any 

> dissemination, distribution or copying of this communication 

> is strictly prohibited. If you have received this 

> communication in error, please notify us immediately by 

> replying to the message and deleting it from your computer.

>

RE: Regular Expression for non-Roman characters

Reply via email to