RE: :Z3950 and diacritics

2003-12-16 Thread Michael D Doran
First, we probably want to figure out what character set the records are
encoded in as received from LOC.  Since only the non-ASCII characters will
give us a clue, we can look at the umlauted-u ("ü") in Dürer.

Charset hex character(s) used to represent "ü"
--- 
MARC-8  0xE8 0x75 (combining umlaut/diaeresis preceding latin small letter
u)
MARC-UCS*   0x75 0xCC 0x88 (latin small letter u followed by combining
umlaut/diaeresis -- in this case the combining character is represented by
two bytes)
Latin-1 0xFC (precomposed latin small letter u with umlaut/diaeresis)

* MARC-UCS/Unicode is UTF-8 encoded, therefore U+0075 becomes 0x75 and the
U+0308 becomes 0xCC 0x88.  The MARC-21 specification does not allow the use
of the precomposed Unicode character for an umluated-u.

>   aRussell, Francis,d1910-14aThe world of D^®urer,

Since you are getting the base character OK (latin small letter u), we
should probably assume a base-plus-combining character scheme, and since the
combining character(s) come *before* the base character, we can probably
assume MARC-8.  If we could actually *verify* the hex encoding, we can go on
to what is happening to the records subsequent to the Z39.50 download... and
what to do with MARC-8, since it is not a character set used outside of
library-specific software applications.  ;-)

BTW, the character set should also agree with the value in character
position 9 in the leader of the MARC record:
09 - Character coding scheme
Identifies the character coding scheme used in the record. 
# - MARC-8 (the pound symbol "#" represents a blank in this case)
a - UCS/Unicode 
[from http://www.loc.gov/marc/bibliographic/ecbdldrd.html#mrcblea ]

> From: Tajoli Zeno [mailto:[EMAIL PROTECTED]
> The best option, now, is to use charset where 1 character
> is always 1 byte, for example ISO 8859_1

Be aware that converting MARC-8 to Latin-1 has the potential for data loss,
since there are many more characters that can be represented in MARC-8, than
can be represented in Latin-1.  The better bet is to convert to Unicode
UTF-8 (or get the records in that character set to begin with, if that is an
option).

> > From: Tajoli Zeno [mailto:[EMAIL PROTECTED]
> > 3)In the leader, position 0-4 you have the number of 
> > character, NOT the number of bytes. 
>
> From: Colin Campbell [mailto:[EMAIL PROTECTED]
> No it should be number of bytes (LOC has clarified this in 
> their spec by saying "number of octets".) It has always
> been the length in bytes.

>From the MARC 21 specifications...

  UCS/Unicode Markers and the MARC 21 Record Leader
  
  In MARC 21 records, Leader character position 9 contains value
  a if the data is encoded using UCS/Unicode characters. If any
  UCS/Unicode characters are to be included in the MARC 21 record,
  the entire MARC record must be encoded using UCS/Unicode characters.
  The record length contained in Leader positions 0-4 is a count of
  the number of octets in the record, not characters. The Leader
  position 9 value is not dependent on the character encoding used.
  This rule applies to MARC 21 records encoded using both the MARC-8
  and UCS/Unicode character sets.
  [from http://www.loc.gov/marc/specifications/speccharucs.html]

-- Michael

#  Michael Doran, Systems Librarian
#  University of Texas at Arlington
#  817-272-5326 office 
#  817-239-5368 cell
#  [EMAIL PROTECTED]
#  http://rocky.uta.edu/doran/



> -Original Message-
> From: Eric Lease Morgan [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, December 16, 2003 7:57 AM
> To: Perl4Lib
> Subject: Net::Z3950 and diacritics
> 
> 
> On 12/15/03 8:54 AM, Eric Lease Morgan <[EMAIL PROTECTED]> wrote:
> 
> > In order to get the MARC records for my "catalog" I have 
> been searching the
> > LOC catalog, identifying the record I desire, and using 
> Net::Z3950 to download
> > the desired record via the MARC 001 tag. Tastes great. Less filling.
> > 
> > When I loop through my MARC records MARC::Batch sometimes 
> warns that the MARC
> > leader is incorrect. This happens when the record contains 
> a diacritic.
> > Specifically, my MARC::Batch object returns "Invalid record 
> length..." I have
> > discovered that I can plow right through the record anyway 
> by turning on
> > strict_off, but my resulting records get really ugly at the 
> point of the
> > diacritic:
> > 
> >  
> http://infomotions.com/books/?cmd=search&query=id=russell-worl
> d-107149566
> 
> Upon further investigation, it seems that MARC::Batch is not 
> necessarily
> causing my problem with diacritics, instead, the problem may 
> lie in the way
> I am downloading my records using Net::Z3950.
> 
> How do I tell Net::Z3950 to download a specific MARC record 
> using a specific
> character set?
> 
> To download my MARC records from the LOC I feed a locally 
> developed Perl
> script, using Net::Z3950, the value from a LOC MARC record, 
> field 001. This
> retrieves one an only one record. I then suc

RE: Displaying diacritics in a terminal vs. a browser

2004-07-06 Thread Michael D Doran
Hi Andy,

> From: Houghton,Andrew [mailto:[EMAIL PROTECTED] 
>
> It just so happens that I have recently been converting 
> MARC-XML to RDF.  The RDF specification mandates Unicode 
> Normal form C, which means that the base character and the 
> diacritic are combined.

That's rather unfortunate, since Unicode includes the precomposed characters
largely for backward compatibility and the preferred 

> So I hacked together some Perl scripts to convert 
> Unicode NFD <-> Unicode NFC.
> 
> I was talking with a colleague, just yesterday, about whether 
> we should unleash these on the Net...  They need to be 
> cleaned up a little and need some basic documentation on how 
> to run the Perl scripts.

The W3C provides a Perl app that (I think) purports to do that [1].  I don't
know how much overlap there may be with your script, but just in case you
were not already aware of the W3C script, you may want to see if there is a
duplication of effort.

[1] "Charlint - A Character Normalization Tool" 
http://www.w3.org/International/charlint/.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-239-5368 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> 


RE: Displaying diacritics in a terminal vs. a browser

2004-07-06 Thread Michael D Doran
> MARC-XML uses Unicode Normal form D, which means that the base
> character is separate from the diacritic.

I am not familiar with the MARC-XML specifications, so at the risk of
embarrasing myself would it be correct to posit that it may not be that
MARC-XML uses Unicode Normal form D, so much as the fact that the MARC 21
UCS/Unicode environment is essentially the MARC-8 character repertoire
translated into the Unicode equivalent code points [1].  Since the MARC-8
character repertoire relies largely on combining characters, the end result
will mostly be Unicode Normal form D.  However, there *are* exceptions.  One
example is UPPERCASE O-HOOK which is a single character in MARC-8 (hex AC),
and therefore a precomposed character in MARC UCS/Unicode (hex 01A1) [and
therefore I assume MARC-XML], even though there is a decomposed (i.e. Normal
Form D) Unicode version (hex 006F 031B) of that character.

I have been trying to learn about character sets, especially in regards to
MARC and library environments and have put some (hopefully) useful
information on the web [2].  Included is a technical primer for librarians
as well as extensive code charts/matrices for MARC character sets.  There is
a fairly decent list of web resources [3].  Note that the powerpoint slide
show is of limited use without the original commentary and is a huge file
due to including embedded fonts.

[1] Coded Character Sets > A Technical Primer for Librarians > MARC Unicode
http://rocky.uta.edu/doran/charsets/unicode.html

[2] Coded Character Sets
http://rocky.uta.edu/doran/charsets/

[3] Resources on the Web: With an emphasis on library automation and the
internet
http://rocky.uta.edu/doran/charsets/resources.html

BTW, the earlier message I sent to the list had an unfinished sentence.  I
should have proofread before sending and I apologize.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-239-5368 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 




RE: Undelivered Mail Returned to Sender

2004-08-05 Thread Michael D Doran
Hi Linh,

Perl4lib is no longer hosted at listserv.rice.edu.  Try one of the options
below.

To unsubscribe, send a message to:

<[EMAIL PROTECTED]>

...or...

 To remove your address from the list, just send a message to
 the address in the ``List-Unsubscribe'' header of any list
 message. If you haven't changed addresses since subscribing,
 you can also send a message to:

   <[EMAIL PROTECTED]>

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Linh Le [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, August 03, 2004 3:51 PM
> To: [EMAIL PROTECTED]
> Subject: Fwd: Undelivered Mail Returned to Sender
> 
> Help, 
>  
> I tried to send a signoff command to 
> [EMAIL PROTECTED] but got the error.
>  
> Linh
> 


Net::Z3950, OPAC record syntax & multiple MFHD 866

2004-08-18 Thread Michael D Doran
Please excuse the cross-posting (perl4lib & Net-z3950).

I am working with a Perl script designed to query our catalog via Net::Z3950
and retrieve a journal record.  The OPAC record syntax is specified because
the ultimate point of the script [1] is to parse the journal holdings to
determine if a particular year is owned by our library.  Our holdings (MFHD)
records often contain multiple 866 fields (which contain the actual holdings
info); however, Net::Z3950 only returns the *last* 866 from a MFHD record,
thereby giving an incomplete list of holdings.  

Below is the relevant code:
 
  use Net::Z3950;
  $issn = '0028-0836'
  $query = '@attr 1=8 ' . $issn;
  $target = 'pulse.uta.edu';
  $port = 7099;
  $database = 'pulse';
  $recordSyntax = 'OPAC';
  $conn = new Net::Z3950::Connection($target, $port, databaseName =>
$database);
  $rs = $conn->search(-prefix => $query);
  $rs->option(preferredRecordSyntax => $recordSyntax);
  for ( $i = 1; $i <= $rs->size(); $i++ ) {
$rec = $rs->record($i);
$marc = $rec->render();
print "$marc";
  }

If I search for the journal Nature (ISSN 0028-0836) which in our catalog has
these multiple 866s in the first holdings record:

  866  0 _av.253(1975)-v.344(1990:Apr.),
  866  0 _av.345(1990)-v.426(2003:Nov.20),
  866  0 _av.426(2003:Dec.)-v.429(2004:May)
  866  0 _aINDEXES
v.277(1979)-v.348(1990),v.403-408(2000),v.415(2002)-v.426(2003)

...I get this MARC data returned by Net::Z3950.  Note the "enumAndChron"
line which contains the 866 info.

* Bibliographic record:

245  00  $aNature.
260  $a[London, etc.,$bMacmillan Journals ltd.]

* Holdings record 1 of 4:
typeOfRecord: y
encodingLevel: 4
receiptAcqStatus: 4
generalRetention: 8
completeness: 4
dateOfReport: 00
nucCode: sel,per
localLocation: Science & Engineering Library: Periodicals
callNumber: Q 1
enumAndChron: ^_aINDEXES
v.277(1979)-v.348(1990),v.403-408(2000),v.415(2002)-v.426(2003)


As you can see, Net::Z3950 only returns the last 866 field.

So my questions are:
1) Has anyone else noticed/experienced this behavior (i.e. only getting the
last 866)?  I'm trying to determine if this behavior is unique to how I am
implementing/configuring Net::Z3950 and/or if it is ILMS specific.  This is
my first time using Net::Z3950, so if I'm doing something wrong, please
correct me.

2) Is this behavior by design or is it a bug?  According to the MARC
standard, the MFHD 866 is repeatable [2].  Please disregard the fact that we
have Index holdings in the 866 rather than the 868 ...or why we are using
multiple 866 even for regular holdings.  Those issues are not under my
control.

3) If it is a bug, is it in Net::Z3950 or is it in the Z39.50 protocol or in
the Voyager Z39.50 implementation/API.  (I have limited experience with
Z39.50 and the only other client I have, BookWhere, does not appear to offer
the "OPAC" record syntax.)  If it is in the Net::Z3950 module can it be
fixed?  :-)

I have browsed the Net-z3950 listserv archive back to September 2003 (when
version 0.36, which added support for the OPAC record syntax, was released)
and didn't see any mention of this behavior.

Our software and versions:
  Net::Z3950 version 0.39 (on Solaris)
  Our ILMS is Endeavor's Voyager, version 2001.2

Thanks!

-- Michael

[1] The script is designed as an SFX plug-in and was written by David Walker
of Cal State San Marcos
http://library.csusm.edu/csu/sfx/local_holding_chameleon.asp
[2] MARC 21 Concise Holdings: Textual Holdings Statement Fields (866-868)
http://www.loc.gov/marc/holdings/echdtext.html

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/


RE: Net::Z3950, OPAC record syntax & multiple MFHD 866 - SOLVED

2004-08-18 Thread Michael D Doran
I have been informed that this is a Voyager ILMS Z39.50 server bug.  (Thanks
Sandy!)

Sorry for the false alarm... didn't mean to cast any aspersions on
Net::Z3950!

-- Michael

> -Original Message-
> From: Michael D Doran 
> Sent: Wednesday, August 18, 2004 11:23 AM
> To: '[EMAIL PROTECTED]'; [EMAIL PROTECTED]
> Subject: Net::Z3950, OPAC record syntax & multiple MFHD 866
> 
> Please excuse the cross-posting (perl4lib & Net-z3950).
> 
> I am working with a Perl script designed to query our catalog 
> via Net::Z3950 and retrieve a journal record.  The OPAC 
> record syntax is specified because the ultimate point of the 
> script [1] is to parse the journal holdings to determine if a 
> particular year is owned by our library.  Our holdings (MFHD) 
> records often contain multiple 866 fields (which contain the 
> actual holdings info); however, Net::Z3950 only returns the 
> *last* 866 from a MFHD record, thereby giving an incomplete 
> list of holdings.  
> 
> Below is the relevant code:
>  
>   use Net::Z3950;
>   $issn = '0028-0836'
>   $query = '@attr 1=8 ' . $issn;
>   $target = 'pulse.uta.edu';
>   $port = 7099;
>   $database = 'pulse';
>   $recordSyntax = 'OPAC';
>   $conn = new Net::Z3950::Connection($target, $port, 
> databaseName => $database);
>   $rs = $conn->search(-prefix => $query);
>   $rs->option(preferredRecordSyntax => $recordSyntax);
>   for ( $i = 1; $i <= $rs->size(); $i++ ) {
> $rec = $rs->record($i);
> $marc = $rec->render();
> print "$marc";
>   }
> 
> If I search for the journal Nature (ISSN 0028-0836) which in 
> our catalog has these multiple 866s in the first holdings record:
> 
>   866  0 _av.253(1975)-v.344(1990:Apr.),
>   866  0 _av.345(1990)-v.426(2003:Nov.20),
>   866  0 _av.426(2003:Dec.)-v.429(2004:May)
>   866  0 _aINDEXES 
> v.277(1979)-v.348(1990),v.403-408(2000),v.415(2002)-v.426(2003)
> 
> ...I get this MARC data returned by Net::Z3950.  Note the 
> "enumAndChron" line which contains the 866 info.
> 
> * Bibliographic record:
> 
> 245  00  $aNature.
> 260  $a[London, etc.,$bMacmillan Journals ltd.]
> 
> * Holdings record 1 of 4:
> typeOfRecord: y
> encodingLevel: 4
> receiptAcqStatus: 4
> generalRetention: 8
> completeness: 4
> dateOfReport: 00
> nucCode: sel,per
> localLocation: Science & Engineering Library: Periodicals
> callNumber: Q 1
> enumAndChron: ^_aINDEXES 
> v.277(1979)-v.348(1990),v.403-408(2000),v.415(2002)-v.426(2003)
> 
> 
> As you can see, Net::Z3950 only returns the last 866 field.
> 
> So my questions are:
> 1) Has anyone else noticed/experienced this behavior (i.e. 
> only getting the last 866)?  I'm trying to determine if this 
> behavior is unique to how I am implementing/configuring 
> Net::Z3950 and/or if it is ILMS specific.  This is my first 
> time using Net::Z3950, so if I'm doing something wrong, 
> please correct me.
> 
> 2) Is this behavior by design or is it a bug?  According to 
> the MARC standard, the MFHD 866 is repeatable [2].  Please 
> disregard the fact that we have Index holdings in the 866 
> rather than the 868 ...or why we are using multiple 866 even 
> for regular holdings.  Those issues are not under my control.
> 
> 3) If it is a bug, is it in Net::Z3950 or is it in the Z39.50 
> protocol or in the Voyager Z39.50 implementation/API.  (I 
> have limited experience with Z39.50 and the only other client 
> I have, BookWhere, does not appear to offer the "OPAC" record 
> syntax.)  If it is in the Net::Z3950 module can it be fixed?  :-)
> 
> I have browsed the Net-z3950 listserv archive back to 
> September 2003 (when version 0.36, which added support for 
> the OPAC record syntax, was released) and didn't see any 
> mention of this behavior.
> 
> Our software and versions:
>   Net::Z3950 version 0.39 (on Solaris)
>   Our ILMS is Endeavor's Voyager, version 2001.2
> 
> Thanks!
> 
> -- Michael
> 
> [1] The script is designed as an SFX plug-in and was written 
> by David Walker of Cal State San Marcos
> http://library.csusm.edu/csu/sfx/local_holding_chameleon.asp
> [2] MARC 21 Concise Holdings: Textual Holdings Statement 
> Fields (866-868)
> http://www.loc.gov/marc/holdings/echdtext.html
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
>