Net::Z3950 and diacritics

2003-12-16 Thread Eric Lease Morgan
On 12/15/03 8:54 AM, Eric Lease Morgan <[EMAIL PROTECTED]> wrote:

> In order to get the MARC records for my "catalog" I have been searching the
> LOC catalog, identifying the record I desire, and using Net::Z3950 to download
> the desired record via the MARC 001 tag. Tastes great. Less filling.
> 
> When I loop through my MARC records MARC::Batch sometimes warns that the MARC
> leader is incorrect. This happens when the record contains a diacritic.
> Specifically, my MARC::Batch object returns "Invalid record length..." I have
> discovered that I can plow right through the record anyway by turning on
> strict_off, but my resulting records get really ugly at the point of the
> diacritic:
> 
>  http://infomotions.com/books/?cmd=search&query=id=russell-world-107149566

Upon further investigation, it seems that MARC::Batch is not necessarily
causing my problem with diacritics, instead, the problem may lie in the way
I am downloading my records using Net::Z3950.

How do I tell Net::Z3950 to download a specific MARC record using a specific
character set?

To download my MARC records from the LOC I feed a locally developed Perl
script, using Net::Z3950, the value from a LOC MARC record, field 001. This
retrieves one an only one record. I then suck up the found record and put it
into a MARC::Record object. It is all done like this:


  # define sum constants
  my $DATABASE = 'voyager';
  my $SERVER   = 'z3950.loc.gov';
  my $PORT = '7090';
  
  # create a LOC (Voyager) 001 query
  my $query = "[EMAIL PROTECTED] 1=7 3118006";
  
  # create a z39.50 object
  my $z3950 = Net::Z3950::Manager->new(databaseName => $DATABASE);
  
  # assign the object some z39.50 characteristics
  $z3950->option(elementSetName => "f");
  $z3950->option(preferredRecordSyntax => Net::Z3950::RecordSyntax::USMARC);
  
  # connect to the server and check for success
  my $connection = $z3950->connect($SERVER, $PORT);
  
  # search
  my $results = $connection->search($query);
  
  # get the found record and turn it into a MARC::Record object
  my $record = $results->record(1);
  $record = MARC::Record->new_from_usmarc($record->rawdata());

  # create a file name
  my $id = time;

  # write the record
  open MARC, "> $id.marc";
  print MARC $record->as_usmarc;
  close MARC;


This process works just fine for records that contain no diacritics, but
when diacritics are in the records extra characters end up in my saved
files, like this:

  00901nam  22002651
^^^
  45100080005001780080041000250350021000669060045000870
  1000170013204000180014905000180016708200100018512900195245009
  200224260003400316347003504900029003975040026004266340045
  27100021004869910044005079910055005510990029006063118006
  1974041700.0731207s1967nyuabf   b000 0beng  
  9(DLC)   67029856  a7bcbccorignewdueocipf19gy-gencatlg
  a   67029856   aDLCcDLCdDLC00aND588.D9bR8500a759.31
  aRussell, Francis,d1910-14aThe world of Dˆ®urer,
  ^^^
  1471-1528,cby Francis Russell and the editors of Time-Life
  Books.  aNew York,bTime, inc.c[1967]  a183 p.billus.,
  maps, col. plates.c32 cm.0 aTime-Life library of art
  aBibliography: p. 177.10aDˆ®urer, Albrecht,d1471-1528.2
  ^^^
  aTime-Life Books.  bc-GenCollhND588.D9iR85tCopy 1wBOOKS
  bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF
  arussell-world-1071495663

Notice how Dürer got munged into Dˆ®urer, twice, and consequently the record
length is not 901 but 903 instead.

Some people say I must be sure to request a specific character set from the
LOC when downloading my MARC records, specifically MARC-8 or MARC-UCS. Which
one of these character sets do I want and how do I tell the remote database
which one I want?

-- 
Eric "The Ugly American Who Doesn't Understand Diacritics" Morgan
University Libraries of Notre Dame

(574) 631-8604




Re: Net::Z3950 and diacritics

2003-12-16 Thread Tajoli Zeno
Hi,

in fact the question is quite complex to explain, and I'm not sure that I 
can explain well.

At 14.57 16/12/03, you wrote:

This process works just fine for records that contain no diacritics, but
when diacritics are in the records extra characters end up in my saved
files, like this:
  00901nam  22002651
^^^
  45100080005001780080041000250350021000669060045000870
  1000170013204000180014905000180016708200100018512900195245009
  200224260003400316347003504900029003975040026004266340045
  27100021004869910044005079910055005510990029006063118006
  1974041700.0731207s1967nyuabf   b000 0beng
  9(DLC)   67029856  a7bcbccorignewdueocipf19gy-gencatlg
  a   67029856   aDLCcDLCdDLC00aND588.D9bR8500a759.31
  aRussell, Francis,d1910-14aThe world of Dˆ®urer,
  ^^^
  1471-1528,cby Francis Russell and the editors of Time-Life
  Books.  aNew York,bTime, inc.c[1967]  a183 p.billus.,
  maps, col. plates.c32 cm.0 aTime-Life library of art
  aBibliography: p. 177.10aDˆ®urer, Albrecht,d1471-1528.2
  ^^^
  aTime-Life Books.  bc-GenCollhND588.D9iR85tCopy 1wBOOKS
  bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF
  arussell-world-1071495663
Notice how Dürer got munged into Dˆ®urer, twice, and consequently the record
length is not 901 but 903 instead.
Some people say I must be sure to request a specific character set from the
LOC when downloading my MARC records, specifically MARC-8 or MARC-UCS. Which
one of these character sets do I want and how do I tell the remote database
which one I want?
1)When you call LOC without a specific character you recive data in MARC-8 
character set.

2) In MARC-8 character set a letter like "è"  [e grave] is done with TWO 
bytes one for the sign [the grave accent] and one for the letter [the 
letter e].

3)In the leader, position 0-4 you have the number of character, NOT the 
number of bytes. In your record there are 901 characters and 903 bytes.

In fact the "lenght" function of perl read the number of bytes. The best 
option, now, is to use charset where 1 character is always 1 byte, for 
example ISO 8859_1
A good place to undestand charset sets is http://www.gymel.com/charsets/ 
[in deutch]

Bye

Zeno Tajoli
[EMAIL PROTECTED]
CILEA - Segrate (MI)
02 / 26995321


Re: Net::Z3950 and diacritics

2003-12-16 Thread Ed Summers
On Tue, Dec 16, 2003 at 03:52:56PM +0100, Tajoli Zeno wrote:
> 1)When you call LOC without a specific character you recive data in MARC-8 
> character set.
> 
> 2) In MARC-8 character set a letter like "è"  [e grave] is done with TWO 
> bytes one for the sign [the grave accent] and one for the letter [the 
> letter e].
> 
> 3)In the leader, position 0-4 you have the number of character, NOT the 
> number of bytes. In your record there are 901 characters and 903 bytes.
> 
> In fact the "length" function of perl read the number of bytes. The best 
> option, now, is to use charset where 1 character is always 1 byte, for 
> example ISO 8859_1

While this is certainly part of the answer we still don't know why the 
record length is off. The way I see it, there are two possible options: 

1. Net::Z3950 is doing on-the-fly conversion of MARC-8 to Latin1
2. LC's Z39.50 server is emitting the records that way, and not updating the 
   record length.

I guess one way to test which one is true would be to query another Z39.50 
server for the same record, and see if the same problem existsin which
case 1 is probably the case. 

//Ed


Re: Net::Z3950 and diacritics

2003-12-16 Thread Colin Campbell
On Tue, Dec 16, 2003 at 03:52:56PM +0100, Tajoli Zeno <[EMAIL PROTECTED]> wrote:
> 
> 2) In MARC-8 character set a letter like "è"  [e grave] is done with TWO 
> bytes one for the sign [the grave accent] and one for the letter [the 
> letter e].
> 
> 3)In the leader, position 0-4 you have the number of character, NOT the 
> number of bytes. In your record there are 901 characters and 903 bytes.
> 
No it should be number of bytes (LOC has clarified this in their spec by
saying "number of octets".) It has always been the length in bytes. In
the  example it looks like the non-spacing diacritic has been converted
to two bytes (which sounds almost like it was assumed to be latin-1 and
got incorrectly marc-8'd somewhere along the line. 
  But the weird characters may result from the screen display, you need
to ascertain what the actual values are there. (dumping the record in
hex nay reveal something)
Cheers
  Colin

-- 
  Colin Campbell 
  Technical Services Consultant
  Sirsi Ltd
  [EMAIL PROTECTED]


RE: :Z3950 and diacritics

2003-12-16 Thread Michael D Doran
First, we probably want to figure out what character set the records are
encoded in as received from LOC.  Since only the non-ASCII characters will
give us a clue, we can look at the umlauted-u ("ü") in Dürer.

Charset hex character(s) used to represent "ü"
--- 
MARC-8  0xE8 0x75 (combining umlaut/diaeresis preceding latin small letter
u)
MARC-UCS*   0x75 0xCC 0x88 (latin small letter u followed by combining
umlaut/diaeresis -- in this case the combining character is represented by
two bytes)
Latin-1 0xFC (precomposed latin small letter u with umlaut/diaeresis)

* MARC-UCS/Unicode is UTF-8 encoded, therefore U+0075 becomes 0x75 and the
U+0308 becomes 0xCC 0x88.  The MARC-21 specification does not allow the use
of the precomposed Unicode character for an umluated-u.

>   aRussell, Francis,d1910-14aThe world of D^®urer,

Since you are getting the base character OK (latin small letter u), we
should probably assume a base-plus-combining character scheme, and since the
combining character(s) come *before* the base character, we can probably
assume MARC-8.  If we could actually *verify* the hex encoding, we can go on
to what is happening to the records subsequent to the Z39.50 download... and
what to do with MARC-8, since it is not a character set used outside of
library-specific software applications.  ;-)

BTW, the character set should also agree with the value in character
position 9 in the leader of the MARC record:
09 - Character coding scheme
Identifies the character coding scheme used in the record. 
# - MARC-8 (the pound symbol "#" represents a blank in this case)
a - UCS/Unicode 
[from http://www.loc.gov/marc/bibliographic/ecbdldrd.html#mrcblea ]

> From: Tajoli Zeno [mailto:[EMAIL PROTECTED]
> The best option, now, is to use charset where 1 character
> is always 1 byte, for example ISO 8859_1

Be aware that converting MARC-8 to Latin-1 has the potential for data loss,
since there are many more characters that can be represented in MARC-8, than
can be represented in Latin-1.  The better bet is to convert to Unicode
UTF-8 (or get the records in that character set to begin with, if that is an
option).

> > From: Tajoli Zeno [mailto:[EMAIL PROTECTED]
> > 3)In the leader, position 0-4 you have the number of 
> > character, NOT the number of bytes. 
>
> From: Colin Campbell [mailto:[EMAIL PROTECTED]
> No it should be number of bytes (LOC has clarified this in 
> their spec by saying "number of octets".) It has always
> been the length in bytes.

>From the MARC 21 specifications...

  UCS/Unicode Markers and the MARC 21 Record Leader
  
  In MARC 21 records, Leader character position 9 contains value
  a if the data is encoded using UCS/Unicode characters. If any
  UCS/Unicode characters are to be included in the MARC 21 record,
  the entire MARC record must be encoded using UCS/Unicode characters.
  The record length contained in Leader positions 0-4 is a count of
  the number of octets in the record, not characters. The Leader
  position 9 value is not dependent on the character encoding used.
  This rule applies to MARC 21 records encoded using both the MARC-8
  and UCS/Unicode character sets.
  [from http://www.loc.gov/marc/specifications/speccharucs.html]

-- Michael

#  Michael Doran, Systems Librarian
#  University of Texas at Arlington
#  817-272-5326 office 
#  817-239-5368 cell
#  [EMAIL PROTECTED]
#  http://rocky.uta.edu/doran/



> -Original Message-
> From: Eric Lease Morgan [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, December 16, 2003 7:57 AM
> To: Perl4Lib
> Subject: Net::Z3950 and diacritics
> 
> 
> On 12/15/03 8:54 AM, Eric Lease Morgan <[EMAIL PROTECTED]> wrote:
> 
> > In order to get the MARC records for my "catalog" I have 
> been searching the
> > LOC catalog, identifying the record I desire, and using 
> Net::Z3950 to download
> > the desired record via the MARC 001 tag. Tastes great. Less filling.
> > 
> > When I loop through my MARC records MARC::Batch sometimes 
> warns that the MARC
> > leader is incorrect. This happens when the record contains 
> a diacritic.
> > Specifically, my MARC::Batch object returns "Invalid record 
> length..." I have
> > discovered that I can plow right through the record anyway 
> by turning on
> > strict_off, but my resulting records get really ugly at the 
> point of the
> > diacritic:
> > 
> >  
> http://infomotions.com/books/?cmd=search&query=id=russell-worl
> d-107149566
> 
> Upon further investigation, it seems that MARC::Batch is not 
> necessarily
> causing my problem with diacritics, instead, the problem may 
> lie in the way
> I am downloading my records using Net::Z3950.
> 
> How do I tell Net::Z3950 to download a specific MARC record 
> using a specific
> character set?
> 
> To download my MARC records from the LOC I feed a locally 
> developed Perl
> script, using Net::Z3950, the value from a LOC MARC record, 
> field 001. This
> retrieves one an only one record. I then suc

MARC::Record v1.34

2003-12-16 Thread Andy Lester
The uploaded file

MARC-Record-1.34.tar.gz

has entered CPAN as

  file: $CPAN/authors/id/P/PE/PETDANCE/MARC-Record-1.34.tar.gz

The big change in this release is the ability to read from a pipe, or
an output stream.  In our case (at Follett Library Resources), we have
thousands of MARC files that we've gzipped to save space (they save
about 85-90% gzipped), but we still need to be able to use the data on
the fly in our new TitleWise service.  Rather than decompressing and
then opening the file, Ed Summers worked it out so that MARC::File::*
can read from a pipe.

We were considering letting the marcdump and marclint programs read from
standard input, but you can do that with "-" as a filename anyway.


1.34December 16th, 2003
[ENHANCEMENTS]
- modified MARC::File::in() to allow passing in filehandles instead
  of a filename. Useful in situations where you might have data
  compressed on disk, and want to read from a decompression pipe.
  This effects MARC::Batch of course as well, which has had its
  documentation updated.
- added t/85.fh.t to test new filehandle passing
- Incorrect filetypes passed in to the MARC::Batch constructor
  now croak instead of die, so you can see where in your code it
  was called.

[FIXES]
- etc/specs modified to correctly parse LCs docs to get the 250
  $b properly. Thanks Bryan Baldus at Quality Books.
- new Lint.pm with 250 $b.
- MARC::Field now leaves alphabetic indicators as they are instead
  of squashing to a space.  Thanks Leif Andersson from Stockholms
  Universitet.
- MARC::File::USMARC no longer checks the validity of indicators
  but leaves that up to MARC::Field (instead of having the check twice).
- In MARC::Batch, the 'warn' elements weren't quoted.
- warnings_on and strict_on should now be respected.

Have fun!

xoa

-- 
Andy Lester => [EMAIL PROTECTED] => www.petdance.com => AIM:petdance


Re: Net::Z3950 and diacritics

2003-12-16 Thread Timothy Prettyman


I don't see how you can get a result for your search if you're using @attr 
1=7.  7 is the USE attribute for an ISBN search, and your term is the local 
system number, I think (use attribute=12)

When I do that search (@attr 1=12 3118006) against the LC bib file, using 
Net::Z3950 in a program essentially the same as yours, the USMARC record 
returned is 860 bytes long.  Here's a formatted dump of the record:

?L?D?R? ?0?0?8?6?0?n?a?m? ? ?2?2?0?0?2?5?3?1? ? ?4?5?0?0?
?0?0?1? ? ? ? ?3?1?1?8?0?0?6?
?0?0?5? ? ? ? ?1?9?7?4?0?4?1?7?0?0?0?0?0?0?.?0?
?0?0?8? ? ? ? ?7?3?1?2?0?7?s?1?9?6?7? ? ? ? ?n?y?u?a?b?f? ? ? ?b? ? ? ? 
?0?0?0? ?0?b?e?n?g? ? ?
?0?3?5? ? ? ? ?|?9?(?D?L?C?)? ? ? ?6?7?0?2?9?8?5?6?
?9?0?6? ? ? ? 
?|?a?7?|?b?c?b?c?|?c?o?r?i?g?n?e?w?|?d?u?|?e?o?c?i?p?|?f?1?9?|?g?y?-?g?e?n?
c?a?t?l?g?
?0?1?0? ? ? ? ?|?a? ? ? ?6?7?0?2?9?8?5?6? ?
?0?4?0? ? ? ? ?|?a?D?L?C?|?c?D?L?C?|?d?D?L?C?
?0?5?0? ?0?0? ?|?a?N?D?5?8?8?.?D?9?|?b?R?8?5?
?0?8?2? ?0?0? ?|?a?7?5?9?.?3?
?1?0?0? ?1? ? ?|?a?R?u?s?s?e?l?l?,? ?F?r?a?n?c?i?s?,?|?d?1?9?1?0?-?
?2?4?5? ?1?4? ?|?a?T?h?e? ?w?o?r?l?d? ?o?f? ?D?u?r?e?r?,? 
?1?4?7?1?-?1?5?2?8?,?|?c?b?y? ?F?r?a?n?c?i?s? ?R?u?s?s?e?l?l? ?a?n?d? 
?t?h?e? ?e?d?i?t?o?r?s? ?o?f? ?T?i?m?e?-?L?i?f?e? ?B?o?o?k?s?.?
?2?6?0? ? ? ? ?|?a?N?e?w? ?Y?o?r?k?,?|?b?T?i?m?e?,? 
?i?n?c?.?|?c?[?1?9?6?7?]?
?3?0?0? ? ? ? ?|?a?1?8?3? ?p?.?|?b?i?l?l?u?s?.?,? ?m?a?p?s?,? ?c?o?l?.? 
?p?l?a?t?e?s?.?|?c?3?2? ?c?m?.?
?4?9?0? ?0? ? ?|?a?T?i?m?e?-?L?i?f?e? ?l?i?b?r?a?r?y? ?o?f? ?a?r?t?
?5?0?4? ? ? ? ?|?a?B?i?b?l?i?o?g?r?a?p?h?y?:? ?p?.? ?1?7?7?.?
?6?0?0? ?1?0? ?|?a?D??u?r?e?r?,? ?A?l?b?r?e?c?h?t?,?|?d?1?4?7?1?-?1?5?2?8?.?
?7?1?0? ?2? ? ?|?a?T?i?m?e?-?L?i?f?e? ?B?o?o?k?s?.?
?9?9?1? ? ? ? 
?|?b?c?-?G?e?n?C?o?l?l?|?h?N?D?5?8?8?.?D?9?|?i?R?8?5?|?t?C?o?p?y? 
?1?|?w?B?O?O?K?S?
?9?9?1? ? ? ? 
?|?b?c?-?G?e?n?C?o?l?l?|?h?N?D?5?8?8?.?D?9?|?i?R?8?5?|?p?0?0?0?3?4?0?1?5?1?
0?7?|?t?C?o?p?y? ?2?|?w?C?C?F

The "?" in the 245 and 600 fields are 0xE8, the MARC-8 code for combining 
umlaut/diaeresis.

It's puzzling that the record you got has a different length--not sure 
what's going on there.

Tim Prettyman
University of Michigan Library
  # define sum constants
  my $DATABASE = 'voyager';
  my $SERVER   = 'z3950.loc.gov';
  my $PORT = '7090';
  # create a LOC (Voyager) 001 query
  my $query = "[EMAIL PROTECTED] 1=7 3118006";
  # create a z39.50 object
  my $z3950 = Net::Z3950::Manager->new(databaseName => $DATABASE);
  # assign the object some z39.50 characteristics
  $z3950->option(elementSetName => "f");
  $z3950->option(preferredRecordSyntax =>
Net::Z3950::RecordSyntax::USMARC);
  # connect to the server and check for success
  my $connection = $z3950->connect($SERVER, $PORT);
  # search
  my $results = $connection->search($query);
  # get the found record and turn it into a MARC::Record object
  my $record = $results->record(1);
  $record = MARC::Record->new_from_usmarc($record->rawdata());
  # create a file name
  my $id = time;
  # write the record
  open MARC, "> $id.marc";
  print MARC $record->as_usmarc;
  close MARC;
This process works just fine for records that contain no diacritics, but
when diacritics are in the records extra characters end up in my saved
files, like this:
  00901nam  22002651
^^^
  45100080005001780080041000250350021000669060045000870
  1000170013204000180014905000180016708200100018512900195245009
  200224260003400316347003504900029003975040026004266340045
  27100021004869910044005079910055005510990029006063118006
  1974041700.0731207s1967nyuabf   b000 0beng  
  9(DLC)   67029856  a7bcbccorignewdueocipf19gy-gencatlg
  a   67029856   aDLCcDLCdDLC00aND588.D9bR8500a759.31
  aRussell, Francis,d1910-14aThe world of Dˆ®urer,
  ^^^
  1471-1528,cby Francis Russell and the editors of Time-Life
  Books.  aNew York,bTime, inc.c[1967]  a183 p.billus.,
  maps, col. plates.c32 cm.0 aTime-Life library of art
  aBibliography: p. 177.10aDˆ®urer, Albrecht,d1471-1528.2
  ^^^
  aTime-Life Books.  bc-GenCollhND588.D9iR85tCopy 1wBOOKS
  bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF
  arussell-world-1071495663
Notice how Dürer got munged into Dˆ®urer, twice, and consequently the
record length is not 901 but 903 instead.
Some people say I must be sure to request a specific character set from
the LOC when downloading my MARC records, specifically MARC-8 or
MARC-UCS. Which one of these character sets do I want and how do I tell
the remote database which one I want?
--
Eric "The Ugly American Who Doesn't Understand Diacritics" Morgan
University Libraries of Notre Dame
(574) 631-8604








Re: Net::Z3950 and diacritics

2003-12-16 Thread Timothy Prettyman
(I'm sending this again, because I think my formatted record may have 
gotten messed up in the process of being cut/pasted.  My aplogies.)

I don't see how you can get a result for your search if you're using @attr 
1=7.  7 is the USE attribute for an ISBN search, and your term is the local 
system number, I think (use attribute=12)

When I do that search (@attr 1=12 3118006) against the LC bib file, using 
Net::Z3950 in a program essentially the same as yours, the USMARC record 
returned is 860 bytes long.  Here's a formatted dump of the record:

LDR 00860nam  22002531  4500
0013118006
0051974041700.0
008731207s1967nyuabf   b000 0beng
035|9(DLC)   67029856
906|a7|bcbc|corignew|du|eocip|f19|gy-gencatlg
010|a   67029856
040|aDLC|cDLC|dDLC
050 00 |aND588.D9|bR85
082 00 |a759.3
100 1  |aRussell, Francis,|d1910-
245 14 |aThe world of D?urer, 1471-1528,|cby Francis Russell and the 
editors of Time-Life Books.
260|aNew York,|bTime, inc.|c[1967]
300|a183 p.|billus., maps, col. plates.|c32 cm.
490 0  |aTime-Life library of art
504|aBibliography: p. 177.
600 10 |aD?urer, Albrecht,|d1471-1528.
710 2  |aTime-Life Books.
991|bc-GenColl|hND588.D9|iR85|tCopy 1|wBOOKS
991|bc-GenColl|hND588.D9|iR85|p00034015107|tCopy 2|wCCF

The "?" in the 245 and 600 fields are 0xE8, the MARC-8 code for combining 
umlaut/diaeresis.

It's puzzling that the record you got has a different length--not sure 
what's going on there.

Tim Prettyman
University of Michigan Library
  # define sum constants
  my $DATABASE = 'voyager';
  my $SERVER   = 'z3950.loc.gov';
  my $PORT = '7090';
  # create a LOC (Voyager) 001 query
  my $query = "[EMAIL PROTECTED] 1=7 3118006";
  # create a z39.50 object
  my $z3950 = Net::Z3950::Manager->new(databaseName => $DATABASE);
  # assign the object some z39.50 characteristics
  $z3950->option(elementSetName => "f");
  $z3950->option(preferredRecordSyntax =>
Net::Z3950::RecordSyntax::USMARC);
  # connect to the server and check for success
  my $connection = $z3950->connect($SERVER, $PORT);
  # search
  my $results = $connection->search($query);
  # get the found record and turn it into a MARC::Record object
  my $record = $results->record(1);
  $record = MARC::Record->new_from_usmarc($record->rawdata());
  # create a file name
  my $id = time;
  # write the record
  open MARC, "> $id.marc";
  print MARC $record->as_usmarc;
  close MARC;
This process works just fine for records that contain no diacritics, but
when diacritics are in the records extra characters end up in my saved
files, like this:
  00901nam  22002651
^^^
  45100080005001780080041000250350021000669060045000870
  1000170013204000180014905000180016708200100018512900195245009
  200224260003400316347003504900029003975040026004266340045
  27100021004869910044005079910055005510990029006063118006
  1974041700.0731207s1967nyuabf   b000 0beng  
  9(DLC)   67029856  a7bcbccorignewdueocipf19gy-gencatlg
  a   67029856   aDLCcDLCdDLC00aND588.D9bR8500a759.31
  aRussell, Francis,d1910-14aThe world of Dˆ®urer,
  ^^^
  1471-1528,cby Francis Russell and the editors of Time-Life
  Books.  aNew York,bTime, inc.c[1967]  a183 p.billus.,
  maps, col. plates.c32 cm.0 aTime-Life library of art
  aBibliography: p. 177.10aDˆ®urer, Albrecht,d1471-1528.2
  ^^^
  aTime-Life Books.  bc-GenCollhND588.D9iR85tCopy 1wBOOKS
  bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF
  arussell-world-1071495663
Notice how Dürer got munged into Dˆ®urer, twice, and consequently the
record length is not 901 but 903 instead.
Some people say I must be sure to request a specific character set from
the LOC when downloading my MARC records, specifically MARC-8 or
MARC-UCS. Which one of these character sets do I want and how do I tell
the remote database which one I want?
--
Eric "The Ugly American Who Doesn't Understand Diacritics" Morgan
University Libraries of Notre Dame
(574) 631-8604