Net::Z3950 and diacritics
On 12/15/03 8:54 AM, Eric Lease Morgan <[EMAIL PROTECTED]> wrote: > In order to get the MARC records for my "catalog" I have been searching the > LOC catalog, identifying the record I desire, and using Net::Z3950 to download > the desired record via the MARC 001 tag. Tastes great. Less filling. > > When I loop through my MARC records MARC::Batch sometimes warns that the MARC > leader is incorrect. This happens when the record contains a diacritic. > Specifically, my MARC::Batch object returns "Invalid record length..." I have > discovered that I can plow right through the record anyway by turning on > strict_off, but my resulting records get really ugly at the point of the > diacritic: > > http://infomotions.com/books/?cmd=search&query=id=russell-world-107149566 Upon further investigation, it seems that MARC::Batch is not necessarily causing my problem with diacritics, instead, the problem may lie in the way I am downloading my records using Net::Z3950. How do I tell Net::Z3950 to download a specific MARC record using a specific character set? To download my MARC records from the LOC I feed a locally developed Perl script, using Net::Z3950, the value from a LOC MARC record, field 001. This retrieves one an only one record. I then suck up the found record and put it into a MARC::Record object. It is all done like this: # define sum constants my $DATABASE = 'voyager'; my $SERVER = 'z3950.loc.gov'; my $PORT = '7090'; # create a LOC (Voyager) 001 query my $query = "[EMAIL PROTECTED] 1=7 3118006"; # create a z39.50 object my $z3950 = Net::Z3950::Manager->new(databaseName => $DATABASE); # assign the object some z39.50 characteristics $z3950->option(elementSetName => "f"); $z3950->option(preferredRecordSyntax => Net::Z3950::RecordSyntax::USMARC); # connect to the server and check for success my $connection = $z3950->connect($SERVER, $PORT); # search my $results = $connection->search($query); # get the found record and turn it into a MARC::Record object my $record = $results->record(1); $record = MARC::Record->new_from_usmarc($record->rawdata()); # create a file name my $id = time; # write the record open MARC, "> $id.marc"; print MARC $record->as_usmarc; close MARC; This process works just fine for records that contain no diacritics, but when diacritics are in the records extra characters end up in my saved files, like this: 00901nam 22002651 ^^^ 45100080005001780080041000250350021000669060045000870 1000170013204000180014905000180016708200100018512900195245009 200224260003400316347003504900029003975040026004266340045 27100021004869910044005079910055005510990029006063118006 1974041700.0731207s1967nyuabf b000 0beng 9(DLC) 67029856 a7bcbccorignewdueocipf19gy-gencatlg a 67029856 aDLCcDLCdDLC00aND588.D9bR8500a759.31 aRussell, Francis,d1910-14aThe world of D®urer, ^^^ 1471-1528,cby Francis Russell and the editors of Time-Life Books. aNew York,bTime, inc.c[1967] a183 p.billus., maps, col. plates.c32 cm.0 aTime-Life library of art aBibliography: p. 177.10aD®urer, Albrecht,d1471-1528.2 ^^^ aTime-Life Books. bc-GenCollhND588.D9iR85tCopy 1wBOOKS bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF arussell-world-1071495663 Notice how Dürer got munged into D®urer, twice, and consequently the record length is not 901 but 903 instead. Some people say I must be sure to request a specific character set from the LOC when downloading my MARC records, specifically MARC-8 or MARC-UCS. Which one of these character sets do I want and how do I tell the remote database which one I want? -- Eric "The Ugly American Who Doesn't Understand Diacritics" Morgan University Libraries of Notre Dame (574) 631-8604
Re: Net::Z3950 and diacritics
Hi, in fact the question is quite complex to explain, and I'm not sure that I can explain well. At 14.57 16/12/03, you wrote: This process works just fine for records that contain no diacritics, but when diacritics are in the records extra characters end up in my saved files, like this: 00901nam 22002651 ^^^ 45100080005001780080041000250350021000669060045000870 1000170013204000180014905000180016708200100018512900195245009 200224260003400316347003504900029003975040026004266340045 27100021004869910044005079910055005510990029006063118006 1974041700.0731207s1967nyuabf b000 0beng 9(DLC) 67029856 a7bcbccorignewdueocipf19gy-gencatlg a 67029856 aDLCcDLCdDLC00aND588.D9bR8500a759.31 aRussell, Francis,d1910-14aThe world of D®urer, ^^^ 1471-1528,cby Francis Russell and the editors of Time-Life Books. aNew York,bTime, inc.c[1967] a183 p.billus., maps, col. plates.c32 cm.0 aTime-Life library of art aBibliography: p. 177.10aD®urer, Albrecht,d1471-1528.2 ^^^ aTime-Life Books. bc-GenCollhND588.D9iR85tCopy 1wBOOKS bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF arussell-world-1071495663 Notice how Dürer got munged into D®urer, twice, and consequently the record length is not 901 but 903 instead. Some people say I must be sure to request a specific character set from the LOC when downloading my MARC records, specifically MARC-8 or MARC-UCS. Which one of these character sets do I want and how do I tell the remote database which one I want? 1)When you call LOC without a specific character you recive data in MARC-8 character set. 2) In MARC-8 character set a letter like "è" [e grave] is done with TWO bytes one for the sign [the grave accent] and one for the letter [the letter e]. 3)In the leader, position 0-4 you have the number of character, NOT the number of bytes. In your record there are 901 characters and 903 bytes. In fact the "lenght" function of perl read the number of bytes. The best option, now, is to use charset where 1 character is always 1 byte, for example ISO 8859_1 A good place to undestand charset sets is http://www.gymel.com/charsets/ [in deutch] Bye Zeno Tajoli [EMAIL PROTECTED] CILEA - Segrate (MI) 02 / 26995321
Re: Net::Z3950 and diacritics
On Tue, Dec 16, 2003 at 03:52:56PM +0100, Tajoli Zeno wrote: > 1)When you call LOC without a specific character you recive data in MARC-8 > character set. > > 2) In MARC-8 character set a letter like "è" [e grave] is done with TWO > bytes one for the sign [the grave accent] and one for the letter [the > letter e]. > > 3)In the leader, position 0-4 you have the number of character, NOT the > number of bytes. In your record there are 901 characters and 903 bytes. > > In fact the "length" function of perl read the number of bytes. The best > option, now, is to use charset where 1 character is always 1 byte, for > example ISO 8859_1 While this is certainly part of the answer we still don't know why the record length is off. The way I see it, there are two possible options: 1. Net::Z3950 is doing on-the-fly conversion of MARC-8 to Latin1 2. LC's Z39.50 server is emitting the records that way, and not updating the record length. I guess one way to test which one is true would be to query another Z39.50 server for the same record, and see if the same problem existsin which case 1 is probably the case. //Ed
Re: Net::Z3950 and diacritics
On Tue, Dec 16, 2003 at 03:52:56PM +0100, Tajoli Zeno <[EMAIL PROTECTED]> wrote: > > 2) In MARC-8 character set a letter like "è" [e grave] is done with TWO > bytes one for the sign [the grave accent] and one for the letter [the > letter e]. > > 3)In the leader, position 0-4 you have the number of character, NOT the > number of bytes. In your record there are 901 characters and 903 bytes. > No it should be number of bytes (LOC has clarified this in their spec by saying "number of octets".) It has always been the length in bytes. In the example it looks like the non-spacing diacritic has been converted to two bytes (which sounds almost like it was assumed to be latin-1 and got incorrectly marc-8'd somewhere along the line. But the weird characters may result from the screen display, you need to ascertain what the actual values are there. (dumping the record in hex nay reveal something) Cheers Colin -- Colin Campbell Technical Services Consultant Sirsi Ltd [EMAIL PROTECTED]
RE: :Z3950 and diacritics
First, we probably want to figure out what character set the records are encoded in as received from LOC. Since only the non-ASCII characters will give us a clue, we can look at the umlauted-u ("ü") in Dürer. Charset hex character(s) used to represent "ü" --- MARC-8 0xE8 0x75 (combining umlaut/diaeresis preceding latin small letter u) MARC-UCS* 0x75 0xCC 0x88 (latin small letter u followed by combining umlaut/diaeresis -- in this case the combining character is represented by two bytes) Latin-1 0xFC (precomposed latin small letter u with umlaut/diaeresis) * MARC-UCS/Unicode is UTF-8 encoded, therefore U+0075 becomes 0x75 and the U+0308 becomes 0xCC 0x88. The MARC-21 specification does not allow the use of the precomposed Unicode character for an umluated-u. > aRussell, Francis,d1910-14aThe world of D^®urer, Since you are getting the base character OK (latin small letter u), we should probably assume a base-plus-combining character scheme, and since the combining character(s) come *before* the base character, we can probably assume MARC-8. If we could actually *verify* the hex encoding, we can go on to what is happening to the records subsequent to the Z39.50 download... and what to do with MARC-8, since it is not a character set used outside of library-specific software applications. ;-) BTW, the character set should also agree with the value in character position 9 in the leader of the MARC record: 09 - Character coding scheme Identifies the character coding scheme used in the record. # - MARC-8 (the pound symbol "#" represents a blank in this case) a - UCS/Unicode [from http://www.loc.gov/marc/bibliographic/ecbdldrd.html#mrcblea ] > From: Tajoli Zeno [mailto:[EMAIL PROTECTED] > The best option, now, is to use charset where 1 character > is always 1 byte, for example ISO 8859_1 Be aware that converting MARC-8 to Latin-1 has the potential for data loss, since there are many more characters that can be represented in MARC-8, than can be represented in Latin-1. The better bet is to convert to Unicode UTF-8 (or get the records in that character set to begin with, if that is an option). > > From: Tajoli Zeno [mailto:[EMAIL PROTECTED] > > 3)In the leader, position 0-4 you have the number of > > character, NOT the number of bytes. > > From: Colin Campbell [mailto:[EMAIL PROTECTED] > No it should be number of bytes (LOC has clarified this in > their spec by saying "number of octets".) It has always > been the length in bytes. >From the MARC 21 specifications... UCS/Unicode Markers and the MARC 21 Record Leader In MARC 21 records, Leader character position 9 contains value a if the data is encoded using UCS/Unicode characters. If any UCS/Unicode characters are to be included in the MARC 21 record, the entire MARC record must be encoded using UCS/Unicode characters. The record length contained in Leader positions 0-4 is a count of the number of octets in the record, not characters. The Leader position 9 value is not dependent on the character encoding used. This rule applies to MARC 21 records encoded using both the MARC-8 and UCS/Unicode character sets. [from http://www.loc.gov/marc/specifications/speccharucs.html] -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-239-5368 cell # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -Original Message- > From: Eric Lease Morgan [mailto:[EMAIL PROTECTED] > Sent: Tuesday, December 16, 2003 7:57 AM > To: Perl4Lib > Subject: Net::Z3950 and diacritics > > > On 12/15/03 8:54 AM, Eric Lease Morgan <[EMAIL PROTECTED]> wrote: > > > In order to get the MARC records for my "catalog" I have > been searching the > > LOC catalog, identifying the record I desire, and using > Net::Z3950 to download > > the desired record via the MARC 001 tag. Tastes great. Less filling. > > > > When I loop through my MARC records MARC::Batch sometimes > warns that the MARC > > leader is incorrect. This happens when the record contains > a diacritic. > > Specifically, my MARC::Batch object returns "Invalid record > length..." I have > > discovered that I can plow right through the record anyway > by turning on > > strict_off, but my resulting records get really ugly at the > point of the > > diacritic: > > > > > http://infomotions.com/books/?cmd=search&query=id=russell-worl > d-107149566 > > Upon further investigation, it seems that MARC::Batch is not > necessarily > causing my problem with diacritics, instead, the problem may > lie in the way > I am downloading my records using Net::Z3950. > > How do I tell Net::Z3950 to download a specific MARC record > using a specific > character set? > > To download my MARC records from the LOC I feed a locally > developed Perl > script, using Net::Z3950, the value from a LOC MARC record, > field 001. This > retrieves one an only one record. I then suc
MARC::Record v1.34
The uploaded file MARC-Record-1.34.tar.gz has entered CPAN as file: $CPAN/authors/id/P/PE/PETDANCE/MARC-Record-1.34.tar.gz The big change in this release is the ability to read from a pipe, or an output stream. In our case (at Follett Library Resources), we have thousands of MARC files that we've gzipped to save space (they save about 85-90% gzipped), but we still need to be able to use the data on the fly in our new TitleWise service. Rather than decompressing and then opening the file, Ed Summers worked it out so that MARC::File::* can read from a pipe. We were considering letting the marcdump and marclint programs read from standard input, but you can do that with "-" as a filename anyway. 1.34December 16th, 2003 [ENHANCEMENTS] - modified MARC::File::in() to allow passing in filehandles instead of a filename. Useful in situations where you might have data compressed on disk, and want to read from a decompression pipe. This effects MARC::Batch of course as well, which has had its documentation updated. - added t/85.fh.t to test new filehandle passing - Incorrect filetypes passed in to the MARC::Batch constructor now croak instead of die, so you can see where in your code it was called. [FIXES] - etc/specs modified to correctly parse LCs docs to get the 250 $b properly. Thanks Bryan Baldus at Quality Books. - new Lint.pm with 250 $b. - MARC::Field now leaves alphabetic indicators as they are instead of squashing to a space. Thanks Leif Andersson from Stockholms Universitet. - MARC::File::USMARC no longer checks the validity of indicators but leaves that up to MARC::Field (instead of having the check twice). - In MARC::Batch, the 'warn' elements weren't quoted. - warnings_on and strict_on should now be respected. Have fun! xoa -- Andy Lester => [EMAIL PROTECTED] => www.petdance.com => AIM:petdance
Re: Net::Z3950 and diacritics
I don't see how you can get a result for your search if you're using @attr 1=7. 7 is the USE attribute for an ISBN search, and your term is the local system number, I think (use attribute=12) When I do that search (@attr 1=12 3118006) against the LC bib file, using Net::Z3950 in a program essentially the same as yours, the USMARC record returned is 860 bytes long. Here's a formatted dump of the record: ?L?D?R? ?0?0?8?6?0?n?a?m? ? ?2?2?0?0?2?5?3?1? ? ?4?5?0?0? ?0?0?1? ? ? ? ?3?1?1?8?0?0?6? ?0?0?5? ? ? ? ?1?9?7?4?0?4?1?7?0?0?0?0?0?0?.?0? ?0?0?8? ? ? ? ?7?3?1?2?0?7?s?1?9?6?7? ? ? ? ?n?y?u?a?b?f? ? ? ?b? ? ? ? ?0?0?0? ?0?b?e?n?g? ? ? ?0?3?5? ? ? ? ?|?9?(?D?L?C?)? ? ? ?6?7?0?2?9?8?5?6? ?9?0?6? ? ? ? ?|?a?7?|?b?c?b?c?|?c?o?r?i?g?n?e?w?|?d?u?|?e?o?c?i?p?|?f?1?9?|?g?y?-?g?e?n? c?a?t?l?g? ?0?1?0? ? ? ? ?|?a? ? ? ?6?7?0?2?9?8?5?6? ? ?0?4?0? ? ? ? ?|?a?D?L?C?|?c?D?L?C?|?d?D?L?C? ?0?5?0? ?0?0? ?|?a?N?D?5?8?8?.?D?9?|?b?R?8?5? ?0?8?2? ?0?0? ?|?a?7?5?9?.?3? ?1?0?0? ?1? ? ?|?a?R?u?s?s?e?l?l?,? ?F?r?a?n?c?i?s?,?|?d?1?9?1?0?-? ?2?4?5? ?1?4? ?|?a?T?h?e? ?w?o?r?l?d? ?o?f? ?D?u?r?e?r?,? ?1?4?7?1?-?1?5?2?8?,?|?c?b?y? ?F?r?a?n?c?i?s? ?R?u?s?s?e?l?l? ?a?n?d? ?t?h?e? ?e?d?i?t?o?r?s? ?o?f? ?T?i?m?e?-?L?i?f?e? ?B?o?o?k?s?.? ?2?6?0? ? ? ? ?|?a?N?e?w? ?Y?o?r?k?,?|?b?T?i?m?e?,? ?i?n?c?.?|?c?[?1?9?6?7?]? ?3?0?0? ? ? ? ?|?a?1?8?3? ?p?.?|?b?i?l?l?u?s?.?,? ?m?a?p?s?,? ?c?o?l?.? ?p?l?a?t?e?s?.?|?c?3?2? ?c?m?.? ?4?9?0? ?0? ? ?|?a?T?i?m?e?-?L?i?f?e? ?l?i?b?r?a?r?y? ?o?f? ?a?r?t? ?5?0?4? ? ? ? ?|?a?B?i?b?l?i?o?g?r?a?p?h?y?:? ?p?.? ?1?7?7?.? ?6?0?0? ?1?0? ?|?a?D??u?r?e?r?,? ?A?l?b?r?e?c?h?t?,?|?d?1?4?7?1?-?1?5?2?8?.? ?7?1?0? ?2? ? ?|?a?T?i?m?e?-?L?i?f?e? ?B?o?o?k?s?.? ?9?9?1? ? ? ? ?|?b?c?-?G?e?n?C?o?l?l?|?h?N?D?5?8?8?.?D?9?|?i?R?8?5?|?t?C?o?p?y? ?1?|?w?B?O?O?K?S? ?9?9?1? ? ? ? ?|?b?c?-?G?e?n?C?o?l?l?|?h?N?D?5?8?8?.?D?9?|?i?R?8?5?|?p?0?0?0?3?4?0?1?5?1? 0?7?|?t?C?o?p?y? ?2?|?w?C?C?F The "?" in the 245 and 600 fields are 0xE8, the MARC-8 code for combining umlaut/diaeresis. It's puzzling that the record you got has a different length--not sure what's going on there. Tim Prettyman University of Michigan Library # define sum constants my $DATABASE = 'voyager'; my $SERVER = 'z3950.loc.gov'; my $PORT = '7090'; # create a LOC (Voyager) 001 query my $query = "[EMAIL PROTECTED] 1=7 3118006"; # create a z39.50 object my $z3950 = Net::Z3950::Manager->new(databaseName => $DATABASE); # assign the object some z39.50 characteristics $z3950->option(elementSetName => "f"); $z3950->option(preferredRecordSyntax => Net::Z3950::RecordSyntax::USMARC); # connect to the server and check for success my $connection = $z3950->connect($SERVER, $PORT); # search my $results = $connection->search($query); # get the found record and turn it into a MARC::Record object my $record = $results->record(1); $record = MARC::Record->new_from_usmarc($record->rawdata()); # create a file name my $id = time; # write the record open MARC, "> $id.marc"; print MARC $record->as_usmarc; close MARC; This process works just fine for records that contain no diacritics, but when diacritics are in the records extra characters end up in my saved files, like this: 00901nam 22002651 ^^^ 45100080005001780080041000250350021000669060045000870 1000170013204000180014905000180016708200100018512900195245009 200224260003400316347003504900029003975040026004266340045 27100021004869910044005079910055005510990029006063118006 1974041700.0731207s1967nyuabf b000 0beng 9(DLC) 67029856 a7bcbccorignewdueocipf19gy-gencatlg a 67029856 aDLCcDLCdDLC00aND588.D9bR8500a759.31 aRussell, Francis,d1910-14aThe world of D®urer, ^^^ 1471-1528,cby Francis Russell and the editors of Time-Life Books. aNew York,bTime, inc.c[1967] a183 p.billus., maps, col. plates.c32 cm.0 aTime-Life library of art aBibliography: p. 177.10aD®urer, Albrecht,d1471-1528.2 ^^^ aTime-Life Books. bc-GenCollhND588.D9iR85tCopy 1wBOOKS bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF arussell-world-1071495663 Notice how Dürer got munged into D®urer, twice, and consequently the record length is not 901 but 903 instead. Some people say I must be sure to request a specific character set from the LOC when downloading my MARC records, specifically MARC-8 or MARC-UCS. Which one of these character sets do I want and how do I tell the remote database which one I want? -- Eric "The Ugly American Who Doesn't Understand Diacritics" Morgan University Libraries of Notre Dame (574) 631-8604
Re: Net::Z3950 and diacritics
(I'm sending this again, because I think my formatted record may have gotten messed up in the process of being cut/pasted. My aplogies.) I don't see how you can get a result for your search if you're using @attr 1=7. 7 is the USE attribute for an ISBN search, and your term is the local system number, I think (use attribute=12) When I do that search (@attr 1=12 3118006) against the LC bib file, using Net::Z3950 in a program essentially the same as yours, the USMARC record returned is 860 bytes long. Here's a formatted dump of the record: LDR 00860nam 22002531 4500 0013118006 0051974041700.0 008731207s1967nyuabf b000 0beng 035|9(DLC) 67029856 906|a7|bcbc|corignew|du|eocip|f19|gy-gencatlg 010|a 67029856 040|aDLC|cDLC|dDLC 050 00 |aND588.D9|bR85 082 00 |a759.3 100 1 |aRussell, Francis,|d1910- 245 14 |aThe world of D?urer, 1471-1528,|cby Francis Russell and the editors of Time-Life Books. 260|aNew York,|bTime, inc.|c[1967] 300|a183 p.|billus., maps, col. plates.|c32 cm. 490 0 |aTime-Life library of art 504|aBibliography: p. 177. 600 10 |aD?urer, Albrecht,|d1471-1528. 710 2 |aTime-Life Books. 991|bc-GenColl|hND588.D9|iR85|tCopy 1|wBOOKS 991|bc-GenColl|hND588.D9|iR85|p00034015107|tCopy 2|wCCF The "?" in the 245 and 600 fields are 0xE8, the MARC-8 code for combining umlaut/diaeresis. It's puzzling that the record you got has a different length--not sure what's going on there. Tim Prettyman University of Michigan Library # define sum constants my $DATABASE = 'voyager'; my $SERVER = 'z3950.loc.gov'; my $PORT = '7090'; # create a LOC (Voyager) 001 query my $query = "[EMAIL PROTECTED] 1=7 3118006"; # create a z39.50 object my $z3950 = Net::Z3950::Manager->new(databaseName => $DATABASE); # assign the object some z39.50 characteristics $z3950->option(elementSetName => "f"); $z3950->option(preferredRecordSyntax => Net::Z3950::RecordSyntax::USMARC); # connect to the server and check for success my $connection = $z3950->connect($SERVER, $PORT); # search my $results = $connection->search($query); # get the found record and turn it into a MARC::Record object my $record = $results->record(1); $record = MARC::Record->new_from_usmarc($record->rawdata()); # create a file name my $id = time; # write the record open MARC, "> $id.marc"; print MARC $record->as_usmarc; close MARC; This process works just fine for records that contain no diacritics, but when diacritics are in the records extra characters end up in my saved files, like this: 00901nam 22002651 ^^^ 45100080005001780080041000250350021000669060045000870 1000170013204000180014905000180016708200100018512900195245009 200224260003400316347003504900029003975040026004266340045 27100021004869910044005079910055005510990029006063118006 1974041700.0731207s1967nyuabf b000 0beng 9(DLC) 67029856 a7bcbccorignewdueocipf19gy-gencatlg a 67029856 aDLCcDLCdDLC00aND588.D9bR8500a759.31 aRussell, Francis,d1910-14aThe world of D®urer, ^^^ 1471-1528,cby Francis Russell and the editors of Time-Life Books. aNew York,bTime, inc.c[1967] a183 p.billus., maps, col. plates.c32 cm.0 aTime-Life library of art aBibliography: p. 177.10aD®urer, Albrecht,d1471-1528.2 ^^^ aTime-Life Books. bc-GenCollhND588.D9iR85tCopy 1wBOOKS bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF arussell-world-1071495663 Notice how Dürer got munged into D®urer, twice, and consequently the record length is not 901 but 903 instead. Some people say I must be sure to request a specific character set from the LOC when downloading my MARC records, specifically MARC-8 or MARC-UCS. Which one of these character sets do I want and how do I tell the remote database which one I want? -- Eric "The Ugly American Who Doesn't Understand Diacritics" Morgan University Libraries of Notre Dame (574) 631-8604