Re: Character sets - kind of solved
On Tue, Dec 07, 2004 at 12:53:44PM -0600, John Hammer wrote: > Attached are the two files. The Marc file seems to be using a Windows font > (1251?). As for the program, the same changes occur if I just read the Marc > file and write it back out with no changes. The Perl I am using is 5.8.3 Ok, I've confirmed that simply reading this record in and writing it out will yield a different file. The unix diff program confirms this, but does not isolate the difference, since MARC records are not multiline documents. Using diff with hexdump provides some more concrete data. First hexdump the original file and the processed file like so: % hexdump -C original.dat > original.dump % hexdump -C processed.dat > processed.dump Then compare these two files with diff: % diff original.dump processed.dump You should see this: 148,149c148,149 < 0930 73 20 1e 1d 0a 0a |s | < 0936 --- > 0930 73 20 1e 1d |s ..| > 0934 What this shows is that the original file has two trailing 0a bytes at the end of the record, and that the processed file does not. This makes sense because MARC::Record was adjusted back in v1.24 (Apr 2003) to remove certain illegal characters between records that some library systems place there. See line 58 in MARC::File::USMARC in the latest version of the MARC-Record distribution if you are curious :-) So unless you are unable to reproduce this I think this mystery is solved. //Ed
Re: Character sets - kind of solved
That's different from what I get. What I get is: 1c1 < 30 32 33 35 36 63 61 6d 20 20 32 32 30 30 34 38 |02356cam 220048| --- > 30 32 33 36 34 63 61 6d 20 20 32 32 30 30 34 38 |02364cam 220048| 21,30c21,30 105,149c105,149 < 0680 20 1f 61 42 69 73 e5 61 f2 74 e5 69 2c 20 4d 75 | .aBis_, Mu| < 0690 f2 68 61 6d 6d 61 64 2e 1f 74 43 6f 6e 76 65 72 |___ammad..tConver| < ... not shown> < 0930 73 20 1e 1d 0a 0a |s | < 0936 --- > 0680 20 1f 61 42 69 73 ef bf bd 61 ef bf bd 74 ef bf | .aBis___a___t___ > 0690 bd 69 2c 20 4d 75 ef bf bd 68 61 6d 6d 61 64 2e |i, Mu___hammad.| < ... not shown> > 0930 69 61 20 47 61 6c 65 27 73 20 1e 1d |ia Gale's ..| > 093c How would deleting the illegal characters cause changes to the characters in lines 680 and 690 above? John On Wed, 8 Dec 2004 10:23:38 -0600 Ed Summers <[EMAIL PROTECTED]> wrote: > On Tue, Dec 07, 2004 at 12:53:44PM -0600, John Hammer wrote: > > Attached are the two files. The Marc file seems to be using a Windows font > > (1251?). As for the program, the same changes occur if I just read the Marc > > file and write it back out with no changes. The Perl I am using is 5.8.3 > > Ok, I've confirmed that simply reading this record in and writing it out > will yield a different file. The unix diff program confirms this, but > does not isolate the difference, since MARC records are not multiline > documents. > > Using diff with hexdump provides some more concrete data. First hexdump the > original file and the processed file like so: > > % hexdump -C original.dat > original.dump > % hexdump -C processed.dat > processed.dump > > Then compare these two files with diff: > > % diff original.dump processed.dump > > You should see this: > > 148,149c148,149 > < 0930 73 20 1e 1d 0a 0a |s | > < 0936 > --- > > 0930 73 20 1e 1d |s ..| > > 0934 > > What this shows is that the original file has two trailing 0a bytes at > the end of the record, and that the processed file does not. This makes > sense because MARC::Record was adjusted back in v1.24 (Apr 2003) to > remove certain illegal characters between records that some library > systems place there. See line 58 in MARC::File::USMARC in the latest > version of the MARC-Record distribution if you are curious :-) > > So unless you are unable to reproduce this I think this mystery is solved. > > //Ed
Re: Character sets - kind of solved
On Wed, Dec 08, 2004 at 03:31:18PM -0600, John Hammer wrote: > How would deleting the illegal characters cause changes to the characters in > lines 680 and 690 above? It doesn't explain it :) What version of MARC::Record are you using? What happens when you use perl to read in the data and write it out, without MARC::Record in the mix? //Ed
Re: Character sets - kind of solved
MARC::Record version 1.39_01. Using diff there is no difference in the files when using Perl to read in and write out the data. John On Wed, 8 Dec 2004 15:43:29 -0600 Ed Summers <[EMAIL PROTECTED]> wrote: > On Wed, Dec 08, 2004 at 03:31:18PM -0600, John Hammer wrote: > > How would deleting the illegal characters cause changes to the characters > > in > > lines 680 and 690 above? > > It doesn't explain it :) What version of MARC::Record are you using? What > happens when you use perl to read in the data and write it out, without > MARC::Record in the mix? > > //Ed
Re: Character sets - kind of solved
On Wed, Dec 08, 2004 at 05:47:23PM -0600, John Hammer wrote: > MARC::Record version 1.39_01. Using diff there is no difference in the > files when using Perl to read in and write out the data. Can you try downgrading to v1.38? v1.39_01 has some experimental utf8 handling code in it which was released as a beta to CPAN. //Ed