Re: Character sets - kind of solved

2004-12-08 Thread Ed Summers
On Tue, Dec 07, 2004 at 12:53:44PM -0600, John Hammer wrote:
> Attached are the two files. The Marc file seems to be using a Windows font 
> (1251?). As for the program, the same changes occur if I just read the Marc 
> file and write it back out with no changes. The Perl I am using is 5.8.3

Ok, I've confirmed that simply reading this record in and writing it out
will yield a different file. The unix diff program confirms this, but
does not isolate the difference, since MARC records are not multiline
documents. 

Using diff with hexdump provides some more concrete data. First hexdump the
original file and the processed file like so:

% hexdump -C original.dat > original.dump
% hexdump -C processed.dat > processed.dump

Then compare these two files with diff:

% diff original.dump processed.dump

You should see this:

148,149c148,149
< 0930  73 20 1e 1d 0a 0a |s |
< 0936
---
> 0930  73 20 1e 1d   |s ..|
> 0934

What this shows is that the original file has two trailing 0a bytes at
the end of the record, and that the processed file does not. This makes
sense because MARC::Record was adjusted back in v1.24 (Apr 2003) to
remove certain illegal characters between records that some library
systems place there. See line 58 in MARC::File::USMARC in the latest
version of the MARC-Record distribution if you are curious :-)

So unless you are unable to reproduce this I think this mystery is solved.

//Ed


Re: Character sets - kind of solved

2004-12-08 Thread John Hammer
That's different from what I get. What I get is:

1c1
<   30 32 33 35 36 63 61 6d  20 20 32 32 30 30 34 38  |02356cam  220048|
---
>   30 32 33 36 34 63 61 6d  20 20 32 32 30 30 34 38  |02364cam  220048|
21,30c21,30

105,149c105,149
< 0680  20 1f 61 42 69 73 e5 61  f2 74 e5 69 2c 20 4d 75  | .aBis_, 
Mu|
< 0690  f2 68 61 6d 6d 61 64 2e  1f 74 43 6f 6e 76 65 72  
|___ammad..tConver|
< ... not shown>
< 0930  73 20 1e 1d 0a 0a |s |
< 0936
---
> 0680  20 1f 61 42 69 73 ef bf  bd 61 ef bf bd 74 ef bf  | .aBis___a___t___
> 0690  bd 69 2c 20 4d 75 ef bf  bd 68 61 6d 6d 61 64 2e  |i, Mu___hammad.|
< ... not shown>
> 0930  69 61 20 47 61 6c 65 27  73 20 1e 1d  |ia Gale's ..|
> 093c

How would deleting the illegal characters cause changes to the characters in 
lines 680 and 690 above?

John

On Wed, 8 Dec 2004 10:23:38 -0600
Ed Summers <[EMAIL PROTECTED]> wrote:

> On Tue, Dec 07, 2004 at 12:53:44PM -0600, John Hammer wrote:
> > Attached are the two files. The Marc file seems to be using a Windows font 
> > (1251?). As for the program, the same changes occur if I just read the Marc 
> > file and write it back out with no changes. The Perl I am using is 5.8.3
> 
> Ok, I've confirmed that simply reading this record in and writing it out
> will yield a different file. The unix diff program confirms this, but
> does not isolate the difference, since MARC records are not multiline
> documents. 
> 
> Using diff with hexdump provides some more concrete data. First hexdump the
> original file and the processed file like so:
> 
> % hexdump -C original.dat > original.dump
> % hexdump -C processed.dat > processed.dump
> 
> Then compare these two files with diff:
> 
> % diff original.dump processed.dump
> 
> You should see this:
> 
> 148,149c148,149
> < 0930  73 20 1e 1d 0a 0a |s |
> < 0936
> ---
> > 0930  73 20 1e 1d   |s ..|
> > 0934
> 
> What this shows is that the original file has two trailing 0a bytes at
> the end of the record, and that the processed file does not. This makes
> sense because MARC::Record was adjusted back in v1.24 (Apr 2003) to
> remove certain illegal characters between records that some library
> systems place there. See line 58 in MARC::File::USMARC in the latest
> version of the MARC-Record distribution if you are curious :-)
> 
> So unless you are unable to reproduce this I think this mystery is solved.
> 
> //Ed


Re: Character sets - kind of solved

2004-12-08 Thread Ed Summers
On Wed, Dec 08, 2004 at 03:31:18PM -0600, John Hammer wrote:
> How would deleting the illegal characters cause changes to the characters in 
> lines 680 and 690 above?

It doesn't explain it :) What version of MARC::Record are you using? What
happens when you use perl to read in the data and write it out, without
MARC::Record in the mix?

//Ed


Re: Character sets - kind of solved

2004-12-08 Thread John Hammer
MARC::Record version 1.39_01. Using diff there is no difference in the files 
when using Perl to read in and write out the data.

John

On Wed, 8 Dec 2004 15:43:29 -0600
Ed Summers <[EMAIL PROTECTED]> wrote:

> On Wed, Dec 08, 2004 at 03:31:18PM -0600, John Hammer wrote:
> > How would deleting the illegal characters cause changes to the characters 
> > in 
> > lines 680 and 690 above?
> 
> It doesn't explain it :) What version of MARC::Record are you using? What
> happens when you use perl to read in the data and write it out, without
> MARC::Record in the mix?
> 
> //Ed


Re: Character sets - kind of solved

2004-12-08 Thread Ed Summers
On Wed, Dec 08, 2004 at 05:47:23PM -0600, John Hammer wrote:
> MARC::Record version 1.39_01. Using diff there is no difference in the 
> files when using Perl to read in and write out the data.

Can you try downgrading to v1.38? v1.39_01 has some experimental utf8
handling code in it which was released as a beta to CPAN.

//Ed