RE: Opening & writing to UTF-8 files; copyright symbol again -- solution

PHILLIPS M.E. Mon, 16 Nov 2015 03:02:13 -0800

> However, combining Jon Gorman's recommendation with some Googling, I get:
> 
> my $outfile='4788022.edited.bib';
> open (my $output_marc, '>', $outfile) or die "Couldn't open file $!" ;
> binmode($output_marc, ':utf8');
> 
> The open statement may not be quite correct, as I am not familiar with the
> more current techniques for opening file handles that John mentioned.
> However, when I use those instructions to open the output file rather than 
> what
> I had before, the copyright symbol does indeed come across as C2 A9 as it was
> in the original record. I didn't want to use the utf8, because I've tried that
> before and ended up with double-encoding (and a real mess). But I'll continue
> testing.


I think I understand how your original problem came about, but I may not be 
able to explain it!  It is important to understand that inside Perl a string 
can be encoded in one of two ways:

1) stored in UTF-8, in which case all ASCII-range characters (roughly space, 
A-Z, a-z, 0-9 and most of the punctuation you see on a keyboard) will be stored 
in a single byte per character, and other characters will be stored in 2, 3, or 
4 bytes

2) stored in an eight-bit character set such as ISO Latin 1. In this situation 
all characters are stored as a single byte, but non-western European characters 
will be unavailable.

Perl tries to store strings in the second form by preference, as it saves 
memory and processing time, but it does this in a way which is transparent to 
the user, so if you have the string "abc" it will be in the second form.  If 
you append a copyright symbol it will still be in the second form as that 
symbol is present in ISO Latin 1, but if you append a w-circumflex (as used in 
Welsh, and not available in ISO Latin 1) or any Chinese, Greek, Cyrillic 
character, then the string will be re-encoded in UTF-8 and Perl will flag it to 
remember that is how it has been stored.  You as a user do not (generally) need 
to worry.

The complication is what to do when reading stuff from files or writing them 
out again, because then Perl has to decide how to represent stuff for the 
outside world.  To be successful, you have to tell Perl what encoding is used 
for anything you are reading in, so that it can be stored appropriately.  If 
you read in a copyright symbol from a UTF-8 encoded file but fail to tell Perl 
it was in UTF-8, Perl will think it is character C2 followed by A9.  Now A9 
happens to be the copyright symbol in ISO Latin 1, but C2 is A-circumflex.  If 
you write it out again, Perl will operate in ISO Latin 1 unless instructed 
otherwise, and you will get C2 A9 in the file, which is probably fine, but Perl 
did not know that it was meant to be a single character so processing you might 
have done, like regular expression matches and finding the length of the 
string, would not have worked as expected.

In your case, if the input was MARC records encoded in UTF-8, the Perl MARC 
modules will have picked this up and will correctly flag all the data as UTF-8. 
But Perl is then at liberty to store it in memory as ISO Latin 1 to save space. 
 When you use the as_usmarc() function the MARC::File::USMARC.pm module will 
build a single string containing the whole record, but as far as I can tell 
from the source code, it does not do anything special about the character set. 
If the record had UTF-8 encoding when read in, the as_usmarc() value will be 
flagged as being in UTF-8.  If you have not specified UTF-8 during the open 
command or via binmode, then when writing the string to the file it would be 
converted to your local 8-bit encoding (e.g. ISO-Latin-1).  This would result 
in a record which is a bit of a mess, to say the least, because the LDR will 
indicate Unicode and the content may not be.  You might also get the warning 
"wide character in print" if any characters outside ISO Latin 1 were included, 
but a copyright symbol would silently be converted to the wrong representation.

Any record in MARC8, however, will be read in as such and will not be mucked 
about with by Perl: it will assume it is all in the local 8-bit encoding, and 
to output it successfully you should avoid opening the output file with UTF-8 
encoding.

In summary:

1. If reading UTF-8 encoded records via the MARC modules, make sure any file 
you write is opened with '>:encoding(UTF-8)'

2. If handling records encoded in MARC8, use '>:raw' when outputting.

3. Do not use '>:raw' with UTF-8 encoded records as any characters in the range 
U+0080 to U+00FF are at risk of being mangled because Perl's internal encoding 
of the string may not be what you expect, being dependent on whether characters 
from U+0100 upwards are included.

It *is* possible to read and write records in a mixture of encodings, but you 
will need to keep your head!!  If you are modifying records you need to ensure 
any additional text you introduce is supplied in the appropriate encoding as 
the MARC modules are not clever enough to handle automatically creating the 
Field objects in an encoding that matches the Record's encoding.

It might be argued that all MARC files should be read and written with ":raw", 
in which case the as_usmarc function would need to be modified so that if the 
record has Unicode encoding it gets converted into a byte-stream of UTF-8 
before being returned.  But there are so many other design decisions that could 
be questioned, around whether it is more helpful to be like Perl and make the 
actual encoding transparent, or whether users should be made to grapple with 
the MARC-8 versus UTF-8 issues for their own good!  As they stand, the MARC 
modules allow you to do what you need to do, without hiding the complications.
 
Matthew

RE: Opening & writing to UTF-8 files; copyright symbol again -- solution

Reply via email to