> However, combining Jon Gorman's recommendation with some Googling, I get: > > my $outfile='4788022.edited.bib'; > open (my $output_marc, '>', $outfile) or die "Couldn't open file $!" ; > binmode($output_marc, ':utf8'); > > The open statement may not be quite correct, as I am not familiar with the > more current techniques for opening file handles that John mentioned. > However, when I use those instructions to open the output file rather than > what > I had before, the copyright symbol does indeed come across as C2 A9 as it was > in the original record. I didn't want to use the utf8, because I've tried that > before and ended up with double-encoding (and a real mess). But I'll continue > testing.
I think I understand how your original problem came about, but I may not be able to explain it! It is important to understand that inside Perl a string can be encoded in one of two ways: 1) stored in UTF-8, in which case all ASCII-range characters (roughly space, A-Z, a-z, 0-9 and most of the punctuation you see on a keyboard) will be stored in a single byte per character, and other characters will be stored in 2, 3, or 4 bytes 2) stored in an eight-bit character set such as ISO Latin 1. In this situation all characters are stored as a single byte, but non-western European characters will be unavailable. Perl tries to store strings in the second form by preference, as it saves memory and processing time, but it does this in a way which is transparent to the user, so if you have the string "abc" it will be in the second form. If you append a copyright symbol it will still be in the second form as that symbol is present in ISO Latin 1, but if you append a w-circumflex (as used in Welsh, and not available in ISO Latin 1) or any Chinese, Greek, Cyrillic character, then the string will be re-encoded in UTF-8 and Perl will flag it to remember that is how it has been stored. You as a user do not (generally) need to worry. The complication is what to do when reading stuff from files or writing them out again, because then Perl has to decide how to represent stuff for the outside world. To be successful, you have to tell Perl what encoding is used for anything you are reading in, so that it can be stored appropriately. If you read in a copyright symbol from a UTF-8 encoded file but fail to tell Perl it was in UTF-8, Perl will think it is character C2 followed by A9. Now A9 happens to be the copyright symbol in ISO Latin 1, but C2 is A-circumflex. If you write it out again, Perl will operate in ISO Latin 1 unless instructed otherwise, and you will get C2 A9 in the file, which is probably fine, but Perl did not know that it was meant to be a single character so processing you might have done, like regular expression matches and finding the length of the string, would not have worked as expected. In your case, if the input was MARC records encoded in UTF-8, the Perl MARC modules will have picked this up and will correctly flag all the data as UTF-8. But Perl is then at liberty to store it in memory as ISO Latin 1 to save space. When you use the as_usmarc() function the MARC::File::USMARC.pm module will build a single string containing the whole record, but as far as I can tell from the source code, it does not do anything special about the character set. If the record had UTF-8 encoding when read in, the as_usmarc() value will be flagged as being in UTF-8. If you have not specified UTF-8 during the open command or via binmode, then when writing the string to the file it would be converted to your local 8-bit encoding (e.g. ISO-Latin-1). This would result in a record which is a bit of a mess, to say the least, because the LDR will indicate Unicode and the content may not be. You might also get the warning "wide character in print" if any characters outside ISO Latin 1 were included, but a copyright symbol would silently be converted to the wrong representation. Any record in MARC8, however, will be read in as such and will not be mucked about with by Perl: it will assume it is all in the local 8-bit encoding, and to output it successfully you should avoid opening the output file with UTF-8 encoding. In summary: 1. If reading UTF-8 encoded records via the MARC modules, make sure any file you write is opened with '>:encoding(UTF-8)' 2. If handling records encoded in MARC8, use '>:raw' when outputting. 3. Do not use '>:raw' with UTF-8 encoded records as any characters in the range U+0080 to U+00FF are at risk of being mangled because Perl's internal encoding of the string may not be what you expect, being dependent on whether characters from U+0100 upwards are included. It *is* possible to read and write records in a mixture of encodings, but you will need to keep your head!! If you are modifying records you need to ensure any additional text you introduce is supplied in the appropriate encoding as the MARC modules are not clever enough to handle automatically creating the Field objects in an encoding that matches the Record's encoding. It might be argued that all MARC files should be read and written with ":raw", in which case the as_usmarc function would need to be modified so that if the record has Unicode encoding it gets converted into a byte-stream of UTF-8 before being returned. But there are so many other design decisions that could be questioned, around whether it is more helpful to be like Perl and make the actual encoding transparent, or whether users should be made to grapple with the MARC-8 versus UTF-8 issues for their own good! As they stand, the MARC modules allow you to do what you need to do, without hiding the complications. Matthew