RE: Some UTF-8-related questions

Hamann, T.D. (Thomas) Wed, 11 Jan 2012 07:50:47 -0800

>>Hi,

>>Thanks for the answers on my last question. I have since then dug a bit 
>>further in the UTF-8-related error >>message I got, and after some reading 
>>have a few questions with regards to UTF-8 handling in perl:


>>(Please bear in mind that I am not an IT guy)

> Worry not -- Basically no IT person gets this right anyway : )
 
>>1a) My use statements are the following:

>>use warnings;
>>use strict;
>>use utf8;
>>use open ':encoding(utf8)';

>I would add

>use feature qw(unicode_strings);

>or even

>use if $^V ge v5.12, feature => qw(unicode_strings);

>and replace :encoding(utf8) for :encoding(UTF-8), but see below.
 
Thanks. That looks very useful. Would it also be a good idea to upgrade perl to 
5.14 instead of 5.12?

>>Now if I understand it correctly, there's two ways of encoding UTF-8 in perl: 
>>One liberal (utf8) and one strict 
>>(UTF-8). For my purpose, I need correctly encoded UTF-8 files. However, I 
>>cannot be sure whether the files I 
>>start with are properly encoded in UTF-8.

>That's primarily right, but I think that you are mistaken in the usage of the 
>lax version, utf8. The latter is only 
>useful when reading something produced by another Perl process that used the 
>lax encoding and outputted >illegal UTF-8.

>For example:

>use Devel::Peek;
>use warnings;

>open my $out_fh, ">:utf8", "invalid_UTF-8.txt" or die $!;
>say { $out_fh } "This here: [\x{FFFF_FFFF}] is illegal UTF-8, but valid in 
>Perl's lax internal encoding";
>close $out_fh or die $!;

>for my $encoding ( qw< utf8 encoding(UTF-8) > ) {

>    say "Encoding: [$encoding]";
>    open my $in_fh, "<:$encoding", "invalid_UTF-8.txt" or die $!;
>    my $line = <$in_fh>;
>    Dump $line;
>    close $in_fh;
>}

>What you get depends on whenever $encoding is utf8 or encoding(UTF-8), though 
>the difference is a bit hard to 
>spot. For the former, you'll get back the string that you originally printed, 
>but for the latter, Encode will complain 
>about \x{FFFF_FFFF} not being in Unicode, and give you a string with a literal 
>\x{FFFFFFFF}, and if you had 
>written it in single quotes!

>The bottom point is that you scarcely ever want the lax, internal form. Moreso 
>because it's subject to change in >upcoming Perl versions, since what it 
>currently does is whack.
 
>>So is it possible to open a file using the liberal interpretation, and write 
>>to a new file using the strict >>interpretation? Are there any issues 
>>regarding this, like characters that might not be re-encoded properly?

>See the above example. Should be entirely fine as long as the contents of the 
>file are all legal UTF-8.
 
So basically I could use the strict version of UTF-8 encoding for all of my 
scripts, as long as the original file is valid UTF-8 and I use valid UTF-8 
characters in my scripts when I need them.


>>2b) Do scripts themselves have to be encoded in UTF-8 to be able to process 
>>UTF-8-files?

>Nope.
 
>>If not, when should you encode the scripts in UTF-8 and when not?

>When you are using UTF-8 literals in your code, for example

>say "In katakana, [ni] is [ニ]";

>or

>my $león = "Simba";

>In which case the file needs to have a "use utf8;" on top, as well as being 
>properly encoded in UTF-8.
 
Alright. I had the "use utf8;" in the scripts, but they weren't encoded in 
UTF-8.

>>Most of my scripts add text to UTF-8 encoded text files. I've noticed that 
>>this sometimes seems to change the
>> encoding or give error messages when e.g. accented characters are involved. 
>> Am I right in assuming that only
>> scripts that remove text or extract certain parts do not need to be encoded 
>> in UTF-8?

>The encoding of the source has basically no relevance whatsoever [*], unless 
>you are using "use encoding", 
>which you shouldn't. Errors with accented characters is probably due to using 
>latin-1 and mistakenly assuming 
>that you are using UTF-8, or the reverse.
> The likely culprits for this sort of things are that you forgot to "use 
> utf8", or your editor isn't outputting UTF-8 
>(maybe latin-1?), or you are using the wrong encoding for reading/writing.

>[*] Nitpick: Unless you are reading things from a __DATA__ section, which 
>inherits the UTF8-ness of the file in 
>which it was found.

See later for the script I am having problems with.


>>2c) Not really a perl question: Does anyone know of a monospaced font for 
>>Windows that handles most UTF-8 
>>characters gracefully? I would like one for use in Notepad++ to make it 
>>easier to write scripts containing 
>>special characters not normally displayable in Windows.

>Symbola. It's awesome. \N{DROMEDARY CAMEL}

Thanks! Looks like it has the characters I need, too.
 
>>3) Windows uses UTF-8 with BOM, Unix and Unix-likes UTF-8 without BOM.

>Nope. Windows uses UTF-16, which requires a BOM to distinguish between 
>UTF-16LE and UTF-16BE. Most 
>Unices use UTF-8, which don't require a BOM and, in fact, using it is against 
>Unicode's recommendation. If you 
>spot a file with a UTF-8 BOM, quickly s///
> it away! 
 
>>A particular script of mine prepends a piece of text to UTF-8 encoded text 
>>files created with MS Word on 
>>Windows (saved as .txt with UTF-8 encoding). Unfortunately, this appears to 
>>break the encoding, which 
>>changes from "UTF-8 with BOM" to "UTF-8 without BOM", probably because the 
>>text is inserted *before* the 
>>BOM at the start of the file. How do I prevent this? How can my script 
>>recognize the BOM at the start of the 
>>file?

>Been a while since I used Word, but I've got a hunch that "UTF-8 with BOM" is 
>actually marked as "Unicode", 
>which in Windowspeak is UTF-16, and see the note about the BOM above.

It shows up as "Unicode (UTF-8)" in the file conversion box in Word. Notepad++ 
tells me this is UTF-8 encoded. After running the script Notepad++ shows it as 
"ANSI as UTF-8", so apparently something happens to the file that kinda breaks 
the encoding (see also this link 
"http://stackoverflow.com/questions/1380690/what-is-ansi-as-utf-8-and-how-can-i-make-fputcsv-generate-utf-8-w-bom";).

>Like mentioned above, you generally -don't- want to read the file and start 
>guessing encodings. That road leads 
>to madness. It would be helpful if you posted some snippets of code that 
>showed what and where the problem 
>lies, that way we could give you  a bit more accurate piece of advice. 
>However, if you absolutely must go on 
>guessing, check out File::BOM and/or Encoding::Guess, or try manually decoding 
>as shown above.

Okay, what I am doing is converting Word files with OCR'ed text from *old* 
books to XML using a custom XML schema for database import.

The workflow is effectively this:
MS Word file with OCR'ed text -> Convert file to .txt with UTF-8 encoding using 
Word -> Run bunch of scripts that insert most of the XML -> clean up and fix 
remaining problems by manually going through the file with XML Spy.

I am not using an XML-specific module because the original text has so many 
quirks and other little problems that the end result needs to be thoroughly 
checked by hand anyways (well-formedness is not an issue). If I can get 90% of 
the XML in there using substitutions I am happy.

I have to assume Word's file conversion produces valid XML. Unfortunately, it 
seems XML Spy is one of those programs that doesn't recognize UTF-8 if the file 
isn't proper UTF-8 with BOM (all accented characters etc. are shown as 
garbage). So one of the scripts breaks the encoding, and I think it is this one:

#!/usr/bin/perl
# pubtags.plx
# Pre- and appends the first and last XML tags to a file.
use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

my $source = shift @ARGV;
my $destination = shift @ARGV;

open IN, $source or die "Can't read source file $source: $!\n";
open OUT, "> $destination" or die "can't write on file $destination: $!\n";

# Prepends the first XML tags:
print OUT "<opening XML tags go here...>\n";

while (<IN>) {
        # Prints all non-empty lines to file:
        if (!/^\s*$/) {
                print OUT $_;
        }
}

# Appends the accompanying closing tags:
print OUT "\n\t\t<closing XML tags go here>";

close IN;
close OUT;

After running this script Notepad++ tells me the encoding has changed as 
described above.

Regards,
Thomas

RE: Some UTF-8-related questions

Reply via email to