>>Hi,
>>Thanks for the answers on my last question. I have since then dug a bit
>>further in the UTF-8-related error >>message I got, and after some reading
>>have a few questions with regards to UTF-8 handling in perl:
>>(Please bear in mind that I am not an IT guy)
> Worry not -- Basically no IT person gets this right anyway : )
>>1a) My use statements are the following:
>>use warnings;
>>use strict;
>>use utf8;
>>use open ':encoding(utf8)';
>I would add
>use feature qw(unicode_strings);
>or even
>use if $^V ge v5.12, feature => qw(unicode_strings);
>and replace :encoding(utf8) for :encoding(UTF-8), but see below.
Thanks. That looks very useful. Would it also be a good idea to upgrade perl to
5.14 instead of 5.12?
>>Now if I understand it correctly, there's two ways of encoding UTF-8 in perl:
>>One liberal (utf8) and one strict
>>(UTF-8). For my purpose, I need correctly encoded UTF-8 files. However, I
>>cannot be sure whether the files I
>>start with are properly encoded in UTF-8.
>That's primarily right, but I think that you are mistaken in the usage of the
>lax version, utf8. The latter is only
>useful when reading something produced by another Perl process that used the
>lax encoding and outputted >illegal UTF-8.
>For example:
>use Devel::Peek;
>use warnings;
>open my $out_fh, ">:utf8", "invalid_UTF-8.txt" or die $!;
>say { $out_fh } "This here: [\x{FFFF_FFFF}] is illegal UTF-8, but valid in
>Perl's lax internal encoding";
>close $out_fh or die $!;
>for my $encoding ( qw< utf8 encoding(UTF-8) > ) {
> say "Encoding: [$encoding]";
> open my $in_fh, "<:$encoding", "invalid_UTF-8.txt" or die $!;
> my $line = <$in_fh>;
> Dump $line;
> close $in_fh;
>}
>What you get depends on whenever $encoding is utf8 or encoding(UTF-8), though
>the difference is a bit hard to
>spot. For the former, you'll get back the string that you originally printed,
>but for the latter, Encode will complain
>about \x{FFFF_FFFF} not being in Unicode, and give you a string with a literal
>\x{FFFFFFFF}, and if you had
>written it in single quotes!
>The bottom point is that you scarcely ever want the lax, internal form. Moreso
>because it's subject to change in >upcoming Perl versions, since what it
>currently does is whack.
>>So is it possible to open a file using the liberal interpretation, and write
>>to a new file using the strict >>interpretation? Are there any issues
>>regarding this, like characters that might not be re-encoded properly?
>See the above example. Should be entirely fine as long as the contents of the
>file are all legal UTF-8.
So basically I could use the strict version of UTF-8 encoding for all of my
scripts, as long as the original file is valid UTF-8 and I use valid UTF-8
characters in my scripts when I need them.
>>2b) Do scripts themselves have to be encoded in UTF-8 to be able to process
>>UTF-8-files?
>Nope.
>>If not, when should you encode the scripts in UTF-8 and when not?
>When you are using UTF-8 literals in your code, for example
>say "In katakana, [ni] is [ニ]";
>or
>my $león = "Simba";
>In which case the file needs to have a "use utf8;" on top, as well as being
>properly encoded in UTF-8.
Alright. I had the "use utf8;" in the scripts, but they weren't encoded in
UTF-8.
>>Most of my scripts add text to UTF-8 encoded text files. I've noticed that
>>this sometimes seems to change the
>> encoding or give error messages when e.g. accented characters are involved.
>> Am I right in assuming that only
>> scripts that remove text or extract certain parts do not need to be encoded
>> in UTF-8?
>The encoding of the source has basically no relevance whatsoever [*], unless
>you are using "use encoding",
>which you shouldn't. Errors with accented characters is probably due to using
>latin-1 and mistakenly assuming
>that you are using UTF-8, or the reverse.
> The likely culprits for this sort of things are that you forgot to "use
> utf8", or your editor isn't outputting UTF-8
>(maybe latin-1?), or you are using the wrong encoding for reading/writing.
>[*] Nitpick: Unless you are reading things from a __DATA__ section, which
>inherits the UTF8-ness of the file in
>which it was found.
See later for the script I am having problems with.
>>2c) Not really a perl question: Does anyone know of a monospaced font for
>>Windows that handles most UTF-8
>>characters gracefully? I would like one for use in Notepad++ to make it
>>easier to write scripts containing
>>special characters not normally displayable in Windows.
>Symbola. It's awesome. \N{DROMEDARY CAMEL}
Thanks! Looks like it has the characters I need, too.
>>3) Windows uses UTF-8 with BOM, Unix and Unix-likes UTF-8 without BOM.
>Nope. Windows uses UTF-16, which requires a BOM to distinguish between
>UTF-16LE and UTF-16BE. Most
>Unices use UTF-8, which don't require a BOM and, in fact, using it is against
>Unicode's recommendation. If you
>spot a file with a UTF-8 BOM, quickly s///
> it away!
>>A particular script of mine prepends a piece of text to UTF-8 encoded text
>>files created with MS Word on
>>Windows (saved as .txt with UTF-8 encoding). Unfortunately, this appears to
>>break the encoding, which
>>changes from "UTF-8 with BOM" to "UTF-8 without BOM", probably because the
>>text is inserted *before* the
>>BOM at the start of the file. How do I prevent this? How can my script
>>recognize the BOM at the start of the
>>file?
>Been a while since I used Word, but I've got a hunch that "UTF-8 with BOM" is
>actually marked as "Unicode",
>which in Windowspeak is UTF-16, and see the note about the BOM above.
It shows up as "Unicode (UTF-8)" in the file conversion box in Word. Notepad++
tells me this is UTF-8 encoded. After running the script Notepad++ shows it as
"ANSI as UTF-8", so apparently something happens to the file that kinda breaks
the encoding (see also this link
"http://stackoverflow.com/questions/1380690/what-is-ansi-as-utf-8-and-how-can-i-make-fputcsv-generate-utf-8-w-bom").
>Like mentioned above, you generally -don't- want to read the file and start
>guessing encodings. That road leads
>to madness. It would be helpful if you posted some snippets of code that
>showed what and where the problem
>lies, that way we could give you a bit more accurate piece of advice.
>However, if you absolutely must go on
>guessing, check out File::BOM and/or Encoding::Guess, or try manually decoding
>as shown above.
Okay, what I am doing is converting Word files with OCR'ed text from *old*
books to XML using a custom XML schema for database import.
The workflow is effectively this:
MS Word file with OCR'ed text -> Convert file to .txt with UTF-8 encoding using
Word -> Run bunch of scripts that insert most of the XML -> clean up and fix
remaining problems by manually going through the file with XML Spy.
I am not using an XML-specific module because the original text has so many
quirks and other little problems that the end result needs to be thoroughly
checked by hand anyways (well-formedness is not an issue). If I can get 90% of
the XML in there using substitutions I am happy.
I have to assume Word's file conversion produces valid XML. Unfortunately, it
seems XML Spy is one of those programs that doesn't recognize UTF-8 if the file
isn't proper UTF-8 with BOM (all accented characters etc. are shown as
garbage). So one of the scripts breaks the encoding, and I think it is this one:
#!/usr/bin/perl
# pubtags.plx
# Pre- and appends the first and last XML tags to a file.
use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';
my $source = shift @ARGV;
my $destination = shift @ARGV;
open IN, $source or die "Can't read source file $source: $!\n";
open OUT, "> $destination" or die "can't write on file $destination: $!\n";
# Prepends the first XML tags:
print OUT "<opening XML tags go here...>\n";
while (<IN>) {
# Prints all non-empty lines to file:
if (!/^\s*$/) {
print OUT $_;
}
}
# Appends the accompanying closing tags:
print OUT "\n\t\t<closing XML tags go here>";
close IN;
close OUT;
After running this script Notepad++ tells me the encoding has changed as
described above.
Regards,
Thomas