>>Hi, >>Thanks for the answers on my last question. I have since then dug a bit >>further in the UTF-8-related error >>message I got, and after some reading >>have a few questions with regards to UTF-8 handling in perl:
>>(Please bear in mind that I am not an IT guy) > Worry not -- Basically no IT person gets this right anyway : ) >>1a) My use statements are the following: >>use warnings; >>use strict; >>use utf8; >>use open ':encoding(utf8)'; >I would add >use feature qw(unicode_strings); >or even >use if $^V ge v5.12, feature => qw(unicode_strings); >and replace :encoding(utf8) for :encoding(UTF-8), but see below. Thanks. That looks very useful. Would it also be a good idea to upgrade perl to 5.14 instead of 5.12? >>Now if I understand it correctly, there's two ways of encoding UTF-8 in perl: >>One liberal (utf8) and one strict >>(UTF-8). For my purpose, I need correctly encoded UTF-8 files. However, I >>cannot be sure whether the files I >>start with are properly encoded in UTF-8. >That's primarily right, but I think that you are mistaken in the usage of the >lax version, utf8. The latter is only >useful when reading something produced by another Perl process that used the >lax encoding and outputted >illegal UTF-8. >For example: >use Devel::Peek; >use warnings; >open my $out_fh, ">:utf8", "invalid_UTF-8.txt" or die $!; >say { $out_fh } "This here: [\x{FFFF_FFFF}] is illegal UTF-8, but valid in >Perl's lax internal encoding"; >close $out_fh or die $!; >for my $encoding ( qw< utf8 encoding(UTF-8) > ) { > say "Encoding: [$encoding]"; > open my $in_fh, "<:$encoding", "invalid_UTF-8.txt" or die $!; > my $line = <$in_fh>; > Dump $line; > close $in_fh; >} >What you get depends on whenever $encoding is utf8 or encoding(UTF-8), though >the difference is a bit hard to >spot. For the former, you'll get back the string that you originally printed, >but for the latter, Encode will complain >about \x{FFFF_FFFF} not being in Unicode, and give you a string with a literal >\x{FFFFFFFF}, and if you had >written it in single quotes! >The bottom point is that you scarcely ever want the lax, internal form. Moreso >because it's subject to change in >upcoming Perl versions, since what it >currently does is whack. >>So is it possible to open a file using the liberal interpretation, and write >>to a new file using the strict >>interpretation? Are there any issues >>regarding this, like characters that might not be re-encoded properly? >See the above example. Should be entirely fine as long as the contents of the >file are all legal UTF-8. So basically I could use the strict version of UTF-8 encoding for all of my scripts, as long as the original file is valid UTF-8 and I use valid UTF-8 characters in my scripts when I need them. >>2b) Do scripts themselves have to be encoded in UTF-8 to be able to process >>UTF-8-files? >Nope. >>If not, when should you encode the scripts in UTF-8 and when not? >When you are using UTF-8 literals in your code, for example >say "In katakana, [ni] is [ニ]"; >or >my $león = "Simba"; >In which case the file needs to have a "use utf8;" on top, as well as being >properly encoded in UTF-8. Alright. I had the "use utf8;" in the scripts, but they weren't encoded in UTF-8. >>Most of my scripts add text to UTF-8 encoded text files. I've noticed that >>this sometimes seems to change the >> encoding or give error messages when e.g. accented characters are involved. >> Am I right in assuming that only >> scripts that remove text or extract certain parts do not need to be encoded >> in UTF-8? >The encoding of the source has basically no relevance whatsoever [*], unless >you are using "use encoding", >which you shouldn't. Errors with accented characters is probably due to using >latin-1 and mistakenly assuming >that you are using UTF-8, or the reverse. > The likely culprits for this sort of things are that you forgot to "use > utf8", or your editor isn't outputting UTF-8 >(maybe latin-1?), or you are using the wrong encoding for reading/writing. >[*] Nitpick: Unless you are reading things from a __DATA__ section, which >inherits the UTF8-ness of the file in >which it was found. See later for the script I am having problems with. >>2c) Not really a perl question: Does anyone know of a monospaced font for >>Windows that handles most UTF-8 >>characters gracefully? I would like one for use in Notepad++ to make it >>easier to write scripts containing >>special characters not normally displayable in Windows. >Symbola. It's awesome. \N{DROMEDARY CAMEL} Thanks! Looks like it has the characters I need, too. >>3) Windows uses UTF-8 with BOM, Unix and Unix-likes UTF-8 without BOM. >Nope. Windows uses UTF-16, which requires a BOM to distinguish between >UTF-16LE and UTF-16BE. Most >Unices use UTF-8, which don't require a BOM and, in fact, using it is against >Unicode's recommendation. If you >spot a file with a UTF-8 BOM, quickly s/// > it away! >>A particular script of mine prepends a piece of text to UTF-8 encoded text >>files created with MS Word on >>Windows (saved as .txt with UTF-8 encoding). Unfortunately, this appears to >>break the encoding, which >>changes from "UTF-8 with BOM" to "UTF-8 without BOM", probably because the >>text is inserted *before* the >>BOM at the start of the file. How do I prevent this? How can my script >>recognize the BOM at the start of the >>file? >Been a while since I used Word, but I've got a hunch that "UTF-8 with BOM" is >actually marked as "Unicode", >which in Windowspeak is UTF-16, and see the note about the BOM above. It shows up as "Unicode (UTF-8)" in the file conversion box in Word. Notepad++ tells me this is UTF-8 encoded. After running the script Notepad++ shows it as "ANSI as UTF-8", so apparently something happens to the file that kinda breaks the encoding (see also this link "http://stackoverflow.com/questions/1380690/what-is-ansi-as-utf-8-and-how-can-i-make-fputcsv-generate-utf-8-w-bom"). >Like mentioned above, you generally -don't- want to read the file and start >guessing encodings. That road leads >to madness. It would be helpful if you posted some snippets of code that >showed what and where the problem >lies, that way we could give you a bit more accurate piece of advice. >However, if you absolutely must go on >guessing, check out File::BOM and/or Encoding::Guess, or try manually decoding >as shown above. Okay, what I am doing is converting Word files with OCR'ed text from *old* books to XML using a custom XML schema for database import. The workflow is effectively this: MS Word file with OCR'ed text -> Convert file to .txt with UTF-8 encoding using Word -> Run bunch of scripts that insert most of the XML -> clean up and fix remaining problems by manually going through the file with XML Spy. I am not using an XML-specific module because the original text has so many quirks and other little problems that the end result needs to be thoroughly checked by hand anyways (well-formedness is not an issue). If I can get 90% of the XML in there using substitutions I am happy. I have to assume Word's file conversion produces valid XML. Unfortunately, it seems XML Spy is one of those programs that doesn't recognize UTF-8 if the file isn't proper UTF-8 with BOM (all accented characters etc. are shown as garbage). So one of the scripts breaks the encoding, and I think it is this one: #!/usr/bin/perl # pubtags.plx # Pre- and appends the first and last XML tags to a file. use warnings; use strict; use utf8; use open ':encoding(utf8)'; my $source = shift @ARGV; my $destination = shift @ARGV; open IN, $source or die "Can't read source file $source: $!\n"; open OUT, "> $destination" or die "can't write on file $destination: $!\n"; # Prepends the first XML tags: print OUT "<opening XML tags go here...>\n"; while (<IN>) { # Prints all non-empty lines to file: if (!/^\s*$/) { print OUT $_; } } # Appends the accompanying closing tags: print OUT "\n\t\t<closing XML tags go here>"; close IN; close OUT; After running this script Notepad++ tells me the encoding has changed as described above. Regards, Thomas