Some UTF-8-related questions

Hamann, T.D. (Thomas) Wed, 11 Jan 2012 02:59:59 -0800

Hi,

Thanks for the answers on my last question. I have since then dug a bit further 
in the UTF-8-related error message I got, and after some reading have a few 
questions with regards to UTF-8 handling in perl:


(Please bear in mind that I am not an IT guy)

1a) My use statements are the following:

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

Now if I understand it correctly, there's two ways of encoding UTF-8 in perl: 
One liberal (utf8) and one strict (UTF-8). For my purpose, I need correctly 
encoded UTF-8 files. However, I cannot be sure whether the files I start with 
are properly encoded in UTF-8. 
So is it possible to open a file using the liberal interpretation, and write to 
a new file using the strict interpretation? Are there any issues regarding 
this, like characters that might not be re-encoded properly?

1b) How can I check whether a file is properly encoded UTF-8?


2a) As I understand it, Windows has a somewhat limited ability to display 
certain UTF-8 characters, although some fonts can display more of them. The 
characters do exist in the file, even if Windows can't display them (besides 
showing a square). Is this correct? If not, does that impact perl's ability to 
handle Unicode? 

2b) Do scripts themselves have to be encoded in UTF-8 to be able to process 
UTF-8-files? If not, when should you encode the scripts in UTF-8 and when not? 
Most of my scripts add text to UTF-8 encoded text files. I've noticed that this 
sometimes seems to change the encoding or give error messages when e.g. 
accented characters are involved. Am I right in assuming that only scripts that 
remove text or extract certain parts do not need to be encoded in UTF-8?

2c) Not really a perl question: Does anyone know of a monospaced font for 
Windows that handles most UTF-8 characters gracefully? I would like one for use 
in Notepad++ to make it easier to write scripts containing special characters 
not normally displayable in Windows.


3) Windows uses UTF-8 with BOM, Unix and Unix-likes UTF-8 without BOM. A 
particular script of mine prepends a piece of text to UTF-8 encoded text files 
created with MS Word on Windows (saved as .txt with UTF-8 encoding). 
Unfortunately, this appears to break the encoding, which changes from "UTF-8 
with BOM" to "UTF-8 without BOM", probably because the text is inserted 
*before* the BOM at the start of the file. How do I prevent this? How can my 
script recognize the BOM at the start of the file?

Thanks for reading.

Regards,
Thomas










--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Some UTF-8-related questions

Reply via email to