Re: Windows Unicode and GCC

Nicolas De Rico Mon, 01 May 2006 11:44:12 -0700

Hello,

As a quick reminder, the problem that I encountered arised when tryingto compile source files that are NOT encoded with the same encoding asthe system header files. My current Linux machine uses UTF-8, but I amtrying to compile files that were created using Windows "unicode".

To make a reproducible test, I created a simple hello world program thatincludes stdio.h. The file hi-utf16.c, created with Notepad and saved in"unicode", contains a BOM which is, in essence, a small header at thebeginning of the file that indicates the encoding.

nicolas:~> gcc -finput-charset=UTF-16 hi-utf16.chi-utf16.c:1:19:failure to convert UTF-16 to UTF-8

It appears that CPP is telling libiconv to convert the source file fromUTF-16 to UTF-8, which works, but as soon as it hits the include file,it fails. Of course, stdio.h is stored in UTF-8 on the system so tryingto convert it from UTF-16 will fail right away.

Now, it would be nice if every file used the same unicode encoding, butthat's not always possible, especially when source control is involved.This issue touches interoperability between Windows and UNIX and also"legacy" (ie. pre-UTF-8) source files in general. My suggestion is tohave CPP open a file, read the first up to 4 bytes to figure out ifthere is a BOM. If so, calculate the encoding and pass it libiconv. Ibelieve that's what vim does, btw. In short, we would have the encodingdetected in the following order:


1-BOM
2-finput-charset option
3-LC_CTYPE

I am even thinking of how sweet it would be to be able to specify theencoding per file or directory to take care of the encodings that arenot auto-detectable (Latin-1 for example), but that is probably anotherproject of its own!


Best regards,

Nicolas

Re: Windows Unicode and GCC

Reply via email to