Hello and thank you for the reply.

I created 3 files (very simple hello world program):

hi.c: UTF-8 without BOM
hi-8.c: UTF-8 with BOM
hi-16.c: UTF-16 with BOM

I ran iconv twice for each file. Once with the -f option which explicitly indicates the encoding, and once without the -f option to see if libiconv is able to detect the encoding from the BOM. In all cases I told iconv to create a UTF-8 file and I used od (octodump) to inspect the resulting file.

My results:
1: without -f option
2: with -f option

hi.c (1):    UTF-8, without BOM
hi.c (2):    UTF-8, without BOM
hi-8.c (1):  UTF-8, with BOM *
hi-8.c (2):  UTF-8, with BOM *
hi-16.c (1): illegal character error. Does not use BOM automatically!
hi-16.c (2): UTF-8, without BOM

Considering those results, it looks a bit like I'll have to bug the libiconv crew!

Presumably, cpp wants everything from libiconv in UTF-8 with no BOM.


Nick


* Did libiconv really consider the BOM or did it just copy the file??? I have to investigate. libiconv may just not support the BOM at all!




Eric Christopher wrote:

It seems that BOM is a Unicode UTF facility that MS thought was a great thing to implement, and I certainly agree with that assessment. BOM tells even more than its name implies. A program can detect if a file is encoded in UTF-8, 16LE, 16BE, 32LE and 32BE in a very easy way.

I think that it would be good for gcc (or cpp) to support this because it would make for better interoperability with Visual C++, and it would allow each file to indicate how it is encoded without having to rely on some setting that may or may not provide the correct information in every case.

cpp relies on libiconv for almost all of it's translation support. Try preprocessing a file with iconv and see if you can compile it afterwards. If you can, then it's a gcc bug, otherwise you'll need to bug the libiconv folks about implementing support.

-eric


Reply via email to