On Fri, Oct 7, 2011 at 12:19 PM, Brandon McCaig <bamcc...@gmail.com> wrote:

>
> I know next to nothing about Unicode programming (in any
> language), but it seems to always be the same prefix. Printing
> this out in Windows' cmd shell seems to yield the same prefix
> that I see in UTF-8 files with a BOM (byte-order mark). Oddly,
> your data seems to have two of them, which I can't explain, but I
> digress. Could you not just remove those two characters with a
> s///?
>
>
Well, it's a replacement character, not a BOM. UTF-8 files aren't supposed
to have BOMs (they can, but it's a no-op -- byte order only matters for
UTF-16 and UTF-32), and in any case BOMs are supposed to be the first couple
of bytes in a file, not in every line.

I have no idea how those came to be though. But I'll gladly just blame use
encoding ...; until proven otherwise : )



>
> Maybe look here for some possibly better advice:
>
> http://ahinea.com/en/tech/perl-unicode-struggle.html
>
>
That's an alright introduction, but Unicode is so much more complex than
that.

I'm only beginning to grasp this stuff myself, but basically any search
result that contains "Unicode" and "Tom Christiansen" is a must read these
days. These two links should get you started:

http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default
and
http://98.245.80.27/tcpc/OSCON2011/index.html

And since he is the author of the new camel (coming out in December!), I'm
assuming that the Unicode chapter there should also be kept in mind.

Reply via email to