On Fri, Oct 7, 2011 at 12:19 PM, Brandon McCaig <bamcc...@gmail.com> wrote:
> > I know next to nothing about Unicode programming (in any > language), but it seems to always be the same prefix. Printing > this out in Windows' cmd shell seems to yield the same prefix > that I see in UTF-8 files with a BOM (byte-order mark). Oddly, > your data seems to have two of them, which I can't explain, but I > digress. Could you not just remove those two characters with a > s///? > > Well, it's a replacement character, not a BOM. UTF-8 files aren't supposed to have BOMs (they can, but it's a no-op -- byte order only matters for UTF-16 and UTF-32), and in any case BOMs are supposed to be the first couple of bytes in a file, not in every line. I have no idea how those came to be though. But I'll gladly just blame use encoding ...; until proven otherwise : ) > > Maybe look here for some possibly better advice: > > http://ahinea.com/en/tech/perl-unicode-struggle.html > > That's an alright introduction, but Unicode is so much more complex than that. I'm only beginning to grasp this stuff myself, but basically any search result that contains "Unicode" and "Tom Christiansen" is a must read these days. These two links should get you started: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default and http://98.245.80.27/tcpc/OSCON2011/index.html And since he is the author of the new camel (coming out in December!), I'm assuming that the Unicode chapter there should also be kept in mind.