ok, I understand, what I thought was UTF8 is in fact latin1. thanks. that makes sense.
-Steve > On 5 Jan 2015, at 22:05, erik quanstrom <quans...@quanstro.net> wrote: > >> On Mon Jan 5 13:48:47 PST 2015, st...@quintile.net wrote: >> I am trying to parse a stream from a tcp connection. >> >> I think the data is utf8, here is a sample >> >> 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 >> >> which when I print it I get: >> >> - e s k r o z h l a s >> ^ ^ >> missing missing >> >> there are two missing characters. Ok, bad UTF8 perhaps? >> but when I try unicode(1) I see: >> >> unicode c8 fd >> È >> ý >> >> Is this 8 bit runes? (!) >> Is there a name for such a thing? >> Is this common? >> Is it just MS code pages but the >0x7f values happen (designed to) to map >> onto the same letters as utf8? > > latin1 has this property that if you embed the byte in a rune-sized chunk, > then it's > a valid Rune. but latin1 is invalid utf-8. > > the reason that unicode(1) failed to meet expectations, is that the desire > was > to convert the supposed utf-8 0xc8 to a codepoint, but what unicode did was > convert the codepoint 0xc8 into utf-8. > > - erik