On Mon Jan 5 13:48:47 PST 2015, st...@quintile.net wrote: > I am trying to parse a stream from a tcp connection. > > I think the data is utf8, here is a sample > > 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 > > which when I print it I get: > > - e s k r o z h l a s > ^ ^ > missing missing > > there are two missing characters. Ok, bad UTF8 perhaps? > but when I try unicode(1) I see: > > unicode c8 fd > È > ý > > Is this 8 bit runes? (!) > Is there a name for such a thing? > Is this common? > Is it just MS code pages but the >0x7f values happen (designed to) to map > onto the same letters as utf8?
latin1 has this property that if you embed the byte in a rune-sized chunk, then it's a valid Rune. but latin1 is invalid utf-8. the reason that unicode(1) failed to meet expectations, is that the desire was to convert the supposed utf-8 0xc8 to a codepoint, but what unicode did was convert the codepoint 0xc8 into utf-8. - erik