On Mon Jan  5 13:48:47 PST 2015, st...@quintile.net wrote:
> I am trying to parse a stream from a tcp connection.
> 
> I think the data is utf8, here is a sample
> 
>        20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73
> 
> which when I print it I get:
> 
>      -       e  s  k       r  o  z  h  l  a  s           
>            ^          ^
>         missing    missing
> 
> there are two missing characters. Ok, bad UTF8 perhaps?
> but when I try unicode(1) I see:
> 
>       unicode c8 fd
>       È
>       ý
> 
> Is this 8 bit runes? (!)
> Is there a name for such a thing?
> Is this common?
> Is it just MS code pages but the >0x7f values happen (designed to) to map 
> onto the same letters as utf8?

latin1 has this property that if you embed the byte in a rune-sized chunk, then 
it's
a valid Rune.  but latin1 is invalid utf-8.

  the reason that unicode(1) failed to meet expectations, is that the desire 
was 
to convert the supposed utf-8 0xc8 to a codepoint, but what unicode did was
convert the codepoint 0xc8 into utf-8.

- erik

Reply via email to