Ben Finney <[EMAIL PROTECTED]> writes: > glacier <[EMAIL PROTECTED]> writes: > > > I use chinese charactors as an example here. > > > > >>>s1='你好吗' > > >>>repr(s1) > > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'" > > >>>b1=s1.decode('GBK') > > > > My first question is : what strategy does 'decode' use to tell the > > way to seperate the words. I mean since s1 is an multi-bytes-char > > string, how did it determine to seperate the string every 2bytes > > or 1byte? > > The codec you specified ("GBK") is, like any character-encoding > codec, a precise mapping between characters and bytes. It's almost > certainly not aware of "words", only character-to-byte mappings.
To be clear, I should point out that I didn't mean to imply static tabular mappings only. The mappings in a character encoding are often more complex and algorithmic. That doesn't make them any less precise, of course; and the core point is that a character-mapping codec is *only* about getting between characters and bytes, nothing else. -- \ "He who laughs last, thinks slowest." -- Anonymous | `\ | _o__) | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list