[EMAIL PROTECTED] wrote: > I was going to submit to sourceforge, but my unicode skills are weak. > I was trying to strip characters from a string that contained values > outside of ASCII. I though I could just encode as 'ascii' in 'replace' > mode but it threw an error. Strangely enough, if I decode via the > ascii codec and then encode via the ascii codec, I get what I want. > That being said, this may be operating correctly.
encode on 8-bit strings and decode on unicode strings aren't exactly obvious operations... > >>> print 'aaa\xae' > aaa® > >>> 'aaa\xae'.encode('ascii','replace') #should return 'aaa?' encode("ascii") is a unicode operation, so when you do this, Python first attempts to turn your string into a unicode string, using the default en- coding. that operation fails: Traceback (most recent call last): File "<interactive input>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 3: ordinal not in range(128) >>> 'aaa\xae'.decode('ascii','replace') #but this doesn't throw an error? u'aaa\ufffd' this converts the encoded stream to Unicode, using a "suitable replacement character" for characters that cannot be converted. U+FFFD is 'REPLACEMENT CHARACTER', which, I assume, is about as suitable as you can get. >>> 'aaa\xae'.decode('ascii','replace').encode('ascii','replace') #this does >>> what I wanted 'aaa?' this converts the unicode string from the previous step back to ascii, using a "suitable replacement character" for characters than cannot be converted. for 8-bit strings, "?" is a suitable character. instead of playing codec games, you could use translate or a simple regular expression: outstring = re.sub("[\x80-\xff]", "?", instring) </F>
-- http://mail.python.org/mailman/listinfo/python-list