In article <515941d8$0$29967$c3e8da3$54964...@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote:
> [...] > >> OK, that leads to the next question. Is there anyway I can (in Python > >> 2.7) detect when a string is not entirely in the BMP? If I could find > >> all the non-BMP characters, I could replace them with U+FFFD > >> (REPLACEMENT CHARACTER) and life would be good (enough). > > Of course you can do this, but you should not. If your input data > includes character C, you should deal with character C and not just throw > it away unnecessarily. That would be rude, and in Python 3.3 it should be > unnecessary. The import job isn't done yet, but so far we've processed 116 million records and had to clean up four of them. I can live with that. Sometimes practicality trumps correctness. It turns out, the problem is that the version of MySQL we're using doesn't support non-BMP characters. Newer versions do (but you have to declare the column to use the utf8bm4 character set). I could upgrade to a newer MySQL version, but it's just not worth it. Actually, I did try spinning up a 5.5 instance (one of the nice things of being in the cloud) and experimented with that, but couldn't get it to work there either. I'll admit that I didn't invest a huge amount of effort to make that work before just writing this: def bmp_filter(self, s): """Filter a unicode string to remove all non-BMP (basic multilingual plane) characters. All such characters are replaced with U+FFFD (Unicode REPLACEMENT CHARACTER). """ if all(ord(c) <= 0xffff for c in s): return s else: self.logger.warning("making %r BMP-clean", s) bmp_chars = [(c if ord(c) <= 0xffff else u'\ufffd') for c in s] return ''.join(bmp_chars) -- http://mail.python.org/mailman/listinfo/python-list