On Wed, Jan 7, 2015 at 11:02 PM, Ned Batchelder <n...@nedbatchelder.com> wrote: >> Any thoughts on a sort of generic method/means to handle any/all >> characters that might be out of range when having pulled them out of >> something like these MS access databases? > > > The best thing is to know what encoding was used to produce these byte > values. Then you can manipulate them as Unicode if you need to. The second > best thing is to simply pass them through as bytes.
If you can't know for sure, you could hazard a guess. There's a good chance that an eight-bit encoding from a Microsoft product is CP-1252. In fact, when I interoperate with Unicode-unaware Windows programs, I usually attempt a UTF-8 decode, and if that fails, I simply assume CP-1252; this generally gives correct results for data coming from US-English Windows users. Jacob, have a look at your data. Contextually, would the '\xa3' be likely to be a pound sign, £? Would '\x85' make sense as an ellipsis? Would \x90, \x91, \x92, and \x93 seem to be used for quote marks? If so, CP-1252 would be the encoding to use. ChrisA -- https://mail.python.org/mailman/listinfo/python-list