Jacob Kruger wrote: > I'm busy using something like pyodbc to pull data out of MS access .mdb > files, and then generate .sql script files to execute against MySQL > databases using MySQLdb module, but, issue is forms of characters in > string values that don't fit inside the 0-127 range - current one seems to > be something like \xa3, and if I pass it through ord() function, it comes > out as character number 163. > > Now issue is, yes, could just run through the hundreds of thousands of > characters in these resulting strings, and strip out any that are not > within the basic 0-127 range, but, that could result in corrupting data - > think so anyway. > > Anyway, issue is, for example, if I try something like > str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or
"\xa3" already is a str; str("\xa3") is as redundant as str(str(str("\xa3"))) ;) > str('\xa3').encode('latin7') - that last one is actually our preferred > encoding for the MySQL database - they all just tell me they can't work > with a character out of range. encode() goes from unicode to byte; you want to convert bytes to unicode and thus need decode(). In this context it is important that you tell us the Python version. In Python 2 str.encode(encoding) is basically str.decode("ascii").encode(encoding) which is why you probably got a UnicodeDecodeError in the traceback: >>> "\xa3".encode("latin7") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/encodings/iso8859_13.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128) >>> "\xa3".decode("latin7") u'\xa3' >>> print "\xa3".decode("latin7") £ Aside: always include the traceback in your posts -- and always read it carefully. The fact that "latin7" is not mentioned might have given you a hint that the problem was not what you thought it was. > Any thoughts on a sort of generic method/means to handle any/all > characters that might be out of range when having pulled them out of > something like these MS access databases? Assuming the data in Access is not broken and that you know the encoding decode() will work. > Another side note is for binary values that might store binary values, I > use something like the following to generate hex-based strings that work > alright when then inserting said same binary values into longblob fields, > but, don't think this would really help for what are really just most > likely badly chosen copy/pasted strings from documents, with strange > encoding, or something: > #sample code line for binary encoding into string output > s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "\\\\") + > ", " I would expect that you can feed bytestrings directly into blobs, without any preparatory step. Try it, and if you get failures show us the failing code and the corresponding traceback. -- https://mail.python.org/mailman/listinfo/python-list