Hi. This was labelled offtopic in python-ideas, so I edited and forwarded it here. Please CC as I am not subscribed.
In short. I need is a bulletproof way to convert from anything to unicode. This requires some kind of escaping to go forward and back. Some helper function like u2b() (unicode to binary) and b2u() (that also removes escaping). So far I can't find any code that does just that. Background story. I need to print SCons graph. SCons is a build tool, so it has a graph of nodes - what depends on what. I have no idea what a node object could be. I know only that it can have human readable representation. Sometimes node is a filename in some encoding that is not utf-8, and without knowing the encoding, converting it to unicode is not possible without loosing the information about that filename. So, here is what Python proposes: https://docs.python.org/2.7/library/functions.html?highlight=unicode#unicode unicode() type constructor that doesn't allow you to do conversion without losing the data. It offers only two basic strategies - crash or corrupt: 1. ignore - meaning skip and corrupt the data 2. replace - just corrupt the data 3. strict - just crash Python design leaves the decision how to implement safe interoperability to you, and that's basically the reason why Python 3 fails. Without a safe approach (get my binary data back frum that unicode) people just can't wrap their heads around that. Python design assumes that people know the encoding of data they are processing, but that's not true in many cases. The data may also be just broken or invalid. So, the real world coding assumptions are: 1. external data encoding is unknown or varies 2. external data has binary chunks that are invalid for conversion to unicode In real world UnicodeDecode crashes is not an option for deal with unknown or broken and invalid input (such as when I need to print human representation of Node to the screen). In many (most?) situations lossless garbage is more welcome than crash or dataloss and that should be a default behaviour. The solution is to have filter preprocess the binary string to escape all non-unicode symbols so that the following lossless transformation becomes possible: binary -> escaped utf-8 string -> unicode -> binary I want to know if that's real? I need to accomplish that with Python 2.x, but the use case is probably valid for Python 3 as well. This stuff is critical to port SCons to Python 3.x and I expect for other similar tools that have to deal with unknown ascii-binary strings too. -- anatoly t. -- https://mail.python.org/mailman/listinfo/python-list