[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: I've also filed https://sourceware.org/bugzilla/show_bug.cgi?id=26034 for glibc, because that's where really the issues seems to be? But perhaps python should be forgiving of glibc errors here. --

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: Like I said above, it could be argued that the bug is in glibc, and then https://p.sipsolutions.net/6a4e9fce82dbbfa0.txt could be used as a simple LD_PRELOAD wrapper to work around this, just to illustrate the problem from that side. Arguably, that makes

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: And wrt. _Py_DecodeUTF8Ex() - it doesn't seem to help. But that's probably because I'm not __ANDROID__, nor __APPLE__, and then regardless of current_locale being non-zero or not, we end up in decode_current_locale() where the impedance m

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: In fact that python one-liner works with just about everything else that you can throw at it, just not something that "looks like utf-8 but isn't". And of course adding LC_CTYPE=ascii or something like that fixes it, as you'

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: A simple test case is something like ./python -c 'import sys; print(sys.argv[1].encode(sys.getfilesystemencoding(), "surrogateescape"))' "$(echo -ne '\xfa\xbd\x83\x96\x80')" Which you'd probably expect to pr

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: Pretty sure this is an issue still, I see it on current git master. This seems to work around it? https://p.sipsolutions.net/603927f1537226b3.txt Basically, it seems that mbstowcs() and mbrtowc() on glibc with utf-8 just blindly decode even invalid UTF-8 to