On Tue, Aug 30, 2016 at 7:36 PM, Johannes Bauer <dfnsonfsdu...@gmx.de> wrote: > On 29.08.2016 17:59, Chris Angelico wrote: > >> Fair enough. If this were something that a lot of programs wanted, >> then yeah, there'd be good value in stdlibbing it. Character encodings >> ARE hard to get right, and this kind of thing does warrant some help. >> But I think it's best not done in core - at least, not until we see a >> lot more people doing the same :) > > I hope this kind of botchery never makes it in the stdlib. It directly > contradicts "In the face of ambiguity, refuse the temptation to guess." > > If you don't know what the charset is, don't guess. It'll introduce > subtle ambiguities and ugly corner cases and will make the life for the > rest of us -- who are trying to get their charsets straight and correct > -- a living hell. > > Having such silly "magic" guessing stuff is actually detrimental to the > whole concept of properly identifying and using character sets. > Everything about the thought makes me shiver.
In the clinical purity of theoretical work, I absolutely agree with you, and for that reason, this definitely doesn't belong in the stdlib. But designers need to leave their wonderlands - the real world is not so wonderful. (Nan Sharpe, to Alice Liddell.) If every program in the world understood character encodings and correctly decoded bytes using a known encoding and encoded text using the same encoding (preferably UTF-8), then sure, it'd be easy. But when your program has to cope with other people's bytes-that-ought-to-represent-text, sometimes guessing IS better than choking. This example is a perfect one; a naive byte-oriented server accepts ASCII-compatible text from a variety of clients, and sends it out to all clients. (Since all the parts that the server actually parses are ASCII, this works.) Very commonly, naive Windows clients send text in the native encoding, eg CP-1252, but smarter clients generally send UTF-8. I want my client to interoperate perfectly with other UTF-8 clients, which is generally easy (the only breakage is if the server attempts to letter-wrap a massively long word, and ends up breaking a UTF-8 sequence across lines), but I also want to have a decent fallback for the eight-bit clients. Obviously I can't *know* the encoding used - if they were smart enough to send encoding info, they'd most likely use UTF-8 - so it's either guess, or choke on any non-ASCII bytes. Another place where guessing is VERY useful is when I'm leafing through 300 subtitles files for "Tangled" and want to know whether they're accurate transcriptions or not. (Not hypothetical. Been doing exactly that for a lot of this weekend. It seemed logical, since I've done the same for "Frozen", and both movies are excellent.) All I have is a file - a sequence of bytes. I know it's an ASCII-compatible encoding because the numeric positioning info looks correct. If my program "avoided the temptation to guess", I would have to manually test a dozen encodings until one of them looked right to me, the human; but instead, I use chardet plus some other heuristics, and generally the program's right on either the first or second guess. That means just two encodings for me to look at, often just one, and only going to the full dozen or so if it gets it completely wrong. The principle "refuse the temptation to guess" applies to core data types and such (and not even universally there), but NOT to applications, where you need domain knowledge to make that kind of call. ChrisA -- https://mail.python.org/mailman/listinfo/python-list