On Fri, Nov 25, 2016 at 7:29 PM, Mark Summerfield <l...@qtrac.plus.com> wrote: > The article has a section called: > > "Statically Typed Strings" > > The title is wrong of course because Python uses dynamic typing. But his > chief complaint seems to be that you can't mix strings and bytes in Python 3. > That's a deliberate design choice that several Python core developers have > explained. Essentially they are saying that you can't concatenate a bunch of > raw bytes with a string in the same way that you can't add a set to a list -- > and this makes perfect sense because raw bytes could be just bytes, or they > could be a representation of text in which case by specifying the encoding > (i.e., converting them to a string) the concatenation can take place. And > this is in keeping with Python's core philosphy of being explicit. >
It's worse than that. Look at his comparison of Py3 and Py2. I've shortened them some to highlight the part I'm looking at: x = bytes("hello", 'utf-8') y = "hello" def addstring(a, b): return a + b addstring(x, y) # TypeError ========== def addstring(a, b): return a + b x = "hello" y = bytes("hello") addstring(x, y) # 'hellohello' ========== He clearly does not understand the difference between bytes and text, as has been proven earlier, but this demonstrates that he doesn't even understand the difference between Python's data types. The first example is trying to add a bytestring to a Unicode string; the second is actually adding two byte strings. He could have given a demo of how Python 2 lets you join str and unicode, but it would have spoiled his Py2 code by putting u'hellohello' into his output, and making Py3 actually look better. Can't have that. Then he says: > If they're going to require beginners to struggle with the difference between > bytes and Unicode the least they could do is tell people what variables are > bytes and what variables are strings. > The trouble is, by the time you're adding bytes and text, you're not looking at variables any more. You're looking at objects. I don't think he's properly understood Python's object model. Here, have some FUD: > Strings are also most frequently received from an external source, such as a > network socket, file, or similar input. This means that Python 3's statically > typed strings and lack of static type safety will cause Python 3 applications > to crash more often and have more security problems when compared with Python > 2. > What security problems? Any evidence of that? On the face of it, without any actual specific examples, which of these would you expect to be more security-problem-prone: mixing data types, or throwing exceptions? In a web application, an exception can be caught at a high level, logged, and handled by kicking a 500 back to the client. In other applications, there may be an equivalent, or you just terminate the server (client gets disconnected) and start up again. At worst, this means that someone can exploit the whole "crash and restart" thing as a way to DOS you. Here, let me walk you through some different numeric types, and you tell me which ones are equal and which aren't - and the security implications of that: 1) 1e2 == 100 ? 2) 1e2 == "100" ? 3) "1e2" == 100 ? 4) "1e2" == "100" ? #1 makes perfect sense. Python says, yes, this is the case. (Not all languages will; 1e2 is a floating-point literal, 100 is an integer, and it's conceivable to keep them separate.) #2 is acceptable to languages with "sloppy comparison" and "strict comparison" operators, like ECMAScript/JavaScript. The number 100 is (non-strictly) equal to the string "100". #3 depends on whether sloppy comparisons are done by converting to string or converting to number. ECMAScript treats them as equal, but I'm just as happy with that being false (actually, probably slightly happier). #4 makes no sense to any sane programmer [1], which must be why PHP chose to have that one be true. Security implications of two different hexadecimal strings comparing equal.... that can't have any bearing on passwords now, can it... Is b"hello" == u"hello" ever a security consideration? If it is, my money is on the exception being the *more* secure option. Straight-up false: > The point being that character encoding detection and negotiation is a solved > problem. Nope, nope it isn't. One of my hobbies is collecting movie subtitles in various languages [2]. They generally come to me in eight-bit encodings with no declaration. Using only internal evidence, chardet has about a 66% hit rate at a pass mark of "readable enough that I can figure out the language", and a much lower hit rate at "actually the correct encoding". With better heuristics (maybe a set of rules specifically aimed at reading subtitle files), that could probably get as far as 100% readable and 75% correct, but it is *never* going to be perfect, because *the input is ambiguous*. If other languages appear to have gotten this right, it's probably because they either enforce a single encoding (eg UTF-8), or just ignore the whole problem, assuming that someone else will have to deal with it. Mark says: > He's right! The % formatting was kept to help port old code, the new > .format() which is far more versatile is a bit verbose, so finally they've > settled on f-strings. So, you do need to know that all three exist (e.g., for > maintaining code), but you can easily choose the style that you prefer and > just use that. > Not strictly true. An f-string is a special construct that isn't as flexible as the other formatting types. You can read a bracey or percent-marked string from a file, then interpolate it with values at run time. You can't do that with an f-string, short of messing around with eval (which is not as pretty as just msg.format(...), plus you have to trust your external file as if it were code). This has strong implications for i18n/l10n, and some other situations as well. So the other string formatting facilities aren't ever going to die, and you really should learn one of them. I do agree that you're welcome to teach just one of them, though, and worry about the other if and when it ever comes up. Personally, I quite like percent-formatting, because it's the same as can be used in a lot of other languages (including shell scripting, via GNU printf), but brace-formatting lets you reorder the parameters, so it has flexibility that can be important for i18n. So my conclusion would probably be: Use f-strings for the simple cases where you'd be using a literal, and then have a glance at each of the other two, so you know they're there when the time comes. He concludes that Py3 is still unusable because he keeps trying to port code and failing. That says, to me, that he needs to take a step back and learn about the fundamental difference between text and bytes, and that might mean learning a bit of a language like Russian or Japanese, where text obviously can't be squeezed into ASCII or into a typical US-English eight-bit character set. For my part, though, I can attest that *not one* of my students has had a problem with Py3, ever since the course switched over. And that includes three (so far) who, after a month and a half of learning JavaScript, are given five days to learn Python and do something useful with it. Five days. They start on Monday, and by Friday close-of-business, they demonstrate what they've learned (in a group where all the students have learned something in a week - eg Angular.js, Socket.io, React Native, Ruby, mobile app design, etc), and it goes into the portfolio. Now, if Python 3 were impossible to learn, you would expect these people to struggle. They don't. In fact, as I was mentoring one of them, I kept telling him "scope it back, scope it back, you have only X days to finish this" - but he charged ahead and did everything anyway. Python is pretty easy to learn; all three of my flex week students had moved beyond messing with the language before the end of Monday, and were onto actual productive work on the project. And it's probably even easier for a perfectly new programmer to understand. You ignore bytes altogether until you start working with networks (even with disks, you can read and write in text mode) or actual binary data (graphic file formats or something). Text behaves the way you'd expect text to. You can use English variable names - but you can also use French, or Swedish, or Russian, or Japanese, because Python doesn't restrict you to ASCII. It does exactly what you'd expect. You just have to expect based on a human's outlook, rather than a C programmer's. ChrisA [1] I am, however, open to the argument that computer programmers are by definition not sane. [2] eg for Disney's "Frozen": https://github.com/Rosuav/LetItTrans/tree/master/entire -- https://mail.python.org/mailman/listinfo/python-list