On Mon, Jan 6, 2014 at 12:22 AM, Ned Batchelder <n...@nedbatchelder.com> wrote: > If anyone wants Python 3 uptake improved, the best thing would be to either > explain to Armin how he missed the easy way to do what he wants (seems > unlikely), or advocate to the core devs why they should change things to > improve this situation.
I'm not sure that there is an "easy way". See, here's the deal. If all your data is ASCII, you can shut your eyes to the difference between bytes and text and Python 2 will work perfectly for you. Then some day you'll get a non-ASCII character come up (or maybe you'll get all of Latin-1 "for free" and it's when you get a non-Latin-1 character - same difference), and you start throwing in encode() and decode() calls in places. But you feel like you're fixing little problems with little solutions, so it's no big deal. Making the switch to Python 3 forces you to distinguish bytes from text, even when that text is all ASCII. Suddenly that's a huge job, a huge change through all your code, and it's all because of this switch to Python 3. The fact that you then get the entire Unicode range "for free" doesn't comfort people who are dealing with URLs and are confident they'll never see anything else (if they *do* see anything else, it's a bug at the far end). Maybe it's the better way, but like trying to get people to switch from MS Word onto an open system, it's far easier to push for Open Office than for LaTeX. Getting your head around a whole new way of thinking about your data is work, and people want to be lazy. (That's not a bad thing, by the way. Laziness means schedules get met.) So what can be done about it? Would it be useful to have a type that represents an ASCII string? (Either 'bytes' or something else, it doesn't matter what.) I'm inclined to say no, because as of the current versions, encoding/decoding UTF-8 has (if I understand correctly) been extremely optimized in the specific case of an all-ASCII string; so the complaint that there's no "string formatting for bytes" could be resolved by simply decoding to str, then encoding to bytes. I'd look on that as having two costs, a run-time performance cost and a code readability cost, and then look at reducing each of them - but without blurring the bytes/text distinction. Yes, that distinction is a cost. It's like any other mental cost, and it just has to be paid. The only way to explain it is that Py2 has the "cost gap" between ASCII (or Latin-1) and the rest of Unicode, but Py3 puts that cost gap before ASCII, and then gives you all of Unicode for the same low price (just $19.99 a month, you won't even notice the payments!). Question, to people who have large Py2 codebases that manipulate mostly-ASCII text. How bad would it be to your code to do this: # Py2: build a URL url = "http://my.server.name/%s/%s" % (path, fn) # Py3: build a URL as bytes def B(s): if isinstance(s, str): return s.encode() return s.decode() url = B(B(b"http://my.server.name/%s/%s") % (path, fn)) ? This little utility function lets you do the formatting as text (let's assume the URL pattern comes from somewhere else, or you'd just strip off the b'' prefix), while still mostly working with bytes. Is it an unacceptable level of code clutter? ChrisA -- https://mail.python.org/mailman/listinfo/python-list