On Sat, Mar 7, 2015 at 1:03 AM, <random...@fastmail.us> wrote: > On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote: >> Number of code points is the most logical way to length-limit >> something. If you want to allow users to set their display names but >> not to make arbitrarily long ones, limiting them to X code points is >> the safest way (and preferably do an NFC or NFD normalization before >> counting, for consistency); > > Why are you length-limiting it? Storage space? Limit it in whatever > encoding they're stored in. Why are combining marks "pathological" but > surrogate characters not? Display space? Limit it by columns. If you're > going to allow a Japanese user's name to be twice as wide, you've got a > problem when you go to display it.
To prevent people from putting three paragraphs of lipsum in and calling it a username. >> this means you disallow pathological cases >> where every base character has innumerable combining marks added. > > No it doesn't. If you limit it to, say, fifty, someone can still post > two base characters with twenty combining marks each. If you actually > want to disallow this, you've got to do more work. You've disallowed > some of the pathological cases, some of the time, by coincidence. And > limiting the number of UTF-8 bytes, or the number of UTF-16 code points, > will accomplish this just as well. They can, but then they're limited to two base characters. They can't have fifty base characters with twenty combining marks each. That's the point. > Now, if you intend to _silently truncate_ it to the desired length, you > certainly don't want to leave half a character in, of course. But who's > to say the base character plus first few combining marks aren't also > "half a character"? If you're _splitting_ a string, rather than merely > truncating it, you probably don't want those combining marks at the > beginning of part two. So you truncate to the desired length, then if the first character of the trimmed-off section is a combining mark (based on its Unicode character types), you keep trimming until you've removed a character which isn't. Then, if you no longer have any content whatsoever, reject the name. Simple. ChrisA -- https://mail.python.org/mailman/listinfo/python-list