Hello, Le jeudi 18 février 2021 à 20:54 +0100, Jérémy Korwin-Zmijowski a écrit : > I happily managed to find some time to write a new chapter for the > Guile Hacker Handbook ! > > https://jeko.frama.io/en/char-sets.html > > It deals with char-sets, something new to me. The exercise was fun, I > liked how convenient it is to play with these data type.
The use of unicode makes it tempting to think that each thing you can index in a string is a character. This will work most of the time, except in some cases with foreign languages. This remark is general, and applies in many situations, including the previous chapter about characters. I suggest reading: https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ . Fortunately, there are very few international problems that need to look at individual characters of a string. Your password rules example is arguably one of them, although it may make non-latin users angry (this upper case / lower case distinction does not work in chinese, as far as I know). The other example that I'm aware of is limiting the size of a message so that the reader does not get bored (so, not for storage reasons). One website famously limits the number of unicode code points for a message, although it is in fact much more complex and opinionated than expected ( https://developer.twitter.com/en/docs/counting-characters). I think that the approach of demonstrating general code that works with latin except "special characters" is rude to the rest of the world and should not be put in such a strategic place as the Guile Hacker Handbook. For your example, I suggest switching to something that has more structure and is purposedly latin, for instance checking the validity of IBAN accounts, car license plates in an applicable country, maybe your grocery store's customer ID... You can also invent your own. The previous chapter about characters gives a good importance to letter intervals, which is even more difficult because the locale order would put 'é' after 'e' and before 'f', but the char>=? predicate would put it after everything. So, this does not even work for all latin. And if you use the locale order, then you won't even have meaningful character ranges anymore. Unicode is a very complex beast, with very few general use cases. Don't let that discourage you. Fortunately, most of everyday computing tasks can be solved without going down to the unicode character semantics. As a general idea, I would suggest to stay away from characters, and start with strings.