Re: Guile Hacker Handbook - Character sets

divoplade Thu, 18 Feb 2021 15:16:06 -0800

Hello,

Le jeudi 18 février 2021 à 20:54 +0100, Jérémy Korwin-Zmijowski a
écrit :
> I happily managed to find some time to write a new chapter for the
> Guile Hacker Handbook !
> 
> https://jeko.frama.io/en/char-sets.html
> 
> It deals with char-sets, something new to me. The exercise was fun, I
> liked how convenient it is to play with these data type.

The use of unicode makes it tempting to think that each thing you can
index in a string is a character. This will work most of the time,
except in some cases with foreign languages. This remark is general,
and applies in many situations, including the previous chapter about
characters. I suggest reading:
https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
.

Fortunately, there are very few international problems that need to
look at individual characters of a string. Your password rules example
is arguably one of them, although it may make non-latin users angry
(this upper case / lower case distinction does not work in chinese, as
far as I know). The other example that I'm aware of is limiting the
size of a message so that the reader does not get bored (so, not for
storage reasons). One website famously limits the number of unicode
code points for a message, although it is in fact much more complex and
opinionated than expected (
https://developer.twitter.com/en/docs/counting-characters).

I think that the approach of demonstrating general code that works with
latin except "special characters" is rude to the rest of the world and
should not be put in such a strategic place as the Guile Hacker
Handbook.

For your example, I suggest switching to something that has more
structure and is purposedly latin, for instance checking the validity
of IBAN accounts, car license plates in an applicable country, maybe
your grocery store's customer ID... You can also invent your own.

The previous chapter about characters gives a good importance to letter
intervals, which is even more difficult because the locale order would
put 'é' after 'e' and before 'f', but the char>=? predicate would put
it after everything. So, this does not even work for all latin. And if
you use the locale order, then you won't even have meaningful character
ranges anymore.

Unicode is a very complex beast, with very few general use cases. Don't
let that discourage you. Fortunately, most of everyday computing tasks
can be solved without going down to the unicode character semantics. As
a general idea, I would suggest to stay away from characters, and start
with strings.

Re: Guile Hacker Handbook - Character sets

Reply via email to