Greetings, during my current quest to get more G-expressions working with UTF-8 input, I have read the Guile's documentation, in particular '(guile)Encoding', and I think change in default behavior is warranted.
Currently the initial value of %default-port-conversion-strategy is 'substitute. I would like to propose changing it to 'error on the ground of preventing subtle bugs and data corruption. Just a reminder, when 'substitute is used, any non-representable character is replaced with #\?. No error is signaled and user has no way to detect it even happened. I just do not believe that to be a reasonable default. Let us take a look for example at test-suite/standalone/test-mb-regexp. It contains this code: (regexp-exec (make-regexp "(.)(.)(.)") (string (integer->char 200) #\x (integer->char 202))) That might look sensible until you realize that the following regexp *also* matches: (make-regexp "(\\?)(.)(\\?)") This is just asking for potential bugs (possibly security related) and data corruption. The 'substitute strategy should of course stay (if someone actually needs it), but the default should really be changed to 'error. Work-wise it is very feasible, the change is minimal (single line both in ports.c and in documentation) and just few tests break: * test-mb-regexp: But this just demonstrates code that should have not worked in the first place. IMO. * test-bad-identifiers: Requires setlocale to UTF-8 locale and converting one source file (guardians.c) from latin1 to UTF-8. * ports.test: This explicitly tests the default value, so it needs to be adjusted. Real world impact should be limited, since most people are likely to run with LANG set to *some* UTF-8 locale. And if you do not have that, I (and I expect majority of engineers) would prefer correctness over convenience. I strongly believe the current default is wrong and dangerous, but I am obviously interested what other people think, hence this message. Please let me know what you think. Should I put this into actual patch? Does it have chance to be accepted and merged into the master? Thank you for reading and have a nice day, Tomas Volf -- There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors.
signature.asc
Description: PGP signature