On Tue, Nov 16, 2010 at 01:16:38PM +0100, Vincent van Ravesteijn wrote: > >> This will work too I guess. > > > > In the sense of "avoid the crash"... > > > > The purpose of hasDigit() is to test for occurrences of digits to avoid > > spell check of words with digits. > > A docstring may very well contain digits coded outside the range of 0x00 .. > > 0x7F (ascii 0-9). > > Unicode contains more numeral in different encodings. > > > > Stephan > > Are you sure that the numeric characters in other parts of the > spectrum cannot occur in real words that need to be spellchecked. An > example to prove that this can be the case is in Chinese: > > ??? means '3', but ?????? means triangle. > > Ok, I don't know what iswdigit() returns for ???, and I guess that > spellchecking for Chinese makes no sense, but you get the idea. > > It would be worse if there is some language in which such a numeric > character occurs for example in 10% of all words (as some common > ending or something), then 10% of the words is not spellchecked. > > It feels like we are trying to be smart, but I'd feel better if we > then exactly know what we do and which words are not spellchecked and > why. > > Besides, I read on this > website:http://linux.about.com/library/cmd/blcmdl3_iswdigit.htm > "The wide character class "digit" always contains exactly the digits > '0' to '9'.", so I'm not sure whether it has any added value.
I experimented a bit on solaris. Using the attached isdigit.c program I get the output in (the also attached) isdigit.out. As you can see, the output is incorrect outside the ascii range and the program segfaults, too. However, if I stick an "#undef isdgit" right after "#include <ctype.h>", I get no crash and the correct result: $ ./isdigit 48 0x30 49 0x31 50 0x32 51 0x33 52 0x34 53 0x35 54 0x36 55 0x37 56 0x38 57 0x39 which is exactly the same as the output of the attached iswdgit.c program. So, using the macro version of isdigit() produces wrong results if the argument is not in the ascii range and also a crash. Using iswdigit() produces the same result as the function version of isdigit(). Moral: either we stick an "#undef isdigit" in our code or we switch to iswdigit(). However, in this case, some locale expert should clarify under what conditions the output of iswdigit() differs from that of isdigit(). -- Enrico
#include <stdio.h> #include <ctype.h> int main(void) { int wc; for (wc=0; wc <= 0xFFFF; wc++) { if (isdigit(wc)) { printf("%3d", wc); printf(" %#4x\n", wc); } } }
#include <stdio.h> #include <wctype.h> int main(void) { int wc; for (wc=0; wc <= 0xFFFF; wc++) { if (iswdigit(wc)) { printf("%3d", wc); printf(" %#4x\n", wc); } } }
48 0x30 49 0x31 50 0x32 51 0x33 52 0x34 53 0x35 54 0x36 55 0x37 56 0x38 57 0x39 261 0x105 262 0x106 263 0x107 264 0x108 269 0x10d 270 0x10e 271 0x10f 272 0x110 277 0x115 278 0x116 279 0x117 280 0x118 285 0x11d 286 0x11e 287 0x11f 288 0x120 293 0x125 294 0x126 295 0x127 296 0x128 301 0x12d 302 0x12e 303 0x12f 304 0x130 309 0x135 310 0x136 311 0x137 312 0x138 317 0x13d 318 0x13e 319 0x13f 320 0x140 325 0x145 326 0x146 327 0x147 328 0x148 333 0x14d 334 0x14e 335 0x14f 336 0x150 341 0x155 342 0x156 343 0x157 344 0x158 349 0x15d 350 0x15e 351 0x15f 352 0x160 357 0x165 358 0x166 359 0x167 360 0x168 365 0x16d 366 0x16e 367 0x16f 368 0x170 373 0x175 374 0x176 375 0x177 376 0x178 381 0x17d 382 0x17e 383 0x17f 384 0x180 523 0x20b 524 0x20c 525 0x20d 526 0x20e Segmentation fault