Am 16.11.2010 um 18:00 schrieb Enrico Forestieri: > On Tue, Nov 16, 2010 at 01:16:38PM +0100, Vincent van Ravesteijn wrote: >>>> This will work too I guess. >>> >>> In the sense of "avoid the crash"... >>> >>> The purpose of hasDigit() is to test for occurrences of digits to avoid >>> spell check of words with digits. >>> A docstring may very well contain digits coded outside the range of 0x00 .. >>> 0x7F (ascii 0-9). >>> Unicode contains more numeral in different encodings. >>> >>> Stephan >> >> Are you sure that the numeric characters in other parts of the >> spectrum cannot occur in real words that need to be spellchecked. An >> example to prove that this can be the case is in Chinese: >> >> ??? means '3', but ?????? means triangle. >> >> Ok, I don't know what iswdigit() returns for ???, and I guess that >> spellchecking for Chinese makes no sense, but you get the idea. >> >> It would be worse if there is some language in which such a numeric >> character occurs for example in 10% of all words (as some common >> ending or something), then 10% of the words is not spellchecked. >> >> It feels like we are trying to be smart, but I'd feel better if we >> then exactly know what we do and which words are not spellchecked and >> why. >> >> Besides, I read on this >> website:http://linux.about.com/library/cmd/blcmdl3_iswdigit.htm >> "The wide character class "digit" always contains exactly the digits >> '0' to '9'.", so I'm not sure whether it has any added value. > > I experimented a bit on solaris. Using the attached isdigit.c program > I get the output in (the also attached) isdigit.out. As you can see, > the output is incorrect outside the ascii range and the program > segfaults, too. > > However, if I stick an "#undef isdgit" right after "#include <ctype.h>", > I get no crash and the correct result: > > $ ./isdigit > 48 0x30 > 49 0x31 > 50 0x32 > 51 0x33 > 52 0x34 > 53 0x35 > 54 0x36 > 55 0x37 > 56 0x38 > 57 0x39 > > which is exactly the same as the output of the attached iswdgit.c program. > So, using the macro version of isdigit() produces wrong results if the > argument is not in the ascii range and also a crash. > Using iswdigit() produces the same result as the function version of > isdigit(). > > Moral: either we stick an "#undef isdigit" in our code or we switch > to iswdigit(). However, in this case, some locale expert should clarify > under what conditions the output of iswdigit() differs from that of > isdigit().
I had to think about this further... 1. isdigit() is only one example for the problems with wchar_t. The same problem exists with isalnum() et. al. * isalnum() in output_xhtml.cpp, function cleanAttr(docstring const & str) * or isalpha() in InsetRef::getFormattedCmd() 2. iswdigit() et. al. depends on the current locale. Shouldn't the locale depend on the document language? Then it would be iswdigit_l() etc... 3. in lstrings.cpp is a isDigit(char_type) implementation... There is some use of ucs4_to_qchar() and this again has a comment saying it's a hack. If it's correct to use isDigit() then hasDigit can use that. 4. Other numerals like 1/3 or "roman numeral one thousand" M etc. should be classified as digits as well. I don't know what the correct solution would be... I'd use neither the "#undef isdigit" nor the range check with "less then 0x80". I'd use some iswctype() or iswctype_l() solution. And all uses of non-wchar_t ctype function for char_type arguments should be verified. Stephan