Re: PATCH for ticket 7026

Stephan Witt Wed, 17 Nov 2010 11:24:19 -0800

Am 16.11.2010 um 18:00 schrieb Enrico Forestieri:

> On Tue, Nov 16, 2010 at 01:16:38PM +0100, Vincent van Ravesteijn wrote:
>>>> This will work too I guess.
>>> 
>>> In the sense of "avoid the crash"...
>>> 
>>> The purpose of hasDigit() is to test for occurrences of digits to avoid 
>>> spell check of words with digits.
>>> A docstring may very well contain digits coded outside the range of 0x00 .. 
>>> 0x7F (ascii 0-9).
>>> Unicode contains more numeral in different encodings.
>>> 
>>> Stephan
>> 
>> Are you sure that the numeric characters in other parts of the
>> spectrum cannot occur in real words that need to be spellchecked. An
>> example to prove that this can be the case is in Chinese:
>> 
>> ??? means '3', but ?????? means triangle.
>> 
>> Ok, I don't know what iswdigit() returns for ???, and I guess that
>> spellchecking for Chinese makes no sense, but you get the idea.
>> 
>> It would be worse if there is some language in which such a numeric
>> character occurs for example in 10% of all words (as some common
>> ending or something), then 10% of the words is not spellchecked.
>> 
>> It feels like we are trying to be smart, but I'd feel better if we
>> then exactly know what we do and which words are not spellchecked and
>> why.
>> 
>> Besides, I read on this
>> website:http://linux.about.com/library/cmd/blcmdl3_iswdigit.htm
>> "The wide character class "digit" always contains exactly the digits
>> '0' to '9'.", so I'm not sure whether it has any added value.
> 
> I experimented a bit on solaris. Using the attached isdigit.c program
> I get the output in (the also attached) isdigit.out. As you can see,
> the output is incorrect outside the ascii range and the program
> segfaults, too.
> 
> However, if I stick an "#undef isdgit" right after "#include <ctype.h>",
> I get no crash and the correct result:
> 
> $ ./isdigit
> 48 0x30
> 49 0x31
> 50 0x32
> 51 0x33
> 52 0x34
> 53 0x35
> 54 0x36
> 55 0x37
> 56 0x38
> 57 0x39
> 
> which is exactly the same as the output of the attached iswdgit.c program.
> So, using the macro version of isdigit() produces wrong results if the
> argument is not in the ascii range and also a crash.
> Using iswdigit() produces the same result as the function version of
> isdigit().
> 
> Moral: either we stick an "#undef isdigit" in our code or we switch
> to iswdigit(). However, in this case, some locale expert should clarify
> under what conditions the output of iswdigit() differs from that of
> isdigit().


I had to think about this further...

1. isdigit() is only one example for the problems with wchar_t.
The same problem exists with isalnum() et. al. 
* isalnum() in output_xhtml.cpp, function cleanAttr(docstring const & str)
* or isalpha() in InsetRef::getFormattedCmd()

2. iswdigit() et. al. depends on the current locale.
Shouldn't the locale depend on the document language? 
Then it would be iswdigit_l() etc...

3. in lstrings.cpp is a isDigit(char_type) implementation...
There is some use of ucs4_to_qchar() and this again has a comment saying it's a 
hack.
If it's correct to use isDigit() then hasDigit can use that.

4. Other numerals like 1/3 or "roman numeral one thousand" M etc. should be
classified as digits as well.

I don't know what the correct solution would be...
I'd use neither the "#undef isdigit" nor the range check with "less then 0x80".
I'd use some iswctype() or iswctype_l() solution. 
And all uses of non-wchar_t ctype function for char_type arguments should be 
verified.

Stephan

Re: PATCH for ticket 7026

Reply via email to