Usually given names are not in a language dictionary, although many
(translation) services have separate dictionaries for proper/given names.
We have two problems here:
(1) Language: I think most users are OK with proper names not being
accepted by the spell checker (before learning them). However, other
options such as "Ignore" should work, too.
(2) Encoding: Words having characters that are not part of the normal
character set in a given language, should behave in the same way as words
that are. This includes "István", "Vološinov", etc. So we have to use UTF-8
to look up words.
When down-converting text to the character set of the target language, we
can ignore non-convertible characters silently, but
echo 'István' | iconv -c -f utf-8 -t ascii
yields "Istvn", which is not very useful.
I think we have to use Unicode for all the given operations and (a) either
risk a mismatch for each word that is not learned/ignored, or (b)
up-convert words in the dictionary before they are matched. The latter
solution implies that the dictionary tool supports this; does anyone know
if that is the case (for at least one tool)?
This is mixing languages with writing systems, IMHO. In fact language
sometimes has an implication on the spelling of names (if it comes to
transliteration), but with rather surpring effects. For instance, the
Russian name Воло́шинов is usually written Vološinov in German, but
Voloshinov in English. Is "š" a "German" character?
I'm not a linguist and my knowledge about these things is limited. The
change of language is the only possibility I know of to get out of the
"broken" dictionary encoding scenario.
Also, I think that marking István as "Hungarian" absurds the language
concept.
More technically, I think it will be irritating for users that they
can add "István" to the personal dictionary, while "Ignore" and
"Ignore all" just won't work.
Yes, I agree.
With the given example "István" and having á in the dictionary encoding
the word is most probably mark as misspelled. But then it's possible to
Ignore it? Isn't there the option to discard the characters that cannot
be converted silently or replace them with something similar for the
dictionary lookup? Not quite correct, I know - but perhaps the better
strategy for the user?
Stephan
--
Regards,
Cyrille Artho - http://artho.com/
Perilous to all of us are the devices of an art deeper than we
ourselves possess.
-- Gandalf the Grey [Tolkien, "Lord of the Rings"]