On 29/12/2015 21:54, Ehsan Akhgari wrote:
They are not Mozilla special words. They are words that we want to add
to our spell checking dictionary that don't exist in the upstream SCOWL
word list.
In bug 1235506 I suggest to maintain three lists:
1) proper names of which we have about 12.000.
2) Special Mozilla words, like "XUL" of which we have exactly 37.
3) A mixed bag of 1000 extra words, mostly internet related terms.
There are many errors in those. Many of those words should be
requested upstream and removed from the Mozilla maintained part
in due course, example: datasheet:
http://app.aspell.net/lookup?dict=en_US-large&words=datasheet
has a likeliness to be added one day.
IOW, our en-US dictionary is a super-set of the SCOWL en-US dictionary.
Yes, minus two exceptions:
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/5-mozilla-removed
(which IMHO make no sense at all).
AFAIK the SCOWL project recommends against using the large word list for
spell checking. If you find evidence to the contrary, I would like to
know more about that.
I NI'ed Kevin Atkinson, as you had requested. The maintainer of the GB
dictionary states: Bigger is better:
http://marcoagpinto.cidadevirtual.pt/faq.html
(About the words that disappeared, please file a bug and attach the list
of the words.)
No. That is the part that makes no sense. These words were in the SCOWL
data before the merge at the end of April 2015. Now there are no longer
in the "small" dictionary. It makes absolutely no sense to administer
that part of the dictionary. We either take the SCOWL data or we don't.
We don't want to be in the business of adding words that SCOWL have
removed. Therefore my suggestion is to use the "large" dataset which has
these words: Example:
http://app.aspell.net/lookup?dict=en_US-large&words=relict%0D%0Aresiduary%0D%0Aenforceability%0D%0Aadvisor%0D%0Ainfeasible%0D%0Aclich%E9%0D%0ABogot%E1%0D%0Ainfeasible%0D%0Aunfeasible
I welcome someone going through the list and reviewing the additions. No
matter what process we use for maintaining our word list, that will be
needed and appreciated.
Yes.
I'm really lost on what problem in the process you are talking about.
Looks like you have found some issues in the word list, which is great.
But I don't see any of these having anything to do with the process for
updating the word list.
I'll try again. The Mozilla dictionary consists of two sources: SCOWL
and Mozilla's words, which should be maintained separately. We want to
be in a position to replace the SCOWL data easily. Mozilla should
administer its own additions, not general English terms. A recent
Mozilla addition, Fukushima, should for example be added to the third
list, the mixed bag that we wish were in the SCOWL data but aren't
(http://app.aspell.net/lookup?dict=en_US-large&words=Fukushima).
Another example: If SCOWL decide to change feasible/U to feasible/UI and
then back to feasible/U, Mozilla should not hang on to the /I part as we
currently do. Mozilla should not administer the plain English
dictionary, it should administer its specific well chosen additions.
The faulty process has led to the unfortunate situation we're in. The
current process accumulates all SCOWL errors forever unless some files a
bug. For example: Somehow "remind's" got into the Mozilla data the only
way to get it out with the current process is to file a bug.
If we were only to maintain carefully chosen additions, then
mind/remind/reminds/etc. would not be part of the Mozilla maintained
list. Mozilla would just follow SCOWL on this word/stem.
If you're suggesting that we should not
maintain any additions on top of SCOWL, that is effectively asking for a
regression to the quality of our word list, and as such is unacceptable.
As I said many times in the thread: We should carefully maintain any
Mozilla additions on top of the SCOWL data. We should leave it to SCOWL
to manage the plain English dictionary and only manage the Mozilla
additions (for which I see three classes, see above).
Let me try a comparison: The SCOWL data is a holiday rental place and
Mozilla is the holiday maker. It moves on with thongs, sunscreen and
shorts. It keeps track of its belongings. Of course Mozillians in the
flat take pictures of each other which feature the things which belong
to the flat. After a week Mozilla returns home. I takes its thongs,
sunscreen and shorts with it. It does not hang on to the flat's carpet
or couch. Neither does is take an inventory of the holiday flat. Next
year Mozilla visits the holiday flat again. The owner has changed the
carpet, removed a picture from the hallway but added a statue in the
living room. The holiday snaps will look different to the ones from the
previous year, but Mozilla doesn't have to guarantee that the same items
of furniture appear on the photos.
If you find a way to encode these accented characters properly, we can
add them to the word list that we maintain.
They are already in the "large" SCOWL dataset. Adding accented
characters to en-US.dic works today. Try it: Add "naïve" or "résumé" to
the data and make sure the file gets saved as ANSI/ISO8859-1 and not
UTF-8. Then type/paste those words into a text field with spell
checking. Works!
Jorg K.
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform