Re: Maintaining the en-US dictionary that ships with Mozilla products

Ehsan Akhgari Tue, 29 Dec 2015 16:50:17 -0800

On 2015-12-29 6:54 PM, Jörg Knobloch wrote:

On 29/12/2015 21:54, Ehsan Akhgari wrote:

They are not Mozilla special words.  They are words that we want to add
to our spell checking dictionary that don't exist in the upstream SCOWL
word list.

In bug 1235506 I suggest to maintain three lists:
1) proper names of which we have about 12.000.
2) Special Mozilla words, like "XUL" of which we have exactly 37.
3) A mixed bag of 1000 extra words, mostly internet related terms.
    There are many errors in those. Many of those words should be
    requested upstream and removed from the Mozilla maintained part
    in due course, example: datasheet:
    http://app.aspell.net/lookup?dict=en_US-large&words=datasheet
    has a likeliness to be added one day.

First things first, let's correct something here. We do _not_ maintainthree word lists. We maintain one list: the list of words that theFirefox spellchecker accepts. Theextensions/spellcheck/locales/en-US/hunspell/dictionary-sources/5-mozilla-*files are there purely for historical reasons, and should only be usedin order to triage the diff of our dictionary as the SCOWL upstream.

FWIW,<https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/README>explains exactly how these files are used and generated thoroughly.

IOW, our en-US dictionary is a super-set of the SCOWL en-US dictionary.

Yes, minus two exceptions:
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/5-mozilla-removed

(which IMHO make no sense at all).

Yes. At the risk of sounding like a broken record, please file a bug?I think it makes sense to add these two words back. :-)

AFAIK the SCOWL project recommends against using the large word list for
spell checking.  If you find evidence to the contrary, I would like to
know more about that.

I NI'ed Kevin Atkinson, as you had requested. The maintainer of the GB
dictionary states: Bigger is better:
http://marcoagpinto.cidadevirtual.pt/faq.html

Cool. FWIW I'm happy to trust Kevin's judgement here, unless if thelarger wordlist increases the size of the code we ship significantly.

(About the words that disappeared, please file a bug and attach the list
of the words.)

No. That is the part that makes no sense. These words were in the SCOWL
data before the merge at the end of April 2015. Now there are no longer
in the "small" dictionary. It makes absolutely no sense to administer
that part of the dictionary. We either take the SCOWL data or we don't.
We don't want to be in the business of adding words that SCOWL have
removed. Therefore my suggestion is to use the "large" dataset which has
these words: Example:
http://app.aspell.net/lookup?dict=en_US-large&words=relict%0D%0Aresiduary%0D%0Aenforceability%0D%0Aadvisor%0D%0Ainfeasible%0D%0Aclich%E9%0D%0ABogot%E1%0D%0Ainfeasible%0D%0Aunfeasible

I'm afraid you're misunderstanding what's happened here. We onlymaintain one word list, and our process of merging upstream changes ispurely additive. As a result, it doesn't handle the case where a worddisappears from SCOWL.

This is clearly a bug, and should be fixed. We may still decide to keepindividual words that SCOWL drops if we decide that we want the Firefoxspell checker to accept them, but as a general rule we should probablyfollow upstream.


I believe this should be relatively simple to fix in make-new-dict.

I'm really lost on what problem in the process you are talking about.
Looks like you have found some issues in the word list, which is great.
But I don't see any of these having anything to do with the process for
updating the word list.

I'll try again. The Mozilla dictionary consists of two sources: SCOWL
and Mozilla's words, which should be maintained separately. We want to
be in a position to replace the SCOWL data easily. Mozilla should
administer its own additions, not general English terms. A recent
Mozilla addition, Fukushima, should for example be added to the third
list, the mixed bag that we wish were in the SCOWL data but aren't
(http://app.aspell.net/lookup?dict=en_US-large&words=Fukushima).


Please see the above.  I believe fixing the above bug will make you happy!

Another example: If SCOWL decide to change feasible/U to feasible/UI and
then back to feasible/U, Mozilla should not hang on to the /I part as we
currently do. Mozilla should not administer the plain English
dictionary, it should administer its specific well chosen additions.


Again, please see the above.

The faulty process has led to the unfortunate situation we're in. The
current process accumulates all SCOWL errors forever unless some files a
bug. For example: Somehow "remind's" got into the Mozilla data the only
way to get it out with the current process is to file a bug.

FWIW please realize that even with the above bug fixed and us notmagically holding zombie SCOWL entries alive, there will still beexamples of embarrassing things that our spell checker gets wrong. Theright thing to do for that is _always_ to file a bug. So, note thatthere are two orthogonal issues here.

If you're suggesting that we should not
maintain any additions on top of SCOWL, that is effectively asking for a
regression to the quality of our word list, and as such is unacceptable.

As I said many times in the thread: We should carefully maintain any
Mozilla additions on top of the SCOWL data. We should leave it to SCOWL
to manage the plain English dictionary and only manage the Mozilla
additions (for which I see three classes, see above).

I disagree. I think we should accept the words that we want, and thentry to upstream them to SCOWL, without holding Firefox back until thathappens. I experimented with this once<https://github.com/kevina/wordlist/issues/117> but unfortunately Ihaven't had the time to go through all of the list. (As a non-nativespeaker this task requires me to spend weeks looking things up indictionaries!)

Let me try a comparison: The SCOWL data is a holiday rental place and
Mozilla is the holiday maker. It moves on with thongs, sunscreen and
shorts. It keeps track of its belongings. Of course Mozillians in the
flat take pictures of each other which feature the things which belong
to the flat. After a week Mozilla returns home. I takes its thongs,
sunscreen and shorts with it. It does not hang on to the flat's carpet
or couch. Neither does is take an inventory of the holiday flat. Next
year Mozilla visits the holiday flat again. The owner has changed the
carpet, removed a picture from the hallway but added a statue in the
living room. The holiday snaps will look different to the ones from the
previous year, but Mozilla doesn't have to guarantee that the same items
of furniture appear on the photos.

I'm well capable of understanding technical arguments, and I'm not sureif I appreciate this kind of simplifications. Let's please stick totechnical terms. :-)

If you find a way to encode these accented characters properly, we can
add them to the word list that we maintain.

They are already in the "large" SCOWL dataset. Adding accented
characters to en-US.dic works today. Try it: Add "naïve" or "résumé" to
the data and make sure the file gets saved as ANSI/ISO8859-1 and not
UTF-8. Then type/paste those words into a text field with spell
checking. Works!

Wonderful! If you have a list of words using these types of charactersthat we need to add, please file a bug, and let's do that!


_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Maintaining the en-US dictionary that ships with Mozilla products

Reply via email to