Re: Maintaining the en-US dictionary that ships with Mozilla products

Jörg Knobloch Tue, 29 Dec 2015 15:56:27 -0800

On 29/12/2015 21:54, Ehsan Akhgari wrote:

They are not Mozilla special words.  They are words that we want to add
to our spell checking dictionary that don't exist in the upstream SCOWL
word list.

In bug 1235506 I suggest to maintain three lists:
1) proper names of which we have about 12.000.
2) Special Mozilla words, like "XUL" of which we have exactly 37.
3) A mixed bag of 1000 extra words, mostly internet related terms.
   There are many errors in those. Many of those words should be
   requested upstream and removed from the Mozilla maintained part
   in due course, example: datasheet:
   http://app.aspell.net/lookup?dict=en_US-large&words=datasheet
   has a likeliness to be added one day.

IOW, our en-US dictionary is a super-set of the SCOWL en-US dictionary.

Yes, minus two exceptions:
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/5-mozilla-removed
(which IMHO make no sense at all).

AFAIK the SCOWL project recommends against using the large word list for
spell checking.  If you find evidence to the contrary, I would like to
know more about that.

I NI'ed Kevin Atkinson, as you had requested. The maintainer of the GBdictionary states: Bigger is better:

http://marcoagpinto.cidadevirtual.pt/faq.html

(About the words that disappeared, please file a bug and attach the list
of the words.)

No. That is the part that makes no sense. These words were in the SCOWLdata before the merge at the end of April 2015. Now there are no longerin the "small" dictionary. It makes absolutely no sense to administerthat part of the dictionary. We either take the SCOWL data or we don't.We don't want to be in the business of adding words that SCOWL haveremoved. Therefore my suggestion is to use the "large" dataset which hasthese words: Example:

http://app.aspell.net/lookup?dict=en_US-large&words=relict%0D%0Aresiduary%0D%0Aenforceability%0D%0Aadvisor%0D%0Ainfeasible%0D%0Aclich%E9%0D%0ABogot%E1%0D%0Ainfeasible%0D%0Aunfeasible

I welcome someone going through the list and reviewing the additions. No
matter what process we use for maintaining our word list, that will be
needed and appreciated.

Yes.

I'm really lost on what problem in the process you are talking about.
Looks like you have found some issues in the word list, which is great.
But I don't see any of these having anything to do with the process for
updating the word list.

I'll try again. The Mozilla dictionary consists of two sources: SCOWLand Mozilla's words, which should be maintained separately. We want tobe in a position to replace the SCOWL data easily. Mozilla shouldadminister its own additions, not general English terms. A recentMozilla addition, Fukushima, should for example be added to the thirdlist, the mixed bag that we wish were in the SCOWL data but aren't

(http://app.aspell.net/lookup?dict=en_US-large&words=Fukushima).

Another example: If SCOWL decide to change feasible/U to feasible/UI andthen back to feasible/U, Mozilla should not hang on to the /I part as wecurrently do. Mozilla should not administer the plain Englishdictionary, it should administer its specific well chosen additions.

The faulty process has led to the unfortunate situation we're in. Thecurrent process accumulates all SCOWL errors forever unless some files abug. For example: Somehow "remind's" got into the Mozilla data the onlyway to get it out with the current process is to file a bug.

If we were only to maintain carefully chosen additions, thenmind/remind/reminds/etc. would not be part of the Mozilla maintainedlist. Mozilla would just follow SCOWL on this word/stem.

If you're suggesting that we should not
maintain any additions on top of SCOWL, that is effectively asking for a
regression to the quality of our word list, and as such is unacceptable.

As I said many times in the thread: We should carefully maintain anyMozilla additions on top of the SCOWL data. We should leave it to SCOWLto manage the plain English dictionary and only manage the Mozillaadditions (for which I see three classes, see above).

Let me try a comparison: The SCOWL data is a holiday rental place andMozilla is the holiday maker. It moves on with thongs, sunscreen andshorts. It keeps track of its belongings. Of course Mozillians in theflat take pictures of each other which feature the things which belongto the flat. After a week Mozilla returns home. I takes its thongs,sunscreen and shorts with it. It does not hang on to the flat's carpetor couch. Neither does is take an inventory of the holiday flat. Nextyear Mozilla visits the holiday flat again. The owner has changed thecarpet, removed a picture from the hallway but added a statue in theliving room. The holiday snaps will look different to the ones from theprevious year, but Mozilla doesn't have to guarantee that the same itemsof furniture appear on the photos.

If you find a way to encode these accented characters properly, we can
add them to the word list that we maintain.

They are already in the "large" SCOWL dataset. Adding accentedcharacters to en-US.dic works today. Try it: Add "naïve" or "résumé" tothe data and make sure the file gets saved as ANSI/ISO8859-1 and notUTF-8. Then type/paste those words into a text field with spellchecking. Works!


Jorg K.
_______________________________________________
dev-platform mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-platform

Re: Maintaining the en-US dictionary that ships with Mozilla products

Reply via email to