On Tue, Jul 12, 2016 at 5:47 AM, Brett Lymn <bl...@internode.on.net> wrote: > On Mon, Jul 11, 2016 at 08:59:05PM +0530, Abhinav Upadhyay wrote: >> >> Thanks, that would be a good starting point too. I guess we will still >> have to add few words to the list manually later, but it should be >> good to begin with. >> > > How about checking the length of the word - technical abbreviations tend > to be short (<= 4 characters predominantly). According to grep there > are 155 two letter words, 1358 three letter words and 5124 four letter > words (assuming my driving of grep is correct) in /usr/share/dict/words. > So it could be feasible to hash just the short words in the dictionary > and then stem if you find a match otherwise assume it is a technical > abbreviation and don't stem. >
Yes, but there are other keywords which are probably not abbreviations, and longer than 3/4 letters. For example, drmkms, usbdevs, scan_ffs etc :) - Abhinav