On Tue, Jul 12, 2016 at 11:41:21AM +0530, Abhinav Upadhyay wrote: > >> But the downside is that technical keywords (e.g. kms, lfs, ffs), are > >> also stemmed down and stored (e.g. km, lf, ff) in the index. So if you > >> search for kms, you will see results for both kms and km. > > > > Interesting problem. > > > > I expect the set of documents that contain a word ("directories") and > > the set of documents containing its true stem ("directory") to overlap > > widely. I also expect the set of documents that contain a word ("kms") > > and an incorrect stem ("km") to scarcely overlap. Do the manual pages > > meet these expections? If so, then maybe you can decide whether or not > > to keep a stem by looking at the document-set overlap? > > Yes, usually when the stem is incorrect, the overlap is not that much. > But the only way to figure out such cases is manually comparing the > output of apropos, unless we have a pre-built list of expected > document-set and we can compare those. :)
You could build such a list from the current set of man pages, and refresh it once in a while, and that would probably work well enough. I'm wondering though if there's some characteristic of the document sets you can use to automatically reject wrong stemmings without having to precompute. What comes to mind though is some kind of diameter or breadth metric on the image of the document set on the crossreference graph. Or maybe something like the average crossreference pagerank of the document set, which if it's too high means you aren't retrieving useful information. But I guess these notions aren't much use because I'm sure we don't currently build the crossreference graph. (Also, as far as longer vs. shorter words, there's not much harm besides performance in searching for nonsense words like "resize_ff" as they generally won't match anything.) -- David A. Holland dholl...@netbsd.org