On 7 May 2016 at 20:56, Abhinav Upadhyay <er.abhinav.upadh...@gmail.com> wrote: > Hi All, > > From man-k.org I was able to create a small dataset of queries, > results and their relevance scores. I am working on trying out some > machine learning models to improve the ranking algorithm of > apropos(1). > > Currently apropos has a weight for each of the sections such as NAME, > DESCRIPTION, etc., and it multiplies a match in a section by this > weight. This is required because a match in one section, for example, > NAME is more relevant than a match in some other section, such as > DESCRIPTION. These weights were put arbitrarily by me as I didn't have > any way to learn their optimum value. > > I am trying out some machine learning techniques to learn these > weights. The results till now have not been any drastic but they are > definitely an improvement. Hopefully I will be able to get more > concrete results soon. A small comparison of results between old > weights and the weights learned from machine learning is below. > > apropos -n 10 -C fork #old weights > fork (2) create a new process > perlfork (1) Perls fork() emulation > cpu_lwp_fork (9) finish a fork operation > pthread_atfork (3) register handlers to be called when process forks > rlogind (8) remote login server > rshd (8) remote shell server > rexecd (8) remote execution server > script (1) make typescript of terminal session > moncontrol (3) control execution profile > vfork (2) spawn new process in a virtual memory efficient way > > apropos -n 10 -C fork #new weights > fork (2) create a new process > perlfork (1) Perls fork() emulation > cpu_lwp_fork (9) finish a fork operation > pthread_atfork (3) register handlers to be called when process forks > vfork (2) spawn new process in a virtual memory efficient way > clone (2) spawn new process with options <-- clone(2) appears in top 10 > daemon (3) run in the background > script (1) make typescript of terminal session > openpty (3) tty utility functions > rlogind (8) remote login server > > clone(2) shows up, rshd(8) and rexecd(8) go away, rlogind(8) moves down. > > > apropos -n 10 -C create new process > init (8) process control initialization > fork (2) create a new process > fork1 (9) create a new process > timer_create (2) create a per-process timer > getpgrp (2) get process group > supfilesrv (8) sup server processes > posix_spawn (3) spawn a process > master (8) Postfix master process > popen (3) process I/O > _lwp_create (2) create a new light-weight process > > apropos -n 10 -C create new process #new weights > fork (2) create a new process <-- fork(2) is number 1 > fork1 (9) create a new process > _lwp_create (2) create a new light-weight process > pthread_create (3) create a new thread > clone (2) spawn new process with options > timer_create (2) create a per-process timer > UI_new (3) New User Interface > init (8) process control initialization > posix_spawn (3) spawn a process > master (8) Postfix master process > > fork(2) moves to number 1, init(8) moves to 7, clone(2) appears etc. > > I wrote a blog about it: > http://abhinav-upadhyay.blogspot.in/2016/05/teaching-apropos-to-rank-work-in.html > > The data is available here: > https://github.com/abhinav-upadhyay/man-nlp-experiments/tree/master/data > > Let me know your thoughts or concerns :)
Very cool - definitely looking forward to seeing the final result back into apropos(1) :) As a possible future option are you planning on special handling of multiple word searches - eg heavier weighting for the words coming consecutively in the data?