Hi, this is the first bi-weekly report on my Summer of Code project 'Semantic Package Review Interface for mentors.debian.net'.
My project aims to extract metadata from packages submitted to mentors.d.n[1], and use this data to match a mackage with a potential sponsor. Since a lot of packages get stuck in the mentoring process because their maintainers have difficulty finding a sponsor, this should ease their entering the Debian process. The initial plan was, very roughly, as follows: - automatically or semi-automatically assign debtags to new packages - match new packages with potential sponsors using the debtags in the latters' uploading histories - glue it all together in a nice web UI Before my proposal got accepted, I started working on a small patch for debexpo, asking the maintainers to accept the Debian Machine Usage Policies before they can upload packages [2]. To do this properly, I needed to add features to debexpo's GnuPG wrapper. I suggested using a third party gpg library, but after some discussion with Nicolas, Arno and others on #debexpo, we realized none was satisfactory: either very old python code, buggy, or as low level as the underlying C libraries... Thus, during the community bonding period, I started working on a new wrapper. While this wasn't in the scope of my actual Summer of Code project, it allowed me to familiarize myself with debexpo's codebase. I set up an alioth account and pushed a first version into a new branch[3], but I did not have the time to finish it, because of end-of-term projects and coming exams. I'll polish it and integrate it with debexpo sometime later when I have made some progress with the actual gsoc project. Now to the actual project. My initial plan was to start extracting tags from new packages. One of my ideas was to use a bayesian or statistical classifier which would learn with packages in debian's archive and predict tags for a new package. I knew from the beginning that it might be too difficult or too big a project, and might be out of the scope of the SoC, but my other ideas seemed very dumb and I did not think they might get anywhere. So I started working on a classifier, hoping that I would at least gather some data that will be useful later, even if I have to give up the classifier plan entirely. Also, a more simple classifier might be a viable idea for the next step of my project (matching packages with sponsors), so finding out how to use a machine learning API would not be wasting my time. I decided to first work with package descriptions. Before I started, I had already chosen the python libraries NLTK (Natural Language Toolkit)[4] and scikit-learn[5] as the more interesting. I experimented a bit with both, and quickly saw that NLTK was way too powerful for my needs, while scikit-learn's text processing features were sufficient. I put aside NLTK and got to work. The first step was to gather data on packages in debian's archive. After some playing around and frustrations with Debian Data Export[6], I finally realized that I already had all the information I wanted, on my Debian system; all I needed was to access it with needed python-apt (for descriptions in apt's cache) and python-debian (for debtags). The second, and hardest, step was to figure out how to process these package descriptions and debtags to make them usable in scikit-learn. This took some googling, reading documentation, going through stackoverflow archive and hundreds of tests in ipython. With sklearn features extraction and text pre-processing tools, I made a vector space model[7] with the descriptions words (with tf*idf weights[8]) and binarized the tags for use with a multi-label classifier. Eventually, I got to the point where I could feed a Naive Bayes classifier with packages descriptions and tags. The results were, let's say... weird. A few packages in my test set would get accurate tags, and most of them none at all. I managed to tweak it a bit to get more results: tags were assigned to 2% of the packages, this time with a very low accuracy (except for a few that got exactly the tags they were supposed to have). I didn't bother writing a real performance evaluator for this classifier: it seems clear enough that developing a complety automatic classifier for debtags is too big a task for this project. I might try again once the summer of code is over. For the record, I commited this code into a branch 'metadata-extract', but I don't think it will be of much future use. This is not much in terms of lines of code; I spent a lot of time researching stuff, and still had a few exams (which are now over). At my mentors Arno and Nicolas' suggestion, I discussed my problem with Enrico Zini[9]. He was very helpful and gave me a few hints to a much more simple strategy that 'might just work'. He also advised me to forget about real classifiers, and told me that someone else had tried to develop one in a previous GSoC and got nowhere. I will use debtags's existing heuristics [10] to suggest a first set of tags for a new package, and ask the maintainer to check and complete it. Then, I can construct a Xapian[11] query with these tags and tokens extracted from the description to find similar packages, keeping only the packages whose maintainer are available sponsors. Later this summer, I will contribute some debtags heuristics, which should also benefit debtags besides debexpo. Thanks to Enrico, I have now a more realistic plan for the next few weeks. I should even have a working prototype integrated to debexpo's current UI before the next report. To help myself stay focused and avoid losing time with too much theory or over-complicated ideas, I divided my near-future work into small tasks: - apply debtags' heuristics to a package - tokenize a package's description and build a Xapian query with resulting tokens and above tags - make the above work with packages uploaded to mentors.d.n - ask the maintainer to check/complete the tags assigned to the package - present the result of the query in debexpo's web UI That's it for today, and I'll keep in mind that valuable lesson: I should have talked more with my mentors :) Footnotes: [1] [http://mentors.debian.net/] [2] [http://wiki.debian.org/Debexpo/Development#Open\_tasks-1] [3] [http://anonscm.debian.org/gitweb/?p=debexpo/debexpo.git;a=blob;f=debexpo/lib/gnupg2.py;h=4add6f2a810f2892b99411729f449ecc60be12b1;hb=refs/heads/gpg-rewrite] [4] [http://nltk.org/] [5] [http://scikit-learn.org/stable/] [6] [http://dde.debian.net/dde/] [7] [http://en.wikipedia.org/wiki/Vector_space_model] [8] [http://en.wikipedia.org/wiki/Tf*idf] [9] [http://enricozini.org/] [10] [http://anonscm.debian.org/gitweb/?p=debtags/debtagsd.git;a=tree;f=debdata;hb=master] [11] [http://www.enricozini.org/2007/debtags/apt-xapian-index/] _______________________________________________ Soc-coordination mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/soc-coordination
