[debian-qa in CC because here we are discussing UDD issues.] On Thu, Oct 22, 2009 at 12:30:06AM +0900, Charles Plessy wrote: > First of all, let's summarise the situation. We want to integrate some > metadata > in our 'web sentinels', like 'http://debian-med.alioth.debian.org/tasks/bio'.
I would like to add that most probably there might evolve even other use cases for this kind of data. Keeping this in mind we might consider moving the topic to debian-devel in the next stage of development. > What I propose is to have a special file in the source packages for gathering > all possible useful informations, debian/upstream-metadata.yaml. I have noticed this and I really like this effort very much (even if I did not actively suported it by adding such a file for packages I touched recently). > In contrary to > debian/control, this file would not contribute data to the Packages.gz files > of > the Debian archive. I think that there are enough source packages managed in > version control systems that we can use them as the main source of our data. I'm not really happy about this "we ignore packages which are not maintained in VCS" attitude but it sounds reasonably to assume that in practice all those package that potentially contain such kind of information are actually maintained in a VCS. An alternative way to gather the information popped up in my mind: There is some code that checks the translation status of upstream sources by unpacking all source packages and checking for <lang>.po files. So there is actually some code which handles complete unpacking of Debian source packages which might be used to fetch debian/upstream-metadata.yaml as well. The pro is to get all packages - the con is that it only seeks in already uploaded packages. > This makes debian/upstream-metadata.yaml available indendantly of the Debian > archive, and more importantly, will allow to update the metadata without > uploading the package, but in a way that only the maintainers can do the > update, which keeps things under control. This has a certain advantage of flexibility over the method I suggested above. I'm not sure what way I would prefer. Implementation wise probably the VCS method is way easier to implement - so we probably should stick to your decision - but I wanted to mention an alternative way which IMHO might have slightly more chances to get accepted on debian-devel for general purposes because people there might be interested in completeness. > The missing piece of the puzzle is then an aggregator that would collect the > information from the source packages and prepare tables for the UDD. I am > drafting > such a program at http://upstream-metadata.debian.net/. Currently, it does > not do much: > > http://upstream-metadata.debian.net/<package>/ALL gets > debian/upstream-metadata.yaml if > the package is in a subversion server that is available to ???debcheckout???. > Luckily, > most of our packages are. > > http://upstream-metadata.debian.net/<package>/<key> gives the content of the > metadata for one key. This sounds really good. > For instance, http://upstream-metadata.debian.net/samtools/PMID gives the > PubMed identification number for the article describing SamTools, 19505943. > > This is the proof or principle for data retreival. Then, we need to construct > the tables. I plan to have the program store the results in a BerkeleyDB > database, and to make it output tables at constant intervals, for instance > daily. The update of the internal database would we done in two ways. If you plan to propagate this data to UDD this might not be an optimal solution. UDD imports are usually a two step process: 1. Fetch text data from whatever source in clear text. 2. Delete table, read text data and put it into the table. If we want to follow this scheme for our specific case IMHO it would be the best idea to just drop a <package>.yaml file in a directory where rsync or wget can fetch these files. the second step to read the yaml files is quite simple. > First, updates could be pushed with commit hooks when package maintainers > commit changes to debian/upstream-metadata.yaml. It could be as simple as > having an url that triggers an update, and using wget or curl to activate the > aggregator. > > Second, normal read access could trigger an update if the record is getting > old. Currently UDD updates are time based (per cron job) and not event based (per commit of some data). If you gather the data by any means at upstream-metadata.debian.net this is not really relevant for UDD import (OK, it makes sense to synchronise the cron jobs to make sure that upstream-metadata cron job runs before UDD cron job fetches data. So I would vote for the option which is safer to implement. In this aspect I would prefer the second method and run the job once a day. The reason is that if I'm not completely wrong the VCS push would require to configure *every* VCS which *potentially* might contain upstream-metadata.yaml files. This is a weak aproach because you do not have control over all VCSes and chances are very high that this will not happen on all VCSes and it sounds quite hard to propagate changes to the commit hooks (imagine upstream-metadata.debian.net becomes upstream-metadata.debian.org or whatever). In this sense I would vote for relaying on the VCS fields in the packaging information and fetch information via cron job using the Vcs specified in debian/control. > In summary, I propose to store metadata in YAML format in the source pacakges, > retreive and store it in a central place using a web agent through the VCS in > which the source packages are stored, and periodically output tables for the > UDD, which keeps a central role for the generation of our web sentinel pages. I like this approach. But there is one thing I'm not really sure about: How should we design the UDD table? There are two options: CREATE TABLE upstream-metadata ( package text, key1 text, key2 text, ... keyN text, PRIMARY KEY package ); with a defined set of keys allowed in upstream-metadata.yaml and exactly one row per package. Every unknown key will be ignored. The advantage of this approach is that tools *know* what keys to expect and can just relay on how to handle these. Alternatively we could do CREATE TABLE upstream-metadata ( package text, key text, value text, PRIMARY KEY (package,key) ); with an arbitrary number of rows per package but no duplicated keys for one package. This is more flexible in case you need some new kind of data you do not need to touch the UDD table structure but it restricts the keys to only one per package. The thir option is to leave out the PRIMARY KEY constraint at all which allows maximum flexibility (for instance there might be more than one citation records). BTW, I'm a bit concerned about mixing different database formats: On one hand you are using yaml on the other hand BibTeX. Well, for sure having a BibTeX record is very valuable. But on the other hand the tools who are working with this data will need a BibTeX parser. I did not dived into this and for sure it is doable - but I just wanted to raise this topic here to hear opinions. > The proof of principle presented above is only a few lines of code, but I > would > prefer discuss further the idea before putting more time on it. Thanks for pushing this foreward! > Lastly, I have accumulated a dozen of debian/upstream-metadata.yaml files in > the packages I maintain, so that meaningful tests are doable for table > generation later. I do not remember the list by heart, but it contains > seaview, > bwa, clustalw, clustalx, perlprimer, samtools, and most of the packages I have > updated recently. > > Since I am quite unexperienced in programming, help is of course most welcome. As I said above: IMHO most of the work is done if you can provide a set of <package>.yaml files at a freely accessible place. Kind regards Andreas. -- http://fam-tille.de -- To UNSUBSCRIBE, email to debian-qa-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org