Le Tue, Jun 05, 2007 at 09:47:37PM +0100, Roger Leigh a écrit : > Anthony Towns <[EMAIL PROTECTED]> writes: > > > > Are either of you going to debconf, or able to point out some example > > large (free?) data sets that should be packaged like this as a test case > > for playing with over debconf? > > The NCBI non-redundant database (nr). Having this packaged and > frequently updated (maybe in volatile) would be fantastic. There are > also quite a number of other significant (popular) databases used for > bioinformatics, genomics, proteomics and other biological fields which > would be really nice to have in Debian. Here's a selection: > > ftp://ftp.ncbi.nih.gov/blast/db/ > ftp://ftp.ncbi.nih.gov/refseq/ > ftp://ftp.ncbi.nih.gov/repository/ > ftp://ftp.ncbi.nih.gov/pub/taxonomy/
Hi all, Thanks to Roger, I do not need to give more examples of big datasets. I recently tried to explore the issues of packaging biological sequence databases with a small one, miRbase: ITP: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=420938 This example shows why using a packaging system is useful to process the data, and why it is space-hungry: Let us examine the contents of miRbase: ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/ (no tarball available) File: README 5 KB 25.05.2007 15:43:00 File: THIS_IS_RELEASE_9_2 1 KB 25.05.2007 15:47:00 Directory: database_files 25.05.2007 15:41:00 Directory: genomes 25.05.2007 15:40:00 File: hairpin.fa.gz 171 KB 25.05.2007 15:40:00 File: mature.fa.gz 62 KB 25.05.2007 15:40:00 File: miFam.dat.gz 24 KB 25.05.2007 15:40:00 File: miRNA.dat.gz 507 KB 25.05.2007 15:40:00 File: miRNA.dead.gz 3 KB 25.05.2007 15:40:00 File: miRNA.diff.gz 3 KB 25.05.2007 15:40:00 File: miRNA.str.gz 362 KB 25.05.2007 15:40:00 File: miRNA.xls 1193 KB 25.05.2007 15:40:00 The core data is miRNA.dat.gz. Its compression ratio is almost 90%: DNA sequences are mostly A,C,T,G,N and headers. To use the database, we can search it either by sequence name (researcher name known sequences), or by similarity ("find anything more than 85% similar to AACTGAATTCGAT"). Debian has tools for this, but they can not work directly on miRNA.dat. Here is a longer summary on many ways to search miRbase (you may skip it if you are busy): * Finding by name: using the (experimental) package emboss, the build rules of a mirbase binary package should create an index in EMBOSS format. The latest format produces indexes which are bigger than the database itself. And it has been specially written to index files bigger than 2 Go ! * Finding by similarity: again, emboss is needed, this time to convert miRNA.dat to an intermediary format, which can be used to create a database in the NCBI blast (package blast2) format. EMBOSS can also index these databases by name, but it sacrifices some information. NCBI blast users may nevertheless opt for these indexes, because it saves a lot of space (the blast databases are binary, not flat files) * Finding through a warehouse. Most databases are interconnectable. A gene in the "nr" database (see above) can signal that it is the target of a miRNA of miRbase, which lists other targets, which code for proteins, which have domains, which have strucure, which bind drugs, which cure diseases, which are caused by mutations in genes, which are the target of miRNA... The most famous warehouse is SRS, but it is proprietary. Luckily, there is an alternative being developed, MRS (http://mrs.cmbi.ru.nl/mrs-3/status.do). * Finding by SQL: in the particular case of miRbase, which is rare, some mySQL dumps are provided in the directory database_files, so that people can set up a SQL database indexing in details all fields. * Finding at the office: did you notice the extension of the biggest file? Ironically it is the only one not to be zipped. miRbase is small enough to fit a spreadsheet (which can be used by OpenOffice). * Finding by chance: the 'genomes' directory contain coordinates of the entries in the different genomes. When displaying a portion of the sequence of a human chromosome, for instance, the provided files can be used to flag places in which the sequences of miRbase originate (a la Google Maps). Consequences from the packaging point of view: In order to provide data packages which take advantage of the dependancy relationships with binary packages, we need some build mechanims, mostly to reformat and create the indexes in a format compatible with the current versions of the packages in Debian such as emboss and blast2. This could be done: - In buildds, - on the users computers, - in "data buildds" Once processed, the data is sort of duplicated. In the example of mirbase, we would have: - The source - mirbase-embl, the origninal database indexed for emboss. - mirbase-blast, the database reformated for blast, maybe indexed for emboss - mirbase-sql, the original databasae injected in a SQL server. - mirbase-common, with the excel file, the genome goodies, and the accessory files which summarise changes from previous versions. Obviously, this strongly increases the size that would be taken on the mirrors. Also, in the (mid-term) future, Debian can have many more mainstream tools, and I am quite sure that they do not all use the same format. So there is the risk of a package proliferation in addition to the inflation of disk space. Maybe a solution to this would be to rely on dpkg triggers (when implemented) so that adding new databases would install only requested things according to the available tools, and adding new tools would trigger the reformatting of databases if necessary ? Have a nice day, -- Charles Plessy http://charles.plessy.org Wako, Saitama, Japan -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]