Hi guys,

I was hoping those that are interested might offer some constructive
criticism on my application:

Abstract:

Bioinformatics research requires the processing of large amounts of
biological data. Because of the sheer quantity of data analysed, most
researchers must run local mirrors of the databases that they use.
Unfortunately, local mirrors can be intimidating to set up and tedious to
maintain. Researchers may choose to use older versions of the datasets
involved out of laziness or fear of breaking their current scripts, or they
may choose to forego large-scale analyses altogether, especially if they
have less experience with systems administration.

I propose to solve this problem by creating a tool that will automate the
process of finding, installing, updating, and indexing mirrors of biological
databases. It will resolve dependencies, such as datasets that are mapped to
other datasets and programs that are required for indexing. The tool should
allow users to maintain multiple versions of the databases, as some analyses
may be linked to specific revisions of the data. As well, it should automate
migration of the datasets from one directory or volume to another, for cases
where hard disk space is limited.

Ideally, biological database mirroring will be made easy enough that it can
be used by anyone familiar with Debian's existing tools. Not only will
current researchers be will be more likely to use the most up-to-date
biological data, but others who were previously deterred by the inherent
difficulties of maintaining such mirrors may be encouraged to pursue
large-scale data analyses.

Debian is one of the most popular and stable GNU/Linux distributions, and
already provides the base for popular bioinformatics-targeted distributions
such as Debian-Med, DNALinux, and Bio-Linux. Debian currently leads in both
the quality and quantity of bioformatics packages. It represents the ideal
platform on which to build such a tool. Conversely, such a tool would also
help to solidify Debian as the standard bioinformatics platform.

Theoretically, the application is not limited to biological databases. It
would be readily expanded to any situation that requires local mirrors of
large data sets, such as those used in astronomy. Other future development
might also add a GUI to make it more user-friendly.

Detailed Description:

Introduction

Advances in the automation of biological experimentation and data collection
have led to an explosion in the size and number of biological databases.
Although data clearinghouses such as GenBank, EMBL, and DDBJ facilitate the
dissemination of such data, any large-scale bioinformatics analysis requires
local mirrors of the relevant databases. The extreme size and volatility of
the data sets involved have prevented them from being integrated into the
standard Debian package management system. Manually finding, installing,
updating, and indexing such databases is a daunting task for any system
administrator, much less a researcher with limited time and computer
training.


Proposed Project

The project is the creation of a tool to automate the life cycle of
biological databases, from installation to removal. It should be usable by
those with limited technical experience. Its various proposed uses are as
follows:

Select:
    Database selection from a list
    Version selection, if appropriate
    Dependency checking for other databases and/or database versions
    Dependency checking for installed programs (especially important for the
"processing" step below)
Install:
    Download
    Extract
    Process: load into MySQL, index for BLAST, etc.
    Clean up: remove any remaining downloaded files
Update:
    Check for new versions of installed datasets
    Install updated sets without removing old versions
Remove:
    Remove data that resulted from processing: drop MySQL tables, delete
indices, etc.
    Remove extracted files
Reinstall:
    Remove and install again
Relocate:
    Either a simple "mv" or a reinstall into a new location


Other considerations: Because analyses may be linked to specific version,
each version will have its own separate installation, e.g. both ensembl.v38
and ensembl.v39. As well, each database will have very different
post-extraction processing, with some being indexed for BLAST, some being
loaded into a local SQL database, and others having nothing done at all.
This problem is compounded by the lack of common data storage formats. A
significant amount of hand-coding may be required for each of the different
databases' installation step.

Timeline

May: Community bonding period
June: Basic download/version functionality with dependency database
July: Installation functionality for select datasets
August: Updating and relocation functionality
    Add as many other datasets as possible


Personal Background

My name is Aidan Findlater ([EMAIL PROTECTED]). I will be graduating
this May with two degrees, a BSc in Computing and a BSc (Honours) in
Biochemistry. I've spent almost ten years using Linux (converting to Debian
early on--its package management rules all), but have never contributed to
any of the open source projects that I use so often. That's something I
would like to change.

Not only will I have degrees in both applicable fields of study, I have
direct experience with the intersection of the two. In 2006, I won an NSERC
Undergraduate Student Research Award to pursue bioinformatics research in
the department of biology. I used BioPerl to do an analysis of N-terminal
acetylation (a post-translational protein modification) where I compared
orthologues in 16 species using the Inparanoid database. I also analysed the
same data set using the Gene Ontology database to determine if there were
terms that were either more or less common in the set. I had to download and
install Inparanoid, HomoloGene, Gene Ontology, and a variety of proteomic
datasets. It was frustrating to update.

This past summer (2007) I worked for the same supervisor in my capacity as a
biochemist. However, I was bored one week and decided to port my original
BioPerl script to BioRuby. I then wrote a Ruby script to automate the
download, extraction, and updating of the biological databases that I was
using. If I had had longer than a week (say, a whole summer) and a mentor to
help guide me, I like to think that it would have been a solid tool. Now I
have the opportunity to do just that. While my script was written by me and
for me, the tool that I would like to write would be useful to all
bioinformatics researchers.

When I was in high school, I taught myself QBasic (of course), then C++,
HTML, PHP, SQL, and XHTML. (I'd like to point out that I'm not restricting
the list to programming languages, strictly speaking.) In university, we
were taught Java, Haskell, assembly, C, yet more SQL, and some less
interesting languages. I taught myself Perl for the above-mentioned project,
and Ruby when I got bored of Perl. The point of this is to show that I have
at least passing familiarity with the languages that might be required for
the project, and the ability to quickly learn new ones.

Beyond programming, I also have experience with Debian administration,
having administered my person Debian servers for around eight years and my
friend's for about as long. I have a TFTP server installed on my (Apple)
laptop so that I can do PXE netinsts of Debian and Ubuntu whenever I need to
(if I have access to the DHCP server). Debian package management is about
the best thing since sliced bread, especially now with Aptitude.

Something that I feel like I should explain are my reasons for switching
from computing to biochemistry in third year. I was getting frustrated with
computing. The subject matter was often boring and usually erred on the side
of academic. I felt like I wasn't learning anything. My friends who
graduated last year still know very little about computers in any practical
sense. When I took a biochemistry course in third year, it was something new
and fresh. Professors were telling me things I didn't already know. I was
actually learning! I had always had an interest in genetics and molecular
biology, so the change made sense to me.


Thanks,

-Aidan

On 04/04/2008, Aidan Findlater <[EMAIL PROTECTED]> wrote:
>
> If you guys are bored, I threw my old Ruby bio DB updater in a git repo:
> http://archive.aidanfindlater.com/cgi-bin/gitweb.cgi?p=biodbman.git;a=summary
>
> I'd be writing it from scratch, presumably in Perl, but it might give you
> an idea of the way my brain works. Most of the logic was coded separately
> for each DB because they're so very different.
>
> And yes, I really did monkeypatch Hash. I'm sorry.
>
> -Aidan
>
> On 02/04/2008, Aidan Findlater <[EMAIL PROTECTED]> wrote:
> >
> > Dear Charles,
> >
> > I was looking at the Debian website and didn't realize that the deadline
> > was extended, so I ended up filling out the Google application stuff
> > yesterday.
> >
> > I'm not sure what you needed from the SoC website, but the screenshot
> > would require manual stitching of images. Here's the most relevant-seeming
> > information:
> >
> > City: Kingston, Canada
> > University: Queen's University (http://www.queensu.ca/)
> > Degree: BScH Biochemistry and BSc Computing (two degrees)
> > Expected Graduation: May 2008
> > Home Page: http://www.aidanfindlater.com/
> > IM Contact: aidanfindlater (AIM), @gmail.com (Google Talk), 
> > @jabber.org(Jabber), @
> > hotmail.com (MSN), @yahoo.ca (Yahoo IM); 12192596 (ICQ)
> >
> > I'm not really sure how it works from here on in. Did you guys have
> > specific ideas about the tool? I was hoping that it would be modeled after
> > apt-get or aptitude so that it would be already familiar to those using it.
> > The proposal posted to the Debian site already looks pretty good to me.
> >
> > How does one get ahold of the preliminary draft? I'd be interested to
> > see what kind of choices he's made with it.
> >
> > -Aidan
> >
> > On 31/03/2008, Charles Plessy <[EMAIL PROTECTED]> wrote:
> > >
> > > Le Mon, Mar 31, 2008 at 10:45:55PM +0200, Andreas Tille a écrit :
> > >
> > > > On Mon, 31 Mar 2008, Aidan Findlater wrote:
> > > >
> > > > >Oh, I forgot to mention that Debian is the best distro, and Vi is
> > > the best
> > > > >editor. Hopefully that last one doesn't get me disqualified...
> > > >
> > > > Well, the first statement increases your chances, but the second
> > > pushes
> > > > you out! ;-)  Wait, last chance: What is better Gnome or KDE? ;-)))
> > > >
> > > > >>Are applications for the SoC biological database manager project
> > > are still
> > > > >>open? I know it's last-minute, but I thought it can't hurt to ask.
> > > Please
> > > > >>let me know if it is possible to still apply, and what I would
> > > need to do
> > > > >>so.
> > > >
> > > > Just go to the GSoC page and apply as student.  I think it is not
> > > too late
> > > > but I personally have no idea how to apply as student - I just found
> > > out
> > > > how to apply as mentor. ;-)
> > >
> > >
> > > Dear Aidan,
> > >
> > > Google extended the application period for one week, so yes, our
> > > project
> > > is still opened :)
> > >
> > > Are you using Jabber ? My ID is [EMAIL PROTECTED] I am in
> > > the
> > > Tokyo timezone. I am also mentor for the biological database manager
> > > project. Do not hesitate to contact me for your application, privately
> > > or on this list. I am not usually on IRC, so if you want to see me
> > > there, please ping me before !
> > >
> > > Do not hesitate to send us a screenshot of
> > > http://code.google.com/soc/2008/student.html : I just get a message
> > > like
> > > "Sorry mentors can not sign in as a student" !
> > >
> > > Also, we can start to discuss to know each other a bit better. Just
> > > start on this list if having the disucssion public is not a problem
> > > for
> > > you. Otherwise, I think that we could have it with me and Andeas, who
> > > are registered mentors, and with Steffen Möller who made a preliminary
> > > draft in Perl, if he is available at the moment.
> > >
> > > Have a nice day,
> > >
> > >
> > > --
> > > Charles Plessy
> > > http://charles.plessy.org
> > > Wakō, Saitama, Japan
> > >
> >
> >
>

Reply via email to