Hello, over the last days I had surprised myself with the fun I had while following Charles' download and post-process instructions for the complete genomes.
To help the cleanliness of the getData Perl script, Charles came up with the idea to have Makefiles share a good part of the functionality. Modularisation. Here the beast that knows how to retrieve full genomes from Ensembl: $ more getData.conf.d/Ensembl_genome.mk SHARED_WGET_OPTIONS=$(shell getData --getWgetOptions) ENSEMBLVERSION=75 MIRROR = ftp://ftp.ensembl.org/pub/release-$(ENSEMBLVERSION)/fasta get: echo "I: Retrieving data for Ensembl version $(ENSEMBLVERSION) species $(ORGANISM_L)" wget $(SHARED_WGET_OPTIONS) $(MIRROR)/$(ORGANISM_L)/dna/$(ORGANISM).*.$(ENSEMBLVERSION).dna.chromosome.*.fa.gz unpack: find . -maxdepth 1 -name "*.fa" -delete for file in *chromosome.*.fa.gz ; do zcat $$file > `basename $$file .gz` ; done blast: if [ -x /usr/bin/makeblastdb ]; then \ echo "I: Found BLAST+ (preferred) for indexing"; \ cat *fa | makeblastdb -title $(NICKNAME) -dbtype nucl -out $(NICKNAME); \ elif [ -x /usr/bin/formatdb ]; then \ echo "I: Found legacy BLAST for indexing"; \ cat *fa | formatdb -i /dev/stdin -t $(NICKNAME) -n $(NICKNAME) -p F ; \ fi The part that calls this Makefile is $ more getData.conf.d/human.getData print STDERR "Reading Homo sapiens configuration file\n" if $verbose; $toBeMirrored{"human.hg18.ncbi36.genome"}={ "name" => "hg18/NCBI36 – Genome Reference Consortium from Ensembl", "tags" => ["human","genome"], "source" => "make ORGANISM=Homo_sapiens ORGANISM_L=homo_sapiens ENSEMBLVERSION=54 NICKNAME=hg18 -f /etc/getData.conf.d/Ensemb l_genome.mk get unpack", "post-download" => "make -f NICKNAME=hg18 -f /etc/getData.conf.d/Ensembl_genome.mk blast", "depends" => "make", "recommends" => "ncbi-blast+", "size" => "39G" }; $toBeMirrored{"human.hg19.grch37.genome"}={ "name" => "hg19/GRCh37 – Genome Reference Consortium from Ensembl", "tags" => ["human","genome"], "source" => "make ORGANISM=Homo_sapiens ORGANISM_L=homo_sapiens ENSEMBLVERSION=75 NICKNAME=hg19 -f /etc/getData.conf.d/Ensemb l_genome.mk get unpack", "post-download" => "make -f NICKNAME=hg19 -f /etc/getData.conf.d/Ensembl_genome.mk blast", "depends" => "make", "recommends" => "ncbi-blast+", "size" => "39G" }; 1; The size attribute is not used at the very moment. At some point getData should warn when there is too little disk space. The downside of this arrangement, for the very moment, is that the blast indices are all dispersed across the different data directories. I would very much like to have those found without environment variable settings. Ideas are welcome. Best, Steffen -- To UNSUBSCRIBE, email to debian-med-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/trinity-0963d9a1-ae4a-4088-8f7b-acbaa3752388-1397040998348@3capp-gmx-bs41