Hi, Most BSgenome data packages have been regenerated to use UCSC 2bit format to store the sequences on disk. The new packages are currently being pushed to the BioC devel repo and should become available in the next hour or so (they'll have version 1.4.0).
Some basic testing indicates that this new storage outperforms the old storage format (1 .rda file per chromosome) and the more recent storage format (1 big RAzip'ed compressed FASTA file for all chromosomes) in every aspect: for random access with getSeq(), for working one chromosome at a time (e.g. with [[, $, or bsapply), and also for the size of the package tarball. Many thanks to Michael for supporting the 2bit format in rtracklayer. For genomes that contain letters other than As, Cs, Gs, Ts, or Ns (e.g. hg17, hg18, GRCh38, Ecoli, TAIR.04232008, and TAIR.TAIR9), the 2bit format cannot be used out-of-the-box (not impossible, but would require some workarounds). So for these genomes, I regenerated the BSgenome data packages using the old storage format (1 .rda file per chromosome). They are also currently being pushed to the BioC devel repo (they'll have version 1.3.1000). Note that, after being deprecated in BioC 2.14, the upstream sequences (i.e. the sequences 1000/2000/5000 bases upstream of annotated transcription starts) are not included in these new packages. Most packages now contain a man page showing how to extract the upstream sequences from the full genome sequences using a gene model. Please let me know if you have questions or concerns about this. Thanks, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel