[Bioc-devel] Random access to sequences in fasta files

Thomas Lin Pedersen Thu, 29 Jan 2015 06:42:38 -0800

Hi

I’m querying on whether there are any plans on supporting random access reading 
of fasta files in the sense that it is possible to upfront specify the indexes 
of sequences that should be read in.


I’m working on a package for comparative microbial genomics and it would be a 
huge speed improvement if it was possible to quickly read in 1000’s of 
sequences distributed on as many files. Currently the proper, vectorised 
approach requires all files to be read in at once and then subsetted, but this 
can result in XStringSet’s in the Gb range, just to access some sequences. The 
slow, un-R way would be to loop through each file (or each sequence using skip 
and nrec to only read in relevant sequences). I’m preferentially looking for an 
interface like:

readXStringSet(files, rec)

Where rec is either a vector that would index into the XStringSet as if 
everything from files had been read in, or a list with the same length as 
files, containing the indexes of interest for each file.

with best wishes

Thomas
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Random access to sequences in fasta files

Reply via email to