Re: [Bioc-devel] Random access to sequences in fasta files

Hervé Pagès Fri, 20 Feb 2015 15:42:06 -0800

Hi Thomas,

I've added support for direct access to an arbitrary subset of sequences
in FASTA files. It works in 3 steps:


Step 1: Call fasta.index() to get an index of what's in your files:

    library(Biostrings)
    filepaths <- c(path1, path2, path3, etc...)
    fai <- fasta.index(filepaths, seqtype="DNA")

  This new fasta.index() utility collects information of what records
  are in the file and where they are. It returns the information in a
  data frame with 1 row per record. Its memory footprint is very small
  because it loads very little information in memory.

Step 2: Select the sequences you want to read by subsetting the FASTA
index:

    i <- create a subscript that reflects what you want to read (integer
         vector, or logical vector, or anything that can be used to
         subset 'fai' by row)

    fai[i, ]  # just for testing

Step 3: Pass fai[i, ] to readDNAStringSet():

    readDNAStringSet(fai[i, ])

See ?fasta.index for more information and examples.

Note that you could even save 'fai' in the directory where you have all
your FASTA files and re-use it later.

This is in Biostrings 2.35.11, which should become available via
biocLite() in the next 24 hours or so (if everything goes well).

Let me know if you have any questions about this.

Thanks,
H.

On 01/29/2015 06:45 AM, Martin Morgan wrote:

On 01/29/2015 06:41 AM, Thomas Lin Pedersen wrote:

Hi

I’m querying on whether there are any plans on supporting random
access reading of fasta files in the sense that it is possible to
upfront specify the indexes of sequences that should be read in.

I’m working on a package for comparative microbial genomics and it
would be a huge speed improvement if it was possible to quickly read
in 1000’s of sequences distributed on as many files. Currently the
proper, vectorised approach requires all files to be read in at once
and then subsetted, but this can result in XStringSet’s in the Gb
range, just to access some sequences. The slow, un-R way would be to
loop through each file (or each sequence using skip and nrec to only
read in relevant sequences). I’m preferentially looking for an
interface like:

readXStringSet(files, rec)

Where rec is either a vector that would index into the XStringSet as
if everything from files had been read in, or a list with the same
length as files, containing the indexes of interest for each file.


Hi Thomas -- this should really be posted to support.bioconductor.org,
but see Rsamtools::FaFile and rtracklayer::TwoBitFile access through
getSeq.

Martin

with best wishes

Thomas
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Random access to sequences in fasta files

Reply via email to