This question is better asked on the Bioconductor support site
https://support.bioconductor.org
please re-post the question (and great answer from Ulrich!) there. I think
there's a straight-forward Biostrings solution, too, and I'm sure it will be
posted when the question appears in the appropriate forum.
Martin
On 11/28/2014 01:24 AM, Ulrich Bodenhofer wrote:
Dear Rodrigo,
We recently released our 'kebabs' package as part of the Bioconductor 3.0
release. Though this package is aimed at providing sequence kernels for
classification, regression, and other tasks such as similarity-based clustering,
the package includes functionality that exactly matches your needs. See the
following example:
s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
s1
s2 <- DNAString(paste0(rep(s, 200), collate=""))
s2
library(kebabs)
sk13 <- spectrumKernel(k=13, normalized=FALSE)
system.time(kmerFreq <- drop(getExRep(s1, sk13)))
kmerFreq
system.time(kmerFreq <- drop(getExRep(s2, sk13)))
kmerFreq
So you see that the k-mer frequencies are obtained as the explicit feature
vector of the standard (unnormalized) spectrum kernel with k=13. This function
is implemented in highly efficient C++ code that builds up a prefix tree and
only considers k-mers that actually occur in the sequence (as you requested).
You see that even for k=13 and a sequence with tens of thousands of bases, the
computations only take fractions of a second (19 msecs on our 5-year-old Dell
server). The above function also works for DNAStringSets, but, in this case, you
should remove the drop() to get a matrix of k-mer frequencies. The matrix is by
default sparse (class 'dgRMatrix'), but you can also enforce the result to be in
standard dense matrix format (however, still omitting k-mers that do not occur
at all in any of the sequences):
sv <- c(DNAStringSet(s), DNAStringSet(s2))
system.time(kmerFreq <- getExRep(sv, sk13))
kmerFreq
system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
kmerFreq
How long the k-mers may be, may depend on your system. On our system, the limit
seems to be k=22 for DNA sequences. The same works for RNA and amino acid
sequences. For the latter, however, the limits in terms of k are significantly
lower, since the feature space is obviously much larger for the same k.
I hope that helps. If you have any further questions, please let us know.
Best regards,
Ulrich
Message-ID: <bay179-w424d25c87634bdebad7393c8...@phx.gbl>
From: Rodrigo Bertollo de Alexandre <rodrigodealexan...@hotmail.com>
To: "bioc-devel@r-project.org" <bioc-devel@r-project.org>
Date: Thu, 27 Nov 2014 21:38:39 -0200
Subject: [Bioc-devel] Regarding the function: oligonucleotideFrequency for
k-mers > 11 bps
I've seen that it is almost impossible to work with k-mers as big as 13 with
this function. This is mainly because this function doesn't create a list of
k-mers from the sequence but from all possible combinations.
This is basically a bug, since in a big sequence of 1000 bps the maximum number
of 13-mers is L-k+1 = 988. While the number of possible 13-mers is 4^k =
28561.This means that the code is basically analyzing 27573 nonexistent k-mers.
I'm wondering if there could have a modification in the package regarding this
issue...
I did my own function for this (which it runs ok). However, having all you need
in a unique package would be even better...(I posted my code on the
stackoverflow: http://stackoverflow.com/a/27178731/4004499)
Sincerely,Rodrigo
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel