Hi Dario,

On 02/05/2013 05:00 PM, Dario Strbenac wrote:
Hello,

Would it be possible to include an option that firstly goes through all of the 
strings and runs a sliding window along them, to find all the unique k-mers 
present in the dataset ?

Finding the unique k-mers in the dataset can easily be done with:

  library(Biostrings)

  uniqueOligonucleotides <- function(x, width)
  {
collapsed_freq <- oligonucleotideFrequency(x, width, simplify.as="collapsed")
    names(collapsed_freq)[which(collapsed_freq != 0L)]
  }

This would avoid having a sparse matrix with many columns of all zero counts, 
when a larger value of width is specified.

Sounds like a useful addition. Maybe we could support this thru
a 'drop' arg. When 'drop' is TRUE, it would do something like
this (building on top of uniqueOligonucleotides() and vcountPDict()):

  oligonucleotideFrequency2 <- function(x, width)
  {
    kmers <- uniqueOligonucleotides(x, width)
    pdict <- PDict(kmers)
    ans <- t(vcountPDict(pdict, x))
    colnames(ans) <- kmers
    ans
  }

Then:

  > library(hgu95av2probe)
  > probes <- DNAStringSet(hgu95av2probe)

  > dim(freq1 <- oligonucleotideFrequency(head(probes), 5))
  [1]    6 1024

  > dim(freq2 <- oligonucleotideFrequency2(head(probes), 5))
  [1]  6 99

  > identical(freq2, freq1[ , colnames(freq2)])
  [1] TRUE

  > all(freq1[ , setdiff(colnames(freq1), colnames(freq2))] == 0L)
  [1] TRUE

Added to my TODO list.

Thanks,
H.


--------------------------------------
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to