You could eventually point your student to MaskedXString and oligonucleotideFrequency(). You can mask the repeats and then just run the latter to count the N-mers. Comparing their original code to the code based on existing high-level utilities might be a useful exercise.
Michael On Wed, Apr 12, 2017 at 8:24 PM, Pariksheet Nanda <pariksheet.na...@uconn.edu> wrote: > On Fri, Apr 7, 2017 at 1:13 AM, Hervé Pagès <hpa...@fredhutch.org> wrote: >> >> This is the expected behavior. >> >> Some background: BSgenomeViews are list-like objects where the *list >> elements* (i.e. the elements one extracts with [[) are the DNA >> sequences from the views > --snip-- >> The important difference is that with [[ I get a DNAString object >> (the content of the view) and with [ I get a BSgenomeViews object >> of length 1. > > Thank you, Hervé! > > I was failing to make the connection with the `[[` accessor. > > > On Fri, Apr 7, 2017 at 1:16 AM, Michael Lawrence <lawrence.mich...@gene.com> > wrote: >> >> I'm curious as to why you are looping over the views in the first >> place. Maybe we could arrive at a vectorized solution, which is often >> but not always simpler and faster. > > Hi Michael! > > Broad background is I'm acculturating an undergraduate student to writing a > bioconductor package and applying software engineering practices of version > control, unit testing, documenting, dependency setup and validation in a > different environment on our university HPC cluster, etc. The student also > came along to LibrePlanet to better understand the culture of software > freedom :o) The package goal is to use Biostrings to look for repeating > DNA sequences of a fixed kmer size and subset to portions of the genome > without repeats (an aligner can do this ofc, but the goal is to teach R and > engineering practices). > > I appreciate your thoughtfulness for vectorizing the code to best use > BSgenomeViews, but please don't spend more than 10 minutes as I have to > balance changes to the code with the student's learning and coding "voice" > and may not do proper justice for more of your effort. My slowness to > reply was getting the project further along to be more understandable. > Here was the line which I've updating as Hervé suggested to use seq_along(): > https://github.com/coregenomics/kmap/blob/4adaed6b8007e8ea39f39ff57a42a821445d3d46/R/BiostringsProjectNEW.R#L185 > (I'm having a hard time thinking of how to summarizing a small example out > of context). > Although in that line ranges_hits() is only operating on single indices, > ranges_hits() was written to process groups of indices to reduce > multi-processor communication. Generating such sets of indices would > involve applying width() to the views inside mappable() to break in into > chunks of, say, a million bases for matchPDict(). Again, I'm linking to > the code for anything that stands out at you, but I will feel bad if you > spend a lot of time on it. > > >> H. > >> Michael > > Pariksheet > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel