Oh hey, one last thing — if all you want is to get nucleotide counts per region 
of interest, just use pileup() in Rsamtools, with bamWhich(GRanges) holding a 
GRanges (Genomic Ranges) of your regions added to scanBamParams for each BAM. 
It sounds awkward but in practice it is super fast and will give you all the 
nucleotide and read level information you could want. One of my interns 
implemented this for mitochondrial variant calling in MTseeker when we got sick 
of using gmapR and being flagged for errors on not-Linux. (We gutted the entire 
package recently and have new, insanely deep examples from Oxford Nanopore 
direct RNA sequencing and from large single cell datasets; I need to add those 
and get the package back out of purgatory). 

That said, in the end you will want a LOT of validation material so this is 
very much just a starting point. But still, it’s your starting point, in R at 
least. And truthfully I much prefer R/Bioconductor idioms to (say) pysam or the 
like. htsnim is nice but then you’ll be implementing the ML bits from scratch, 
so I think your instincts to try R first are sensible. 

Good luck! Even if you use this for something else besides MRD, I think it will 
become a useful exercise.  

--t

> On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.tri...@gmail.com> wrote:
> 
> 
> a few thoughts: 
> 
> 1) look into Shearwater 
> (https://bioconductor.org/packages/release/bioc/html/deepSNV.html), then 
> 
> 2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @ 
> Erasmus, the usual suspects
> 
> 3) plan to validate w/ddPCR (at the absolute very least) and be aware that 
> most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR 
> (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)
> 
> not saying that ML-based methods might not help, but if you've got a 30x-100x 
> genome (or even 1000x FM1) and are trying to compete with existing standard 
> approaches that can detect molecules at 1e-6, it'll be rough.  An alternative 
> approach (that has been used repeatedly) is to throw caution to the wind, 
> generate primers for numerous subject-specific somatic variants, and use the 
> ensemble to try and model MRD (speaking of ML). On the one hand, that could 
> give the clinic a "customer for life"; on the other hand, it's not conducive 
> to large-scale automation & deployment. As far as I know, it never got much 
> traction, in leukemia or anywhere else.  (Consider that flow cytometry is 
> capable of detecting 1-in-10K to 1-in-a-million cells in most clinical flow 
> labs.)
> 
> Best of luck! (and if you're not already working with UMI-tagged reads, 
> please talk to the people in #2 above; the reason most people won't go below 
> 5% VAF is that you get thwacked by error rates at that level, and the reason 
> most NGS-based MRD is based on UMIs is that existing PCR-based methods have 6 
> logs sensitivity.)
> 
> --t
> 
> 
> On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulde...@umcg.nl> wrote:
>> Hi,
>> 
>> 
>> I was wondering if anyone could help me with a script and support for the 
>> above mentioned goal.
>> 
>> For this I have several BAM files for which I want to determine de 
>> nucleotide count per region of interest. The latter could be several hotspot 
>> mutation sites. I would like to get an overall overview of all the BAM 
>> files. Next I want to use these counts to determine for any new BAM file if 
>> the count for a particular genomic position is higher than the allowable 
>> range, hence could indicate if a mutation is present. For this I would like 
>> to use the modified Thompson Tau test. I think machine learning could be 
>> used for this. So, why do I want to do all this? Well, normal NGS pipelines 
>> only deal with variants at a frequency of 5%. Mutatios below this frequency 
>> are often missed. To know if a mutation is present below this level, you 
>> showed dive into the alignment and most often manually investigate the base 
>> calls. I know that this races some questions regarding call qualities, but 
>> then again our conventional assays have actually confirmed some of these low 
>> mutations. In addition, NGS can 
>>  be used to determine the presence of low frequent mutation which is of 
>> great importance for determining the measurable residual disease after 
>> treatment.
>> 
>> 
>> I am new to r and bioconductor so I would be very thankful if someone could 
>> help me in my mission to setting up a script for this purpose.
>> 
>> 
>> Thanks,
>> 
>> 
>> Rene Mulder
>> 
>> Laboratory Medicine
>> 
>> University Medical Center Groningen
>> 
>> The Netherlands
>> 
>> ________________________________
>> De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}
>> 
>> _______________________________________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to