something like this perhaps? library(Rsamtools) BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile)) names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), `[`, 1) show(BamFiles) #BamFileList of length 14 #names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19 REMC_CD3
library(Homo.sapiens) gxs <- genes(Homo.sapiens) names(gxs) <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ] sbparam <- ScanBamParam(which=SomeGenes) puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5, include_insertions=TRUE, min_minor_allele_depth=1) piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam) filterInvariant <- function(x) { positions <- subset(x, duplicated(pos))$pos subset(x, pos %in% positions) } show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2))) # seqnames pos strand nucleotide count # Normal_0813_S19_R1_001.43 chr11 1979973 - C 1 # Normal_0813_S19_R1_001.44 chr11 1979973 + T 1 # Normal_2013_S18_R1_001.16 chr11 1979889 + T 3 # Normal_2013_S18_R1_001.17 chr11 1979889 - T 1 # Normal_2714_S20_R1_001.4 chr11 1979933 + A 1 # Normal_2714_S20_R1_001.5 chr11 1979933 - A 1 # P01-00020-T06-P-cfDNA.21 chr11 1980045 + G 1 # P01-00020-T06-P-cfDNA.22 chr11 1980045 - G 1 # P01-00020-T06-P-cfDNA2.11 chr11 1979884 + T 1 # P01-00020-T06-P-cfDNA2.12 chr11 1979884 - T 1 # P01-00021-T06-P-cfDNA.5 chr11 1979879 + C 1 # P01-00021-T06-P-cfDNA.6 chr11 1979879 - C 1 # P01-00024-T06-P-cfDNA.9 chr11 1979990 + C 1 # P01-00024-T06-P-cfDNA.10 chr11 1979990 + T 2 # P01-00028-T06-N-P01.3 chr11 1979851 + A 1 # P01-00028-T06-N-P01.4 chr11 1979851 + G 1 # P01-00028-T06-P-cfDNA.10 chr11 1979948 + A 1 # P01-00028-T06-P-cfDNA.11 chr11 1979948 + G 1 # P01-00028-T06-T-P01.2 chr11 1979850 + C 3 # P01-00028-T06-T-P01.3 chr11 1979850 - C 2 # P01-00028-T06-T-P02.3 chr11 1979851 + A 1 # P01-00028-T06-T-P02.4 chr11 1979851 + C 1 Or whatever you like (GAlignments would get you the reads, GAlignmentPairs the read pairs, and so on) --t On Sat, Mar 7, 2020 at 3:51 PM Mulder, R <r.mulde...@umcg.nl> wrote: > Dear Tim, > > > I installed Rsamtools and ran it on a bamfile using the following command > to get nucleotide count for 2 hotspot regions. > > > library(Rsamtools) > bamfile <- "test.bam" > bf <- BamFile(bamfile) > param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321, > 5073770), end=c(55599321, 5073770)))) > > Now, I want to perform this on more than 1 bam file at once and on more > region. > > Do you perhaps have ideas on how I could do this? > > Best, > > Rene > > ------------------------------ > *Van:* Tim Triche, Jr. <tim.tri...@gmail.com> > *Verzonden:* vrijdag 6 maart 2020 13:57:10 > *Aan:* Mulder, R > *CC:* bioc-devel@r-project.org > *Onderwerp:* Re: [Bioc-devel] MRD measurements in Leukemic patients using > NGS data in r > > Oh hey, one last thing — if all you want is to get nucleotide counts per > region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges) > holding a GRanges (Genomic Ranges) of your regions added to scanBamParams > for each BAM. It sounds awkward but in practice it is super fast and will > give you all the nucleotide and read level information you could want. One > of my interns implemented this for mitochondrial variant calling in > MTseeker when we got sick of using gmapR and being flagged for errors on > not-Linux. (We gutted the entire package recently and have new, insanely > deep examples from Oxford Nanopore direct RNA sequencing and from large > single cell datasets; I need to add those and get the package back out of > purgatory). > > That said, in the end you will want a LOT of validation material so this > is very much just a starting point. But still, it’s your starting point, in > R at least. And truthfully I much prefer R/Bioconductor idioms to (say) > pysam or the like. htsnim is nice but then you’ll be implementing the ML > bits from scratch, so I think your instincts to try R first are sensible. > > Good luck! Even if you use this for something else besides MRD, I think it > will become a useful exercise. > > --t > > On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.tri...@gmail.com> wrote: > > > a few thoughts: > > 1) look into Shearwater ( > https://bioconductor.org/packages/release/bioc/html/deepSNV.html > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C1d7ff8bff12b40c27b7a08d7c1cde470%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637190962347080053&sdata=LpUeflml9u2XLENpmDAEuDBRANXZpjkFbIEiJAeP2pw%3D&reserved=0>), > then > > 2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @ > Erasmus, the usual suspects > > 3) plan to validate w/ddPCR (at the absolute very least) and be aware that > most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR > (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto) > > not saying that ML-based methods might not help, but if you've got a > 30x-100x genome (or even 1000x FM1) and are trying to compete with existing > standard approaches that can detect molecules at 1e-6, it'll be rough. An > alternative approach (that has been used repeatedly) is to throw caution to > the wind, generate primers for numerous subject-specific somatic variants, > and use the ensemble to try and model MRD (speaking of ML). On the one > hand, that could give the clinic a "customer for life"; on the other hand, > it's not conducive to large-scale automation & deployment. As far as I > know, it never got much traction, in leukemia or anywhere else. (Consider > that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million > cells in most clinical flow labs.) > > Best of luck! (and if you're not already working with UMI-tagged reads, > please talk to the people in #2 above; the reason most people won't go > below 5% VAF is that you get thwacked by error rates at that level, and the > reason most NGS-based MRD is based on UMIs is that existing PCR-based > methods have 6 logs sensitivity.) > > --t > > > On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulde...@umcg.nl> wrote: > >> Hi, >> >> >> I was wondering if anyone could help me with a script and support for the >> above mentioned goal. >> >> For this I have several BAM files for which I want to determine de >> nucleotide count per region of interest. The latter could be several >> hotspot mutation sites. I would like to get an overall overview of all the >> BAM files. Next I want to use these counts to determine for any new BAM >> file if the count for a particular genomic position is higher than the >> allowable range, hence could indicate if a mutation is present. For this I >> would like to use the modified Thompson Tau test. I think machine learning >> could be used for this. So, why do I want to do all this? Well, normal NGS >> pipelines only deal with variants at a frequency of 5%. Mutatios below this >> frequency are often missed. To know if a mutation is present below this >> level, you showed dive into the alignment and most often manually >> investigate the base calls. I know that this races some questions regarding >> call qualities, but then again our conventional assays have actually >> confirmed some of these low mutations. In addition, NGS can >> be used to determine the presence of low frequent mutation which is of >> great importance for determining the measurable residual disease after >> treatment. >> >> >> I am new to r and bioconductor so I would be very thankful if someone >> could help me in my mission to setting up a script for this purpose. >> >> >> Thanks, >> >> >> Rene Mulder >> >> Laboratory Medicine >> >> University Medical Center Groningen >> >> The Netherlands >> >> ________________________________ >> De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}} >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C1d7ff8bff12b40c27b7a08d7c1cde470%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637190962347090042&sdata=HXNdz6iiEUQ3l%2BVwwDIGX0RAdO7Tdd7r7f4FcgW1k9M%3D&reserved=0> >> > ------------------------------ > De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de > geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken > van dit bericht, het niet openbaar maken of op enige wijze verspreiden of > vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een > incomplete aankomst of vertraging van dit verzonden bericht. > > The contents of this message are confidential and only intended for the > eyes of the addressee(s). Others than the addressee(s) are not allowed to > use this message, to make it public or to distribute or multiply this > message in any way. The UMCG cannot be held responsible for incomplete > reception or delay of this transferred message. > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel