Re: [R] Global curve fitting/shared parameters with nls() alternatives
Dear Bert Thanks for getting back to me. Yes that is exactly the sort of problem I am trying to solve. I am aware of the option of hard coding the experimental groups as you suggested, but was hoping for an easy out of the box approach as I have many groups! Thanks James On Tue, 5 Nov 2019 at 20:28, Bert Gunter wrote: > A simplified example of what you wish to do might help to clarify here. > > Here's my guess. Feel free to dismiss if I'm off base. > > Suppose your model is: > y = exp(a*x) + b > > and you wish the b to be constant but the a to vary across expts. Then can > you not combine the data from both into single x, y vectors, add a variable > expt that takes the value 1 for expt1 and 2 for expt 2 and fit the single > model: > > y = (expt ==1)*(exp(a1*x) + b) + (expt == 2)* (exp(a2*x) + b) > > This would obtain separate estimates of a1 and a2 but a single estimate of > b . > > There are probably better ways to do this, but I've done hardly any > nonlinear model fitting (so warning!) and can only offer this brute force > approach; so wait for someone to suggest something better before trying it. > > Cheers, > Bert > > > On Tue, Nov 5, 2019 at 9:12 AM James Wagstaff > wrote: > >> Hello >> I am trying to determine least-squares estimates of the parameters of a >> nonlinear model, where I expect some parameters to remain constant across >> experiments, and for others to vary. I believe this is typically referred >> to as global curve fitting, or the presence of shared/nested parameters. >> The "[]" syntax in the stats::nls() function is an extremely convenient >> solution ( >> >> https://r.789695.n4.nabble.com/How-to-do-global-curve-fitting-in-R-td4712052.html >> ), >> but in my case I seem to need the Levenberg-Marquardt/Marquardt solvers >> such as nlsr::nlxb() and minpack.lm::nlsLM. I can not find any >> examples/documentation explaining a similar syntax for these tools. Is >> anyone aware of a nls-like tool with this functionality, or an alternative >> approach? >> Best wishes >> James Wagstaff >> >> [[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > -- James Wagstaff +447910113349 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Global curve fitting/shared parameters with nls() alternatives
> James Wagstaff > on Fri, 8 Nov 2019 13:20:41 + writes: > Dear Bert Thanks for getting back to me. Yes that is > exactly the sort of problem I am trying to solve. I am > aware of the option of hard coding the experimental groups > as you suggested, but was hoping for an easy out of the > box approach as I have many groups! Thanks James If I understand correctly, nlme :: nlsList() is exactly what you want. No need to install anything, as 'nlme' is among the formally 'Recommended' packages and hence is part of every (non-handicapped) R installation. Best, Martin Maechler ETH Zurich and R Core Team > On Tue, 5 Nov 2019 at 20:28, Bert Gunter > wrote: >> A simplified example of what you wish to do might help to >> clarify here. >> >> Here's my guess. Feel free to dismiss if I'm off base. >> >> Suppose your model is: y = exp(a*x) + b >> >> and you wish the b to be constant but the a to vary >> across expts. Then can you not combine the data from both >> into single x, y vectors, add a variable expt that takes >> the value 1 for expt1 and 2 for expt 2 and fit the single >> model: >> >> y = (expt ==1)*(exp(a1*x) + b) + (expt == 2)* (exp(a2*x) >> + b) >> >> This would obtain separate estimates of a1 and a2 but a >> single estimate of b . >> >> There are probably better ways to do this, but I've done >> hardly any nonlinear model fitting (so warning!) and can >> only offer this brute force approach; so wait for someone >> to suggest something better before trying it. >> >> Cheers, Bert >> >> >> On Tue, Nov 5, 2019 at 9:12 AM James Wagstaff >> wrote: >> >>> Hello I am trying to determine least-squares estimates >>> of the parameters of a nonlinear model, where I expect >>> some parameters to remain constant across experiments, >>> and for others to vary. I believe this is typically >>> referred to as global curve fitting, or the presence of >>> shared/nested parameters. The "[]" syntax in the >>> stats::nls() function is an extremely convenient >>> solution ( >>> >>> https://r.789695.n4.nabble.com/How-to-do-global-curve-fitting-in-R-td4712052.html >>> ), but in my case I seem to need the >>> Levenberg-Marquardt/Marquardt solvers such as >>> nlsr::nlxb() and minpack.lm::nlsLM. I can not find any >>> examples/documentation explaining a similar syntax for >>> these tools. Is anyone aware of a nls-like tool with >>> this functionality, or an alternative approach? Best >>> wishes James Wagstaff >>> >>> [[alternative HTML version deleted]] >>> >>> __ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and >>> more, see https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html and provide >>> commented, minimal, self-contained, reproducible code. >>> >> > -- > James Wagstaff > +447910113349 > [[alternative HTML version deleted]] > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and > more, see https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide > commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to find number of unique rows for combination of r columns
Hello, I have a data frame like this: > head(dt,20) chrpos gene_id pval_nominal pval_ret wl wr 1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278 21.2838 2: chr1 58814 ENSG02272320.2952110 0.8975820 31.62278 21.2838 3: chr1 60351 ENSG02272320.4397880 0.8679590 31.62278 21.2838 4: chr1 61920 ENSG02272320.3195280 0.6018090 31.62278 21.2838 5: chr1 63671 ENSG02272320.2377390 0.9880390 31.62278 21.2838 6: chr1 64931 ENSG02272320.2766790 0.9070370 31.62278 21.2838 7: chr1 81587 ENSG02272320.6057930 0.6167630 31.62278 21.2838 8: chr1 115746 ENSG02272320.4078770 0.7799110 31.62278 21.2838 9: chr1 135203 ENSG02272320.4078770 0.9299130 31.62278 21.2838 10: chr1 138593 ENSG02272320.8464560 0.5696060 31.62278 21.2838 it is very big, > dim(dt) [1] 737191228 To count number of unique rows for all 3 columns: chr, pos and gene_id I could just join those 3 columns and than count. But how would I find unique number of rows for these 4 columns without joining them? Thanks Ana __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to find number of unique rows for combination of r columns
Hi, Ana, doesn't udt <- unique(dt[c("chr", "pos", "gene_id")]) nrow(udt) get close to what you want? Hth -- Gerrit - Dr. Gerrit Eichner Mathematical Institute, Room 212 gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany http://www.uni-giessen.de/eichner - Am 08.11.2019 um 15:38 schrieb Ana Marija: Hello, I have a data frame like this: head(dt,20) chrpos gene_id pval_nominal pval_ret wl wr 1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278 21.2838 2: chr1 58814 ENSG02272320.2952110 0.8975820 31.62278 21.2838 3: chr1 60351 ENSG02272320.4397880 0.8679590 31.62278 21.2838 4: chr1 61920 ENSG02272320.3195280 0.6018090 31.62278 21.2838 5: chr1 63671 ENSG02272320.2377390 0.9880390 31.62278 21.2838 6: chr1 64931 ENSG02272320.2766790 0.9070370 31.62278 21.2838 7: chr1 81587 ENSG02272320.6057930 0.6167630 31.62278 21.2838 8: chr1 115746 ENSG02272320.4078770 0.7799110 31.62278 21.2838 9: chr1 135203 ENSG02272320.4078770 0.9299130 31.62278 21.2838 10: chr1 138593 ENSG02272320.8464560 0.5696060 31.62278 21.2838 it is very big, dim(dt) [1] 737191228 To count number of unique rows for all 3 columns: chr, pos and gene_id I could just join those 3 columns and than count. But how would I find unique number of rows for these 4 columns without joining them? Thanks Ana __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to find number of unique rows for combination of r columns
I tried it but I got this error: > udt <- unique(dt[c("chr", "pos", "gene_id")]) Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM. On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner wrote: > > Hi, Ana, > > doesn't > > udt <- unique(dt[c("chr", "pos", "gene_id")]) > nrow(udt) > > get close to what you want? > > Hth -- Gerrit > > - > Dr. Gerrit Eichner Mathematical Institute, Room 212 > gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > http://www.uni-giessen.de/eichner > - > > Am 08.11.2019 um 15:38 schrieb Ana Marija: > > Hello, > > > > I have a data frame like this: > > > >> head(dt,20) > > chrpos gene_id pval_nominal pval_ret wl wr > > 1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278 21.2838 > > 2: chr1 58814 ENSG02272320.2952110 0.8975820 31.62278 21.2838 > > 3: chr1 60351 ENSG02272320.4397880 0.8679590 31.62278 21.2838 > > 4: chr1 61920 ENSG02272320.3195280 0.6018090 31.62278 21.2838 > > 5: chr1 63671 ENSG02272320.2377390 0.9880390 31.62278 21.2838 > > 6: chr1 64931 ENSG02272320.2766790 0.9070370 31.62278 21.2838 > > 7: chr1 81587 ENSG02272320.6057930 0.6167630 31.62278 21.2838 > > 8: chr1 115746 ENSG02272320.4078770 0.7799110 31.62278 21.2838 > > 9: chr1 135203 ENSG02272320.4078770 0.9299130 31.62278 21.2838 > > 10: chr1 138593 ENSG02272320.8464560 0.5696060 31.62278 21.2838 > > > > it is very big, > >> dim(dt) > > [1] 737191228 > > > > To count number of unique rows for all 3 columns: chr, pos and gene_id > > I could just join those 3 columns and than count. But how would I find > > unique number of rows for these 4 columns without joining them? > > > > Thanks > > Ana > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to find number of unique rows for combination of r columns
It seems as if dt is not a (base R) data frame but a data table. I assume, you will have to transform dt into a data frame (maybe with as.data.frame) to be able to apply unique in the suggested way. However, I am not familiar with data tables. Perhaps somebody else can provide a more profound guess. Regards -- Gerrit - Dr. Gerrit Eichner Mathematical Institute, Room 212 gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany http://www.uni-giessen.de/eichner - Am 08.11.2019 um 16:02 schrieb Ana Marija: I tried it but I got this error: udt <- unique(dt[c("chr", "pos", "gene_id")]) Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM. On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner wrote: Hi, Ana, doesn't udt <- unique(dt[c("chr", "pos", "gene_id")]) nrow(udt) get close to what you want? Hth -- Gerrit - Dr. Gerrit Eichner Mathematical Institute, Room 212 gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany http://www.uni-giessen.de/eichner - Am 08.11.2019 um 15:38 schrieb Ana Marija: Hello, I have a data frame like this: head(dt,20) chrpos gene_id pval_nominal pval_ret wl wr 1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278 21.2838 2: chr1 58814 ENSG02272320.2952110 0.8975820 31.62278 21.2838 3: chr1 60351 ENSG02272320.4397880 0.8679590 31.62278 21.2838 4: chr1 61920 ENSG02272320.3195280 0.6018090 31.62278 21.2838 5: chr1 63671 ENSG02272320.2377390 0.9880390 31.62278 21.2838 6: chr1 64931 ENSG02272320.2766790 0.9070370 31.62278 21.2838 7: chr1 81587 ENSG02272320.6057930 0.6167630 31.62278 21.2838 8: chr1 115746 ENSG02272320.4078770 0.7799110 31.62278 21.2838 9: chr1 135203 ENSG02272320.4078770 0.9299130 31.62278 21.2838 10: chr1 138593 ENSG02272320.8464560 0.5696060 31.62278 21.2838 it is very big, dim(dt) [1] 737191228 To count number of unique rows for all 3 columns: chr, pos and gene_id I could just join those 3 columns and than count. But how would I find unique number of rows for these 4 columns without joining them? Thanks Ana __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to find number of unique rows for combination of r columns
Thank you so much! Converting it to data frame resolved the issue! On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner wrote: > > It seems as if dt is not a (base R) data frame but a > data table. I assume, you will have to transform dt > into a data frame (maybe with as.data.frame) to be > able to apply unique in the suggested way. However, > I am not familiar with data tables. Perhaps somebody > else can provide a more profound guess. > > Regards -- Gerrit > > - > Dr. Gerrit Eichner Mathematical Institute, Room 212 > gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > http://www.uni-giessen.de/eichner > - > > Am 08.11.2019 um 16:02 schrieb Ana Marija: > > I tried it but I got this error: > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : > >When i is a data.table (or character vector), the columns to join by > > must be specified using 'on=' argument (see ?data.table), by keying x > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing > > column names between x and i (i.e., a natural join). Keyed joins might > > have further speed benefits on very large data due to x being sorted > > in RAM. > > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner > > wrote: > >> > >> Hi, Ana, > >> > >> doesn't > >> > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > >> nrow(udt) > >> > >> get close to what you want? > >> > >>Hth -- Gerrit > >> > >> - > >> Dr. Gerrit Eichner Mathematical Institute, Room 212 > >> gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen > >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > >> http://www.uni-giessen.de/eichner > >> - > >> > >> Am 08.11.2019 um 15:38 schrieb Ana Marija: > >>> Hello, > >>> > >>> I have a data frame like this: > >>> > head(dt,20) > >>>chrpos gene_id pval_nominal pval_ret wl wr > >>>1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278 21.2838 > >>>2: chr1 58814 ENSG02272320.2952110 0.8975820 31.62278 21.2838 > >>>3: chr1 60351 ENSG02272320.4397880 0.8679590 31.62278 21.2838 > >>>4: chr1 61920 ENSG02272320.3195280 0.6018090 31.62278 21.2838 > >>>5: chr1 63671 ENSG02272320.2377390 0.9880390 31.62278 21.2838 > >>>6: chr1 64931 ENSG02272320.2766790 0.9070370 31.62278 21.2838 > >>>7: chr1 81587 ENSG02272320.6057930 0.6167630 31.62278 21.2838 > >>>8: chr1 115746 ENSG02272320.4078770 0.7799110 31.62278 21.2838 > >>>9: chr1 135203 ENSG02272320.4078770 0.9299130 31.62278 21.2838 > >>> 10: chr1 138593 ENSG02272320.8464560 0.5696060 31.62278 21.2838 > >>> > >>> it is very big, > dim(dt) > >>> [1] 737191228 > >>> > >>> To count number of unique rows for all 3 columns: chr, pos and gene_id > >>> I could just join those 3 columns and than count. But how would I find > >>> unique number of rows for these 4 columns without joining them? > >>> > >>> Thanks > >>> Ana > >>> > >>> __ > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >> > >> __ > >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to find number of unique rows for combination of r columns
would you know how would I extract from my original data frame, just these unique rows? because this gives me only those 3 columns, and I want all columns from the original data frame > head(udt) chr pos gene_id 1 chr1 54490 ENSG0227232 2 chr1 58814 ENSG0227232 3 chr1 60351 ENSG0227232 4 chr1 61920 ENSG0227232 5 chr1 63671 ENSG0227232 6 chr1 64931 ENSG0227232 > head(dt) chr pos gene_id pval_nominal pval_ret wl wr META 1: chr1 54490 ENSG0227232 0.608495 0.783778 31.62278 21.2838 0.7475480 2: chr1 58814 ENSG0227232 0.295211 0.897582 31.62278 21.2838 0.6031214 3: chr1 60351 ENSG0227232 0.439788 0.867959 31.62278 21.2838 0.6907182 4: chr1 61920 ENSG0227232 0.319528 0.601809 31.62278 21.2838 0.4032200 5: chr1 63671 ENSG0227232 0.237739 0.988039 31.62278 21.2838 0.7482519 6: chr1 64931 ENSG0227232 0.276679 0.907037 31.62278 21.2838 0.5974800 On Fri, Nov 8, 2019 at 9:30 AM Ana Marija wrote: > > Thank you so much! Converting it to data frame resolved the issue! > > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner > wrote: > > > > It seems as if dt is not a (base R) data frame but a > > data table. I assume, you will have to transform dt > > into a data frame (maybe with as.data.frame) to be > > able to apply unique in the suggested way. However, > > I am not familiar with data tables. Perhaps somebody > > else can provide a more profound guess. > > > > Regards -- Gerrit > > > > - > > Dr. Gerrit Eichner Mathematical Institute, Room 212 > > gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen > > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > > http://www.uni-giessen.de/eichner > > - > > > > Am 08.11.2019 um 16:02 schrieb Ana Marija: > > > I tried it but I got this error: > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : > > >When i is a data.table (or character vector), the columns to join by > > > must be specified using 'on=' argument (see ?data.table), by keying x > > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing > > > column names between x and i (i.e., a natural join). Keyed joins might > > > have further speed benefits on very large data due to x being sorted > > > in RAM. > > > > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner > > > wrote: > > >> > > >> Hi, Ana, > > >> > > >> doesn't > > >> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > > >> nrow(udt) > > >> > > >> get close to what you want? > > >> > > >>Hth -- Gerrit > > >> > > >> - > > >> Dr. Gerrit Eichner Mathematical Institute, Room 212 > > >> gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen > > >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > > >> http://www.uni-giessen.de/eichner > > >> - > > >> > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija: > > >>> Hello, > > >>> > > >>> I have a data frame like this: > > >>> > > head(dt,20) > > >>>chrpos gene_id pval_nominal pval_ret wl > > >>> wr > > >>>1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278 > > >>> 21.2838 > > >>>2: chr1 58814 ENSG02272320.2952110 0.8975820 31.62278 > > >>> 21.2838 > > >>>3: chr1 60351 ENSG02272320.4397880 0.8679590 31.62278 > > >>> 21.2838 > > >>>4: chr1 61920 ENSG02272320.3195280 0.6018090 31.62278 > > >>> 21.2838 > > >>>5: chr1 63671 ENSG02272320.2377390 0.9880390 31.62278 > > >>> 21.2838 > > >>>6: chr1 64931 ENSG02272320.2766790 0.9070370 31.62278 > > >>> 21.2838 > > >>>7: chr1 81587 ENSG02272320.6057930 0.6167630 31.62278 > > >>> 21.2838 > > >>>8: chr1 115746 ENSG02272320.4078770 0.7799110 31.62278 > > >>> 21.2838 > > >>>9: chr1 135203 ENSG02272320.4078770 0.9299130 31.62278 > > >>> 21.2838 > > >>> 10: chr1 138593 ENSG02272320.8464560 0.5696060 31.62278 21.2838 > > >>> > > >>> it is very big, > > dim(dt) > > >>> [1] 737191228 > > >>> > > >>> To count number of unique rows for all 3 columns: chr, pos and gene_id > > >>> I could just join those 3 columns and than count. But how would I find > > >>> unique number of rows for these 4 columns without joining them? > > >>> > > >>> Thanks > > >>> Ana > > >>> > > >>> __ > > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > >>> https://stat.ethz.ch/mailman/listinfo/r-help > > >>> PLEASE do read the posting guide > > >>> http://www.R-project.org/posting-guide.html > > >>>
Re: [R] how to find number of unique rows for combination of r columns
Are you trying to eliminate duplicated rows from your dataframe? Because that would be better achieved with duplicated(). B. > On 2019-11-08, at 10:32, Ana Marija wrote: > > would you know how would I extract from my original data frame, just > these unique rows? > because this gives me only those 3 columns, and I want all columns > from the original data frame > >> head(udt) > chr pos gene_id > 1 chr1 54490 ENSG0227232 > 2 chr1 58814 ENSG0227232 > 3 chr1 60351 ENSG0227232 > 4 chr1 61920 ENSG0227232 > 5 chr1 63671 ENSG0227232 > 6 chr1 64931 ENSG0227232 > >> head(dt) >chr pos gene_id pval_nominal pval_ret wl wr META > 1: chr1 54490 ENSG0227232 0.608495 0.783778 31.62278 21.2838 0.7475480 > 2: chr1 58814 ENSG0227232 0.295211 0.897582 31.62278 21.2838 0.6031214 > 3: chr1 60351 ENSG0227232 0.439788 0.867959 31.62278 21.2838 0.6907182 > 4: chr1 61920 ENSG0227232 0.319528 0.601809 31.62278 21.2838 0.4032200 > 5: chr1 63671 ENSG0227232 0.237739 0.988039 31.62278 21.2838 0.7482519 > 6: chr1 64931 ENSG0227232 0.276679 0.907037 31.62278 21.2838 0.5974800 > > On Fri, Nov 8, 2019 at 9:30 AM Ana Marija wrote: >> >> Thank you so much! Converting it to data frame resolved the issue! >> >> On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner >> wrote: >>> >>> It seems as if dt is not a (base R) data frame but a >>> data table. I assume, you will have to transform dt >>> into a data frame (maybe with as.data.frame) to be >>> able to apply unique in the suggested way. However, >>> I am not familiar with data tables. Perhaps somebody >>> else can provide a more profound guess. >>> >>> Regards -- Gerrit >>> >>> - >>> Dr. Gerrit Eichner Mathematical Institute, Room 212 >>> gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen >>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >>> http://www.uni-giessen.de/eichner >>> - >>> >>> Am 08.11.2019 um 16:02 schrieb Ana Marija: I tried it but I got this error: > udt <- unique(dt[c("chr", "pos", "gene_id")]) Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM. On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner wrote: > > Hi, Ana, > > doesn't > > udt <- unique(dt[c("chr", "pos", "gene_id")]) > nrow(udt) > > get close to what you want? > > Hth -- Gerrit > > - > Dr. Gerrit Eichner Mathematical Institute, Room 212 > gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > http://www.uni-giessen.de/eichner > - > > Am 08.11.2019 um 15:38 schrieb Ana Marija: >> Hello, >> >> I have a data frame like this: >> >>> head(dt,20) >> chrpos gene_id pval_nominal pval_ret wl wr >> 1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278 21.2838 >> 2: chr1 58814 ENSG02272320.2952110 0.8975820 31.62278 21.2838 >> 3: chr1 60351 ENSG02272320.4397880 0.8679590 31.62278 21.2838 >> 4: chr1 61920 ENSG02272320.3195280 0.6018090 31.62278 21.2838 >> 5: chr1 63671 ENSG02272320.2377390 0.9880390 31.62278 21.2838 >> 6: chr1 64931 ENSG02272320.2766790 0.9070370 31.62278 21.2838 >> 7: chr1 81587 ENSG02272320.6057930 0.6167630 31.62278 21.2838 >> 8: chr1 115746 ENSG02272320.4078770 0.7799110 31.62278 21.2838 >> 9: chr1 135203 ENSG02272320.4078770 0.9299130 31.62278 21.2838 >> 10: chr1 138593 ENSG02272320.8464560 0.5696060 31.62278 21.2838 >> >> it is very big, >>> dim(dt) >> [1] 737191228 >> >> To count number of unique rows for all 3 columns: chr, pos and gene_id >> I could just join those 3 columns and than count. But how would I find >> unique number of rows for these 4 columns without joining them? >> >> Thanks >> Ana >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do
Re: [R] how to find number of unique rows for combination of r columns
Sorry, but you ask basic questions.You really need to spend some more time with an R tutorial or two. This list is not meant to replace your own learning efforts. You also do not seem to be reading the docs carefully. Under ?unique, it links ?duplicated and tells you that it gives indices of duplicated rows of a data frame. These then can be used by subscripting to remove those rows from the data frame. Here is a reproducible example: df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) df[-duplicated(df[,2:3]), ] ## Note the - sign If you prefer, the "Tidyverse" world has what are purported to be more user-friendly versions of such data handling functionality that you can use instead. Bert On Fri, Nov 8, 2019 at 7:38 AM Ana Marija wrote: > would you know how would I extract from my original data frame, just > these unique rows? > because this gives me only those 3 columns, and I want all columns > from the original data frame > > > head(udt) >chr pos gene_id > 1 chr1 54490 ENSG0227232 > 2 chr1 58814 ENSG0227232 > 3 chr1 60351 ENSG0227232 > 4 chr1 61920 ENSG0227232 > 5 chr1 63671 ENSG0227232 > 6 chr1 64931 ENSG0227232 > > > head(dt) > chr pos gene_id pval_nominal pval_ret wl wr > META > 1: chr1 54490 ENSG0227232 0.608495 0.783778 31.62278 21.2838 > 0.7475480 > 2: chr1 58814 ENSG0227232 0.295211 0.897582 31.62278 21.2838 > 0.6031214 > 3: chr1 60351 ENSG0227232 0.439788 0.867959 31.62278 21.2838 > 0.6907182 > 4: chr1 61920 ENSG0227232 0.319528 0.601809 31.62278 21.2838 > 0.4032200 > 5: chr1 63671 ENSG0227232 0.237739 0.988039 31.62278 21.2838 > 0.7482519 > 6: chr1 64931 ENSG0227232 0.276679 0.907037 31.62278 21.2838 > 0.5974800 > > On Fri, Nov 8, 2019 at 9:30 AM Ana Marija > wrote: > > > > Thank you so much! Converting it to data frame resolved the issue! > > > > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner > > wrote: > > > > > > It seems as if dt is not a (base R) data frame but a > > > data table. I assume, you will have to transform dt > > > into a data frame (maybe with as.data.frame) to be > > > able to apply unique in the suggested way. However, > > > I am not familiar with data tables. Perhaps somebody > > > else can provide a more profound guess. > > > > > > Regards -- Gerrit > > > > > > - > > > Dr. Gerrit Eichner Mathematical Institute, Room 212 > > > gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen > > > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > > > http://www.uni-giessen.de/eichner > > > - > > > > > > Am 08.11.2019 um 16:02 schrieb Ana Marija: > > > > I tried it but I got this error: > > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > > > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : > > > >When i is a data.table (or character vector), the columns to join > by > > > > must be specified using 'on=' argument (see ?data.table), by keying x > > > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing > > > > column names between x and i (i.e., a natural join). Keyed joins > might > > > > have further speed benefits on very large data due to x being sorted > > > > in RAM. > > > > > > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner > > > > wrote: > > > >> > > > >> Hi, Ana, > > > >> > > > >> doesn't > > > >> > > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > > > >> nrow(udt) > > > >> > > > >> get close to what you want? > > > >> > > > >>Hth -- Gerrit > > > >> > > > >> > - > > > >> Dr. Gerrit Eichner Mathematical Institute, Room > 212 > > > >> gerrit.eich...@math.uni-giessen.de Justus-Liebig-University > Giessen > > > >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, > Germany > > > >> http://www.uni-giessen.de/eichner > > > >> > - > > > >> > > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija: > > > >>> Hello, > > > >>> > > > >>> I have a data frame like this: > > > >>> > > > head(dt,20) > > > >>>chrpos gene_id pval_nominal pval_ret wl > wr > > > >>>1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278 > 21.2838 > > > >>>2: chr1 58814 ENSG02272320.2952110 0.8975820 31.62278 > 21.2838 > > > >>>3: chr1 60351 ENSG02272320.4397880 0.8679590 31.62278 > 21.2838 > > > >>>4: chr1 61920 ENSG02272320.3195280 0.6018090 31.62278 > 21.2838 > > > >>>5: chr1 63671 ENSG02272320.2377390 0.9880390 31.62278 > 21.2838 > > > >>>6: chr1 64931 ENSG02272320.2766790 0.9070370 31.62278 > 21.2838 > > > >>>7: chr1 81587 ENSG02272320.6057930 0.616763
Re: [R] how to find number of unique rows for combination of r columns
I am trying to first identify how many duplicate rows are there determined by the unique values in the first 3 columns. Now I know that is about 2 rows which are non unique. But I would like to extract all 8 columns for those non unique rows and see what is going on with META value I have in them. About duplicated() function I know as well as about unique On Fri, 8 Nov 2019 at 10:08, Boris Steipe wrote: > Are you trying to eliminate duplicated rows from your dataframe? Because > that would be better achieved with duplicated(). > > > B. > > > > > > On 2019-11-08, at 10:32, Ana Marija wrote: > > > > would you know how would I extract from my original data frame, just > > these unique rows? > > because this gives me only those 3 columns, and I want all columns > > from the original data frame > > > >> head(udt) > > chr pos gene_id > > 1 chr1 54490 ENSG0227232 > > 2 chr1 58814 ENSG0227232 > > 3 chr1 60351 ENSG0227232 > > 4 chr1 61920 ENSG0227232 > > 5 chr1 63671 ENSG0227232 > > 6 chr1 64931 ENSG0227232 > > > >> head(dt) > >chr pos gene_id pval_nominal pval_ret wl wr > META > > 1: chr1 54490 ENSG0227232 0.608495 0.783778 31.62278 21.2838 > 0.7475480 > > 2: chr1 58814 ENSG0227232 0.295211 0.897582 31.62278 21.2838 > 0.6031214 > > 3: chr1 60351 ENSG0227232 0.439788 0.867959 31.62278 21.2838 > 0.6907182 > > 4: chr1 61920 ENSG0227232 0.319528 0.601809 31.62278 21.2838 > 0.4032200 > > 5: chr1 63671 ENSG0227232 0.237739 0.988039 31.62278 21.2838 > 0.7482519 > > 6: chr1 64931 ENSG0227232 0.276679 0.907037 31.62278 21.2838 > 0.5974800 > > > > On Fri, Nov 8, 2019 at 9:30 AM Ana Marija > wrote: > >> > >> Thank you so much! Converting it to data frame resolved the issue! > >> > >> On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner > >> wrote: > >>> > >>> It seems as if dt is not a (base R) data frame but a > >>> data table. I assume, you will have to transform dt > >>> into a data frame (maybe with as.data.frame) to be > >>> able to apply unique in the suggested way. However, > >>> I am not familiar with data tables. Perhaps somebody > >>> else can provide a more profound guess. > >>> > >>> Regards -- Gerrit > >>> > >>> - > >>> Dr. Gerrit Eichner Mathematical Institute, Room 212 > >>> gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen > >>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > >>> http://www.uni-giessen.de/eichner > >>> - > >>> > >>> Am 08.11.2019 um 16:02 schrieb Ana Marija: > I tried it but I got this error: > > udt <- unique(dt[c("chr", "pos", "gene_id")]) > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : > When i is a data.table (or character vector), the columns to join by > must be specified using 'on=' argument (see ?data.table), by keying x > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing > column names between x and i (i.e., a natural join). Keyed joins might > have further speed benefits on very large data due to x being sorted > in RAM. > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner > wrote: > > > > Hi, Ana, > > > > doesn't > > > > udt <- unique(dt[c("chr", "pos", "gene_id")]) > > nrow(udt) > > > > get close to what you want? > > > > Hth -- Gerrit > > > > - > > Dr. Gerrit Eichner Mathematical Institute, Room 212 > > gerrit.eich...@math.uni-giessen.de Justus-Liebig-University > Giessen > > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > > http://www.uni-giessen.de/eichner > > - > > > > Am 08.11.2019 um 15:38 schrieb Ana Marija: > >> Hello, > >> > >> I have a data frame like this: > >> > >>> head(dt,20) > >> chrpos gene_id pval_nominal pval_ret wl > wr > >> 1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278 > 21.2838 > >> 2: chr1 58814 ENSG02272320.2952110 0.8975820 31.62278 > 21.2838 > >> 3: chr1 60351 ENSG02272320.4397880 0.8679590 31.62278 > 21.2838 > >> 4: chr1 61920 ENSG02272320.3195280 0.6018090 31.62278 > 21.2838 > >> 5: chr1 63671 ENSG02272320.2377390 0.9880390 31.62278 > 21.2838 > >> 6: chr1 64931 ENSG02272320.2766790 0.9070370 31.62278 > 21.2838 > >> 7: chr1 81587 ENSG02272320.6057930 0.6167630 31.62278 > 21.2838 > >> 8: chr1 115746 ENSG02272320.4078770 0.7799110 31.62278 > 21.2838 > >> 9: chr1 135203 ENSG02272320.4078770 0.9299130 31.62278 > 21.2838
Re: [R] how to find number of unique rows for combination of r columns
Good. Duplicated returns a boolean index vector that you can use to extract the non-unique rows. B. > On 2019-11-08, at 11:30, Ana Marija wrote: > > I am trying to first identify how many duplicate rows are there determined by > the unique values in the first 3 columns. Now I know that is about 2 rows > which are non unique. But I would like to extract all 8 columns for those non > unique rows and see what is going on with META value I have in them. > > About duplicated() function I know as well as about unique > > On Fri, 8 Nov 2019 at 10:08, Boris Steipe wrote: > Are you trying to eliminate duplicated rows from your dataframe? Because that > would be better achieved with duplicated(). > > > B. > > > > > > On 2019-11-08, at 10:32, Ana Marija wrote: > > > > would you know how would I extract from my original data frame, just > > these unique rows? > > because this gives me only those 3 columns, and I want all columns > > from the original data frame > > > >> head(udt) > > chr pos gene_id > > 1 chr1 54490 ENSG0227232 > > 2 chr1 58814 ENSG0227232 > > 3 chr1 60351 ENSG0227232 > > 4 chr1 61920 ENSG0227232 > > 5 chr1 63671 ENSG0227232 > > 6 chr1 64931 ENSG0227232 > > > >> head(dt) > >chr pos gene_id pval_nominal pval_ret wl wr > > META > > 1: chr1 54490 ENSG0227232 0.608495 0.783778 31.62278 21.2838 > > 0.7475480 > > 2: chr1 58814 ENSG0227232 0.295211 0.897582 31.62278 21.2838 > > 0.6031214 > > 3: chr1 60351 ENSG0227232 0.439788 0.867959 31.62278 21.2838 > > 0.6907182 > > 4: chr1 61920 ENSG0227232 0.319528 0.601809 31.62278 21.2838 > > 0.4032200 > > 5: chr1 63671 ENSG0227232 0.237739 0.988039 31.62278 21.2838 > > 0.7482519 > > 6: chr1 64931 ENSG0227232 0.276679 0.907037 31.62278 21.2838 > > 0.5974800 > > > > On Fri, Nov 8, 2019 at 9:30 AM Ana Marija > > wrote: > >> > >> Thank you so much! Converting it to data frame resolved the issue! > >> > >> On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner > >> wrote: > >>> > >>> It seems as if dt is not a (base R) data frame but a > >>> data table. I assume, you will have to transform dt > >>> into a data frame (maybe with as.data.frame) to be > >>> able to apply unique in the suggested way. However, > >>> I am not familiar with data tables. Perhaps somebody > >>> else can provide a more profound guess. > >>> > >>> Regards -- Gerrit > >>> > >>> - > >>> Dr. Gerrit Eichner Mathematical Institute, Room 212 > >>> gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen > >>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > >>> http://www.uni-giessen.de/eichner > >>> - > >>> > >>> Am 08.11.2019 um 16:02 schrieb Ana Marija: > I tried it but I got this error: > > udt <- unique(dt[c("chr", "pos", "gene_id")]) > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : > When i is a data.table (or character vector), the columns to join by > must be specified using 'on=' argument (see ?data.table), by keying x > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing > column names between x and i (i.e., a natural join). Keyed joins might > have further speed benefits on very large data due to x being sorted > in RAM. > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner > wrote: > > > > Hi, Ana, > > > > doesn't > > > > udt <- unique(dt[c("chr", "pos", "gene_id")]) > > nrow(udt) > > > > get close to what you want? > > > > Hth -- Gerrit > > > > - > > Dr. Gerrit Eichner Mathematical Institute, Room 212 > > gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen > > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > > http://www.uni-giessen.de/eichner > > - > > > > Am 08.11.2019 um 15:38 schrieb Ana Marija: > >> Hello, > >> > >> I have a data frame like this: > >> > >>> head(dt,20) > >> chrpos gene_id pval_nominal pval_ret wl > >> wr > >> 1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278 > >> 21.2838 > >> 2: chr1 58814 ENSG02272320.2952110 0.8975820 31.62278 > >> 21.2838 > >> 3: chr1 60351 ENSG02272320.4397880 0.8679590 31.62278 > >> 21.2838 > >> 4: chr1 61920 ENSG02272320.3195280 0.6018090 31.62278 > >> 21.2838 > >> 5: chr1 63671 ENSG02272320.2377390 0.9880390 31.62278 > >> 21.2838 > >> 6: chr1 64931 ENSG02272320.2766790 0
Re: [R] how to find number of unique rows for combination of r columns
Correction: df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) df[!duplicated(df[,2:3]), ] ## Note the ! sign Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter wrote: > Sorry, but you ask basic questions.You really need to spend some more time > with an R tutorial or two. This list is not meant to replace your own > learning efforts. > > You also do not seem to be reading the docs carefully. Under ?unique, it > links ?duplicated and tells you that it gives indices of duplicated rows of > a data frame. These then can be used by subscripting to remove those rows > from the data frame. Here is a reproducible example: > > df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) > df[-duplicated(df[,2:3]), ] ## Note the - sign > > If you prefer, the "Tidyverse" world has what are purported to be more > user-friendly versions of such data handling functionality that you can use > instead. > > > Bert > > On Fri, Nov 8, 2019 at 7:38 AM Ana Marija > wrote: > >> would you know how would I extract from my original data frame, just >> these unique rows? >> because this gives me only those 3 columns, and I want all columns >> from the original data frame >> >> > head(udt) >>chr pos gene_id >> 1 chr1 54490 ENSG0227232 >> 2 chr1 58814 ENSG0227232 >> 3 chr1 60351 ENSG0227232 >> 4 chr1 61920 ENSG0227232 >> 5 chr1 63671 ENSG0227232 >> 6 chr1 64931 ENSG0227232 >> >> > head(dt) >> chr pos gene_id pval_nominal pval_ret wl wr >> META >> 1: chr1 54490 ENSG0227232 0.608495 0.783778 31.62278 21.2838 >> 0.7475480 >> 2: chr1 58814 ENSG0227232 0.295211 0.897582 31.62278 21.2838 >> 0.6031214 >> 3: chr1 60351 ENSG0227232 0.439788 0.867959 31.62278 21.2838 >> 0.6907182 >> 4: chr1 61920 ENSG0227232 0.319528 0.601809 31.62278 21.2838 >> 0.4032200 >> 5: chr1 63671 ENSG0227232 0.237739 0.988039 31.62278 21.2838 >> 0.7482519 >> 6: chr1 64931 ENSG0227232 0.276679 0.907037 31.62278 21.2838 >> 0.5974800 >> >> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija >> wrote: >> > >> > Thank you so much! Converting it to data frame resolved the issue! >> > >> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner >> > wrote: >> > > >> > > It seems as if dt is not a (base R) data frame but a >> > > data table. I assume, you will have to transform dt >> > > into a data frame (maybe with as.data.frame) to be >> > > able to apply unique in the suggested way. However, >> > > I am not familiar with data tables. Perhaps somebody >> > > else can provide a more profound guess. >> > > >> > > Regards -- Gerrit >> > > >> > > - >> > > Dr. Gerrit Eichner Mathematical Institute, Room 212 >> > > gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen >> > > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >> > > http://www.uni-giessen.de/eichner >> > > - >> > > >> > > Am 08.11.2019 um 16:02 schrieb Ana Marija: >> > > > I tried it but I got this error: >> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) >> > > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : >> > > >When i is a data.table (or character vector), the columns to >> join by >> > > > must be specified using 'on=' argument (see ?data.table), by keying >> x >> > > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing >> > > > column names between x and i (i.e., a natural join). Keyed joins >> might >> > > > have further speed benefits on very large data due to x being sorted >> > > > in RAM. >> > > > >> > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner >> > > > wrote: >> > > >> >> > > >> Hi, Ana, >> > > >> >> > > >> doesn't >> > > >> >> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) >> > > >> nrow(udt) >> > > >> >> > > >> get close to what you want? >> > > >> >> > > >>Hth -- Gerrit >> > > >> >> > > >> >> - >> > > >> Dr. Gerrit Eichner Mathematical Institute, Room >> 212 >> > > >> gerrit.eich...@math.uni-giessen.de Justus-Liebig-University >> Giessen >> > > >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, >> Germany >> > > >> http://www.uni-giessen.de/eichner >> > > >> >> - >> > > >> >> > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija: >> > > >>> Hello, >> > > >>> >> > > >>> I have a data frame like this: >> > > >>> >> > > head(dt,20) >> > > >>>chrpos gene_id pval_nominal pval_ret >> wl wr >> > > >>>1: chr1 54490 ENSG02272320.6084950 0.7837780 31.62278
[R] About separate train and test data
Hi For instance, we have separate train and test data files (not want to do k fold), so we will not use the function trainControl? In that case if we have to tune the parameters, do we need to specify search =grid in the train function? My second question is how we can measure MCC classification measure? Is it same like metric=Roc in the train function? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to find number of unique rows for combination of r columns
Thank you so much!!! On Fri, Nov 8, 2019 at 11:40 AM Bert Gunter wrote: > > Correction: > df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) > df[!duplicated(df[,2:3]), ] ## Note the ! sign > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter wrote: >> >> Sorry, but you ask basic questions.You really need to spend some more time >> with an R tutorial or two. This list is not meant to replace your own >> learning efforts. >> >> You also do not seem to be reading the docs carefully. Under ?unique, it >> links ?duplicated and tells you that it gives indices of duplicated rows of >> a data frame. These then can be used by subscripting to remove those rows >> from the data frame. Here is a reproducible example: >> >> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) >> df[-duplicated(df[,2:3]), ] ## Note the - sign >> >> If you prefer, the "Tidyverse" world has what are purported to be more >> user-friendly versions of such data handling functionality that you can use >> instead. >> >> >> Bert >> >> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija >> wrote: >>> >>> would you know how would I extract from my original data frame, just >>> these unique rows? >>> because this gives me only those 3 columns, and I want all columns >>> from the original data frame >>> >>> > head(udt) >>>chr pos gene_id >>> 1 chr1 54490 ENSG0227232 >>> 2 chr1 58814 ENSG0227232 >>> 3 chr1 60351 ENSG0227232 >>> 4 chr1 61920 ENSG0227232 >>> 5 chr1 63671 ENSG0227232 >>> 6 chr1 64931 ENSG0227232 >>> >>> > head(dt) >>> chr pos gene_id pval_nominal pval_ret wl wr >>> META >>> 1: chr1 54490 ENSG0227232 0.608495 0.783778 31.62278 21.2838 >>> 0.7475480 >>> 2: chr1 58814 ENSG0227232 0.295211 0.897582 31.62278 21.2838 >>> 0.6031214 >>> 3: chr1 60351 ENSG0227232 0.439788 0.867959 31.62278 21.2838 >>> 0.6907182 >>> 4: chr1 61920 ENSG0227232 0.319528 0.601809 31.62278 21.2838 >>> 0.4032200 >>> 5: chr1 63671 ENSG0227232 0.237739 0.988039 31.62278 21.2838 >>> 0.7482519 >>> 6: chr1 64931 ENSG0227232 0.276679 0.907037 31.62278 21.2838 >>> 0.5974800 >>> >>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija >>> wrote: >>> > >>> > Thank you so much! Converting it to data frame resolved the issue! >>> > >>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner >>> > wrote: >>> > > >>> > > It seems as if dt is not a (base R) data frame but a >>> > > data table. I assume, you will have to transform dt >>> > > into a data frame (maybe with as.data.frame) to be >>> > > able to apply unique in the suggested way. However, >>> > > I am not familiar with data tables. Perhaps somebody >>> > > else can provide a more profound guess. >>> > > >>> > > Regards -- Gerrit >>> > > >>> > > - >>> > > Dr. Gerrit Eichner Mathematical Institute, Room 212 >>> > > gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen >>> > > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >>> > > http://www.uni-giessen.de/eichner >>> > > - >>> > > >>> > > Am 08.11.2019 um 16:02 schrieb Ana Marija: >>> > > > I tried it but I got this error: >>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) >>> > > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : >>> > > >When i is a data.table (or character vector), the columns to join >>> > > > by >>> > > > must be specified using 'on=' argument (see ?data.table), by keying x >>> > > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing >>> > > > column names between x and i (i.e., a natural join). Keyed joins might >>> > > > have further speed benefits on very large data due to x being sorted >>> > > > in RAM. >>> > > > >>> > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner >>> > > > wrote: >>> > > >> >>> > > >> Hi, Ana, >>> > > >> >>> > > >> doesn't >>> > > >> >>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) >>> > > >> nrow(udt) >>> > > >> >>> > > >> get close to what you want? >>> > > >> >>> > > >>Hth -- Gerrit >>> > > >> >>> > > >> - >>> > > >> Dr. Gerrit Eichner Mathematical Institute, Room 212 >>> > > >> gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen >>> > > >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >>> > > >> http://www.uni-giessen.de/eichner >>> > > >> - >>> > > >> >>> > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija: >>> > > >>> Hello, >>> > > >>> >>> > > >>> I hav
Re: [R] how to find number of unique rows for combination of r columns
With this example > df = data.frame(a = c(1, 1, 2, 2), b = c(1, 1, 2, 3), value = 1:4) > df a b value 1 1 1 1 2 1 1 2 3 2 2 3 4 2 3 4 The approach to drop duplicates in the first and second columns has as a consequence the arbitrary choice of 'value' for the duplicate entries -- why chose a value of '1' rather than '2' (or the average of 1 and 2, or a list containing all possible values, or...) for the rows duplicated in columns a and b? > df[!duplicated(df[,1:2]),] a b value 1 1 1 1 3 2 2 3 4 2 3 4 In base R one might > aggregate(value ~ a + b, df, mean) a b value 1 1 1 1.5 2 2 2 3.0 3 2 3 4.0 > aggregate(value ~ a + b, df, list) a b value 1 1 1 1, 2 2 2 2 3 3 2 3 4 but handling several value-like columns would be hard(?) Using library(dplyr), I have > group_by(df, a, b) %>% summarize(mean_value = mean(value)) # A tibble: 3 x 3 # Groups: a [2] a b mean_value 1 1 11.5 2 2 23 3 2 34 or > group_by(df, a, b) %>% summarize(values = list(value)) # A tibble: 3 x 3 # Groups: a [2] a b values 1 1 1 2 2 2 3 2 3 summarizing multiple columns with dplyr > df$v1 = 1:4 > df$v2 = 4:1 > group_by(df, a, b) %>% summarize(v1_mean = mean(v1), v2_median = median(v2)) # A tibble: 3 x 4 # Groups: a [2] a b v1_mean v2_median 1 1 1 1.5 3.5 2 2 2 3 2 3 2 3 4 1 I do not know how performant this would be with data of your size. Martin Morgan On 11/8/19, 1:39 PM, "R-help on behalf of Ana Marija" wrote: Thank you so much!!! On Fri, Nov 8, 2019 at 11:40 AM Bert Gunter wrote: > > Correction: > df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) > df[!duplicated(df[,2:3]), ] ## Note the ! sign > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter wrote: >> >> Sorry, but you ask basic questions.You really need to spend some more time with an R tutorial or two. This list is not meant to replace your own learning efforts. >> >> You also do not seem to be reading the docs carefully. Under ?unique, it links ?duplicated and tells you that it gives indices of duplicated rows of a data frame. These then can be used by subscripting to remove those rows from the data frame. Here is a reproducible example: >> >> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) >> df[-duplicated(df[,2:3]), ] ## Note the - sign >> >> If you prefer, the "Tidyverse" world has what are purported to be more user-friendly versions of such data handling functionality that you can use instead. >> >> >> Bert >> >> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija wrote: >>> >>> would you know how would I extract from my original data frame, just >>> these unique rows? >>> because this gives me only those 3 columns, and I want all columns >>> from the original data frame >>> >>> > head(udt) >>>chr pos gene_id >>> 1 chr1 54490 ENSG0227232 >>> 2 chr1 58814 ENSG0227232 >>> 3 chr1 60351 ENSG0227232 >>> 4 chr1 61920 ENSG0227232 >>> 5 chr1 63671 ENSG0227232 >>> 6 chr1 64931 ENSG0227232 >>> >>> > head(dt) >>> chr pos gene_id pval_nominal pval_ret wl wr META >>> 1: chr1 54490 ENSG0227232 0.608495 0.783778 31.62278 21.2838 0.7475480 >>> 2: chr1 58814 ENSG0227232 0.295211 0.897582 31.62278 21.2838 0.6031214 >>> 3: chr1 60351 ENSG0227232 0.439788 0.867959 31.62278 21.2838 0.6907182 >>> 4: chr1 61920 ENSG0227232 0.319528 0.601809 31.62278 21.2838 0.4032200 >>> 5: chr1 63671 ENSG0227232 0.237739 0.988039 31.62278 21.2838 0.7482519 >>> 6: chr1 64931 ENSG0227232 0.276679 0.907037 31.62278 21.2838 0.5974800 >>> >>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija wrote: >>> > >>> > Thank you so much! Converting it to data frame resolved the issue! >>> > >>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner >>> > wrote: >>> > > >>> > > It seems as if dt is not a (base R) data frame but a >>> > > data table. I assume, you will have to transform dt >>> > > into a data frame (maybe with as.data.frame) to be >>> > > able to apply unique in the suggested way. However, >>> > > I am not familiar with data tables. Perhaps somebody >>> > > else can provide a more profound guess. >>> > > >>> > > Regards -- Gerrit >>> > > >>> > > --
Re: [R] how to find number of unique rows for combination of r columns
Hello, If performance is important, and with 73M rows it probably is, take a look at this StackOverflow post. [1] https://stackoverflow.com/a/36058634/8245406 Hope this helps, Rui Barradas Às 21:33 de 08/11/19, Martin Morgan escreveu: With this example df = data.frame(a = c(1, 1, 2, 2), b = c(1, 1, 2, 3), value = 1:4) df a b value 1 1 1 1 2 1 1 2 3 2 2 3 4 2 3 4 The approach to drop duplicates in the first and second columns has as a consequence the arbitrary choice of 'value' for the duplicate entries -- why chose a value of '1' rather than '2' (or the average of 1 and 2, or a list containing all possible values, or...) for the rows duplicated in columns a and b? df[!duplicated(df[,1:2]),] a b value 1 1 1 1 3 2 2 3 4 2 3 4 In base R one might aggregate(value ~ a + b, df, mean) a b value 1 1 1 1.5 2 2 2 3.0 3 2 3 4.0 aggregate(value ~ a + b, df, list) a b value 1 1 1 1, 2 2 2 2 3 3 2 3 4 but handling several value-like columns would be hard(?) Using library(dplyr), I have group_by(df, a, b) %>% summarize(mean_value = mean(value)) # A tibble: 3 x 3 # Groups: a [2] a b mean_value 1 1 11.5 2 2 23 3 2 34 or group_by(df, a, b) %>% summarize(values = list(value)) # A tibble: 3 x 3 # Groups: a [2] a b values 1 1 1 2 2 2 3 2 3 summarizing multiple columns with dplyr df$v1 = 1:4 df$v2 = 4:1 group_by(df, a, b) %>% summarize(v1_mean = mean(v1), v2_median = median(v2)) # A tibble: 3 x 4 # Groups: a [2] a b v1_mean v2_median 1 1 1 1.5 3.5 2 2 2 3 2 3 2 3 4 1 I do not know how performant this would be with data of your size. Martin Morgan On 11/8/19, 1:39 PM, "R-help on behalf of Ana Marija" wrote: Thank you so much!!! On Fri, Nov 8, 2019 at 11:40 AM Bert Gunter wrote: > > Correction: > df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) > df[!duplicated(df[,2:3]), ] ## Note the ! sign > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter wrote: >> >> Sorry, but you ask basic questions.You really need to spend some more time with an R tutorial or two. This list is not meant to replace your own learning efforts. >> >> You also do not seem to be reading the docs carefully. Under ?unique, it links ?duplicated and tells you that it gives indices of duplicated rows of a data frame. These then can be used by subscripting to remove those rows from the data frame. Here is a reproducible example: >> >> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) >> df[-duplicated(df[,2:3]), ] ## Note the - sign >> >> If you prefer, the "Tidyverse" world has what are purported to be more user-friendly versions of such data handling functionality that you can use instead. >> >> >> Bert >> >> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija wrote: >>> >>> would you know how would I extract from my original data frame, just >>> these unique rows? >>> because this gives me only those 3 columns, and I want all columns >>> from the original data frame >>> >>> > head(udt) >>>chr pos gene_id >>> 1 chr1 54490 ENSG0227232 >>> 2 chr1 58814 ENSG0227232 >>> 3 chr1 60351 ENSG0227232 >>> 4 chr1 61920 ENSG0227232 >>> 5 chr1 63671 ENSG0227232 >>> 6 chr1 64931 ENSG0227232 >>> >>> > head(dt) >>> chr pos gene_id pval_nominal pval_ret wl wr META >>> 1: chr1 54490 ENSG0227232 0.608495 0.783778 31.62278 21.2838 0.7475480 >>> 2: chr1 58814 ENSG0227232 0.295211 0.897582 31.62278 21.2838 0.6031214 >>> 3: chr1 60351 ENSG0227232 0.439788 0.867959 31.62278 21.2838 0.6907182 >>> 4: chr1 61920 ENSG0227232 0.319528 0.601809 31.62278 21.2838 0.4032200 >>> 5: chr1 63671 ENSG0227232 0.237739 0.988039 31.62278 21.2838 0.7482519 >>> 6: chr1 64931 ENSG0227232 0.276679 0.907037 31.62278 21.2838 0.5974800 >>> >>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija wrote: >>> > >>> > Thank you so much! Converting it to data frame resolved the issue! >>> > >>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner >>> > wrote: >>> > > >>> > > It seems as if dt is not a (base R) data frame but a >>> > > data table. I assume, you will have to transform dt >>> > > into a data frame (maybe with as.data.frame) to be >>> > > able to ap