Hi Bert, hello R-experts, I am close to a solution but I still need one hint w.r.t. the following procedure (available also from https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
rm(list=ls()) # this is (an extract of) the INPUT file I have: f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) # this is the procedure that Bert suggested (slightly adjusted): z <- rnorm(nrow(f1)) ## or anything you want z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5) aggregate(v4~v1*v2,f1,sum) aggregate(z1~v1*v2,f1,sum) aggregate(v4~v3,f1,sum) aggregate(z1~v3,f1,sum) My question to you is: how can I set z so that I can obtain specific values for z1-v4 in the v3 aggregation? In other words, how can I configure the procedure so that e.g. B=29 and C=2.56723 after running the procedure: aggregate(z1~v3,f1,sum) Thank you, Luca PS: to avoid any doubts you might have about who I am the following is my web page: http://lucameyer.wordpress.com/ 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.ber...@gene.com>: > ... or cleaner: > > z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean)) > > > Just for curiosity, was this homework? (in which case I should > probably have not provided you an answer -- that is, assuming that I > HAVE provided an answer). > > Cheers, > Bert > > Bert Gunter > Genentech Nonclinical Biostatistics > (650) 467-7374 > > "Data is not information. Information is not knowledge. And knowledge > is certainly not wisdom." > Clifford Stoll > > > > > On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgun...@gene.com> wrote: > > z <- rnorm(nrow(f1)) ## or anything you want > > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean)) > > > > > > aggregate(v4~v1,f1,sum) > > aggregate(z1~v1,f1,sum) > > aggregate(v4~v2,f1,sum) > > aggregate(z1~v2,f1,sum) > > aggregate(v4~v3,f1,sum) > > aggregate(z1~v3,f1,sum) > > > > > > Cheers, > > Bert > > > > Bert Gunter > > Genentech Nonclinical Biostatistics > > (650) 467-7374 > > > > "Data is not information. Information is not knowledge. And knowledge > > is certainly not wisdom." > > Clifford Stoll > > > > > > > > > > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1...@gmail.com> wrote: > >> Hi Bert, > >> > >> Thank you for your message. I am looking into ave() and tapply() as you > >> suggested but at the same time I have prepared a example of input and > output > >> files, just in case you or someone else would like to make an attempt to > >> generate a code that goes from input to output. > >> > >> Please see below or download it from > >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 > >> > >> # this is (an extract of) the INPUT file I have: > >> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", > >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", > >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", > >> "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, > 1.42917, > >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872, > >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", > row.names = > >> c(2L, > >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) > >> > >> # this is (an extract of) the OUTPUT file I would like to obtain: > >> f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", > >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", > >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", > >> "B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295, > 1.77918, > >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872, > >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", > row.names = > >> c(2L, > >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) > >> > >> # please notice that while the aggregated v4 on v3 has changed … > >> aggregate(f1[,c("v4")],list(f1$v3),sum) > >> aggregate(f2[,c("v4")],list(f2$v3),sum) > >> > >> # … the aggregated v4 over v1xv2 has remained unchanged: > >> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) > >> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum) > >> > >> Thank you very much in advance for your assitance. > >> > >> Luca > >> > >> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.ber...@gene.com>: > >>> > >>> 1. Still not sure what you mean, but maybe look at ?ave and ?tapply, > >>> for which ave() is a wrapper. > >>> > >>> 2. You still need to heed the rest of Jeff's advice. > >>> > >>> Cheers, > >>> Bert > >>> > >>> Bert Gunter > >>> Genentech Nonclinical Biostatistics > >>> (650) 467-7374 > >>> > >>> "Data is not information. Information is not knowledge. And knowledge > >>> is certainly not wisdom." > >>> Clifford Stoll > >>> > >>> > >>> > >>> > >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1...@gmail.com> > wrote: > >>> > Hi Jeff & other R-experts, > >>> > > >>> > Thank you for your note. I have tried myself to solve the issue > without > >>> > success. > >>> > > >>> > Following your suggestion, I am providing a sample of the dataset I > am > >>> > using below (also downloadble in plain text from > >>> > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0): > >>> > > >>> > #this is an extract of the overall dataset (n=1200 cases) > >>> > f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", > >>> > "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", > >>> > "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", > >>> > "B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835, > >>> > 3.43806581506388, > >>> > 0.002733567617055, 1.42917483425029, 1.05786640463504, > >>> > 0.000420548864162308, > >>> > 2.37232740842861, 3.01835841813241, 0, 1.13430282139936, > >>> > 0.928725667117666, > >>> > 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", > row.names > >>> > = > >>> > c(2L, > >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) > >>> > > >>> > I need to find a automated procedure that allows me to adjust v3 > >>> > marginals > >>> > while maintaining v1xv2 marginals unchanged. > >>> > > >>> > That is: modify the v4 values you can find by running: > >>> > > >>> > aggregate(f1[,c("v4")],list(f1$v3),sum) > >>> > > >>> > while maintaining costant the values you can find by running: > >>> > > >>> > aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) > >>> > > >>> > Now does it make sense? > >>> > > >>> > Please notice I have tried to build some syntax that tries to modify > >>> > values > >>> > within each v1xv2 combination by computing sum of v4, row percentage > in > >>> > terms of v4, and there is where my effort is blocked. Not really sure > >>> > how I > >>> > should proceed. Any suggestion? > >>> > > >>> > Thanks, > >>> > > >>> > Luca > >>> > > >>> > > >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnew...@dcn.davis.ca.us>: > >>> > > >>> >> I don't understand your description. The standard practice on this > list > >>> >> is > >>> >> to provide a reproducible R example [1] of the kind of data you are > >>> >> working > >>> >> with (and any code you have tried) to go along with your > description. > >>> >> In > >>> >> this case, that would be two dputs of your input data frames and a > dput > >>> >> of > >>> >> an output data frame (generated by hand from your input data frame). > >>> >> (Probably best to not use the full number of input values just to > keep > >>> >> the > >>> >> size down.) We could then make an attempt to generate code that goes > >>> >> from > >>> >> input to output. > >>> >> > >>> >> Of course, if you post that hard work using HTML then it will get > >>> >> corrupted (much like the text below from your earlier emails) and we > >>> >> won't > >>> >> be able to use it. Please learn to post from your email software > using > >>> >> plain text when corresponding with this mailing list. > >>> >> > >>> >> [1] > >>> >> > >>> >> > http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example > >>> >> > >>> >> > --------------------------------------------------------------------------- > >>> >> Jeff Newmiller The ..... ..... Go > >>> >> Live... > >>> >> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. > Live > >>> >> Go... > >>> >> Live: OO#.. Dead: OO#.. > >>> >> Playing > >>> >> Research Engineer (Solar/Batteries O.O#. #.O#. > with > >>> >> /Software/Embedded Controllers) .OO#. .OO#. > >>> >> rocks...1k > >>> >> > >>> >> > --------------------------------------------------------------------------- > >>> >> Sent from my phone. Please excuse my brevity. > >>> >> > >>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1...@gmail.com> > >>> >> wrote: > >>> >> >Thanks for you input Michael, > >>> >> > > >>> >> >The continuous variable I have measures quantities (down to the 3rd > >>> >> >decimal level) so unfortunately are not frequencies. > >>> >> > > >>> >> >Any more specific suggestions on how that could be tackled? > >>> >> > > >>> >> >Thanks & kind regards, > >>> >> > > >>> >> >Luca > >>> >> > > >>> >> > > >>> >> >=== > >>> >> > > >>> >> >Michael Friendly wrote: > >>> >> >I'm not sure I understand completely what you want to do, but > >>> >> >if the data were frequencies, it sounds like task for fitting a > >>> >> >loglinear model with the model formula > >>> >> > > >>> >> >~ V1*V2 + V3 > >>> >> > > >>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote: > >>> >> >>* Hello, > >>> >> >*>>* I am facing a quite challenging task (at least to me) and I > was > >>> >> >wondering > >>> >> >*>* if someone could advise how R could assist me to speed the task > >>> >> > up. > >>> >> >*>>* I am dealing with a dataset with 3 discrete variables and one > >>> >> >continuous > >>> >> >*>* variable. The discrete variables are: > >>> >> >*>>* V1: 8 modalities > >>> >> >*>* V2: 13 modalities > >>> >> >*>* V3: 13 modalities > >>> >> >*>>* The continuous variable V4 is a decimal number always greater > >>> >> > than > >>> >> >zero in > >>> >> >*>* the marginals of each of the 3 variables but it is sometimes > equal > >>> >> >to zero > >>> >> >*>* (and sometimes negative) in the joint tables. > >>> >> >*>>* I have got 2 files: > >>> >> >*>>* => one with distribution of all possible combinations of V1xV2 > >>> >> >(some of > >>> >> >*>* which are zero or neagtive) and > >>> >> >*>* => one with the marginal distribution of V3. > >>> >> >*>>* I am trying to build the long and narrow dataset V1xV2xV3 in > such > >>> >> >a way > >>> >> >*>* that each V1xV2 cell does not get modified and V3 fits as > closely > >>> >> >as > >>> >> >*>* possible to its marginal distribution. Does it make sense? > >>> >> >*>>* To be even more specific, my 2 input files look like the > >>> >> >following. > >>> >> >*>>* FILE 1 > >>> >> >*>* V1,V2,V4 > >>> >> >*>* A, A, 24.251 > >>> >> >*>* A, B, 1.065 > >>> >> >*>* (...) > >>> >> >*>* B, C, 0.294 > >>> >> >*>* B, D, 2.731 > >>> >> >*>* (...) > >>> >> >*>* H, L, 0.345 > >>> >> >*>* H, M, 0.000 > >>> >> >*>>* FILE 2 > >>> >> >*>* V3, V4 > >>> >> >*>* A, 1.575 > >>> >> >*>* B, 4.294 > >>> >> >*>* C, 10.044 > >>> >> >*>* (...) > >>> >> >*>* L, 5.123 > >>> >> >*>* M, 3.334 > >>> >> >*>>* What I need to achieve is a file such as the following > >>> >> >*>>* FILE 3 > >>> >> >*>* V1, V2, V3, V4 > >>> >> >*>* A, A, A, ??? > >>> >> >*>* A, A, B, ??? > >>> >> >*>* (...) > >>> >> >*>* D, D, E, ??? > >>> >> >*>* D, D, F, ??? > >>> >> >*>* (...) > >>> >> >*>* H, M, L, ??? > >>> >> >*>* H, M, M, ??? > >>> >> >*>>* Please notice that FILE 3 need to be such that if I aggregate > on > >>> >> >V1+V2 I > >>> >> >*>* recover exactly FILE 1 and that if I aggregate on V3 I can > recover > >>> >> >a file > >>> >> >*>* as close as possible to FILE 3 (ideally the same file). > >>> >> >*>>* Can anyone suggest how I could do that with R? > >>> >> >*>>* Thank you very much indeed for any assistance you are able to > >>> >> >provide. > >>> >> >*>>* Kind regards, > >>> >> >*>>* Luca* > >>> >> > > >>> >> > [[alternative HTML version deleted]] > >>> >> > > >>> >> >______________________________________________ > >>> >> >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> >> >https://stat.ethz.ch/mailman/listinfo/r-help > >>> >> >PLEASE do read the posting guide > >>> >> >http://www.R-project.org/posting-guide.html > >>> >> >and provide commented, minimal, self-contained, reproducible code. > >>> >> > >>> >> > >>> > > >>> > [[alternative HTML version deleted]] > >>> > > >>> > ______________________________________________ > >>> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> > https://stat.ethz.ch/mailman/listinfo/r-help > >>> > PLEASE do read the posting guide > >>> > http://www.R-project.org/posting-guide.html > >>> > and provide commented, minimal, self-contained, reproducible code. > >> > >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.