[See at end] On 15-Sep-2012 20:36:49 Niklas Fischer wrote: > Dear R users, > > I have a reproducible data and try to create new variable "clo" is 1 if > know variable is equal to "very well" or "fairly well" and getalong is 4 or > 5 > otherwise it is 0.
>[A] rep_data<- read.table(header=TRUE, text=" id1 id2 know getalong 100000016_a1 100000016_a2 very well 4 100000035_a1 100000035_a2 fairly well NA 100000036_a1 100000036_a2 very well 3 100000039_a1 100000039_a2 very well 5 100000067_a1 100000067_a2 very well 5 100000076_a1 100000076_a2 fairly well 5 ") rep_data$clo<- ifelse((rep_data$know==c("fairly well","very well") & rep_data$getalong==c(4,5)),1,0) > For sure, something must be wrong, I couldn't find it out. rep_data id1 id2 know getalong clo 100000016_a1 100000016_a2 very well 4 0 100000035_a1 100000035_a2 fairly well NA 0 100000036_a1 100000036_a2 very well 3 0 100000039_a1 100000039_a2 very well 5 0 100000067_a1 100000067_a2 very well 5 0 100000076_a1 100000076_a2 fairly well 5 0 > Any help is appreciated.. > Bests, > Niklas There are several things wrong with the way you are trying to do it, and indeed it is a bit complicated! First: if the above table (at >[A] above) is the format in which you input the data, then you should either comma-separate your data fields (and use sep="," in read.table(), or else just use read.csv()), or else enclose the two-word fields within "...", i.e. EITHER: >[B] id1, id2, know, getalong 100000016_a1, 100000016_a2, very well, 4 100000035_a1, 100000035_a2, fairly well, NA 100000036_a1, 100000036_a2, very well, 3 100000039_a1, 100000039_a2, very well, 5 100000067_a1, 100000067_a2, very well, 5 100000076_a1, 100000076_a2, fairly well, 5 OR: >[C] id1 id2 know getalong 100000016_a1 100000016_a2 "very well" 4 100000035_a1 100000035_a2 "fairly well" NA 100000036_a1 100000036_a2 "very well" 3 100000039_a1 100000039_a2 "very well" 5 100000067_a1 100000067_a2 "very well" 5 100000076_a1 100000076_a2 "fairly well" 5 Otherwise, in your original format, read.table() will read in FIVE fields, since it will treat "very" and "well" as separate, and will treat "fairly" and "well" as separate. Furthermore, it will match the header "getalong" with the 5th field (4,NA,etc), the header "know" with the 4th field ("well","well",...,"well"), header "id2" with the 3rd field ("very","fairly","very",...,"fairly"), and header "id1" with the 2nd field ("100000016_a2"). And even further more, the first field will become the row-names of the dataframe and will no longer be data! Second: Use of "==" to compare $know with "very well" and "fairly well" will not work as you expect. In your comparison rep_data$know==c("fairly well","very well") you will get the result: # [1] FALSE FALSE FALSE TRUE FALSE FALSE rather then your expected # [1] TRUE TRUE TRUE TRUE TRUE TRUE. This is because "==" will compare $know with ONE ELEMENT of c("fairly well","very well"), and will recycle these elements, so it will compare $know successively with "fairly well","very well" "fairly well","very well" "fairly well","very well" and since $know is "very well","fairly well","very well","very well","very well","fairly well" the only match is in the 4th instance, which is why you get # [1] FALSE FALSE FALSE TRUE FALSE FALSE A better comparison is to use the "%in" operator, as in: rep_data$know %in% c("fairly well","very well") # [1] TRUE TRUE TRUE TRUE TRUE TRUE so you can in the end do: rep_data$clo<- ifelse((rep_data$know %in% c("fairly well","very well")) & (rep_data$getalong %in% c(4,5)),1,0) which results in: rep_data # id1 id2 know getalong clo # 1 100000016_a1 100000016_a2 very well 4 1 # 2 100000035_a1 100000035_a2 fairly well NA 0 # 3 100000036_a1 100000036_a2 very well 3 0 # 4 100000039_a1 100000039_a2 very well 5 1 # 5 100000067_a1 100000067_a2 very well 5 1 # 6 100000076_a1 100000076_a2 fairly well 5 1 Finally, I suppose it is a happy coincidence that NA %in% c(4,5) yields FALSE rather than what R might have been written to yield, i.e. NA -- since NA is basically a synonym for "something that we do not know the value of", strictly speaking we do not know the value of NA %in% c(4,5). It is possible that the "something that we do not know the value of" could be either 4 or 5, in which case NA %in% c(4,5) would be TRUE; but it is also possible that the "something that we do not know the value of" could be neither 4 nor 5, in which case NA %in% c(4,5) would be FALSE; but since we do not know which of these possibilities is the case, we do not know whether it should be TRUE or FALSE, so one can argue that the result should itself be NA. But, as it happens, 3 %in% c(4,5) # [1] FALSE 4 %in% c(4,5) # [1] TRUE 5 %in% c(4,5) # [1] TRUE NA %in% c(3,4) # [1] FALSE so all is well! Hoping this helps, Ted. ------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@wlandres.net> Date: 15-Sep-2012 Time: 23:02:14 This message was sent by XFMail ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.