I'm new to R (previously used SAS primarily) and I have a genetics data frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3, ...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of the data is shown below: SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G C A Min_Allele T A T G ID1 CC GG CT AA ID2 CC GG CC AA ID3 CC GG nc AA ID4 _ _ _ _ ID5 CC GG CC AA ID6 CC GG CC AA ID7 CC GG CT AA ID8 _ _ _ _ ID9 CT GG CC AG ID10 CC GG CC AA ID11 CC GG CT AA ID12 _ _ _ _ ID13 CC GG CC AA The name of the data file is Kgeno. What I would like to do is recode all of the genotype values to standard integer notation, based on their values relative to the reference rows (Maj_Allele and Min_Allele). Standard notation sums the total of minor alleles in the genotype, so values can be 0, 1 or 2.
Here are the changes I want to make: 1. If the genotype= "nc" or '_" then set equal to NA. 2. If genotype value = a character string comprised of two consecutive major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0. 3. If genotype value= c(Maj_Allele, Min_Allele) then set equal to 1. 4. If genotype value = c(Min_Allele, Min_Allele) then set equal to 2. I've tried the following ifelse processing but get error (Warning: Executed script did not end with R session at the top-level prompt. Top-level state will be restored) and can't seem to fix the code properly. I've counted the parentheses. Also, not sure if it would execute properly if I could fix it. # change 'nc' and '_' to NA, else leave as is: Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2]) Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2]) #convert genotype strings in the first data column to numeric values #(two major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else #leave as is (to preserve NA values). Kgeno[,2] <- ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character( Kgeno[1,2]), sep=""), 0, ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character( Kgeno[2,2]), sep=""), 1, ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character( Kgeno[2,2]), sep=""), 2, Kgeno[,2]))) Finally, if above code were corrected, this would only change the first column of data, but I would like to change all 3000+ columns in the same way. I would greatly appreciate some suggestions on how to proceed. Thank you, Kathleen --- Kathleen Askland, MD Assistant Professor Department of Psychiatry & Human Behavior The Warren Alpert School of Medicine Brown University/Butler Hospital [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.