Hi, May be this helps: Kgeno<- read.table(text=" SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G C A Min_Allele T A T G ID1 CC GG CT AA ID2 CC GG CC AA ID3 CC GG nc AA ID4 _ _ _ _ ID5 CC GG CC AA ID6 CC GG CC AA ID7 CC GG CT AA ID8 _ _ _ _ ID9 CT GG CC AG ID10 CC GG CC AA ID11 CC GG CT AA ID12 _ _ _ _ ID13 CC GG CC AA ",sep="",header=TRUE,stringsAsFactors=FALSE)
library(stringr) library(car) fun1<- function(x){ MajMin<- paste0(x[1],x[2]) MajMaj<-str_dup(x[1],2) MinMin<-str_dup(x[2],2) recode(x,"'nc'=NA;'_'=NA;MajMaj=0;MajMin=1;MinMin=2")} sapply(Kgeno[,-1],fun1) #or mat1<-sapply(Kgeno[1:2,-1],function(x) {c(str_dup(x,2),paste(x,collapse=""))})[c(1,3,2),] sapply(seq_len(ncol(Kgeno[,-1])),function(i) {x<-Kgeno[-c(1:2),-1][,i];as.numeric(factor(x,levels=mat1[,i]))-1}) #Speed comparison KgenoNew<- rbind(Kgeno[c(1:2),-1],sapply(Kgeno[-c(1:2),-1],rep,1e4)) system.time(res1<- sapply(KgenoNew,fun1)) # user system elapsed #0.672 0.000 0.674 system.time({ mat1<-sapply(Kgeno[1:2,-1],function(x) {c(str_dup(x,2),paste(x,collapse=""))})[c(1,3,2),] res2<- sapply(seq_len(ncol(KgenoNew)),function(i){ x<- KgenoNew[-c(1:2),][,i];as.numeric(factor(x,levels=mat1[,i]))-1}) }) #user system elapsed # 0.212 0.000 0.214 res1New<- res1[-c(1:2),] res1New1<- as.numeric(res1New) dim(res1New1)<- dim(res1New) identical(res1New1,res2) #[1] TRUE A.K. ----- Original Message ----- From: kathleen askland <k.askl...@gmail.com> To: r-help@r-project.org Cc: Sent: Tuesday, July 2, 2013 4:46 PM Subject: [R] Recoding variables based on reference values in data frame I'm new to R (previously used SAS primarily) and I have a genetics data frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3, ...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of the data is shown below: SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G C A Min_Allele T A T G ID1 CC GG CT AA ID2 CC GG CC AA ID3 CC GG nc AA ID4 _ _ _ _ ID5 CC GG CC AA ID6 CC GG CC AA ID7 CC GG CT AA ID8 _ _ _ _ ID9 CT GG CC AG ID10 CC GG CC AA ID11 CC GG CT AA ID12 _ _ _ _ ID13 CC GG CC AA The name of the data file is Kgeno. What I would like to do is recode all of the genotype values to standard integer notation, based on their values relative to the reference rows (Maj_Allele and Min_Allele). Standard notation sums the total of minor alleles in the genotype, so values can be 0, 1 or 2. Here are the changes I want to make: 1. If the genotype= "nc" or '_" then set equal to NA. 2. If genotype value = a character string comprised of two consecutive major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0. 3. If genotype value= c(Maj_Allele, Min_Allele) then set equal to 1. 4. If genotype value = c(Min_Allele, Min_Allele) then set equal to 2. I've tried the following ifelse processing but get error (Warning: Executed script did not end with R session at the top-level prompt. Top-level state will be restored) and can't seem to fix the code properly. I've counted the parentheses. Also, not sure if it would execute properly if I could fix it. # change 'nc' and '_' to NA, else leave as is: Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2]) Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2]) #convert genotype strings in the first data column to numeric values #(two major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else #leave as is (to preserve NA values). Kgeno[,2] <- ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character( Kgeno[1,2]), sep=""), 0, ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character( Kgeno[2,2]), sep=""), 1, ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character( Kgeno[2,2]), sep=""), 2, Kgeno[,2]))) Finally, if above code were corrected, this would only change the first column of data, but I would like to change all 3000+ columns in the same way. I would greatly appreciate some suggestions on how to proceed. Thank you, Kathleen --- Kathleen Askland, MD Assistant Professor Department of Psychiatry & Human Behavior The Warren Alpert School of Medicine Brown University/Butler Hospital [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.