I'm new to R (previously used SAS primarily) and I have a genetics data
frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3,
...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of
the data is shown below:
SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G C A Min_Allele T A T
G ID1
CC GG CT AA ID2 CC GG CC AA ID3 CC GG
nc
AA ID4 _ _ _ _ ID5 CC GG CC AA ID6 CC
GG CC
AA ID7 CC GG CT AA ID8 _ _ _ _ ID9 CT GG
CC AG ID10 CC GG CC AA ID11 CC GG CT AA
ID12 _ _ _ _ ID13 CC GG CC AA
The name of the data file is Kgeno.
What I would like to do is recode all of the genotype values to standard
integer notation, based on their values relative to the reference rows
(Maj_Allele and Min_Allele). Standard notation sums the total of minor
alleles in the genotype, so values can be 0, 1 or 2.
Here are the changes I want to make:
1. If the genotype= "nc" or '_" then set equal to NA.
2. If genotype value = a character string comprised of two consecutive
major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0.
3. If genotype value= c(Maj_Allele, Min_Allele) then set equal to 1.
4. If genotype value = c(Min_Allele, Min_Allele) then set equal to 2.
I've tried the following ifelse processing but get error (Warning:
Executed
script did not end with R session at the top-level prompt. Top-level
state
will be restored) and can't seem to fix the code properly. I've
counted the
parentheses. Also, not sure if it would execute properly if I could
fix it.
# change 'nc' and '_' to NA, else leave as is:
Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2])
Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2])
#convert genotype strings in the first data column to numeric values
#(two
major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else
#leave as
is (to preserve NA values).
Kgeno[,2] <-
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
Kgeno[1,2]), sep=""), 0,
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
Kgeno[2,2]), sep=""), 1,
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character(
Kgeno[2,2]), sep=""), 2,
Kgeno[,2])))
Finally, if above code were corrected, this would only change the first
column of data, but I would like to change all 3000+ columns in the same
way.
I would greatly appreciate some suggestions on how to proceed.
Thank you,
Kathleen
---
Kathleen Askland, MD
Assistant Professor
Department of Psychiatry & Human Behavior
The Warren Alpert School of Medicine
Brown University/Butler Hospital
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.