HI, In case if you wanted to take "BC" and "CB" as the same. dat1<- read.table(text=" Seq,Output A B B C D A C,Yes B C A C B D A C,Yes C D A A C D,No ",sep=",",header=TRUE,stringsAsFactors=FALSE) lapply(str_split(str_trim(dat1$Seq)," ")[dat1$Output=="Yes"],function(x) {x1<-t(combn(x,2)); x2<-sapply(strsplit(apply(x1,1,paste0,collapse=""),""),function(x) paste(x[order(x)],collapse="")); table(x2)}) [[1]] #x2 #AA AB AC AD BB BC BD CC CD # 1 4 4 2 1 4 2 1 2
#[[2]] #x2 #AA AB AC AD BB BC BD CC CD # 1 4 6 2 1 6 2 3 3 dat1$MaxCombn<- NA res1<-sapply(str_split(str_trim(dat1$Seq)," ")[dat1$Output=="Yes"],function(x) {x1<-t(combn(x,2)); x2<-sapply(strsplit(apply(x1,1,paste0,collapse=""),""),function(x) paste(x[order(x)],collapse="")); x3<-table(x2); x3[x3%in% max(x3)]}) dat1$MaxCombn[dat1$Output=="Yes"]<-lapply(res1,names) dat1 # Seq Output MaxCombn #1 A B B C D A C Yes AB, AC, BC #2 B C A C B D A C Yes AC, BC #3 C D A A C D No NA A.K. ----- Original Message ----- From: arun <smartpink...@yahoo.com> To: R help <r-help@r-project.org> Cc: Sent: Friday, April 12, 2013 4:37 PM Subject: Re: Search for common character strings within a column Hi, May be this helps: Not sure how you wanted to select those two letters. dat1<- read.table(text=" Seq,Output A B B C D A C,Yes B C A C B D A C,Yes C D A A C D,No ",sep=",",header=TRUE,stringsAsFactors=FALSE) library(stringr) lapply(str_split(str_trim(dat1$Seq)," ")[dat1$Output=="Yes"],function(x) {x1<-t(combn(x,2)); apply(x1,1,paste0,collapse="")}) #[[1]] # [1] "AB" "AB" "AC" "AD" "AA" "AC" "BB" "BC" "BD" "BA" "BC" "BC" "BD" "BA" "BC" #[16] "CD" "CA" "CC" "DA" "DC" "AC" #[[2]] # [1] "BC" "BA" "BC" "BB" "BD" "BA" "BC" "CA" "CC" "CB" "CD" "CA" "CC" "AC" "AB" #[16] "AD" "AA" "AC" "CB" "CD" "CA" "CC" "BD" "BA" "BC" "DA" "DC" "AC" res<- sapply(str_split(str_trim(dat1$Seq)," ")[dat1$Output=="Yes"],function(x) {x1<-t(combn(x,2)); x2<-table(apply(x1,1,paste0,collapse="")); x2[which.max(x2)]}) res #BC BC # 4 4 dat1$MaxCombn<-NA dat1$MaxCombn[dat1$Output=="Yes"]<- names(res) dat1 # Seq Output MaxCombn #1 A B B C D A C Yes BC #2 B C A C B D A C Yes BC #3 C D A A C D No <NA> A.K. >I have a dataset (data) that consists of two columns: Seq and output. Each entry in Seq is a combination of As,Bs,Cs and Ds and ranges from 5 – >30 characters in length. Each sequence is associated with an output of either yes or no such that: > > Seq Output >(1) A B B C D A C Yes >(2) B C A C B D A C Yes >(3) C D A A C D No > >etc, etc. > >I want to find which 2 letter (A B, A C, A D, etc) strings are most associated with each output. Essentially I want to find which 2 letter combinations >occur most frequently in the column Seq, when the output is Yes. I’m new to R and can’t figure out a solution to this problem. > >Any help greatly appreciated! > >Cheers, > >AB ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.