This will give you the unique gene values: > colA <- c("M-012847-00" ,"M-012847-00" ,"M-015544-00", "M-015544-00", > "M-024202-00","M-024202-00") > colB <- c("NM_057175", "NM_057175", "NM_153008", "NM_153027", "NM_207330", > "NM_207330") > colC <- c( 'NARG1', 'TBDN100', 'FLJ30277', 'FLJ31659', 'NIPAL1', 'NPAL1') > > dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC) > require(sqldf) > sqldf(" + select ProteinID + , group_concat(distinct GeneID) as GeneID + , group_concat(distinct GeneName) as GeneName + from dupes + group by ProteinID + ", method = 'raw') ProteinID GeneID GeneName 1 M-012847-00 NM_057175 NARG1,TBDN100 2 M-015544-00 NM_153008,NM_153027 FLJ30277,FLJ31659 3 M-024202-00 NM_207330 NIPAL1,NPAL1 >
On Fri, Jun 10, 2011 at 4:35 AM, hi Berven <thebe...@hotmail.com> wrote: > > Hello all! > > I am currently trying to sort a data frame in a particular way, but I am > having some difficulties with this. Specifically I want to sort the below > dataset in such a way that there is only one line per ProteinID and if there > are multiple GeneID or GeneName entries for a single proteinID, that they be > concatenated with a comma separating them. The way I have done it earlier > worked fine for small datasets, but as I am working with around 30,000 > entries, it proved too slow and I'm not sure how to do it in another way. > Here is an example of the input. > > > > > > ProteinID > > GeneID > > GeneName > > > M-012847-00 > > NM_057175 > > NARG1 > > > M-012847-00 > > NM_057175 > > TBDN100 > > > M-015544-00 > > NM_153008 > > FLJ30277 > > > M-015544-00 > > NM_153027 > > FLJ31659 > > > M-024202-00 > > NM_207330 > > NIPAL1 > > > M-024202-00 > > NM_207330 > > NPAL1 > > Here is an example showing what I want: > > > > > ProteinID > > GeneID > > GeneName > > > M-012847-00 > > NM_057175 > > NARG1, TBDN100 > > > M-015544-00 > > NM_153008, NM_153027 > > FLJ30277, FLJ31659 > > > M-024202-00 > > NM_207330 > > NIPAL1, NPAL1 > > Here is the code I have been using so far. I have only managed to get this to > work by using a loop, which I know is not the best way, but at the moment I'm > stuck. > > colA <- c("M-012847-00" ,"M-012847-00" ,"M-015544-00", "M-015544-00", > "M-024202-00","M-024202-00") > colB <- c("NM_057175", "NM_057175", "NM_153008", "NM_153027", "NM_207330", > "NM_207330") > colC <- c( 'NARG1', 'TBDN100', 'FLJ30277', 'FLJ31659', 'NIPAL1', 'NPAL1') > > dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC) > > idVec <- character() > geneIDVec <- character() > geneNameVec <- character() > dataType <- "ProteinID" > annotation <- data.frame() > > for (id in unique(dupes[[dataType]])) { > print (id) > idVec <- c(idVec, id) > geneIDVec <- c(geneIDVec, paste(unique(dupes$GeneID[dupes[[dataType]] > == id]), collapse=", ")) > geneNameVec <- c(geneNameVec, > paste(unique(dupes$GeneName[dupes[[dataType]] == id]), collapse=", ")) > annotation[[dataType]][annotation[[dataType]] == id] <- NA > } > filtered <- data.frame(ProteinID=idVec, GeneID=geneIDVec, > GeneName=geneNameVec) > > > Thanks! > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.