Hello all!
 
I am currently trying to sort a data frame in a particular way, but I am having 
some difficulties with this. Specifically I want to sort the below dataset in 
such a way that there is only one line per ProteinID and if there are multiple 
GeneID or GeneName entries for a single proteinID, that they be concatenated 
with a comma separating them. The way I have done it earlier worked fine for 
small datasets, but as I am working with around 30,000 entries, it proved too 
slow and I'm not sure how to do it in another way. 
Here is an example of the input.  
 




ProteinID

GeneID

GeneName


M-012847-00

NM_057175

NARG1


M-012847-00

NM_057175

TBDN100


M-015544-00

NM_153008

FLJ30277


M-015544-00

NM_153027

FLJ31659


M-024202-00

NM_207330

NIPAL1


M-024202-00

NM_207330

NPAL1
 
Here is an example showing what I want: 




ProteinID

GeneID

GeneName


M-012847-00

NM_057175

NARG1, TBDN100


M-015544-00

NM_153008, NM_153027

FLJ30277, FLJ31659 


M-024202-00

NM_207330

NIPAL1, NPAL1
 
Here is the code I have been using so far. I have only managed to get this to 
work by using a loop, which I know is not the best way, but at the moment I'm 
stuck.
 
colA <- c("M-012847-00" ,"M-012847-00" ,"M-015544-00", "M-015544-00", 
"M-024202-00","M-024202-00")
colB <- c("NM_057175", "NM_057175", "NM_153008", "NM_153027", "NM_207330", 
"NM_207330")
colC <- c( 'NARG1', 'TBDN100', 'FLJ30277', 'FLJ31659', 'NIPAL1', 'NPAL1')
 
dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC)
 
idVec <- character()
geneIDVec <- character()
geneNameVec <- character()
dataType <- "ProteinID"
annotation <- data.frame()
 
for (id in unique(dupes[[dataType]])) {
       print (id)
       idVec <- c(idVec, id)
       geneIDVec <- c(geneIDVec, paste(unique(dupes$GeneID[dupes[[dataType]] == 
id]), collapse=", "))
       geneNameVec <- c(geneNameVec, 
paste(unique(dupes$GeneName[dupes[[dataType]] == id]), collapse=", "))
       annotation[[dataType]][annotation[[dataType]] == id] <- NA
}
filtered <- data.frame(ProteinID=idVec, GeneID=geneIDVec, GeneName=geneNameVec)
 
 
Thanks!                                           
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to