Dear all, I am new in R and I have been faced with the following problem, that slows me down a lot. I am short of ideas to circumvent it. So, any help would be highly appreciated:
I have 2 dataframes x and y. x is very big (70 million observations), whereas y is smaller (300000 observations). All the observations of y are present in x. But y has one additional variable that I would like to incorporate to the dataframe x. For instance, imagine they have the following variable names: colnames(x)<- c("V1", "V2", "V3", "V4") and colnames(y)<- c("V1", "V2", "V5") -Since the observations of y are present in x, my strategy was to merge x and y so that the dataframe x would get the values of the variable V5 for the observations that are both in x and y. -So, I did the following: dat<- merge(x, y, all=TRUE). On a small example, it works fine. The only problem is that when I apply it to my big dataframe x, it really take for ever (several days and not done yet) and I have a very fast computer. So, I don't know whether I should stop now or keep on waiting. Does anyone have any idea to perform this operation in a more efficient way (in terms of computation time)? In addition, does anyone know how to incoporate some sort of counter in a program to check what how much work has been done at a given point of time? Any comments are very welcome, Thanks, Best, Aurelien [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.