Re: [R] Improving data processing efficiency

Don MacQueen Fri, 06 Jun 2008 15:49:58 -0700

In a case like this, if you can possibly work with matrices insteadof data frames, you might get significant speedup.(More accurately, I have had situations where I obtained speed up byworking with matrices instead of dataframes.)

Even if you have to code character columns as numeric, it can be worth it.

Data frames have overhead that matrices do not. (Here's whereprofiling might have given a clue) Granted, there has been recentwork in reducing the overhead associated with dataframes, but I thinkit's worth a try. Carrying along extra columns and doing rowsubsetting, rbinding, etc, means a lot more things happening inmemory.

So, for example, if all of your matching is based just on a fewcolumns, extract those columns, convert them to a matrix, do all thematching, and then based on some sort of row index retrieve all ofthe associated columns.


-Don

At 2:09 PM -0400 6/5/08, Daniel Folkinshteyn wrote:

Hi everyone!

I have a question about data processing efficiency.
My data are as follows: I have a data set on quarterly institutionalownership of equities; some of them have had recent IPOs, some havenot (I have a binary flag set). The total dataset size is 700k+ rows.
My goal is this: For every quarter since issue for each IPO, I needto find a "matched" firm in the same industry, and close in marketcap. So, e.g., for firm X, which had an IPO, i need to find amatched non-issuing firm in quarter 1 since IPO, then a (possiblydifferent) non-issuing firm in quarter 2 since IPO, etc. Repeat foreach issuing firm (there are about 8300 of these).
Thus it seems to me that I need to be doing a lot of data selectionand subsetting, and looping (yikes!), but the result appears to behighly inefficient and takes ages (well, many hours). What I amdoing, in pseudocode, is this:
1. for each quarter of data, getting out all the IPOs and all theeligible non-issuing firms.2. for each IPO in a quarter, grab all the non-issuers in the sameindustry, sort them by size, and finally grab a matching firmclosest in size (the exact procedure is to grab the closest biggerfirm if one exists, and just the biggest available if all aresmaller)3. assign the matched firm-observation the same "quarters sinceissue" as the IPO being matched
4. rbind them all into the "matching" dataset.
The function I currently have is pasted below, for your reference.Is there any way to make it produce the same result but much faster?Specifically, I am guessing eliminating some loops would be verygood, but I don't see how, since I need to do some fancy footworkfor each IPO in each quarter to find the matching firm. I'll bedoing a few things similar to this, so it's somewhat important to upthe efficiency of this. Maybe some of you R-fu masters can clue mein? :)
I would appreciate any help, tips, tricks, tweaks, you name it! :)

========== my function below ===========
fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,quarters_since_issue=40) {
result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix ischeaper, so typecast the result to matrix
    colnames = names(tfdata)

    quarterends = sort(unique(tfdata$DATE))

    for (aquarter in quarterends) {
        tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
tfdata_quarter_fitting_nonissuers = tfdata_quarter[(tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue)& (tfdata_quarter$IPO.Flag == 0), ]tfdata_quarter_ipoissuers = tfdata_quarter[tfdata_quarter$IPO.Flag == 1, ]
        for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
            arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]industrypeers = industrypeers[order(industrypeers$Market.Cap.13f), ]
            if ( nrow(industrypeers) > 0 ) {
if (nrow(industrypeers[industrypeers$Market.Cap.13f >=arow$Market.Cap.13f, ]) > 0 ) {bestpeer =industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f,][1,]
                }
                else {
                    bestpeer = industrypeers[nrow(industrypeers),]
                }
bestpeer$Quarters.Since.IPO.Issue =arow$Quarters.Since.IPO.Issue
#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==bestpeer$PERMNO] = 1
                result = rbind(result, as.matrix(bestpeer))
            }
        }
        #result = rbind(result, tfdata_quarter)
        print (aquarter)
    }

    result = as.data.frame(result)
    names(result) = colnames
    return(result)

}

========= end of my function =============

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Improving data processing efficiency

Reply via email to