Re: [R] Improving data processing efficiency

Daniel Folkinshteyn Fri, 06 Jun 2008 10:46:11 -0700

just in case, uploaded it to the server, you can get the zip file imentioned here:

http://astro.temple.edu/~dfolkins/helplistfiles.zip


on 06/06/2008 01:25 PM Daniel Folkinshteyn said the following:

i thought since the function code (which i provided in full) was prettyshort, it would be reasonably easy to just read the code and see whatit's doing.

but ok, so... i am attaching a zip file, with a small sample of the dataset (tab delimited), and the function code, in a zip file (postingguidelines claim that "some archive formats" are allowed, i assume zipis one of them...


would appreciate your comments! :)

on 06/06/2008 12:05 PM Gabor Grothendieck said the following:

Its summarized in the last line to r-help.  Note reproducible and
minimal.

On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn<[EMAIL PROTECTED]> wrote:

i did! what did i miss?

on 06/06/2008 11:45 AM Gabor Grothendieck said the following:

Try reading the posting guide before posting.

On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn<[EMAIL PROTECTED]>

wrote:

Anybody have any thoughts on this? Please? :)

on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional

ownership of equities; some of them have had recent IPOs, somehave not

(I
have a binary flag set). The total dataset size is 700k+ rows.

My goal is this: For every quarter since issue for each IPO, Ineed tofind a "matched" firm in the same industry, and close in marketcap. So,e.g., for firm X, which had an IPO, i need to find a matchednon-issuing

firm in quarter 1 since IPO, then a (possibly different) non-issuing
firm in

quarter 2 since IPO, etc. Repeat for each issuing firm (there areabout

8300
of these).

Thus it seems to me that I need to be doing a lot of dataselection and

subsetting, and looping (yikes!), but the result appears to be highly
inefficient and takes ages (well, many hours). What I am doing, in
pseudocode, is this:

1. for each quarter of data, getting out all the IPOs and all the
eligible
non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same

industry, sort them by size, and finally grab a matching firmclosest in

size (the exact procedure is to grab the closest bigger firm if one
exists,
and just the biggest available if all are smaller)

3. assign the matched firm-observation the same "quarters sinceissue"

as
the IPO being matched
4. rbind them all into the "matching" dataset.

The function I currently have is pasted below, for your reference. Is
there any way to make it produce the same result but much faster?

Specifically, I am guessing eliminating some loops would be verygood,

but I
don't see how, since I need to do some fancy footwork for each IPO in
each

quarter to find the matching firm. I'll be doing a few thingssimilar to

this, so it's somewhat important to up the efficiency of this. Maybe
some of
you R-fu masters can clue me in? :)

I would appreciate any help, tips, tricks, tweaks, you name it! :)

========== my function below ===========

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
quarters_since_issue=40) {

  result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
cheaper, so typecast the result to matrix

  colnames = names(tfdata)

  quarterends = sort(unique(tfdata$DATE))

  for (aquarter in quarterends) {
      tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

      tfdata_quarter_fitting_nonissuers = tfdata_quarter[
(tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) &
(tfdata_quarter$IPO.Flag == 0), ]
      tfdata_quarter_ipoissuers = tfdata_quarter[
tfdata_quarter$IPO.Flag
== 1, ]

      for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
          arow = tfdata_quarter_ipoissuers[i,]
          industrypeers = tfdata_quarter_fitting_nonissuers[
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
          industrypeers = industrypeers[
order(industrypeers$Market.Cap.13f), ]
          if ( nrow(industrypeers) > 0 ) {
              if ( nrow(industrypeers[industrypeers$Market.Cap.13f >=
arow$Market.Cap.13f, ]) > 0 ) {

bestpeer =industrypeers[industrypeers$Market.Cap.13f

= arow$Market.Cap.13f, ][1,]

              }
              else {
                  bestpeer = industrypeers[nrow(industrypeers),]
              }
              bestpeer$Quarters.Since.IPO.Issue =
arow$Quarters.Since.IPO.Issue

#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
bestpeer$PERMNO] = 1
              result = rbind(result, as.matrix(bestpeer))
          }
      }
      #result = rbind(result, tfdata_quarter)
      print (aquarter)
  }

  result = as.data.frame(result)
  names(result) = colnames
  return(result)

}

========= end of my function =============

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Improving data processing efficiency

Reply via email to