I think we need some data and code Reproducibility https://github.com/hadley/devtools/wiki/Reproducibility http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
John Kane Kingston ON Canada > -----Original Message----- > From: r...@catwhisker.org > Sent: Mon, 30 Mar 2015 06:50:59 -0700 > To: r-help@r-project.org > Subject: [R] data.frame: data-driven column selections that vary by row?? > > Sorry if that's confusing: I'm probably confused. :-( > > I am collecting and trying to analyze data regarding performance of > computer systems. > > After extracting the data from its repository, I have created and > used a Perl script to generate a (relatively) simple CSV, each > record of which contains: > * a POSIXct timestamp > * a hostname > * a collection of metrics for the interval identified by the timestamp, > and specific to the host in question, as well as some factors to > group the hosts (e.g., whether it's in a "control" vs. a "test" > group; a broad categorization of how the host is provisioned; which > version of the software it was running at the time...). (Each > metric and factor is in a uniquely-named column.) > > As extracted from the repository, there were several records for each > such hostname/timestamp pair -- e.g., there would be separate records > for: > * Input bandwidth utilization for network interface 1 > * Output bandwidth utilization for network interface 1 > * Input bandwidth utilization for network interface 2 > * Output bandwidth utilization for network interface 2 > > (And the same field would be used for each of these -- the > interpretation being driven by the content of other fields in teh > record.) > > Working with the data as described (immediately) above directly in R > seemed... daunting, at best: thus the excursion into Perl. > > And for some of the data, what I have works well enough. > > But now I also want to analyze information from disk drives, and things > get messy (as far as I can see). > > First, each disk drive has a collection of 17 metrics (such as > "busy_pct", "kb_per_transfer_read", and "transfers_per_second_write"), > as well as a factor ("dev_type"). Each also has a device name that is > unique within the host where it resides (e.g. "da1", "da2", "da3"....). > (The "dev_type" factor identifies whether the drive is a solid-state > device or a spinning disk.) > > I have thus made the corresponding columns unique by pasting the drive > name and the name of the metric (or factor), separating the two with > "_" (e.g. "da7_busy_pct"; "ada0_mb_per_second_write"; > "ada4_queue_length"). I am not certain that's the best thing I could > have done -- and I'm open to changing the approach. > > The challenge for me is that different (classes of) machines are > provisioned differently; some consequennces of that: > * While da1 may be a spinning disk on host A, that has no bearing on > whether or not the "da1" on host B is a spinning disk or an SSD. > * Host C may not even have a "da1" device. > * Host D may be of a type that normally has a "da1," but in this case, > the drive has failed and has been disabled (so host D won't report > anything about "da1"). > > (I'm not too bothered about the "non-reporting" case, but cite it so we > all know about it.) > > I expect I will want to be using groupings: > * All disk devices -- this one is easy. > * All SSD devices (excluding spinning disks). > * All spinning disks (excluding SSDs). > > I'm having trouble with the latter two (though, certainly, if I solve > one, the other is also solved). > > Also, for some of the metrics, I will want to sum them; for others, > I will want to do other things -- find minima or maxima, or average > them. So pre-calculating such aggregates in the Perl script isn't > something that appeals to me. > > Finally (as far as complications go), I'm trying to write the code in > such a way that if we deploy a new configuration of machine that has > (say) twice as many drives as the biggest one we presently deploy, the > code Just Works -- I shouldn't need to update the code merely to adapt > to another hardware configuration. > > I have been able to write a function that takes the data.frame obtained > by reading the above-cited CSV, and generates a data.frame with a row > for each host, and depicts the "dev_type" for each device for that host; > here's an abbreviated (and slightly redacted) copy of its output to > illustrate some of the above: > > ada0 ada1 ada2 ada3 ada4 ada5 da30 da31 da32 da33 da34 da35 da36 > da3 > host_A ssd ssd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd > hdd > host_B ssd ssd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd > hdd > host_G ssd ssd ssd ssd ssd ssd > ssd > host_H ssd ssd ssd ssd ssd ssd > ssd > host_M ssd ssd ssd ssd ssd ssd > ssd > host_N ssd ssd ssd ssd ssd ssd > ssd > > (That function is written with the explicit assumption(!) that for the > period covered by a given set of input data, a given host's > configuration remains static: we won't have drives changing type > mid-stream.) > > So the point of this lengthy(!) note is to ask if there's a > somewhat-sane way to be able to group the metrics for the "ssd" devices > (for example), given the above. > > (So far, the least obnoxious way that comes to mind is to actually > create 2 columns for each device metric: one for the device if it's an > "ssd";l the other for "hdd" -- so instead of columns such as: > * da3_busy_pct > * da3_dev_type > * da3_kb_per_transfer_read > * da36_cam_timeouts > * da36_dev_type > * da36_mb_per_second_read > > I would have: > * da3_hdd_busy_pct > * da3_ssd_busy_pct > * da3_hdd_dev_type > * da3_ssd_dev_type > * da3_hdd_kb_per_transfer_read > * da3_ssd_kb_per_transfer_read > * da36_hdd_cam_timeouts > * da36_ssd_cam_timeouts > * da36_hdd_dev_type > * da36_ssd_dev_type > * da36_hdd_mb_per_second_read > * da36_ssd_mb_per_second_read > > and no more than half of those would actually be populated (depending on > the content of "dev_type" when the Perl script is creating the CSV). > > That seems rather hackish, though. > > Thank you in advance for any insight. > > Peace, > david > -- > David H. Wolfskill r...@catwhisker.org > Those who murder in the name of God or prophet are blasphemous cowards. > > See http://www.catwhisker.org/~david/publickey.gpg for my public key. ____________________________________________________________ Can't remember your password? Do you need a strong and secure password? Use Password manager! It stores your passwords & protects your account. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.