Dear folks--
I always seem to find that I spend more than half my time making sure my
input date is in the right form, properly aligned, with no bizarre features. 
You know the drill: five kinds of missing values, three of them documented.
An alpha mistype in one numeric field turns 30,000 numbers into factor
levels.  SPSS conversion turns 250 factors nicely into R factors, except 3
have levels instead of labels. A few columns in some years of a survey have
undocumented differences in units.  Halfway through a 20-year annual survey,
they add two more allowable answers to a question. etc. 

I'm looking for things to make my data auditing go faster.  One of them is a
dopy little function, testX(),  bundling together a variety of r tools to
tell me what is in an object.  Here it is:

testX <- function(objectX, bar=TRUE) {    # A useful diagnostic function <- deparse(substitute(objectX))
    if(bar) cat("########################\n");  # visual separation between
consecutive objects.
    cat("testX(",, "): ");  cat("Class=", class(objectX)); cat(" 
Mode=", mode(objectX), "\n");
    cat("Summary:\n"); print(summary(objectX))
    cat("Structure:\n");  str(objectX);
    if (is.factor(objectX)) {cat("Levels: ", levels(objectX), "\n");
cat("Length: ", length(objectX), "\n")}

This works well when I give it the name of a single object. My problem is
when I try to produce descriptions of a bunch of variables in a row, such as
the variables in a list of variables, or all the variables that I have
clomped together in a data frame.  The output is all side effects. Some ways
of passing multiple variables get the name wrong, but the rest right. For
example, if I have a list of variables, and do:

> lapply(varList, testX)

I get an output like this:

testX( X[[1L]] ): Class= factor  Mode= numeric 
1994 1997 1999 2002 2003 2007 2009 
1009 1165  985 2502 2528 2007 3013 
 Factor w/ 7 levels "1994","1997",..: 1 1 1 1 1 1 1 1 1 1 ...
Levels:  1994 1997 1999 2002 2003 2007 2009 
Length:  13209 

If instead, I do it with a loop through a the variable names in a
data.frame, I get the name wrong _and_ it does not evaluate all the way to
an object:

> names(var.df)
 [1] "year"      "YEAR"      "AGE"       "COHORT.5"  "COHORT.10" "ETHNIC"   
"EDUC"      "INCOME"    "INTERNET"  "PARTY"     "IDEOL" 

>for (sel in 1:length(names(var.df))) testX(names(var.df)[sel]) 

Gives an output like this:

testX( names(var.df)[sel] ): Class= character  Mode= character 
   Length     Class      Mode 
        1 character character 
 chr "year"

Or I can select the column instead of the name of the column. This gives me
the right answer on the object description, but not the name, thus:
> for (sel in 1:length(names(var.df))) testX(var.df[[sel]])

testX( var.df[[sel]] ): Class= integer  Mode= numeric 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1994    2002    2003    2003    2007    2009 
 int [1:13209] 1994 1994 1994 1994 1994 1994 1994 1994 1994 1994 ...

I've tried doing various things to names(var.df)[sel] to get it closer to
the object -- as.symbol, eval(substitute() ), several others, but I just get
variations on the output above. 

So there are actually two questions here:
1.  How can I write this function so that it works when I just give it an
object, but I can also use it with an apply-family function and a  list (or
vector, or whatever)  of objects, and still have it both treat the object as
an object and print its name correctly?  

2.  How can I write the function, or write a loop, or use an apply-family
function, to use this function to go through the columns of a data.frame,
correctly naming and correctly describing each?

Another way of asking this same question is this: I want to be able to give
testX the name of an object, or a reference to a named object, via
apply-family function, indexing, or whatever.  (A) How can I get the name I
print,, to be the name of the object in both cases? And, (B),
how can I make sure that objectX is the actual object that the name refers
to, and not the name or the reference, in both cases?

Finally, and this should maybe be another post, I'd love to hear if others
have thought through the whole question of efficient data auditing.  Is
there a suite of tools, or a standard set of recommendations, that you use
and like? I'd love to hear any useful advice about how to accelerate this
stage of a project, and get more quickly to its statistical heart.

Most sincerely, andrewH

View this message in context:
Sent from the R help mailing list archive at

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.

Reply via email to