On Aug 23, 2012, at 2:02 AM, Ingmar Schuster wrote:

Thanks Rui!

Anybody with ideas regarding filling _while_ binding data frames instead of
afterwards?

Not sure what you mean by " _while_ binding dataframes" but the original question seems answered by this sentence from the help file for factor:

"For a numeric x, set exclude=NULL to make NA an extra level (prints as <NA>); by default, this is the last level."

fac <- factor(fac, exclude=NULL) # would skip all that `is.na()`, `level=` gymnastics


If you want to loop over factor dataframe columns:

facidx <-  sapply(d, is.factor)
d[ ,facidx ] <- lapply( d[ , facidx ], factor, exclude=NULL)

I see no parameters to data.frame or read.table that would allow specifying different than the default behavior for factor().

--
David


Ingmar

2012/8/22 Rui Barradas <ruipbarra...@sapo.pt>

Hello,

Your function doesn't seem to be very difficult to generalize.

d <- read.table(text="

  trg_type child_type_1
1 Scientists NA
2        of         used
", header=TRUE)
str(d)

subs_na <- function(tok, na_factor_level = "NOT_REALIZED", na_num = 99999)
{
   ifac <- which(sapply(tok, is.factor))
   inum <- which(sapply(tok, is.numeric))
   for(i in ifac) {
       levels(tok[, i]) <- c(levels(tok[, i]), na_factor_level)
       tok[is.na(tok[, i]), i] <- as.factor(na_factor_level)
   }
   for(i in inum)
       tok[is.na(tok[, i]), i] <- na_num
   tok
}

r1 <- substitute_na(d)
r2 <- subs_na(d)
str(r1)
str(r2)
identical(r1, r2)  # TRUE

You could use the same coding for characters, Dates, etc.

Hope this helps,

Rui Barradas

Em 22-08-2012 20:16, Ingmar Schuster escreveu:

Hi,

I have a data set with variables that are _not_ missing at random. Now I use a package for learning a Bayesian Network which won't accept NA as a value. From a database I query data.frames with k,k+n,k+2n, ... variables
(there are always at least k variables as leftmost columns). Using
rbind.fill from the reshape package on two data frames I would get a data
frame like

   trg_type child_type_1
1 Scientists NA
2        of         used

Now to get rid of NA values I use the following function, which works for
data frames with only factor values:

  substitute_na <- function(tok, na_factor_level = "NOT_REALIZED") {
    for (i in 1:length(tok)) {levels(tok[,i]) <- c(levels(tok[,i]),
na_factor_level)}
    tok[is.na(tok)] <- as.factor(na_factor_level)
    return(tok)
  }

Is there a better/faster way to do it? It would also be great to be able
to
distinguish factor columns from numeric columns and use a special numeric value there. The current version of rbind.fill makes no direct reference
to
the fill value so that I could change its implementation for my purpose.


Thanks!

Ingmar





--
Ingmar Schuster
Natural Language Processing Group
Department of Computer Science
University of Leipzig
Johannisgasse 26
04103 Leipzig, Germany

Tel. +49 341 9732205

http://asv.informatik.uni-leipzig.de/en/staff/Ingmar_Schuster

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to