On Sun, Jun 14, 2009 at 02:56:01PM -0400, Gabor Grothendieck wrote: > If read.csv's colClasses= argument is NOT used then read.csv accepts > double quoted numerics: > > 1: > read.csv(stdin()) > 0: A,B > 1: "1",1 > 2: "2",2 > 3: > A B > 1 1 1 > 2 2 2 > > However, if colClasses is used then it seems that it does not: > > > read.csv(stdin(), colClasses = "numeric") > 0: A,B > 1: "1",1 > 2: "2",2 > 3: > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > scan() expected 'a real', got '"1"' > > Is this really intended? I would have expected that a csv file in which > each field is surrounded with double quotes is acceptable in both > cases. This may be documented as is yet seems undesirable from > both a consistency viewpoint and the viewpoint that it should be > possible to double quote fields in a csv file.
The problem is not specific to read.csv(). The same difference appears for read.table(). read.table(stdin()) "1" 1 2 "2" # V1 V2 # 1 1 1 # 2 2 2 but read.table(stdin(), colClasses = "numeric") "1" 1 2 "2" Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '"1"' The error occurs in the call of scan() at line 152 in src/library/utils/R/readtable.R, which is data <- scan(file = file, what = what, sep = sep, quote = quote, ... (This is the third call of scan() in the source code of read.table()) In this call, scan() gets the types of columns in "what" argument. If the type is specified, scan() performs the conversion itself and fails, if a numeric field is quoted. If the type is not specified, the output of scan() is of type character, but with quotes eliminated, if there are some in the input file. Columns with unknown type are then converted using type.convert(), which receives the data already without quotes. The call of type.convert() is contained in a cycle for (i in (1L:cols)[do]) { data[[i]] <- if (is.na(colClasses[i])) type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) ## as na.strings have already been converted to <NA> else if (colClasses[i] == "factor") as.factor(data[[i]]) else if (colClasses[i] == "Date") as.Date(data[[i]]) else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]]) else methods::as(data[[i]], colClasses[i]) } which contains also lines, which could perform conversion for columns with a specified type, but these lines are not used, since the vector "do" is defined as do <- keep & !known where "known" determines for which columns the type is known. It is possible to modify the code so that scan() is called with all types unspecified and leave the conversion to the lines else if (colClasses[i] == "factor") as.factor(data[[i]]) else if (colClasses[i] == "Date") as.Date(data[[i]]) else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]]) else methods::as(data[[i]], colClasses[i]) above. Since this solution is already prepared in the code, the patch is very simple --- R-devel/src/library/utils/R/readtable.R 2009-05-18 17:53:08.000000000 +0200 +++ R-devel-readtable/src/library/utils/R/readtable.R 2009-06-25 10:20:06.000000000 +0200 @@ -143,9 +143,6 @@ names(what) <- col.names colClasses[colClasses %in% c("real", "double")] <- "numeric" - known <- colClasses %in% - c("logical", "integer", "numeric", "complex", "character") - what[known] <- sapply(colClasses[known], do.call, list(0)) what[colClasses %in% "NULL"] <- list(NULL) keep <- !sapply(what, is.null) @@ -189,7 +186,7 @@ stop(gettextf("'as.is' has the wrong length %d != cols = %d", length(as.is), cols), domain = NA) - do <- keep & !known # & !as.is + do <- keep & !as.is if(rlabp) do[1L] <- FALSE # don't convert "row.names" for (i in (1L:cols)[do]) { data[[i]] <- (Also in attachment) I did a test as follows d1 <- read.table(stdin()) "1" TRUE 3.5 2 NA "0.1" NA FALSE 0.1 3 "TRUE" NA sapply(d1, typeof) # V1 V2 V3 # "integer" "logical" "double" is.na(d1) # V1 V2 V3 # [1,] FALSE FALSE FALSE # [2,] FALSE TRUE FALSE # [3,] TRUE FALSE FALSE # [4,] FALSE FALSE TRUE d2 <- read.table(stdin(), colClasses=c("integer", "logical", "double")) "1" TRUE 3.5 2 NA "0.1" NA FALSE 0.1 3 "TRUE" NA sapply(d2, typeof) # V1 V2 V3 # "integer" "logical" "double" is.na(d2) # V1 V2 V3 # [1,] FALSE FALSE FALSE # [2,] FALSE TRUE FALSE # [3,] TRUE FALSE FALSE # [4,] FALSE FALSE TRUE I think, there was a reason to let scan() to perform the type conversion, for example, it may be more efficient. So, if correct, the above patch is a possible solution, but some other may be more appropriate. In particular, function scan() may be modified to remove quotes also from fields specified as numeric. Petr. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel