Re: [R] Can file size affect how na.strings operates in a read.table call?

Jeff Newmiller Thu, 14 Nov 2019 08:36:12 -0800

Consider the following sample:

#####
s <- "A,B,C
0,0,0
1,-99,-99
2,-99 ,-99
3, -99, -99
"


dta_notok <- read.csv( text = s
                     , header=TRUE
                     , na.strings = c( "-99", "" )
                     )

dta_ok <- read.csv( text = s
                  , header=TRUE
                  , na.strings = c( "-99", " -99"
                                  , "-99 ", ""
                                  )
                  )

library(data.table)

fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) )
fdta_ok <- as.data.frame( fdt_ok )
#####

Leading and trailing spaces cause problems. The data.table::fread functionhas a strip.white argument that defaults to TRUE, but the resulting objectis a data.table which has different semantics than a data.frame.


On Thu, 14 Nov 2019, Sebastien Bihorel wrote:

The data file is a csv file. Some text variables contain spaces.

"Check for extraneous spaces"
Are there specific locations that would be more critical than others?


____________________________________________________________________________
From: Jeff Newmiller <jdnew...@dcn.davis.ca.us>
Sent: Thursday, November 14, 2019 10:52
To: Sebastien Bihorel <sebastien.biho...@cognigencorp.com>; Sebastien
Bihorel via R-help <r-help@r-project.org>; r-help@r-project.org
<r-help@r-project.org>
Subject: Re: [R] Can file size affect how na.strings operates in a
read.table call?  
Check for extraneous spaces. You may need more variations of the na.strings.

On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help
<r-help@r-project.org> wrote:
>Hi,
>
>I have this generic function to read ASCII data files. It is
>essentially a wrapper around the read.table function. My function is
>used in a large variety of situations and has no a priori knowledge
>about the data file it is asked to read. Nothing is known about file
>size, variable types, variable names, or data table dimensions.
>
>One argument of my function is na.strings which is passed down to
>read.table.
>
>Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
>~ 160 columns) using na.strings = c('-99', '.') with the intention of
>interpreting '.' and '-99'
>strings as the internal missing data NA. Dots were converted to NA
>appropriately. However, not all -99 values in the data were interpreted
>as NA. In some variables, -99 were converted to NA, while in others -99
>was read as a number. More surprisingly, when the data file was cut in
>smaller chunks (ie, by dropping either rows or columns) saved in
>multiple files, the function calls applied on the new data files
>resulted in the correct conversion of the -99 values into NAs.
>
>In all cases, the data frames produced by read.table contained the
>expected number of records.
>
>While, on face value, it appears that file size affects how the
>na.strings argument operates, I wondering if there is something else at
>play here.
>
>Unfortunately, I cannot share the data file for confidentiality reason
>but was wondering if you could suggest some checks I could perform to
>get to the bottom on this issue.
>
>Thank you in advance for your help and sorry for the lack of
>reproducible example.
>
>
>______________________________________________
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.


---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Can file size affect how na.strings operates in a read.table call?

Reply via email to