Consider the following sample:
#####
s <- "A,B,C
0,0,0
1,-99,-99
2,-99 ,-99
3, -99, -99
"
dta_notok <- read.csv( text = s
, header=TRUE
, na.strings = c( "-99", "" )
)
dta_ok <- read.csv( text = s
, header=TRUE
, na.strings = c( "-99", " -99"
, "-99 ", ""
)
)
library(data.table)
fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) )
fdta_ok <- as.data.frame( fdt_ok )
#####
Leading and trailing spaces cause problems. The data.table::fread function
has a strip.white argument that defaults to TRUE, but the resulting object
is a data.table which has different semantics than a data.frame.
On Thu, 14 Nov 2019, Sebastien Bihorel wrote:
The data file is a csv file. Some text variables contain spaces.
"Check for extraneous spaces"
Are there specific locations that would be more critical than others?
____________________________________________________________________________
From: Jeff Newmiller <jdnew...@dcn.davis.ca.us>
Sent: Thursday, November 14, 2019 10:52
To: Sebastien Bihorel <sebastien.biho...@cognigencorp.com>; Sebastien
Bihorel via R-help <r-help@r-project.org>; r-help@r-project.org
<r-help@r-project.org>
Subject: Re: [R] Can file size affect how na.strings operates in a
read.table call?
Check for extraneous spaces. You may need more variations of the na.strings.
On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help
<r-help@r-project.org> wrote:
>Hi,
>
>I have this generic function to read ASCII data files. It is
>essentially a wrapper around the read.table function. My function is
>used in a large variety of situations and has no a priori knowledge
>about the data file it is asked to read. Nothing is known about file
>size, variable types, variable names, or data table dimensions.
>
>One argument of my function is na.strings which is passed down to
>read.table.
>
>Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
>~ 160 columns) using na.strings = c('-99', '.') with the intention of
>interpreting '.' and '-99'
>strings as the internal missing data NA. Dots were converted to NA
>appropriately. However, not all -99 values in the data were interpreted
>as NA. In some variables, -99 were converted to NA, while in others -99
>was read as a number. More surprisingly, when the data file was cut in
>smaller chunks (ie, by dropping either rows or columns) saved in
>multiple files, the function calls applied on the new data files
>resulted in the correct conversion of the -99 values into NAs.
>
>In all cases, the data frames produced by read.table contained the
>expected number of records.
>
>While, on face value, it appears that file size affects how the
>na.strings argument operates, I wondering if there is something else at
>play here.
>
>Unfortunately, I cannot share the data file for confidentiality reason
>but was wondering if you could suggest some checks I could perform to
>get to the bottom on this issue.
>
>Thank you in advance for your help and sorry for the lack of
>reproducible example.
>
>
>______________________________________________
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
--
Sent from my phone. Please excuse my brevity.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.