I tend to keep data in Excel. The reason is that I can keep data and analysis 
output in one file. A part of this is that I tend to use SAS where I get 
abundant output.
One way that this type of result happens is with junk in the file. Someone 
might put a space in a cell or a period. Such characters are hard to find. I 
select entire columns and rows and delete everything for several dozen rows 
past were the data are in the worksheet. For all I know someone made a few 
calculations and then tried to "clean up the data" but did not remove 
everything, or maybe the cat walked across the keyboard and left presents. 
Another issue is when someone is not consistent with how they enter missing 
data. Sometimes you get a blend of "na" and "." and "  "  along with empty 
cells. Global replace can be your friend. One indication of these sorts of 
problems is a numeric column of data reads as character. If there is just one 
non-numeric value then the variable is character rather than numeric.

Hope that helps.

Regards,
Tim

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Duncan Murdoch
Sent: Sunday, September 24, 2023 6:17 AM
To: Parkhurst, David <parkh...@indiana.edu>; r-help@r-project.org
Subject: Re: [R] Odd result

[External Email]

On 23/09/2023 6:55 p.m., Parkhurst, David wrote:
> With help from several people, I used file.choose() to get my file name, and 
> read.csv() to read in the file as KurtzData.  Then when I print KurtzData, 
> the last several lines look like this:
> 39   5/31/22              16.0      341    1.75525 0.0201 0.0214   7.00
> 40   6/28/22  2:00 PM      0.0      215    0.67950 0.0156 0.0294     NA
> 41   7/25/22 11:00 AM      11.9   1943.5        NA     NA 0.0500   7.80
> 42   8/31/22                  0    220.5        NA     NA 0.0700  30.50
> 43   9/28/22              0.067     10.9        NA     NA 0.0700  10.20
> 44  10/26/22              0.086      237        NA     NA 0.1550  45.00
> 45   1/12/23  1:00 PM     36.26    24196        NA     NA 0.7500 283.50
> 46   2/14/23  1:00 PM     20.71       55        NA     NA 0.0500   2.40
> 47                                              NA     NA     NA     NA
> 48                                              NA     NA     NA     NA
> 49                                              NA     NA     NA     NA
>
> Then the NA s go down to one numbered 973.  Where did those extras likely 
> come from, and how do I get rid of them?  I assume I need to get rid of all 
> the lines after #46,  to do calculations and graphics, no?

Many Excel spreadsheets have a lot of garbage outside the range of the data.  
Sometimes it is visible if you know where to look, sometimes it is blank cells. 
 Perhaps at some point you (or the file creator) accidentally entered a number 
in line 973.  Then Excel will think the sheet has 973 lines.  I don't know the 
best way to tell Excel that those lines are pure garbage.

That's why old fogies like me recommend that you do as little as possible in 
Excel.  Get the data into a reliable form as soon as possible.

Once it is an R dataframe, you can delete lines using negative indices.
In this case use

     fixed <- KurtzData[-(47:nrow(KurtzData)), ]

which will create a new dataframe with only rows 1 to 46.

Duncan Murdoch

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.r-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to