Re: [R] reading data from web data sources

Phil Spector Sat, 27 Feb 2010 14:54:50 -0800

Sorry, I forgot to cc the group:

Tim -
   Here's a way to read the data into a list, with one entry per year:


x = 
read.table('http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat',
                header=FALSE,fill=TRUE,skip=13)
cts = apply(x,1,function(x)sum(is.na(x)))
wh = which(cts == 12)
start = wh+1
end = c(wh[-1] - 1,nrow(x))
ans = mapply(function(i,j)x[i:j,],start,end,SIMPLIFY=FALSE)
names(ans) = x[wh,1]

Hope this helps.
                                        - Phil Spector



On Sat, 27 Feb 2010, Gabor Grothendieck wrote:

No one else posted so the other post you are referring to must have
been an email to you, not a post.  We did not see it.

By one off I think you are referring to the row names, which are
meaningless, rather than the day numbers.  The data for day 1 is
present, not missing.  The example code did replace the day number
column with the year since the days were just sequential and therefore
derivable but its trivial to keep them if that is important to you and
we have made that change below.

The previous code used grep to kick out lines that had any character
not in the set: minus sign, space and digit but in this version we add
minus sign to that set.   We also corrected the year column and added
column names and converted all -999 strings to NA.  Due to this last
point we cannot use na.omit any more but we now have iy available that
distinguishes between year rows and other rows.

Every line here has been indented so anything that starts at the left
column must have been word wrapped in transmission.

 myURL <- "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat";
 raw.lines <- readLines(myURL)
 DF <- read.table(textConnection(raw.lines[!grepl("[^- 0-9.]", raw.lines)]),
   fill = TRUE, col.names = c("day", month.abb), na.strings = "-999")

 iy <- is.na(DF[[2]]) # is year row
 DF$year <- DF[iy, 1][cumsum(iy)]
 DF <- DF[!iy, ]

 DF


On Sat, Feb 27, 2010 at 3:28 PM, Tim Coote <tim+r-project....@coote.org> wrote:

Thanks, Gabor. My take away from this and Phil's post is that I'm going to


I think the other `post`` must have been directly to you.  We didn`t see it.

have to construct some code to do the parsing, rather than use a standard
function. I'm afraid that neither approach works, yet:

Gabor's gets has an off-by-one error (days start on the 2nd, not the first),
and the years get messed up around the 29th day.  I think that na.omit (DF)
line is throwing out the baby with the bathwater.  It's interesting that
this approach is based on read.table, I'd assumed that I'd need read.ftable,
which I couldn't understand the documentation for.  What is it that's
removing the -999 and -888 values in this code -they seem to be gone, but I
cannot see why.

Phil's reads in the data, but interleaves rows with just a year and all
other values as NA.

Tim
On 27 Feb 2010, at 17:33, Gabor Grothendieck wrote:

Mark Leeds pointed out to me that the code wrapped around in the post
so it may not be obvious that the regular expression in the grep is
(i.e. it contains a space):
"[^ 0-9.]"


On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
<ggrothendi...@gmail.com> wrote:


Try this.  First we read the raw lines into R using grep to remove any
lines containing a character that is not a number or space.  Then we
look for the year lines and repeat them down V1 using cumsum.  Finally
we omit the year lines.

myURL <-
"http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat";
raw.lines <- readLines(myURL)
DF <- read.table(textConnection(raw.lines[!grepl("[^
0-9.]",raw.lines)]), fill = TRUE)
DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
DF <- na.omit(DF)
head(DF)


On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project....@coote.org>
wrote:


Hullo
I'm trying to read some time series data of meteorological records that
are
available on the web (eg
http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat).
I'd
like to be able to read in the digital data directly into R. However, I
cannot work out the right function and set of parameters to use.  It
could
be that the only practical route is to write a parser, possibly in some
other language,  reformat the files and then read these into R. As far
as I
can tell, the informal grammar of the file is:

<comments terminated by a blank line>
[<year number on a line on its own>
<daily readings lines> ]+

and the <daily readings> are of the form:
<whitespace> <day number> [<whitespace> <reading on day of month>] 12

Readings for days in months where a day does not exist have special
values.
Missing values have a different special value.

And then I've got the problem of iterating over all relevant files to
get a
whole timeseries.

Is there a way to read in this type of file into R? I've read all of the
examples that I can find, but cannot work out how to do it. I don't
think
that read.table can handle the separate sections of data representing
each
year. read.ftable maybe can be coerced to parse the data, but I cannot
see
how after reading the documentation and experimenting with the
parameters.

I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.

Any help/suggestions would be greatly appreciated. I can see that this
type
of issue is likely to grow in importance, and I'd also like to give the
data
owners suggestions on how to reformat their data so that it is easier to
consume by machines, while being easy to read for humans.

The early records are a serious machine parsing challenge as they are
tiff
images of old notebooks ;-)

tia

Tim
Tim Coote
t...@coote.org
vincit veritas

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Tim Coote
t...@coote.org
vincit veritas


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading data from web data sources

Reply via email to