[R] How to download and unzip data in a loop
Hi All, I need to loop through and download the past 10 years of met data to a temporary directory. I then need to unzip it and place it into another directory. year = (2005:2015) for (i in year) tmpdir = tempdir() file[i] = file.path(tmpdir, sprintf('724927-23285-%4i.gz', i)) url = sprintf(' ftp://ftp.ncdc.noaa.gov/pub/data/noaa/%4i/724927-23285-%4i.gz', i, i) #file = basename(url) download.file(url, file[i]) files = dir(tmpdir, '*.gz', full.names=FALSE) read.table(gzfile('files')) 'file' returns 2015 indices with "/tmp/RtmpKvB4Wz/724927-23285-2015.gz" next to 2015. and files returns 724927-23285-2015.gz. However, when I try to unzip the gz file using the last line, it says it cannot open the connection and the probable reason is that there is no such file or directory. Thanks, Alexandra [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to download and unzip data in a loop
Thank you guys for the response. I'm trying to download the last ten years of meteorology data from a weather station in Livermore from the URL: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2015/724927-23285-2015.gz The Livermore station code is 724927-23285. If I wanted to download data from 2005, the URL would be: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2005/724927-23285-2005.gz Once I download the data into a temporary file, I want to unzip it and store it into another directory where I can access it. Also, why are there 2015 indices instead of just 10 when I'm only looping through 2005:2015? Thanks, Alexandra On Thu, Feb 5, 2015 at 3:11 AM, Jon Skoien wrote: > In addition to following Jim's suggestion, you should probably also use > full.names = TRUE, otherwise you will try to open a connection to files in > your current directory, not in tmpdir. > Another thing is that the unzipped files appear irregular with respect to > columns, so read.table might not work too well. > > Jon > > > On 2/5/2015 11:30 AM, jim holtman wrote: > >> try taking the quotes off of 'files' >> >> >> Jim Holtman >> Data Munger Guru >> >> What is the problem that you are trying to solve? >> Tell me what you want to do, not how you want to do it. >> >> On Wed, Feb 4, 2015 at 5:24 PM, Alexandra Catena >> wrote: >> >> Hi All, >>> >>> I need to loop through and download the past 10 years of met data to a >>> temporary directory. I then need to unzip it and place it into another >>> directory. >>> >>> >>> year = (2005:2015) >>> >>> for (i in year) >>>tmpdir = tempdir() >>>file[i] = file.path(tmpdir, sprintf('724927-23285-%4i.gz', i)) >>>url = sprintf(' >>> ftp://ftp.ncdc.noaa.gov/pub/data/noaa/%4i/724927-23285-%4i.gz', i, i) >>>#file = basename(url) >>>download.file(url, file[i]) >>>files = dir(tmpdir, '*.gz', full.names=FALSE) >>>read.table(gzfile('files')) >>> >>> >>> >>> 'file' returns 2015 indices with "/tmp/RtmpKvB4Wz/724927-23285-2015.gz" >>> next to 2015. and files returns 724927-23285-2015.gz. However, when I >>> try >>> to unzip the gz file using the last line, it says it cannot open the >>> connection and the probable reason is that there is no such file or >>> directory. >>> >>> >>> >>> Thanks, >>> Alexandra >>> >>> [[alternative HTML version deleted]] >>> >>> __ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> [[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > -- > Jon Olav Skøien > Joint Research Centre - European Commission > Institute for Environment and Sustainability (IES) > Climate Risk Management Unit > > Via Fermi 2749, TP 100-01, I-21027 Ispra (VA), ITALY > > jon.sko...@jrc.ec.europa.eu > Tel: +39 0332 789205 > > Disclaimer: Views expressed in this email are those of the individual and > do not necessarily represent official views of the European Commission. > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to unzip a .gz file
Hello, Can someone help me with unzipping a .gz file. I used: readLines(gzfile('/home/file.gz')) I also found that I could use gunzip, but after trying to install it, it says: "package ‘gunzip’ is not available (for R version 2.15.1)" Thanks, Alexandra [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Help with looping
Hi, I need help with a for loop and printing data. I want to loop through a few years and print the data from each year stacked on top of each other. For example, for (i in 2000:2003){ #script for downloading each year Data = readLines(sprintf('file/%4i,i)) } It only prints out the data from the last year. Also, I tried Data[i] = readLines(sprintf('file/%4i,i)) but it says: "number of items to replace is not a multiple of replacement length" How do I get it to not replace each year of data? I have R version 2.15.1 Thanks, Alexandra [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Replacing 9999 and 999 values with NA
Hello All, I have a data frame of two columns for wind. The first column is for wind speed and the second wind direction. I'm trying to replace the values in the first column and the 999 values in the second column with NA. I tried to use the function ltdl.fix.df but it doesn't seem to do anything. > ltdl.fix.df(windMV, zero2na = FALSE, coded = 999) n = 9432 by p = 4 matrix checked, 0 NA(s) present 0 factor variable(s) present 5675 value(s) coded 999 set to NA 0 -ve value(s) set to +ve half the negative value I have R version 3.1.1 Thanks, Alexandra [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Error with using windRose function from the open air package
Hello All, I have a data frame called windSFO of four columns, wind speed, wind direction, station number, and date (mmdd). I downloaded the gz data from a site online and then unzipped it using readLines. I then concatenated these four columns from the unzipped data into a dataframe using cbind. windSFO = data.frame(cbind(ws,wd,stn,yearSite)) Here are the first four rows as an example: ws wd stn yearSite 1 36 290 724940-23234 20090101 2 77 280 724940-23234 20090101 3 72 290 724940-23234 20090101 4 46 290 724940-23234 20090101 I'm trying to make a wind rose using the windRose function but I keep getting an error that I don't understand. I type in: windRose(windSFO,ws='ws',wd='wd') I then get the error: Error in Summary.factor(c(27L, 35L, 34L, 29L, 28L, 25L, 25L, 24L, 24L, : max not meaningful for factors In addition: Warning messages: 1: In Ops.factor(mydata[[wd]], 10) : %% not meaningful for factors 2: In Ops.factor(mydata[[wd]], angle) : / not meaningful for factors Can anyone tell me what this means/what I'm doing wrong? Also, I have R version 3.1.1 Thank you! Alexandra __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Plotting using tapply function output
Hello, I am trying to plot the hourly standard deviation of wind speeds from 13 different measured locations over many years. I imported the data using readLines and into a dataframe called finalData. Using tapply, I determined the standard deviation of the windspeed (ws) for each hour (hour) from every location (stn) using this command line: statHour = tapply(finalData$ws,list(finalData$stn,finalData$hour),sd) I want to plot the standard deviation for each hour of the day, with hours as the x-axis and the standard deviation for the y-axis, and each station as a different color. I've managed to get a boxplot of this, but ideally, I'd like a scatter plot to determine the variations between each instrument throughout the day. The boxplot command is this: boxplot(statHour, names=colnames(statHour),xlab='Hour of the Day',ylab='Standard Deviation of Wind Speed') I also tried to make a dataframe of the tapply output but it ends up using the hours as the column names instead of putting it into the dataframe. Please help!! I have R version 3.1.1 Thanks a lot, Alexandra __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Finding values in a dataframe at a specified hour
Hello, I have a large dataframe (windHW) of wind speeds (ws) at each hour from many days over a set of years. Some of these values are obviously wrong (600 m/s) and I want to get rid of all the values that are larger than 5*sigma for each hour. The 5*sigma (variable name sigma5) values are located in different dataframes for each season, with each dataframe titled as a season. For example, in the dataframe, spring, the 5*sigma value is 79.6 m/s for hour 1. So my question is as follows: how can I get it so that the code will be able to find all the wind speed values in the dataframe, windHW, of a specific hour be higher than the 5*sigma value at that hour? For example, I would like to find if any of the wind speed values at hour 1 are higher than 79.6 m/s, and if so, then replace that value with NA. I have something like this but I can't seem to figure out how to get it for specific hours: windHW$ws[windHW$ws>=spring$sigma5] <- NA I imported the data using readLines and into the dataframe windHW. I also have R version 3.1.1 Any help would be appreciated! Thanks, Alexandra __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Finding values in a dataframe at a specified hour
Update: I have this so far. * The first column of windHW is the wind speed. The 5th column of the dataframe, spring, is the 5*sigma value of every hour. hourRow gives out all the rows of wind speed at a given hour. for (i in 0:23){ hourRow = which(windHW$hour==i,arr.ind=TRUE) for (h in hourRow){ if (windHW[h,1]>=spring[spring$hour==i,5]){ windHW[h,1]<-NA} } } This then gives the error: Error in if (windHW[h, 1] >= spring[spring$hour == i, 5]) { : argument is of length zero *Note: The dataframe for each of the seasons have 24 rows corresponding to each hour of the day 0:23. Thanks, Alexandra On Fri, Apr 10, 2015 at 1:07 PM, Alexandra Catena wrote: > Hello, > > I have a large dataframe (windHW) of wind speeds (ws) at each hour > from many days over a set of years. Some of these values are > obviously wrong (600 m/s) and I want to get rid of all the values that > are larger than 5*sigma for each hour. The 5*sigma (variable name > sigma5) values are located in different dataframes for each season, > with each dataframe titled as a season. For example, in the > dataframe, spring, the 5*sigma value is 79.6 m/s for hour 1. > > So my question is as follows: how can I get it so that the code will > be able to find all the wind speed values in the dataframe, windHW, of > a specific hour be higher than the 5*sigma value at that hour? > For example, I would like to find if any of the wind speed values at > hour 1 are higher than 79.6 m/s, and if so, then replace that value > with NA. > > I have something like this but I can't seem to figure out how to get > it for specific hours: > > windHW$ws[windHW$ws>=spring$sigma5] <- NA > > I imported the data using readLines and into the dataframe windHW. I > also have R version 3.1.1 > > Any help would be appreciated! > > Thanks, > Alexandra __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Finding values in a dataframe at a specified hour
Hi Jim, Thanks for the response, but unfortunately it results in the same error. I think it is something wrong with the if statement. I tried it out manually for the first row and hour that it's testing and indeed, the wind speed is not higher than the 5*sigma value. Since it is not higher than the 5*sigma value, I would think it would just pass to the next loop, yet it doesn't. I will keep trying! Thanks, Alexandra On Fri, Apr 10, 2015 at 3:43 PM, Jim Lemon wrote: > Hi Alexandra, > The error probably comes from the first iteration of i in 0:23. As indexing > in R begins at 1, there is no element 0. Try using: > > for(i in 1:24) { > ... > > and see what happens. > > Jim > > > On Sat, Apr 11, 2015 at 7:06 AM, Alexandra Catena wrote: >> >> Update: >> >> I have this so far. * The first column of windHW is the wind speed. >> The 5th column of the dataframe, spring, is the 5*sigma value of every >> hour. hourRow gives out all the rows of wind speed at a given hour. >> >> for (i in 0:23){ >> hourRow = which(windHW$hour==i,arr.ind=TRUE) >> for (h in hourRow){ >> if (windHW[h,1]>=spring[spring$hour==i,5]){ >> windHW[h,1]<-NA} >> } >> } >> >> This then gives the error: Error in if (windHW[h, 1] >= >> spring[spring$hour == i, 5]) { : argument is of length zero >> >> *Note: The dataframe for each of the seasons have 24 rows >> corresponding to each hour of the day 0:23. >> >> Thanks, >> Alexandra >> >> >> On Fri, Apr 10, 2015 at 1:07 PM, Alexandra Catena >> wrote: >> > Hello, >> > >> > I have a large dataframe (windHW) of wind speeds (ws) at each hour >> > from many days over a set of years. Some of these values are >> > obviously wrong (600 m/s) and I want to get rid of all the values that >> > are larger than 5*sigma for each hour. The 5*sigma (variable name >> > sigma5) values are located in different dataframes for each season, >> > with each dataframe titled as a season. For example, in the >> > dataframe, spring, the 5*sigma value is 79.6 m/s for hour 1. >> > >> > So my question is as follows: how can I get it so that the code will >> > be able to find all the wind speed values in the dataframe, windHW, of >> > a specific hour be higher than the 5*sigma value at that hour? >> > For example, I would like to find if any of the wind speed values at >> > hour 1 are higher than 79.6 m/s, and if so, then replace that value >> > with NA. >> > >> > I have something like this but I can't seem to figure out how to get >> > it for specific hours: >> > >> > windHW$ws[windHW$ws>=spring$sigma5] <- NA >> > >> > I imported the data using readLines and into the dataframe windHW. I >> > also have R version 3.1.1 >> > >> > Any help would be appreciated! >> > >> > Thanks, >> > Alexandra >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.