file= is the input data file. filter= is just a command string that specifies a program to run (not a data file).
1. If Filename.tmp is the name of a temporary file (that it creates) it runs a batch command similar to this: paste("cmd /c", filter, "<", file, ">", Filename.tmp) 2. Then it reads Filename.tmp into the database (which it creates for you) and does this without involving R and 3. finally it reads the table in the database that was created into R, as an R dataframe, and destroys the database. On Sat, Feb 6, 2010 at 7:53 PM, Vadlamani, Satish {FLNA} <satish.vadlam...@fritolay.com> wrote: > Gabor: > It did suppress the message now and I was able to load the data. Question. > > 1. test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl > parse_3wkout.pl") > > In the statement above, should the filename in file= and the file name that > the perl script uses through the filter= command be the same? I would think > not. I would say that if filter= is passed to the statement, then the > filename should be ignored. Is this how it works? > > Thanks. > Satish > > > -----Original Message----- > From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] > Sent: Saturday, February 06, 2010 4:58 PM > To: Vadlamani, Satish {FLNA} > Cc: r-help@r-project.org > Subject: Re: [R] Reading large files > > I have uploaded another version which suppresses display of the error > message but otherwise works the same. Omitting the redundant > arguments we have: > > ibrary(sqldf) > # next line is only needed once per session to read in devel version > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > > test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl > parse_3wkout.pl") > > > On Sat, Feb 6, 2010 at 5:48 PM, Vadlamani, Satish {FLNA} > <satish.vadlam...@fritolay.com> wrote: >> Gabor: >> Please see the results below. Sourcing your new R script worked (although >> with the same error message). If I put eol="\n" option, it is adding a "\r" >> to the last column. I took out the eol option below. This is just some more >> feedback to you. >> >> I am thinking that I will just do an inline edit in Perl (that is create the >> csv file through Perl by overwriting the current file) and then use >> read.csv.sql without the filter= option. This seems to be more tried and >> tested. If you have any suggestions, please let me know. Thanks. >> Satish >> >> >> BEFORE SOURCING YOUR NEW R SCRIPT >>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", sql = "select * >>> from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl") >> Error in readRegistry(key, maxdepth = 3) : >> Registry key 'SOFTWARE\R-core' not found >>> test_df >> Error: object 'test_df' not found >> >> AFTER SOURCING YOUR NEW R SCRIPT >>> source("f:/dp_modeling_team/downloads/R/sqldf.R") >>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", sql = "select * >>> from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl") >> Error in readRegistry(key, maxdepth = 3) : >> Registry key 'SOFTWARE\R-core' not found >> In addition: Warning messages: >> 1: closing unused connection 5 (3wkoutstatfcst_small.dat) >> 2: closing unused connection 4 (3wkoutstatfcst_small.dat) >> 3: closing unused connection 3 (3wkoutstatfcst_small.dat) >>> test_df >> allgeo area1 zone dist ccust1 whse bindc ccust2 account area2 ccust3 >> 1 A 4 1 37 99 4925 4925 99 99 4 99 >> 2 A 4 1 37 99 4925 4925 99 99 4 99 >> >> -----Original Message----- >> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >> Sent: Saturday, February 06, 2010 4:28 PM >> To: Vadlamani, Satish {FLNA} >> Cc: r-help@r-project.org >> Subject: Re: [R] Reading large files >> >> The software attempts to read the registry and temporarily augment the >> path in case you have Rtools installed so that the filter can access >> all the tools that Rtools provides. I am not sure why its failing on >> your system but there is evidently some differences between systems >> here and I have added some code to trap and bypass that portion in >> case it fails. I have added the new version to the svn repository so >> try this: >> >> library(sqldf) >> # overwrite with development version >> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") >> # your code to call read.csv.sql >> >> >> On Sat, Feb 6, 2010 at 5:18 PM, Vadlamani, Satish {FLNA} >> <satish.vadlam...@fritolay.com> wrote: >>> >>> Gabor: >>> Here is the update. As you can see, I got the same error as below in 1. >>> >>> 1. Error >>> test_df <- read.csv.sql(file="out_small.txt", sql = "select * from file", >>> header = TRUE, sep = ",", filter="perl parse_3wkout.pl", eol="\n") >>> Error in readRegistry(key, maxdepth = 3) : >>> Registry key 'SOFTWARE\R-core' not found >>> >>> 2. But the loading of the bigger file was successful as you can see below. >>> 857 MB, 333,250 rows, 227 columns. This is good. >>> >>> I will have to just do an inline edit in Perl and change the file to csv >>> from within R and then call the read.csv.sql. >>> >>> If you have any suggestions to fix 1, I would like to try them. >>> >>> system.time(test_df <- read.csv.sql(file="out.txt")) >>> user system elapsed >>> 192.53 15.50 213.68 >>> Warning message: >>> closing unused connection 3 (out.txt) >>> >>> Thanks again. >>> >>> Satish >>> >>> -----Original Message----- >>> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >>> Sent: Saturday, February 06, 2010 3:02 PM >>> To: Vadlamani, Satish {FLNA} >>> Cc: r-help@r-project.org >>> Subject: Re: [R] Reading large files >>> >>> Note that you can shorten #1 to read.csv.sql("out.txt") since your >>> other arguments are the default values. >>> >>> For the second one, use read.csv.sql, eliminate the arguments that are >>> defaults anyways (should not cause a problem but its error prone) and >>> add an explicit eol= argument since SQLite can have problems with end >>> of line in some cases. Also test out your perl script separately from >>> R first to ensure that it works: >>> >>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl >>> parse_3wkout.pl", eol = "\n") >>> >>> SQLite has some known problems with end of line so try it with and >>> without the eol= argument just in case. When I just made up the >>> following gawk example I noticed that I did need to specify the eol= >>> argument. >>> >>> Also I have added a complete example using gawk as Example 13c on the >>> home page just now: >>> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql >>> >>> >>> On Sat, Feb 6, 2010 at 3:52 PM, Vadlamani, Satish {FLNA} >>> <satish.vadlam...@fritolay.com> wrote: >>>> Gabor: >>>> >>>> I had success with the following. >>>> 1. I created a csv file with a perl script called "out.txt". Then ran the >>>> following successfully >>>> library("sqldf") >>>> test_df <- read.csv.sql(file="out.txt", sql = "select * from file", header >>>> = TRUE, sep = ",", dbname = tempfile()) >>>> >>>> 2. I did not have success with the following. Could you tell me what I may >>>> be doing wrong? I could paste the perl script if necessary. From the perl >>>> script, I am reading the file, creating the csv record and printing each >>>> record one by one and then exiting. >>>> >>>> Thanks. >>>> >>>> Not had success with below.. >>>> #test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * >>>> from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", >>>> dbname = tempfile()) >>>> test_df >>>> >>>> Error message below: >>>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * >>>> from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", >>>> dbname = tempfile()) >>>> Error in readRegistry(key, maxdepth = 3) : >>>> Registry key 'SOFTWARE\R-core' not found >>>> In addition: Warning messages: >>>> 1: closing unused connection 14 (3wkoutstatfcst_small.dat) >>>> 2: closing unused connection 13 (3wkoutstatfcst_small.dat) >>>> 3: closing unused connection 11 (3wkoutstatfcst_small.dat) >>>> 4: closing unused connection 9 (3wkoutstatfcst_small.dat) >>>> 5: closing unused connection 3 (3wkoutstatfcst_small.dat) >>>>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * >>>>> from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", >>>>> dbname = tempfile()) >>>> Error in readRegistry(key, maxdepth = 3) : >>>> Registry key 'SOFTWARE\R-core' not found >>>> >>>> -----Original Message----- >>>> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >>>> Sent: Saturday, February 06, 2010 12:14 PM >>>> To: Vadlamani, Satish {FLNA} >>>> Cc: r-help@r-project.org >>>> Subject: Re: [R] Reading large files >>>> >>>> No. >>>> >>>> On Sat, Feb 6, 2010 at 1:01 PM, Vadlamani, Satish {FLNA} >>>> <satish.vadlam...@fritolay.com> wrote: >>>>> Gabor: >>>>> Can I pass colClasses as a vector to read.csv.sql? Thanks. >>>>> Satish >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >>>>> Sent: Saturday, February 06, 2010 9:41 AM >>>>> To: Vadlamani, Satish {FLNA} >>>>> Cc: r-help@r-project.org >>>>> Subject: Re: [R] Reading large files >>>>> >>>>> Its just any Windows batch command string that filters stdin to >>>>> stdout. What the command consists of should not be important. An >>>>> invocation of perl that runs a perl script that filters stdin to >>>>> stdout might look like this: >>>>> read.csv.sql("myfile.dat", filter = "perl myprog.pl") >>>>> >>>>> For an actual example see the source of read.csv2.sql which defaults >>>>> to using a Windows vbscript program as a filter. >>>>> >>>>> On Sat, Feb 6, 2010 at 10:16 AM, Vadlamani, Satish {FLNA} >>>>> <satish.vadlam...@fritolay.com> wrote: >>>>>> Jim, Gabor: >>>>>> Thanks so much for the suggestions where I can use read.csv.sql and >>>>>> embed Perl (or gawk). I just want to mention that I am running on >>>>>> Windows. I am going to read the documentation the filter argument and >>>>>> see if it can take a decent sized Perl script and then use its output as >>>>>> input. >>>>>> >>>>>> Suppose that I write a Perl script that parses this fwf file and creates >>>>>> a CSV file. Can I embed this within the read.csv.sql call? Or, can it >>>>>> only be a statement or something? If you know the answer, please let me >>>>>> know. Otherwise, I will try a few things and report back the results. >>>>>> >>>>>> Thanks again. >>>>>> Saitsh >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: jim holtman [mailto:jholt...@gmail.com] >>>>>> Sent: Saturday, February 06, 2010 6:16 AM >>>>>> To: Gabor Grothendieck >>>>>> Cc: Vadlamani, Satish {FLNA}; r-help@r-project.org >>>>>> Subject: Re: [R] Reading large files >>>>>> >>>>>> In perl the 'unpack' command makes it very easy to parse fixed fielded >>>>>> data. >>>>>> >>>>>> On Fri, Feb 5, 2010 at 9:09 PM, Gabor Grothendieck >>>>>> <ggrothendi...@gmail.com> wrote: >>>>>>> Note that the filter= argument on read.csv.sql can be used to pass the >>>>>>> input through a filter written in perl, [g]awk or other language. >>>>>>> For example: read.csv.sql(..., filter = "gawk -f myfilter.awk") >>>>>>> >>>>>>> gawk has the FIELDWIDTHS variable for automatically parsing fixed >>>>>>> width fields, e.g. >>>>>>> http://www.delorie.com/gnu/docs/gawk/gawk_44.html >>>>>>> making this very easy but perl or whatever you are most used to would >>>>>>> be fine too. >>>>>>> >>>>>>> On Fri, Feb 5, 2010 at 8:50 PM, Vadlamani, Satish {FLNA} >>>>>>> <satish.vadlam...@fritolay.com> wrote: >>>>>>>> Hi Gabor: >>>>>>>> Thanks. My files are all in fixed width format. They are a lot of >>>>>>>> them. It would take me some effort to convert them to CSV. I guess >>>>>>>> this cannot be avoided? I can write some Perl scripts to convert fixed >>>>>>>> width format to CSV format and then start with your suggestion. Could >>>>>>>> you let me know your thoughts on the approach? >>>>>>>> Satish >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >>>>>>>> Sent: Friday, February 05, 2010 5:16 PM >>>>>>>> To: Vadlamani, Satish {FLNA} >>>>>>>> Cc: r-help@r-project.org >>>>>>>> Subject: Re: [R] Reading large files >>>>>>>> >>>>>>>> If your problem is just how long it takes to load the file into R try >>>>>>>> read.csv.sql in the sqldf package. A single read.csv.sql call can >>>>>>>> create an SQLite database and table layout for you, read the file into >>>>>>>> the database (without going through R so R can't slow this down), >>>>>>>> extract all or a portion into R based on the sql argument you give it >>>>>>>> and then remove the database. See the examples on the home page: >>>>>>>> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql >>>>>>>> >>>>>>>> On Fri, Feb 5, 2010 at 2:11 PM, Satish Vadlamani >>>>>>>> <satish.vadlam...@fritolay.com> wrote: >>>>>>>>> >>>>>>>>> Matthew: >>>>>>>>> If it is going to help, here is the explanation. I have an end state >>>>>>>>> in >>>>>>>>> mind. It is given below under "End State" header. In order to get >>>>>>>>> there, I >>>>>>>>> need to start somewhere right? I started with a 850 MB file and could >>>>>>>>> not >>>>>>>>> load in what I think is reasonable time (I waited for an hour). >>>>>>>>> >>>>>>>>> There are references to 64 bit. How will that help? It is a 4GB RAM >>>>>>>>> machine >>>>>>>>> and there is no paging activity when loading the 850 MB file. >>>>>>>>> >>>>>>>>> I have seen other threads on the same types of questions. I did not >>>>>>>>> see any >>>>>>>>> clear cut answers or errors that I could have been making in the >>>>>>>>> process. If >>>>>>>>> I am missing something, please let me know. Thanks. >>>>>>>>> Satish >>>>>>>>> >>>>>>>>> >>>>>>>>> End State >>>>>>>>>> Satish wrote: "at one time I will need to load say 15GB into R" >>>>>>>>> >>>>>>>>> >>>>>>>>> ----- >>>>>>>>> Satish Vadlamani >>>>>>>>> -- >>>>>>>>> View this message in context: >>>>>>>>> http://n4.nabble.com/Reading-large-files-tp1469691p1470667.html >>>>>>>>> Sent from the R help mailing list archive at Nabble.com. >>>>>>>>> >>>>>>>>> ______________________________________________ >>>>>>>>> R-help@r-project.org mailing list >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>> PLEASE do read the posting guide >>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help@r-project.org mailing list >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide >>>>>>> http://www.R-project.org/posting-guide.html >>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jim Holtman >>>>>> Cincinnati, OH >>>>>> +1 513 646 9390 >>>>>> >>>>>> What is the problem that you are trying to solve? >>>>>> >>>>> >>>> >>> >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.