OK, here is a stripped down variant of my code. I can run it here unchanged (apart from the credentials for connecting to my DB).
Sys.setenv(MYSQL_HOME='C:/Program Files/MySQL/MySQL Server 5.0') > library(TSMySQL) > library(plyr) > library(fitdistrplus) > con <- dbConnect(MySQL(), user="rejbyers", password="jesakos", > dbname="merchants2") > x <- sprintf("SELECT m_id,sale_date,YEAR(sale_date) AS > sale_year,WEEK(sale_date) AS sale_week,return_type,0.0001 + > DATEDIFF(return_date,sale_date) AS elapsed_time FROM `risk_input` WHERE > DATEDIFF(return_date,sale_date) IS NOT NULL") > x > moreinfo <- dbGetQuery(con, x) > str(moreinfo) > #moreinfo > #print(moreinfo) > dbDisconnect(con) > f1 <- fitdist(moreinfo$elapsed_time,"exp"); > summary(f1) > lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week),drop > = TRUE), > function(df) fitdist(df$elapsed_time,"exp")) > I guess that for others to run this script, it is just necessary to create some sample data, consisting of two or more m_id values (I have several hundred), and temporally ordered data for each. I am not familiar enough with R to know how to do that using R. Usually, if I need dummy data, I make it with my favourite rng using either C++ or Perl. I am still trying to get used to R. Each record in my data has one random variate and a MySQL TIMESTAMP (nn-nn-nnnn nn:nn:nn), anywhere from hundreds to thousands each week for anywhere from a few months to several years. My SQL actually produces the random variate by taking the difference between the sale date and return date, and is structured as it is because I know how to group by year and week from a timestamp field using SQL but didn't know how to accomplish the same thing in R. The statement 'x' by itself, always shows me the correct SQL statement to get the data (I can execute it unchanged in the mysql commandline client). 'str(moreinfo)' always gives me the data structure I expect. E.g.: > str(moreinfo) 'data.frame': 177837 obs. of 6 variables: $ m_id : num 171 206 206 206 206 206 206 218 224 224 ... $ sale_date : chr "2008-04-25 07:41:09" "2008-05-09 20:58:12" "2008-09-06 19:51:52" "2008-05-01 21:26:40" ... $ sale_year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ... $ sale_week : int 16 18 35 17 31 21 19 52 44 35 ... $ return_type : num 1 1 1 1 1 1 1 1 1 1 ... $ elapsed_time: num 0.0001 0.0001 3.0001 4.0001 21.0001 ... 'summary(f1)' shows me the results I expect from the aggregate data. E.g.: > summary(f1) FITTING OF THE DISTRIBUTION ' exp ' BY MAXIMUM LIKELIHOOD PARAMETERS estimate Std. Error rate 0.0652917 0.0001547907 Loglikelihood: -663134.7 AIC: 1326271 BIC: 1326281 ------ GOODNESS-OF-FIT STATISTICS _____________ Chi-squared_____________ Chi-squared statistic: 400277239 Degree of freedom of the Chi-squared distribution: 56 Chi-squared p-value: 0 !!! the p-value may be wrong with some theoretical counts < 5 !!! !!! For continuous distributions, Kolmogorov-Smirnov and Anderson-Darling statistics should be prefered !!! _____________ Kolmogorov-Smirnov_____________ Kolmogorov-Smirnov statistic: 0.1660987 Kolmogorov-Smirnov test: rejected !!! The result of this test may be too conservative as it assumes that the distribution parameters are known !!! _____________ Anderson-Darling_____________ Anderson-Darling statistic: Inf Anderson-Darling test: rejected And at the end, I get the error I mentioned. NB: In this variant, I added drop = TRUE as Jim suggested. > lapply(split(all_samples,list(all_samples$m_id,all_samples$sale_year,all_samples$sale_week),drop = TRUE), + function(df) fitdist(df$elapsed_time,"exp")) Error in fitdist(df$elapsed_time, "exp") : data must be a numeric vector of length greater than 1 If, then, "drop = TRUE" results in all empty combinations of m_id, year and week being excluded, then (noticing the requirement is actually that the sample size be greater than 1), I can only conclude that at least one of the samples has only 1 record. But that is too small. Is there a way to allow the above code to apply fitdist only if the sample size of a given subsample is greater than, say, 100? Even better, is there a way to make the split more dynamic, so that it groups a given m_id's data by month if the average weekly subsample size is less than 100, or by day if the average weekly subsample is greater than 1000? Thanks Ted On Mon, Jul 12, 2010 at 3:20 PM, Erik Iverson <er...@ccbr.umn.edu> wrote: > Your code is not reproducible. Can you come up with a small example > showing the crux of your data structures/problem, that we can all run in our > R sessions? You're likely get much higher quality responses this way. > > Ted Byers wrote: > >> From the documentation I have found, it seems that one of the functions >>> from >>> >> package plyr, or a combination of functions like split and lapply would >> allow me to have a really short R script to analyze all my data (I have >> reduced it to a couple hundred thousand records with about half a dozen >> records. >> >> I get the same result from ddply and split/lapply: >> >> ddply(moreinfo,c("m_id","sale_year","sale_week"), >>> + function(df) data.frame(res = fitdist(df$elapsed_time,"exp"),est >>> = >>> res$estimate,sd = res$sd)) >>> Error in fitdist(df$elapsed_time, "exp") : >>> data must be a numeric vector of length greater than 1 >>> >>> >> and >> >> lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)), >>> + function(df) fitdist(df$elapsed_time,"exp")) >>> Error in fitdist(df$elapsed_time, "exp") : >>> data must be a numeric vector of length greater than 1 >>> >>> >> Now, in retrospect, unless I misunderstood the properties of a data.frame, >> I >> suppose a data.frame might not have been entirely appropriate as the m_id >> samples start and end on very different dates, but I would have thought a >> list data structure should have been able to handle that. It would seem >> that split is making groups that have the same start and end dates (or >> that >> if, for example, I have sale data for precisely the last year, split would >> insist on both 2009 and 2010 having weeks from 0 through 52 instead of >> just >> the weeks in each year that actually have data: 26 through 52 for last >> year >> and 1 through 25 for this year). I don't see how else the data passed to >> fitdist could have a sample size of 0. >> >> I'd appreciate understanding how to resolve this. However, it isn't s >> show >> stopper as it now seems trivial to just break it out into a loop (followed >> by a lapply/split combo using only sale year and sale month). >> >> While I am asking, is there a better way to split such temporally ordered >> data into weekly samples that respective the year in which the sample is >> taken as well as the week in which it is taken? >> >> Thanks >> >> Ted >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.