Re: [R] exercise in frustration: applying a function to subsamples

Ted Byers Mon, 12 Jul 2010 13:44:13 -0700

OK,  here is a stripped down variant of my code.  I can run it here
unchanged (apart from the credentials for connecting to my DB).


Sys.setenv(MYSQL_HOME='C:/Program Files/MySQL/MySQL Server 5.0')
> library(TSMySQL)
> library(plyr)
> library(fitdistrplus)
> con <- dbConnect(MySQL(), user="rejbyers", password="jesakos",
> dbname="merchants2")
> x <- sprintf("SELECT m_id,sale_date,YEAR(sale_date) AS
> sale_year,WEEK(sale_date) AS sale_week,return_type,0.0001 +
> DATEDIFF(return_date,sale_date) AS elapsed_time FROM `risk_input` WHERE
> DATEDIFF(return_date,sale_date) IS NOT NULL")
> x
> moreinfo <- dbGetQuery(con, x)
> str(moreinfo)
> #moreinfo
> #print(moreinfo)
> dbDisconnect(con)
> f1 <- fitdist(moreinfo$elapsed_time,"exp");
> summary(f1)
> lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week),drop
> = TRUE),
>       function(df) fitdist(df$elapsed_time,"exp"))
>

I guess that for others to run this script, it is just necessary to create
some sample data, consisting of two or more m_id values (I have several
hundred), and temporally ordered data for each.  I am not familiar enough
with R to know how to do that using R.    Usually, if I need dummy data, I
make it with my favourite rng using either C++ or Perl.  I am still trying
to get used to R.

Each record in my data has one random variate and a MySQL TIMESTAMP
(nn-nn-nnnn nn:nn:nn), anywhere from hundreds to thousands each week for
anywhere from a few months to several years.  My SQL actually produces the
random variate by taking the difference between the sale date and return
date, and is structured as it is because I know how to group by year and
week from a timestamp field using SQL but didn't know how to accomplish the
same thing in R.

The statement 'x' by itself, always shows me the correct SQL statement to
get the data (I can execute it unchanged in the mysql commandline client).
'str(moreinfo)' always gives me the data structure I expect.  E.g.:

> str(moreinfo)
'data.frame':   177837 obs. of  6 variables:
 $ m_id        : num  171 206 206 206 206 206 206 218 224 224 ...
 $ sale_date   : chr  "2008-04-25 07:41:09" "2008-05-09 20:58:12"
"2008-09-06 19:51:52" "2008-05-01 21:26:40" ...
 $ sale_year   : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
 $ sale_week   : int  16 18 35 17 31 21 19 52 44 35 ...
 $ return_type : num  1 1 1 1 1 1 1 1 1 1 ...
 $ elapsed_time: num  0.0001 0.0001 3.0001 4.0001 21.0001 ...

'summary(f1)' shows me the results I expect from the aggregate data.  E.g.:

> summary(f1)
FITTING OF THE DISTRIBUTION ' exp ' BY MAXIMUM LIKELIHOOD
PARAMETERS
      estimate   Std. Error
rate 0.0652917 0.0001547907
Loglikelihood:  -663134.7   AIC:  1326271   BIC:  1326281
------
GOODNESS-OF-FIT STATISTICS

_____________ Chi-squared_____________
Chi-squared statistic:  400277239
Degree of freedom of the Chi-squared distribution:  56
Chi-squared p-value:  0
!!! the p-value may be wrong
                                                  with some theoretical
counts < 5 !!!

!!! For continuous distributions, Kolmogorov-Smirnov and
      Anderson-Darling statistics should be prefered !!!

_____________ Kolmogorov-Smirnov_____________
Kolmogorov-Smirnov statistic:  0.1660987
Kolmogorov-Smirnov test:  rejected
!!! The result of this test may be too conservative as it
     assumes that the distribution parameters are known !!!

_____________ Anderson-Darling_____________
Anderson-Darling statistic:  Inf
Anderson-Darling test:  rejected


And at the end, I get the error I mentioned.  NB: In this variant, I added
drop = TRUE as Jim suggested.

>
lapply(split(all_samples,list(all_samples$m_id,all_samples$sale_year,all_samples$sale_week),drop
= TRUE),
+       function(df) fitdist(df$elapsed_time,"exp"))
Error in fitdist(df$elapsed_time, "exp") :
  data must be a numeric vector of length greater than 1

If, then, "drop = TRUE" results in all empty combinations of m_id, year and
week being excluded, then (noticing the requirement is actually that the
sample size be greater than 1), I can only conclude that at least one of the
samples has only 1 record.

But that is too small.  Is there a way to allow the above code to apply
fitdist only if the sample size of a given subsample is greater than, say,
100?

Even better, is there a way to make the split more dynamic, so that it
groups a given m_id's data by month if the average weekly subsample size is
less than 100, or by day if the average weekly subsample is greater than
1000?

Thanks

Ted


On Mon, Jul 12, 2010 at 3:20 PM, Erik Iverson <er...@ccbr.umn.edu> wrote:

> Your code is not reproducible.  Can you come up with a small example
> showing the crux of your data structures/problem, that we can all run in our
> R sessions?  You're likely get much higher quality responses this way.
>
> Ted Byers wrote:
>
>> From the documentation I have found, it seems that one of the functions
>>> from
>>>
>> package plyr, or a combination of functions like split and lapply would
>> allow me to have a really short R script to analyze all my data (I have
>> reduced it to a couple hundred thousand records with about half a dozen
>> records.
>>
>> I get the same result from ddply and split/lapply:
>>
>>  ddply(moreinfo,c("m_id","sale_year","sale_week"),
>>> +       function(df) data.frame(res = fitdist(df$elapsed_time,"exp"),est
>>> =
>>> res$estimate,sd = res$sd))
>>> Error in fitdist(df$elapsed_time, "exp") :
>>>  data must be a numeric vector of length greater than 1
>>>
>>>
>> and
>>
>> lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)),
>>> +       function(df) fitdist(df$elapsed_time,"exp"))
>>> Error in fitdist(df$elapsed_time, "exp") :
>>>  data must be a numeric vector of length greater than 1
>>>
>>>
>> Now, in retrospect, unless I misunderstood the properties of a data.frame,
>> I
>> suppose a data.frame might not have been entirely appropriate as the m_id
>> samples start and end on very different dates, but I would have thought a
>> list data structure should have been able to handle that.  It would seem
>> that split is making groups that have the same start and end dates (or
>> that
>> if, for example, I have sale data for precisely the last year, split would
>> insist on both 2009 and 2010 having weeks from 0 through 52 instead of
>> just
>> the weeks in each year that actually have data: 26 through 52 for last
>> year
>> and 1 through 25 for this year).  I don't see how else the data passed to
>> fitdist could have a sample size of 0.
>>
>> I'd appreciate understanding how to resolve this.  However, it isn't s
>> show
>> stopper as it now seems trivial to just break it out into a loop (followed
>> by a lapply/split combo using only sale year and sale month).
>>
>> While I am asking, is there a better way to split such temporally ordered
>> data into weekly samples that respective the year in which the sample is
>> taken as well as the week in which it is taken?
>>
>> Thanks
>>
>> Ted
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] exercise in frustration: applying a function to subsamples

Reply via email to