on 07/31/2008 04:29 PM Alan Cox wrote:
Hello.  I am hoping someone will be willing to help me understand
something about hazard plots created with muhaz(...).  I have some
background in statistics (minor in grad school), but I haven't been
able to figure one thing about hazard plots.  I am using hazard plots
to track customer cancellations.  I figure I can treat a cancellation
as a "death", and if someone is still a customer today, they're right
censored.  I know that a hazard plot shows the probability that
someone will cancel in month  n  given that they're a customer in
month n-1 .


If a customer signs up on January 1st and cancels on January 2nd,
we've had what I thought was an intellectual but pointless debate
about whether we count that as being a customer for 1 month or 0
months.  I thought the two plots would be identical, except for a
different X axis.


However, when I create the two plots, they are very different ...
very, very different.  I've posted the two plots to Flickr:


http://flickr.com/photos/alancox/2720915878/in/photostream/ shows the
plot where the lifetime of a customer who signs up on Jan 1 and
cancels on Jan 2 is 0.

http://flickr.com/photos/alancox/2720915904/in/photostream/ shows the
plot where the lifetime of a customer who signs up on Jan 1 and
cancels on Jan 2 is 1.

My question is: Why are these two so different?  How do I know which
is right?

The call that I'm making to produce the model is:

hazardV08 <- muhaz(nmc,s,max.time=max(nmc))


I suspect that there is more here than meets the eye.

Lacking your data and the actual code that you are using to generate the two different curves, this could be anything from the way in which you have coded/collapsed/truncated the event intervals, to the way in which muhaz() is fitting the smoothed curve to each of the two data sets.

The "correct" way to track the intervals would be to use a resolution of days, which could be transformed into months and fractions thereof (eg. by dividing days by 30.44) if you prefer. The day of sign-up would be Day 0 and each subsequent calendar day would increment the interval by one day.

So based upon your example above (sign-up on Jan 1, cancel on Jan 2), the customer would have an "event" on day 1 or 0.03285151 months.

All of your censored events (clients that have not yet canceled) should have their intervals based upon their own Time 0 (sign-up day) to whatever date you are using as your end point. I am guessing that you might have some form of paid membership, such that as long as the customer is paying, they are considered active, as opposed to a customer who simply stops doing business with you and you don't know. If the customer is paying some type of monthly fee, for example, then you should really censor them based upon their last payment date, not today's date, since the last payment date is when you know that they are still a paying client.

This would be akin to patient coming in for a follow up contact, at which point you know they are still alive. Once they leave the office, you don't know if they are alive until the next actual contact as they might be hit by a car walking to the parking lot.

Based upon your comments above, where you appear to have information on a daily basis, if you might be collapsing time into integer months, you are losing information. The kernel based approach that is used by muhaz() as I understand it, is highly sensitive to small datasets and the granularity of the data, among other things.

You might want to review the online complement to MASS4 by Venables and Ripley here:

  http://www.stats.ox.ac.uk/pub/MASS4/#Complements

and review the section on survival analysis, which covers smoothing functions for survival.

You might also want to simply consider using a standard Kaplan-Meier non-parametric estimator using survfit() in the survival package. The function calls for your data should be something like:

  library(survival)

  summary(survfit(Surv(nmc, s)))

and

  plot(survfit(Surv(nmc, s)))


HTH,

Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to