Thank you for your help. Yes, my problem is one of non-response. We try to hand a survey form to everyone that boards at each stop, but we're getting only ~10% usable responses. One reason is that the "full survey" is long, and requires geo-locating 2 points - Trip Origin and Destination.
My hope is to perform a second survey to establish the temporal distribution. Unfortunately, it appears that this will need to be nearly as extensive (expensive) as the original survey. If this (raking "time on board") can be show to work, we can then generalize the process to other variables. Thank you. PS Why is 'ByEBOn' a list and not a DataFrame? > OnLabels <- c( "Warner Center", "De Soto", "Pierce College", "Tampa", "Reseda", "Balboa", "Woodley", "Sepulveda", "Van Nuys", "Woodman", "Valley College", "Laurel Canyon", "North Hollywood") > EBOnNewTots <- c( 1000, 600, 1200, 500, 1000, 500, 200, 250, 1000, 300, 100, 50, 73.65 ) > EBNumStn <- c(673.65, 800, 1000, 1000, 800, 700, 600, 500, 400, 200, 50, 50 ) > ByEBOn <- data.frame(OnLabels,EBOnNewTots) > ByEBNum <- data.frame(c(1:12),EBNumStn) > RakedEBSurvey <- rake(EBDesign, list(~ByEBOn, ~ByEBNum), list(EBOnNewTots, EBNumStn ) ) Error in model.frame.default(margin, data = design$variables) : invalid type (list) for variable 'ByEBOn' > Robert Farley Metro www.Metro.net -----Original Message----- From: Stas Kolenikov [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 20, 2008 07:13 To: Farley, Robert Cc: r-help@r-project.org Subject: Re: [R] Survey Design / Rake questions On Mon, Aug 18, 2008 at 6:18 PM, Farley, Robert <[EMAIL PROTECTED]> wrote: > My motivation is to try to correct for a "time on board" bias we see in > our surveys. Not surprisingly, riders who are only on board a short > time don't attempt/finish our survey forms. We're able to weight our > survey to the "bus stop-on by bus run" level. So is it the problem of catching the short rides in your sample, or the problem of having those short rides complete the survey? If the former, then all you have to do is to weight by inverse probability of selection (Horvitz-Thompson estimator). This probability is probably roughly proportional to time on bus, which in turn might be proportional to the number of stops in their ride. You may not need any raking for that, just do some algebra computing those probabilities of selection. If the latter is the problem, then it is the problem of non-response. If you think that the only thing that matters in whether a person chooses to respond or not is the length of the ride, then your data are "missing at random" (MAR), one of several standard concepts in the missing data statistics (http://www.citeulike.org/user/ctacmo/article/553290). You can bypass that -- in survey statistics, that will be done with weights, again. Here, you would need to boost the weight by the inverse fraction of those who did complete the survey. In a more difficult situation, your response probability might depend on other factors, say demographics of the passengers, time of the day, etc. I would imagine you would still have MAR data, unless you have some weird questions like "Do you carry firearms on the bus?" to which the people who did have guns at the time of their ride would probably decline to answer, making the data informatively missing/not missing at random (NMAR). -- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: I use this email account for mailing lists only. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.