On Fri, September 1, 2006 22:14, Gregory Stark wrote: > I think the slow part is trying to figure out whether to count the current > call as a hit or a miss. How do you determine whether the plan you're > running > is the best plan without replanning the query?
The question of knowing which plan is best _based on what's in the actual tables_ would be unsolved just as it always was. The scheme addresses only the opportunity to optimize for pseudo-constant parameters. It treats the existing planner as a black box. If you find a solution to the problem of inaccurate statistics, it'll probably be more or less orthogonal to what I'm describing: you could have one or the other, but combining them shouldn't be much harder. I don't think telling hits from misses would be all that hard. Let's say you're having a prepared statement called, and you're evaluating a candidate plan. Each parameter is in one of two sets: those "predicted" by the plan to have certain values (let's call them P), and those "not predicted" by the plan because their confidence counters were below the threshold (I'm tempted to call this set NP, but let's make it Q instead). Whether a parameter is in P or in Q can be derived from its confidence counter. In my previous example, you just take its most-significant bit. * For any parameter in P, if the actual value does not match the plan's prediction, you have a miss. Can't use this plan. Use another if you have one that applies (such as your regular old non-specialized plan--that always applies), or if not, write a new one! If you get through this without finding a mismatch, congratulations: you have a hit. The plan you're looking at is applicable to your call. But now we see if we can do better: * For any parameter in Q, if its value would have been predicted correctly but its counter was below the confidence threshold, you increment the counter. If that lifts the counter above the threshold, you have room for improving on this plan. It means there's a good chance you can re-plan for the case that this parameter is also a pseudo-constant, without the effort being wasted. Of course you could also set a minimum number of invocations between re-plannings to get a more long-term view (e.g. different parameters being recognized as pseudo-constants in subsequent calls--you may not want to re-plan for each of those calls). So which plan do you execute if you have more than one applicable candidate? We can see what works well. As a starter I would definitely pick the one with the larger P (smaller Q), breaking ties in favour of the most recently generated plan. I'm assuming we only want one plan for a given P. We'd probably want to limit the number of candidate plans per statement to some very small, fixed number--somewhere between one and four, I'd say; or maybe one generalized plan plus up to two specialized ones. With numbers like that, none of this should be very expensive. A simple linear match against 1-4 candidates may be more effective than any attempt to be clever. I must admit I haven't thought through all of the consequences of caching more than one specialized plan per statement. For example, we could give every cached plan its own set of confidence counters, and match an incoming invocation against each of those; or we could keep just one "most likely" plan with its associated predictor state, and only consider previously generated plans if we either miss or find room for improvement in the predictor. Jeroen ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly