Hi Tomas, > The thing I was proposing is that it should be possible to build > histograms with bins adapted to density in the given region. With > smaller buckets in areas with high density. So that queries intersect > with fewer buckets in low-density parts of the histogram.
This is how Equi-Depth Histogram works. Postgres has maller buckets in areas with high density: values[(i * (nvals - 1)) / (num_hist - 1)] > I don't recall the details of the MHIST-2 scheme, but it's true > calculating "perfect" V-optimal histogram would be quadratic O(N^2*B). In M-dimensional space "perfect" V-Optimal Histogram is an NP-hard problem. In other words it is not possible to build it in polynomial time. How did you come up with the estimate?! > But that's exactly why greedy/approximate algorithms exist. Yes, it's > not going to produce the optimal V-optimal histogram, but so what? Greedy/approximate algorithm has time complexity O(M*N*B), where M equals the number of dimensions. MHIST-2 is a greedy/approximate algorithm. > And how does this compare to the approximate/greedy algorithms, both in > terms of construction time and accuracy? Time complexity of Equi-Depth Histogram has no dependence on B. > But all of this can apply to histograms in general, no? It's not somehow > special to equi-depth histograms, a v-optimal histogram could be stored > as an r-tree etc. I agree. Regards, Alexander Cheshev On Sun, 7 Jan 2024 at 00:54, Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > > > On 1/6/24 01:00, Alexander Cheshev wrote: > > Hi Tomas, > > > >> Another reason was that the algorithm described in the two papers you > >> reference (1988 paper by DeWitt and the evaluation by Carlson and > >> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It > >> processes the columns one by one - first build bucket on column "a", > >> then splits each bucket into buckets on "b". So it's not symmetrical, > >> and it results in lower accuracy compared to an "ideal" histogram with > >> the same number of buckets (especially for the dimensions split early). > > > > As stated in [1] Sum Square Error (SSE) is one of the most natural > > error metrics. Equi-Depth Histogram is not ideal because it doesn't > > minimize SSE. On the other hand V-Optimal Histogram indeed minimizes > > SSE and from this point of view can be considered as an ideal > > solution. > > > >> This does indeed produce an equi-depth histogram, which seems nice, but > >> the mesh is regular in such a way that any value of the first dimension > >> intersects all buckets on the second dimension. So for example with a > >> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with > >> 100 buckets on "b", each representing 1/10000 of tuples. But we don't > >> know how the tuples are distributed in the tuple - so we end up using > >> 0.5 of the bucket as unbiased estimate, but that can easily add-up in > >> the wrong dimension. > > > > Suppose that we have a query box a=X and we know how data is > > distributed in buckets. Lets consider only the buckets which are > > intersected by the query box a=X. As we know how data is distributes > > in buckets we can exclude the buckets which have no tuples which > > intersect the query box a=X. > > > > As we store buckets with no information about data distribution we > > have to make reasonable assumptions. If we keep buckets relativly > > small then we can assume that buckets have uniform distribution. > > > > What I am trying to say is that the problem which you pointed out > > comes from the fact that we store buckets with no information about > > data distribution. Even in one dimensional case we have to assume how > > data is distributed in buckets. By the way Postgres assumes that data > > has uniform distribution in buckets. > > > > It's not just what Postgres assumes, the assumption bucket uniformity is > somewhat inherent to the whole concept of a histogram. Yes, maybe we > could keep some "distribution" info about each bucket, but then maybe we > could simply build histogram with more bins occupying the same space? > > The thing I was proposing is that it should be possible to build > histograms with bins adapted to density in the given region. With > smaller buckets in areas with high density. So that queries intersect > with fewer buckets in low-density parts of the histogram. > > >> However, this is not the only way to build an equi-depth histogram, > >> there are ways to make it more symmetrical. Also, it's not clear > >> equi-depth histograms are ideal with multiple columns. There are papers > >> proposing various other types of histograms (using different criteria to > >> build buckets optimizing a different metric). The most interesting one > >> seems to be V-Optimal histograms - see for example papers [1], [2], [3], > >> [4], [5] and [6]. I'm sure there are more. The drawback of course is > >> that it's more expensive to build such histograms. > > > > Tomas thank you for shearing with me your ideas regarding V-Optimal > > Histogram. I read through the papers which you gave me and came up > > with the following conclusion. > > > > The problem can be formulated like this. We have N tuples in > > M-dimensional space. We need to partition space into buckets > > iteratively until SSE is less than E or we reach the limit of buckets > > B. > > > > Yes. Although v-optimal histograms minimize variance of frequencies. Not > sure if that's what you mean by SSE. > > > In the case of M-dimensional space it seems to me like an NP-hard > > problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we > > sort N tuples in ascending order. Then we iteratively select a bucket > > which leads to the largest SSE reduction and split it into two parts. > > We repeat the process until SSE is less than E or we reach the limit > > of buckets B. > > > > I don't recall all the details of the MHIST-2 algorithm, but this sounds > about right. Yes, building the optimal histogram would be NP-hard, so > we'd have to use some approximate / greedy algorithm. > > > If we assume that B is significantly less than N then the time > > complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M > > equals 3, B equals 1000 and N equals 300*B then it will take slightly > > over 0.9*10^9 iterations to build a V-Optimal Histogram. > > > > You can see that we have to keep B as low as possible in order to > > build V-Optimal Histogram in feasible time. And here is a paradox. > > From one side we use V-Optimal Histogram in order to minimize SSE but > > from the other side we have to keep B as low as possible which > > eventually leads to increase in SSE. > > > > I don't recall the details of the MHIST-2 scheme, but it's true > calculating "perfect" V-optimal histogram would be quadratic O(N^2*B). > > But that's exactly why greedy/approximate algorithms exist. Yes, it's > not going to produce the optimal V-optimal histogram, but so what? > > > On the other hand time complexity required to build an Equi-Depth > > Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE > > can be arbitrarily reduced by increasing B which in turn is only > > limited by the storage limit. Experimental results show low error > > metric [3]. > > > > And how does this compare to the approximate/greedy algorithms, both in > terms of construction time and accuracy? > > The [3] paper does not compare those, ofc, and I'm somewhat skeptical > about the results in that paper. Not that they'd be wrong, but AFAICS > the assumptions are quite simplistic and well-suited for that particular > type of histogram. > > It's far from clear how would it perform for less "smooth" data, > non-numeric data, etc. > > > In Equi-Depth Histogram a bucket is represented by two vectors. The > > first vector points to the left bottom corner of the bucket and the > > other one point to the right top corner of the bucket. Thus space > > complexity of Equi-Depth Histogram can be estimated as > > 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and > > integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000 > > bytes. > > > > If a bucket is partially intersected by a query box then we assume > > that data has uniform distribution inside of the bucket. It is a > > reasonable assumption if B is relativly large. > > > > In order to efficianly find buckets which intersect a query box we can > > store Equi-Depth Histogram in R-tree as proposed in [3]. On average it > > takes O(logB) iterations to find buckets which intersect a query box. > > As storage requirements are dominated by leaf nodes we can assume that > > it takes slightly more than 2*integer_size*M*B. > > > > But all of this can apply to histograms in general, no? It's not somehow > special to equi-depth histograms, a v-optimal histogram could be stored > as an r-tree etc. > > >> IIRC the patch tried to do something like V-optimal histograms, but > >> using a greedy algorithm and not the exhaustive stuff described in the > >> various papers. > > > > We should only consider computationally tackable solutions. In one > > dimensional case V-Optimal Histogram is probably a good solution but > > in multi-dimensional case I would only consider Equi-Width or > > Equi-Depth Histograms. As stated in multiple papers Equi-Depth > > Histogram proves to be more accurate than Equi-Width Histogram. By the > > way Postgres uses something like Equi-Width Histogram. > > > > Sure, but AFAIK the greedy/approximate algorithms are not intractable. > > And at some point it becomes a tradeoff between accuracy of estimates > and resources to build the histogram. Papers from ~1988 are great, but > maybe not sufficient to make such decisions now. > > >> FWIW I did not intend to reject the idea of adding multi-dimensional > >> histograms, but rather to provide some insight into the history of the > >> past attempt, and also point some weaknesses of the algorithm described > >> in the 1988 paper. If you choose to work on this, I can do a review etc. > > > > Thank you very much Tomas. I am new in the community and I definitely > > didn't expect to have such a warm welcome. > > > > As I indicated above Equi-Depth Histogram proves to be more accurate > > than Equi-Width Histogram and both have the same time and space > > requirements. Postgres uses some sort of Equi-Width Histogram. Suppose > > that: > > * I will create a patch which will replace Equi-Width Histogram with > > Equi-Depth Histogram but only in 1-dimensional case. > > * I will show experimental results which will demonstrate improvement > > of selectivity estimation. > > Then will the path be accepted by the community? > > > > If the above path is accepted by the community then I will proceed > > further with M-dimensional Equi-Depth Histogram... > > > > I find it very unlikely we want (or even need) to significantly rework > the 1-D histograms we already have. AFAICS the current histograms work > pretty well, and if better accuracy is needed, it's easy/efficient to > just increase the statistics target. I'm sure there are various > estimation issues, but I'm not aware of any that'd be solved simply by > using a different 1-D histogram. > > I'm not going to reject that outright - but I think the bar you'd need > to justify such change is damn high. Pretty much everyone is using the > current histograms, which makes it a very sensitive area. You'll need to > show that it helps in practice, without hurting causing harm ... > > It's an interesting are for experiments, no doubt about it. And if you > choose to explore it, that's fine. But it's better to be aware it may > not end with a commit. > > > For the multi-dimensional case, I propose we first try to experiment > with the various algorithms, and figure out what works etc. Maybe > implementing them in python or something would be easier than C. > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company