On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote:
> On 10/15/22 14:33, Tomas Vondra wrote: > > Hi, > > > > ... > > > > There's a bunch of issues with this initial version of the patch, > > usually described in XXX comments in the relevant places.6) > > > > ... > > I forgot to mention one important issue in my list yesterday, and that's > memory consumption. The way the patch is coded now, the new BRIN support > function (brin_minmax_ranges) produces information about *all* ranges in > one go, which may be an issue. The worst case is 32TB table, with 1-page > BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B > structs, so this would require ~128GB of RAM. With the default 128-page > ranges, it's still be ~1GB, which is quite a lot. > > We could have a discussion about what's the reasonable size of BRIN > ranges on such large tables (e.g. building a bitmap on 4 billion ranges > is going to be "not cheap" so this is likely pretty rare). But we should > not introduce new nodes that ignore work_mem, so we need a way to deal > with such cases somehow. > > The easiest solution likely is to check this while planning - we can > check the table size, calculate the number of BRIN ranges, and check > that the range info fits into work_mem, and just not create the path > when it gets too large. That's what we did for HashAgg, although that > decision was unreliable because estimating GROUP BY cardinality is hard. > > The wrinkle here is that counting just the range info (BrinRange struct) > does not include the values for by-reference types. We could use average > width - that's just an estimate, though. > > A more comprehensive solution seems to be to allow requesting chunks of > the BRIN ranges. So that we'd get "slices" of ranges and we'd process > those. So for example if you have 1000 ranges, and you can only handle > 100 at a time, we'd do 10 loops, each requesting 100 ranges. > > This has another problem - we do care about "overlaps", and we can't > really know if the overlapping ranges will be in the same "slice" > easily. The chunks would be sorted (for example) by maxval. But there > can be a range with much higher maxval (thus in some future slice), but > very low minval (thus intersecting with ranges in the current slice). > > Imagine ranges with these minval/maxval values, sorted by maxval: > > [101,200] > [201,300] > [301,400] > [150,500] > > and let's say we can only process 2-range slices. So we'll get the first > two, but both of them intersect with the very last range. > > We could always include all the intersecting ranges into the slice, but > what if there are too many very "wide" ranges? > > So I think this will need to switch to an iterative communication with > the BRIN index - instead of asking "give me info about all the ranges", > we'll need a way to > > - request the next range (sorted by maxval) > - request the intersecting ranges one by one (sorted by minval) > > Of course, the BRIN side will have some of the same challenges with > tracking the info without breaking the work_mem limit, but I suppose it > can store the info into a tuplestore/tuplesort, and use that instead of > plain in-memory array. Alternatively, it could just return those, and > BrinSort would use that. OTOH it seems cleaner to have some sort of API, > especially if we want to support e.g. minmax-multi opclasses, that have > a more complicated concept of "intersection". > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > Hi, In your example involving [150,500], can this range be broken down into 4 ranges, ending in 200, 300, 400 and 500, respectively ? That way, there is no intersection among the ranges. bq. can store the info into a tuplestore/tuplesort Wouldn't this involve disk accesses which may reduce the effectiveness of BRIN sort ? Cheers