On 10/15/22 14:33, Tomas Vondra wrote: > Hi, > > ... > > There's a bunch of issues with this initial version of the patch, > usually described in XXX comments in the relevant places.6) > > ...
I forgot to mention one important issue in my list yesterday, and that's memory consumption. The way the patch is coded now, the new BRIN support function (brin_minmax_ranges) produces information about *all* ranges in one go, which may be an issue. The worst case is 32TB table, with 1-page BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B structs, so this would require ~128GB of RAM. With the default 128-page ranges, it's still be ~1GB, which is quite a lot. We could have a discussion about what's the reasonable size of BRIN ranges on such large tables (e.g. building a bitmap on 4 billion ranges is going to be "not cheap" so this is likely pretty rare). But we should not introduce new nodes that ignore work_mem, so we need a way to deal with such cases somehow. The easiest solution likely is to check this while planning - we can check the table size, calculate the number of BRIN ranges, and check that the range info fits into work_mem, and just not create the path when it gets too large. That's what we did for HashAgg, although that decision was unreliable because estimating GROUP BY cardinality is hard. The wrinkle here is that counting just the range info (BrinRange struct) does not include the values for by-reference types. We could use average width - that's just an estimate, though. A more comprehensive solution seems to be to allow requesting chunks of the BRIN ranges. So that we'd get "slices" of ranges and we'd process those. So for example if you have 1000 ranges, and you can only handle 100 at a time, we'd do 10 loops, each requesting 100 ranges. This has another problem - we do care about "overlaps", and we can't really know if the overlapping ranges will be in the same "slice" easily. The chunks would be sorted (for example) by maxval. But there can be a range with much higher maxval (thus in some future slice), but very low minval (thus intersecting with ranges in the current slice). Imagine ranges with these minval/maxval values, sorted by maxval: [101,200] [201,300] [301,400] [150,500] and let's say we can only process 2-range slices. So we'll get the first two, but both of them intersect with the very last range. We could always include all the intersecting ranges into the slice, but what if there are too many very "wide" ranges? So I think this will need to switch to an iterative communication with the BRIN index - instead of asking "give me info about all the ranges", we'll need a way to - request the next range (sorted by maxval) - request the intersecting ranges one by one (sorted by minval) Of course, the BRIN side will have some of the same challenges with tracking the info without breaking the work_mem limit, but I suppose it can store the info into a tuplestore/tuplesort, and use that instead of plain in-memory array. Alternatively, it could just return those, and BrinSort would use that. OTOH it seems cleaner to have some sort of API, especially if we want to support e.g. minmax-multi opclasses, that have a more complicated concept of "intersection". regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company