On 02/22/2018 09:52 PM, Claudio Freire wrote: > On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> On 02/22/2018 08:33 PM, Claudio Freire wrote: >>> That's kinda slow to do per-item. I tried to "count" distinct items by >>> checking the BF before adding (don't add redundantly), but that's less >>> precise than a HLL in my experience. >> >> But you don't need to do that for each item you try to add - you know >> that with M bits and K hash functions you can't reach 50% before at >> least M/(2*K) additions. And you don't really need to track the number >> of bits exactly - if it gets 55% full, it's not going to explode. >> >> So just wait until M/(2*K) additions, see how many bits remain until the >> threshold - rinse and repeat. And use some reasonable minimum distance >> (say, 5% of the capacity), not to do the check too often. > > Ok, that works too. > > Point is, SBFs help to keep the BF size as small as possible, keep it > below work_mem, and to avoid caring about the total number of distinct > items. > > You may want the planner to try and estimate that to figure out > whether it's worth trying BFs or not, but once you decide to try, > SBFs remove any dependency on the accuracy of that estimate. >
OK, thanks for reminding me about SBF and for the discussion. At this point I'll probably focus on the other parts though - determining selectivity of the join, etc. Which I think is crucial, and we need to get that right even for accurate estimates. It's good to know that we have a solution for that part, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services