Re: MCV lists for highly skewed distributions

John Naylor Sun, 18 Mar 2018 05:25:51 -0700

I wrote:

> Looks good. I'll run some tests of my own this week, trying to find
> corner cases different from yours.

Over the weekend I tried out a test to measure:
-The size of the MCV list
-The ratio between actual and estimated cardinality of the least
common value in the MCV list.
-The ratio between actual and estimated cardinality of the most common
value not in the MCV list.

The script, results, and some starter queries are attached. I ran a
bunch more queries to drill down in certain areas, but they're all
based on the ones here.

I ran against master, the RSE patch, and 3 different variations of the
V2 patch (with stddev cutoff of 2, 2.5, and 3.0). For each file, I
loaded into each DB, ran ANALYZE 10 times, and took the average for
each of the three metrics above. I only tested with the default stats
target.

For this test, I used the same distributions as Dean's original
script, but I whacked around the script to enable inserting results
into a table.

--
Summary for non-uniform distributions:
Note: There's a lot of variation in the estimation of the most common
non-MCV. About 1% of all ANALYZE runs apparently failed to sample one
of the good candidates for the MCV list, resulting in huge
underestimates. It didn't matter what patch was applied. For future
tests, I'll do more runs and measure a truncated mean. Looking at the
number of MCVs is much more stable, but unlike the uniform
distribution case, we don't have an intuitive feel for what that
number should be. That said,

-The V2 patch is somewhat better than RSE and much better than master
at estimating the most common value not in the MCV list.
-Bumping the sample stddev threshold to 3 seems to behave similarly to
the RSE patch, maybe slightly better.

--
Summary for uniform distributions:
Note 1: Using the average is not really adequate for testing
under/overestimation of values in my test setup, since I set the
estimation ratio to zero if there was either an empty MCV list or a
completely full list. In future runs, I'll drill down a bit more
precisely.
That said, there didn't seem to be a huge amount variation between the
patches as far as either of the ratios I was testing. Looking at the
number of MCVs is much better anyway, since we know we want either
none or everything.

Note 2: All patches fully populated the list when all the values fit
in the MCV list. For the rest of the cases:

-The RSE patch is not much better than master at excluding MCVs.
-The V2 patch is much better than either of these, but still not ideal
(up to 30)
-The V2 patch with a cutoff of 2.5 severely cut down the MCVs (at most 3).
-With a cutoff of 3.0, all are empty.

--
Here's one interesting case. It's a normal distribution, but highly
dispersed (N=500000, sd=1000), so it resembles a uniform one. Looking
at the number of MCVs, it jumps around a lot between patches, but the
estimates don't vary much:

Master: 100
RSE: 1
V2: 100
V2 with 2.5 cutoff: 100
V2 with 3.0 cutoff: 32

--
Thoughts and future testing:
I think the V2 patch strikes a good balance.
I'm partial to the V2 patch with the 2.5 cutoff in sample standard
deviation, since it's much more aggressive about excluding values for
uniform distributions. It's almost as good as V2 at including values
in non-uniform distributions, but possibly worse enough that it's not
the best choice.

Time is running out, but some things I'd like to look at:

-As mentioned above, better noise reduction in my data analysis.
-The continuity-corrected Wald interval Dean mentioned [1]
-I wonder if we can actually untie the max size of the MCV list from
the stats target, and use a hard-coded maximum like 1000. That might
allow us to bump the max stat target without hurting planning time.
Perhaps next cycle.

-John Naylor

[1]
https://www.postgresql.org/message-id/CAEZATCVHEEg%2BCrP%2B-0JUsVeNPu5rs_S23oJVeH4VF%3DfgDwhfLQ%40mail.gmail.com

#!/usr/bin/python

import datetime
import math
import numpy
import psycopg2
import random
import re
import sys

HEAD_PORT = 5430
RSE_PORT = 5431
SSD_20_PORT = 5432
SSD_25_PORT = 5425
SSD_30_PORT = 5433

TEST_DATA_FILE = '/tmp/mcv-test-data'
NUMBER_OF_ANAYLYSE_TESTS = 10

conn0 = psycopg2.connect('port=%d dbname=john user=john' % HEAD_PORT)
conn0.set_session(autocommit = True)
conn1 = psycopg2.connect('port=%d dbname=john user=john' % RSE_PORT)
conn1.set_session(autocommit = True)
conn20 = psycopg2.connect('port=%d dbname=john user=john' % SSD_20_PORT)
conn20.set_session(autocommit = True)
conn25 = psycopg2.connect('port=%d dbname=john user=john' % SSD_25_PORT)
conn25.set_session(autocommit = True)
conn30 = psycopg2.connect('port=%d dbname=john user=john' % SSD_30_PORT)
conn30.set_session(autocommit = True)

branches = {}
branches['master'] = conn0.cursor()
branches['RSE'] = conn1.cursor()
branches['V2_20'] = conn20.cursor()
branches['V2_25'] = conn25.cursor()
branches['V2_30'] = conn30.cursor()

# Store test results in the master DB.
result_store = branches['master']

# Create one results table each time the script is run.
timestamp = datetime.datetime.now().strftime('%m%d%H%M')
results_table = 'mcv_results_' + timestamp
result_store.execute('DROP TABLE IF EXISTS %s' % (results_table,))
result_store.execute('CREATE TABLE %s ' \
                        '(branch text,' \
                        'title text,' \
                        'count int,' \
                        'dcount int,' \
                        'num_mcvs int,' \
                        'ratio_least_common_mcv float4,' \
                        'ratio_most_common_non_mcv float4)' % (results_table,))

def test_stats(cur, x):
    cur.execute('EXPLAIN ANALYSE'\
                ' SELECT * FROM mcv_test WHERE x = %s', (x,))
    plan = cur.fetchone()[0]
    #~ print '***' + plan
    match = re.search('.*rows=([0-9]*).*rows=([0-9]*)', plan)
    estimate = float(match.group(1))
    actual = float(match.group(2))
    return actual / estimate

def test_analyse(title, branch, cur):
    for i in range(NUMBER_OF_ANAYLYSE_TESTS):
        cur.execute('ANALYSE mcv_test')
        cur.execute('SELECT coalesce(cardinality(most_common_vals), 0)'\
                    '  FROM pg_stats WHERE tablename = \'mcv_test\'')
        num_mcvs = (cur.fetchone()[0])

        cur.execute('SELECT count(x), sum(count)'\
                    '  FROM mcv_test_counts')
        (dcount, count) = cur.fetchone()

        # Returns MCVs as rows
        mcv_query = 'SELECT unnest('\
                    '         string_to_array('\
                    '           array_to_string('\
                    '             most_common_vals, \',\'), \',\'))::int'\
                    '  FROM pg_stats WHERE tablename = \'mcv_test\''

        # Report ratio between estimated and actual cardinality of most common non-MCV
        cur.execute('SELECT x' \
                    '  FROM mcv_test_counts' \
                    '  WHERE x NOT IN ( %s )' \
                    '  ORDER BY count DESC LIMIT 1' % (mcv_query,))

        most_common_non_mcv = cur.fetchone()
        if most_common_non_mcv is None:
            ratio_most_common_non_mcv = 0
        else:
            ratio_most_common_non_mcv = test_stats(cur, most_common_non_mcv)

        # Report ratio between estimated and actual cardinality of least common MCV
        cur.execute('SELECT x' \
                    '  FROM mcv_test_counts' \
                    '  WHERE x IN ( %s )' \
                    '  ORDER BY count ASC LIMIT 1' % (mcv_query,))

        least_common_mcv = cur.fetchone()
        if least_common_mcv is None:
            ratio_least_common_mcv = 0
        else:
            ratio_least_common_mcv = test_stats(cur, least_common_mcv)

        # Save results
        result_store.execute(
        'INSERT INTO %s' \
        '       (branch, title,  count, dcount, num_mcvs, ratio_most_common_non_mcv, ratio_least_common_mcv)' \
        'VALUES (\'%s\', \'%s\', %d,    %d,     %d,       %.4f,                      %.4f)' %
            (results_table,
                 branch, title,  count, dcount, num_mcvs, ratio_most_common_non_mcv, ratio_least_common_mcv))

        print
        print branch
        print title
        print 'Table size:', count
        print 'Distinct values:', dcount
        print 'Number of MCVs:', num_mcvs
        print 'Ratio MC non-MCV', ratio_most_common_non_mcv
        print 'Ratio LC MCV', ratio_least_common_mcv

def test(dist_fn, *args):
    with open(TEST_DATA_FILE, 'w') as fd:
        title = dist_fn(fd, *args)

    for branch, cur in branches.iteritems():
        # Recreate table
        cur.execute('DROP TABLE IF EXISTS mcv_test')
        cur.execute('CREATE TABLE mcv_test(x int)')

        # Load data
        cur.execute('COPY mcv_test FROM \'%s\'' % TEST_DATA_FILE)
        #~ cur.execute('CREATE INDEX mcv_test_x_idx ON mcv_test(x)')

        # Populate counts table
        cur.execute('DROP TABLE IF EXISTS mcv_test_counts')
        cur.execute('CREATE TABLE mcv_test_counts(x int, count int)')
        cur.execute('INSERT INTO mcv_test_counts (x, count)' \
                    'SELECT x, count(*)' \
                    '  FROM mcv_test' \
                    '  GROUP BY 1' \
                    '  ORDER BY 2 DESC')

        test_analyse(title, branch, cur)

def uniform_correlated(fd, N, Nd):
    x = 0
    while x < N:
        val = int(x*float(Nd)/N)
        print >>fd, val
        x += 1
    return 'Correlated uniform distribution (N=%d, Nd=%d)' % (N, Nd)

def uniform_random(fd, N, Nd):
    vals = range(0, N)
    random.shuffle(vals)
    x = 0
    while x < N:
        val = int(vals[x]*float(Nd)/N)
        print >>fd, val
        x += 1
    return 'Random uniform distribution (N=%d, Nd=%d)' % (N, Nd)

def normal(fd, N, sd):
    for val in numpy.random.normal(0, sd, N):
        print >>fd, int(round(val))
    return 'Normal distribution (N=%d, sd=%d)' % (N, sd)
    
def binomial(fd, N, n, p):
    for val in numpy.random.binomial(n, p, N):
        print >>fd, int(round(val))
    return 'Binomial distribution (N=%d, n=%d, p=%f)' % (N, n, p)

def exponential(fd, N, sd):
    for val in numpy.random.exponential(sd, N):
        print >>fd, int(round(val))
    return 'Exponential distribution (N=%d, sd=%f (beta))' % (N, sd)

def laplace(fd, N, b):
    for val in numpy.random.laplace(0, b, N):
        print >>fd, int(round(val))
    return 'Laplace distribution (N=%d, b=%f (lambda))' % (N, b)

def pareto(fd, N, a):
    for val in numpy.random.pareto(a, N):
        print >>fd, int(round(val))
    return 'Pareto distribution (N=%d, a=%f)' % (N, a)

def multimodal(fd, N, peaks, sep, sd):
    for x in range(0, N):
        mean = random.randrange(0, peaks)*sep
        print >>fd, int(round(numpy.random.normal(mean, sd)))
    return 'Multimodal normal mix (N=%d, peaks=%d, sep=%d, sd=%d)' % (N, peaks, sep, sd)

def uniform_correlated_tests():
    test(uniform_correlated, 30000, 100)
    test(uniform_correlated, 30000, 150)

    test(uniform_correlated, 40000, 100)
    test(uniform_correlated, 40000, 150)
    test(uniform_correlated, 40000, 200)
    test(uniform_correlated, 40000, 500)
    test(uniform_correlated, 40000, 1000)
    test(uniform_correlated, 40000, 2000)
    test(uniform_correlated, 40000, 3000)
    test(uniform_correlated, 40000, 4000)
    test(uniform_correlated, 40000, 5000)
    test(uniform_correlated, 40000, 10000)
    test(uniform_correlated, 40000, 20000)

    test(uniform_correlated, 50000, 100)
    test(uniform_correlated, 50000, 150)
    test(uniform_correlated, 50000, 200)
    test(uniform_correlated, 50000, 500)
    test(uniform_correlated, 50000, 1000)
    test(uniform_correlated, 50000, 2000)
    test(uniform_correlated, 50000, 3000)
    test(uniform_correlated, 50000, 6000)
    test(uniform_correlated, 50000, 12000)
    test(uniform_correlated, 50000, 25000)

    test(uniform_correlated, 60000, 100)
    test(uniform_correlated, 60000, 150)
    test(uniform_correlated, 60000, 200)
    test(uniform_correlated, 60000, 500)
    test(uniform_correlated, 60000, 1000)
    test(uniform_correlated, 60000, 2000)
    test(uniform_correlated, 60000, 3000)
    test(uniform_correlated, 60000, 5000)
    test(uniform_correlated, 60000, 7000)
    test(uniform_correlated, 60000, 15000)
    test(uniform_correlated, 60000, 30000)

    test(uniform_correlated, 100000, 100)
    test(uniform_correlated, 100000, 150)
    test(uniform_correlated, 100000, 200)
    test(uniform_correlated, 100000, 300)
    test(uniform_correlated, 100000, 400)
    test(uniform_correlated, 100000, 500)
    test(uniform_correlated, 100000, 1000)
    test(uniform_correlated, 100000, 2000)
    test(uniform_correlated, 100000, 3000)
    test(uniform_correlated, 100000, 4000)
    test(uniform_correlated, 100000, 5000)
    test(uniform_correlated, 100000, 10000)
    test(uniform_correlated, 100000, 20000)
    test(uniform_correlated, 100000, 50000)

    test(uniform_correlated, 1000000, 100)
    test(uniform_correlated, 1000000, 200)
    test(uniform_correlated, 1000000, 500)
    test(uniform_correlated, 1000000, 750)
    test(uniform_correlated, 1000000, 1000)
    test(uniform_correlated, 1000000, 1500)
    test(uniform_correlated, 1000000, 2000)
    test(uniform_correlated, 1000000, 3000)
    test(uniform_correlated, 1000000, 4000)
    test(uniform_correlated, 1000000, 5000)
    test(uniform_correlated, 1000000, 10000)
    test(uniform_correlated, 1000000, 100000)
    test(uniform_correlated, 1000000, 500000)

def uniform_random_tests():
    test(uniform_random, 30000, 100)
    test(uniform_random, 30000, 150)

    test(uniform_random, 40000, 100)
    test(uniform_random, 40000, 150)
    test(uniform_random, 40000, 200)
    test(uniform_random, 40000, 500)
    test(uniform_random, 40000, 1000)
    test(uniform_random, 40000, 2000)
    test(uniform_random, 40000, 3000)
    test(uniform_random, 40000, 4000)
    test(uniform_random, 40000, 5000)
    test(uniform_random, 40000, 10000)
    test(uniform_random, 40000, 20000)

    test(uniform_random, 50000, 100)
    test(uniform_random, 50000, 150)
    test(uniform_random, 50000, 200)
    test(uniform_random, 50000, 500)
    test(uniform_random, 50000, 1000)
    test(uniform_random, 50000, 2000)
    test(uniform_random, 50000, 3000)
    test(uniform_random, 50000, 6000)
    test(uniform_random, 50000, 12000)
    test(uniform_random, 50000, 25000)

    test(uniform_random, 60000, 100)
    test(uniform_random, 60000, 150)
    test(uniform_random, 60000, 200)
    test(uniform_random, 60000, 500)
    test(uniform_random, 60000, 1000)
    test(uniform_random, 60000, 2000)
    test(uniform_random, 60000, 3000)
    test(uniform_random, 60000, 5000)
    test(uniform_random, 60000, 7000)
    test(uniform_random, 60000, 15000)
    test(uniform_random, 60000, 30000)

    test(uniform_random, 100000, 100)
    test(uniform_random, 100000, 150)
    test(uniform_random, 100000, 200)
    test(uniform_random, 100000, 300)
    test(uniform_random, 100000, 400)
    test(uniform_random, 100000, 500)
    test(uniform_random, 100000, 1000)
    test(uniform_random, 100000, 2000)
    test(uniform_random, 100000, 3000)
    test(uniform_random, 100000, 4000)
    test(uniform_random, 100000, 5000)
    test(uniform_random, 100000, 10000)
    test(uniform_random, 100000, 20000)
    test(uniform_random, 100000, 50000)

    test(uniform_random, 1000000, 100)
    test(uniform_random, 1000000, 200)
    test(uniform_random, 1000000, 500)
    test(uniform_random, 1000000, 750)
    test(uniform_random, 1000000, 1000)
    test(uniform_random, 1000000, 1500)
    test(uniform_random, 1000000, 2000)
    test(uniform_random, 1000000, 3000)
    test(uniform_random, 1000000, 4000)
    test(uniform_random, 1000000, 5000)
    test(uniform_random, 1000000, 10000)
    test(uniform_random, 1000000, 100000)
    test(uniform_random, 1000000, 500000)

def normal_tests():
    test(normal, 500000, 1)
    test(normal, 500000, 2)
    test(normal, 500000, 3)
    test(normal, 500000, 4)
    test(normal, 500000, 5)
    test(normal, 500000, 6)
    test(normal, 500000, 6)
    test(normal, 500000, 7)
    test(normal, 500000, 8)
    test(normal, 500000, 9)
    test(normal, 500000, 10)
    test(normal, 500000, 15)
    test(normal, 500000, 20)
    test(normal, 500000, 30)
    test(normal, 500000, 40)
    test(normal, 500000, 50)
    test(normal, 500000, 75)
    test(normal, 500000, 100)
    test(normal, 500000, 150)
    test(normal, 500000, 200)
    test(normal, 500000, 300)
    test(normal, 500000, 400)
    test(normal, 500000, 500)
    test(normal, 500000, 600)
    test(normal, 500000, 800)
    test(normal, 500000, 1000)

def binomial_tests():
    test(binomial, 500000, 100, 0.05)
    test(binomial, 500000, 100, 0.1)
    test(binomial, 500000, 100, 0.2)
    test(binomial, 500000, 100, 0.3)
    test(binomial, 500000, 100, 0.4)
    test(binomial, 500000, 100, 0.5)

    test(binomial, 500000, 1000, 0.1)
    test(binomial, 500000, 1000, 0.2)
    test(binomial, 500000, 1000, 0.3)
    test(binomial, 500000, 1000, 0.4)
    test(binomial, 500000, 1000, 0.5)

    test(binomial, 500000, 2000, 0.5)
    test(binomial, 500000, 3000, 0.5)
    test(binomial, 500000, 4000, 0.5)
    test(binomial, 500000, 5000, 0.5)

def exponential_tests():
    test(exponential, 500000, 0.25)
    test(exponential, 500000, 0.5)
    test(exponential, 500000, 1)
    test(exponential, 500000, 2)
    test(exponential, 500000, 3)
    test(exponential, 500000, 4)
    test(exponential, 500000, 5)
    test(exponential, 500000, 10)
    test(exponential, 500000, 20)
    test(exponential, 500000, 30)

def laplace_tests():
    test(laplace, 500000, 0.1)
    test(laplace, 500000, 0.25)
    test(laplace, 500000, 0.5)
    test(laplace, 500000, 1)
    test(laplace, 500000, 2)
    test(laplace, 500000, 3)
    test(laplace, 500000, 4)
    test(laplace, 500000, 5)
    test(laplace, 500000, 10)
    test(laplace, 500000, 25)
    test(laplace, 500000, 50)
    test(laplace, 500000, 100)

def pareto_tests():
    test(pareto, 500000, 1)
    test(pareto, 500000, 2)
    test(pareto, 500000, 3)
    test(pareto, 500000, 4)
    test(pareto, 500000, 5)

def multimodal_tests():
    test(multimodal, 500000, 2, 100, 10)
    test(multimodal, 500000, 3, 50, 10)
    test(multimodal, 500000, 5, 30, 10)
    test(multimodal, 1000000, 10, 20, 10)
    test(multimodal, 2000000, 20, 20, 15)

uniform_correlated_tests()
uniform_random_tests()
normal_tests()
binomial_tests()
exponential_tests()
laplace_tests()
pareto_tests()
multimodal_tests()

MCV_queries.sql
Description: application/sql

mcv_results_JCN_0317_full.csv.gz
Description: GNU Zip compressed data

Re: MCV lists for highly skewed distributions

Reply via email to