On 12.8.24 19:25, Tomas Vondra wrote:
Is TPC-B really interesting/useful for this patch? The queries are super simple, with only a single clause (so it may not even get to the code handling extended statistics). Did you create any extended stats?
No, it's not the case. I simply wanted to verify that other queries are not slowed down after applying my patch.
I think you'll need to construct a custom test, with queries that have multiple (var op var) clauses, extended stats created, etc. And benchmark that.
I used the test generator from a previous thread [1] and ran it with |default_statistics_target = 1000| to achieve more accurate estimates for 3000 rows. It would also be beneficial to run tests with 10,000 and 100,000 rows for a broader perspective. I've attached the python test. Here’s a breakdown of the issues:
1. (A op A) Clause: Before applying my patch, there were poor estimates for expressions like |(A op A)|. Currently, we only have correct estimates for the |(A = A)| clause, which transforms into |A IS NOT NULL|. Should I address this in this thread? I believe we should extend the same correction to clauses like |(A != A)|, |(A < A)|, and similar conditions. However, this issue is not for current thread. 2. AND Clauses: The estimates for AND clauses were inaccurate before my patch. I noticed code segments where I could add something specific for the |(Var op Var)| clause, but I'm unsure if I'm missing anything crucial. If my understanding is incorrect, I'd appreciate any guidance or corrections.
FWIW I don't think it makes sense to benchmark the query execution - if the estimate improves, it's possible to get arbitrary speedup, but that's expected and mostly mostly irrelevant I think. What I'd focus on is benchmarking just the query planning - we need the overhead to be negligible (or at least small) so that it does not hurt people with already good plans. BTW can you elaborate why you are interested in this patch? Do you just think it's interesting/useful, or do you have a workload where it would actually help? I'm asking because me being uncertain how beneficial this is in practice (not just being nice in theory) was one of the reasons why I didn't do more work on this in 2021. regards
I have two reasons for pursuing this. Firstly, I've encountered some of these queries in practice, although they are quite rare. While it might be easy to dismiss these cases due to their infrequency, I believe that we shouldn't overlook the opportunity to develop better handling for them, regardless of how seldom they occur.
Secondly, I see that you're working on improving estimates for JOIN clauses in thread [2]. I believe that enhancing estimates for these rare cases could also benefit future work on JOIN queries, particularly those with multiple |ON (T1.column = T2.column)| conditions, which are essentially |(Var op Var)| clauses. My idea is to start with non-JOIN queries, and then apply the same approach to improve JOIN estimates. Of course, I might be wrong, but I think this approach has potential.
P.S. If I sent this mail twice I'm sorry. I wanted to sent results of the test, and it was not sent to hackers because of big size of attached file. Now I sent only test.
[1]: https://www.postgresql.org/message-id/ecc0b08a-518d-7ad6-17ed-a5e962fc4f5f%40enterprisedb.com
[2]: https://www.postgresql.org/message-id/flat/c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c%40enterprisedb.com
-- Regards, Ilia Evdokimov, Tantor Labs LCC.
#!/usr/bin/python3 import psycopg2 import random import select import hashlib def generate_conditions(nclauses, attributes = ['a', 'b', 'c', 'd'], operators = ['<', '<=', '=', '!=', '>=', '>']): if nclauses == 1: cols = [random.choice(attributes), random.choice(attributes)] oper = ' ' + random.choice(operators) + ' ' clause = oper.join(cols) if random.randint(0,100) < 50: clause = 'NOT ' + clause return clause nparts = random.randint(2, nclauses) # distribute the clauses between query parts nclauses_parts = [1 for p in range(0, nparts)] for x in range(0, nclauses - nparts): nclauses_parts[random.randint(0, nparts) - 1] += 1 parts = [] for p in range(0, nparts): parts.append('(' + generate_conditions(nclauses_parts[p], attributes, operators) + ')') c = random.choice([' AND ', ' OR ']) return c.join(parts) def generate_data(nrows, attributes = ['a', 'b', 'c', 'd']): sql = 'insert into t (' + ','.join(attributes) + ') select ' attrs = [] for attr in attributes: x = random.choice([-1, 2, 5, 10, 20, 30]) if x == -1: x = random.randint(5, 20) expr = '(random() * ' + str(x) + ')::int' else: expr = 'mod(i,' + str(x) + ')' if random.randint(0,100) < 50: x = random.choice([2, 5, 10, 20, 30]) attrs.append('case when mod(i,' + str(x) + ') = 0 then null else ' + expr + ' end') else: attrs.append(expr) sql += ', '.join(attrs) + ' from generate_series(1,' + str(nrows) + ') s(i)' return sql def wait(conn): while True: state = conn.poll() if state == psycopg2.extensions.POLL_OK: break elif state == psycopg2.extensions.POLL_WRITE: select.select([], [conn.fileno()], []) elif state == psycopg2.extensions.POLL_READ: select.select([conn.fileno()], [], []) else: raise psycopg2.OperationalError("poll() returned %s" % state) def run_everywhere(conns, queries): curs = [conn.cursor() for conn in conns] for q in queries: [cur.execute(q) for cur in curs] [wait(conn) for conn in conns] [cur.close() for cur in curs] if __name__ == '__main__': conns = [ psycopg2.connect('host=localhost port=5001 user=postgres dbname=postgres', async_=True), psycopg2.connect('host=localhost port=5002 user=postgres dbname=postgres', async_=True), psycopg2.connect('host=localhost port=5003 user=postgres dbname=postgres', async_=True), psycopg2.connect('host=localhost port=5004 user=postgres dbname=postgres', async_=True)] [wait(conn) for conn in conns] curs = [conn.cursor() for conn in conns] # 100 data sets for cnt in [30000, 100000, 1000000]: for d in range(1,100): # generate the data on all versions data_sql = generate_data(cnt) data_md5 = hashlib.md5(data_sql.encode('utf-8')).hexdigest() with open('data.csv', 'a') as f: f.write('%s\t%s\n' % (data_md5, data_sql)) run_everywhere(conns, ['drop table if exists t', 'create table t (a int, b int, c int, d int)', data_sql, 'commit', 'analyze']) # generate the clauses conditions = [] for c in range(1, 6): for q in range(1,100): conditions.append({'clauses' : c, 'conditions' : generate_conditions(c)}) with open('results.csv', 'a') as f: for conds in conditions: sql = "select * from check_estimated_rows('select * from t where " + conds['conditions'] + "')" [cur.execute(sql) for cur in curs] [wait(conn) for conn in conns] r = [cur.fetchone() for cur in curs] actual_rows = r[0][1] estimated_rows = [str(x[0]) for x in r] f.write(('%s\t%s\t%s\t%s\t%s\t%s\t%s\n') % (data_md5, cnt, conds['clauses'], conds['conditions'], 'no', actual_rows, '\t'.join(estimated_rows))) run_everywhere(conns, ['create statistics s (mcv) on a, b, c, d from t', 'commit', 'analyze']) with open('results.csv', 'a') as f: for conds in conditions: sql = "select * from check_estimated_rows('select * from t where " + conds['conditions'] + "')" [cur.execute(sql) for cur in curs] [wait(conn) for conn in conns] r = [cur.fetchone() for cur in curs] actual_rows = r[0][1] estimated_rows = [str(x[0]) for x in r] f.write(('%s\t%s\t%s\t%s\t%s\t%s\t%s\n') % (data_md5, cnt, conds['clauses'], conds['conditions'], 'yes', actual_rows, '\t'.join(estimated_rows)))