On 12.8.24 19:25, Tomas Vondra wrote:

Is TPC-B really interesting/useful for this patch? The queries are super
simple, with only a single clause (so it may not even get to the code
handling extended statistics). Did you create any extended stats?

No, it's not the case. I simply wanted to verify that other queries are not slowed down after applying my patch.

I think you'll need to construct a custom test, with queries that have
multiple (var op var) clauses, extended stats created, etc. And
benchmark that.


I used the test generator from a previous thread [1] and ran it with |default_statistics_target = 1000| to achieve more accurate estimates for 3000 rows. It would also be beneficial to run tests with 10,000 and 100,000 rows for a broader perspective. I've attached the python test. Here’s a breakdown of the issues:

1. (A op A) Clause: Before applying my patch, there were poor estimates
   for expressions like |(A op A)|. Currently, we only have correct
   estimates for the |(A = A)| clause, which transforms into |A IS NOT
   NULL|. Should I address this in this thread? I believe we should
   extend the same correction to clauses like |(A != A)|, |(A < A)|,
   and similar conditions. However, this issue is not for current thread.
2. AND Clauses: The estimates for AND clauses were inaccurate before my
   patch. I noticed code segments where I could add something specific
   for the |(Var op Var)| clause, but I'm unsure if I'm missing
   anything crucial. If my understanding is incorrect, I'd appreciate
   any guidance or corrections.

FWIW I don't think it makes sense to benchmark the query execution - if
the estimate improves, it's possible to get arbitrary speedup, but
that's expected and mostly mostly irrelevant I think.

What I'd focus on is benchmarking just the query planning - we need the
overhead to be negligible (or at least small) so that it does not hurt
people with already good plans.

BTW can you elaborate why you are interested in this patch? Do you just
think it's interesting/useful, or do you have a workload where it would
actually help? I'm asking because me being uncertain how beneficial this
is in practice (not just being nice in theory) was one of the reasons
why I didn't do more work on this in 2021.


regards


I have two reasons for pursuing this. Firstly, I've encountered some of these queries in practice, although they are quite rare. While it might be easy to dismiss these cases due to their infrequency, I believe that we shouldn't overlook the opportunity to develop better handling for them, regardless of how seldom they occur.

Secondly, I see that you're working on improving estimates for JOIN clauses in thread [2]. I believe that enhancing estimates for these rare cases could also benefit future work on JOIN queries, particularly those with multiple |ON (T1.column = T2.column)| conditions, which are essentially |(Var op Var)| clauses. My idea is to start with non-JOIN queries, and then apply the same approach to improve JOIN estimates. Of course, I might be wrong, but I think this approach has potential.


P.S. If I sent this mail twice I'm sorry. I wanted to sent results of the test, and it was not sent to hackers because of big size of attached file. Now I sent only test.

[1]: https://www.postgresql.org/message-id/ecc0b08a-518d-7ad6-17ed-a5e962fc4f5f%40enterprisedb.com

[2]: https://www.postgresql.org/message-id/flat/c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c%40enterprisedb.com

--
Regards,
Ilia Evdokimov,
Tantor Labs LCC.
#!/usr/bin/python3

import psycopg2
import random
import select
import hashlib

def generate_conditions(nclauses, attributes = ['a', 'b', 'c', 'd'], operators = ['<', '<=', '=', '!=', '>=', '>']):

	if nclauses == 1:
		cols = [random.choice(attributes), random.choice(attributes)]
		oper = ' ' + random.choice(operators) + ' '

		clause = oper.join(cols)

		if random.randint(0,100) < 50:
			clause = 'NOT ' + clause

		return clause


	nparts = random.randint(2, nclauses)

	# distribute the clauses between query parts
	nclauses_parts = [1 for p in range(0, nparts)]

	for x in range(0, nclauses - nparts):
		nclauses_parts[random.randint(0, nparts) - 1] += 1

	parts = []
	for p in range(0, nparts):
		parts.append('(' + generate_conditions(nclauses_parts[p], attributes, operators) + ')')

	c = random.choice([' AND ', ' OR '])

	return c.join(parts)

def generate_data(nrows, attributes = ['a', 'b', 'c', 'd']):

	sql = 'insert into t (' + ','.join(attributes) + ') select '

	attrs = []
	for attr in attributes:

		x = random.choice([-1, 2, 5, 10, 20, 30])

		if x == -1:
			x = random.randint(5, 20)
			expr = '(random() * ' + str(x) + ')::int'
		else:
			expr = 'mod(i,' + str(x) + ')'

		if random.randint(0,100) < 50:
			x = random.choice([2, 5, 10, 20, 30])
			attrs.append('case when mod(i,' + str(x) + ') = 0 then null else ' + expr + ' end')
		else:
			attrs.append(expr)

	sql += ', '.join(attrs) + ' from generate_series(1,' + str(nrows) + ') s(i)'

	return sql

def wait(conn):

	while True:
		state = conn.poll()
		if state == psycopg2.extensions.POLL_OK:
			break
		elif state == psycopg2.extensions.POLL_WRITE:
			select.select([], [conn.fileno()], [])
		elif state == psycopg2.extensions.POLL_READ:
			select.select([conn.fileno()], [], [])
		else:
			raise psycopg2.OperationalError("poll() returned %s" % state)

def run_everywhere(conns, queries):

	curs = [conn.cursor() for conn in conns]

	for q in queries:
		[cur.execute(q) for cur in curs]
		[wait(conn) for conn in conns]

	[cur.close() for cur in curs]

if __name__ == '__main__':

	conns = [
		psycopg2.connect('host=localhost port=5001 user=postgres dbname=postgres', async_=True),
		psycopg2.connect('host=localhost port=5002 user=postgres dbname=postgres', async_=True),
		psycopg2.connect('host=localhost port=5003 user=postgres dbname=postgres', async_=True),
		psycopg2.connect('host=localhost port=5004 user=postgres dbname=postgres', async_=True)]

	[wait(conn) for conn in conns]

	curs = [conn.cursor() for conn in conns]

	# 100 data sets
	for cnt in [30000, 100000, 1000000]:

		for d in range(1,100):

			# generate the data on all versions
			data_sql = generate_data(cnt)
			data_md5 = hashlib.md5(data_sql.encode('utf-8')).hexdigest()

			with open('data.csv', 'a') as f:
				f.write('%s\t%s\n' % (data_md5, data_sql))

			run_everywhere(conns, ['drop table if exists t', 'create table t (a int, b int, c int, d int)', data_sql, 'commit', 'analyze'])

			# generate the clauses
			conditions = []
			for c in range(1, 6):
				for q in range(1,100):
					conditions.append({'clauses' : c, 'conditions' : generate_conditions(c)})

			with open('results.csv', 'a') as f:
				for conds in conditions:
					sql = "select * from check_estimated_rows('select * from t where " + conds['conditions'] + "')"

					[cur.execute(sql) for cur in curs]
					[wait(conn) for conn in conns]
					r = [cur.fetchone() for cur in curs]

					actual_rows = r[0][1]
					estimated_rows = [str(x[0]) for x in r]

					f.write(('%s\t%s\t%s\t%s\t%s\t%s\t%s\n') % (data_md5, cnt, conds['clauses'], conds['conditions'], 'no', actual_rows, '\t'.join(estimated_rows)))

			run_everywhere(conns, ['create statistics s (mcv) on a, b, c, d from t', 'commit', 'analyze'])

			with open('results.csv', 'a') as f:
				for conds in conditions:
					sql = "select * from check_estimated_rows('select * from t where " + conds['conditions'] + "')"

					[cur.execute(sql) for cur in curs]
					[wait(conn) for conn in conns]
					r = [cur.fetchone() for cur in curs]

					actual_rows = r[0][1]
					estimated_rows = [str(x[0]) for x in r]

					f.write(('%s\t%s\t%s\t%s\t%s\t%s\t%s\n') % (data_md5, cnt, conds['clauses'], conds['conditions'], 'yes', actual_rows, '\t'.join(estimated_rows)))

Reply via email to