Duplicate WHERE condition changes performance and plan
Hi, We have an odd issue where specifying the same where clause twice causes PG to pick a much more efficent plan. We would like to know why. Query A (this is the 'slow' query): UPDATE problem_instance SET processed = false FROM problem WHERE problem.id = problem_instance.problem_id AND problem.status != 2 AND processed = true; Query B (this is the 'fast' query): UPDATE problem_instance SET processed = false FROM problem WHERE problem.id = problem_instance.problem_id AND problem.status != 2 AND problem.status != 2 AND processed = true; The EXPLAIN ANALYZE for both queries can be found here:- Query A: https://explain.depesz.com/s/lFuy Query B: https://explain.depesz.com/s/Jqmv The table definitions (including the indexes) can be found here:- public.problem: https://gist.github.com/indy-singh/e90ee6d23d053d32c2564501720353df public.problem_instance: https://gist.github.com/indy-singh/3c77096b91c89428752cf314d8e20286 Data stats:- public.problem has around 10,000 rows and once the condition status != 2 is applied there are around 800 rows left. public.problem_instance has around 592,000 rows and once the condition processed = true is applied there are around 370,000 rows left. PG version: PostgreSQL 9.5.19 on x86_64-pc-linux-gnu (Debian 9.5.19-1.pgdg90+1), compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit -- SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname='TABLE_NAME' Table metadata:- public.problem: https://gist.github.com/indy-singh/ff34a3b6e45432ea4be2bf0b5038e0be public.problem_instance: https://gist.github.com/indy-singh/a09fe66c8a8840b7661ce9726ebcab71 Last Vacuum:- public.problem: 2020-04-14 23:11:47.51056+01 public.problem_instance: 2020-04-14 20:11:04.187138+01 Last Analyze: public.problem: 2020-04-14 23:11:47.592878+01 public.problem_instance: 2020-04-14 20:11:04.508432+01 Server Configuration: https://gist.github.com/indy-singh/8386d59206af042d365e5cd49fbae68f I tried my best getting all the information up front, please let me know if I missed anything. Thanks, Indy
Re: Duplicate WHERE condition changes performance and plan
> Starting with PostgreSQL 9.6, foreign keys are also used to help with > join selectivity estimations. I see you have a suitable foreign key > from the schema you posted. You might want to add that to the list of > reasons to upgrade. Apologies for the delay in response. I've had "PostgreSQL 9.6.3, compiled by Visual C++ build 1800, 64-bit" setup at home for a while and after importing the data across I'm still seeing the same behaviour. Even after upgrading my local install of PG to "PostgreSQL 12.2, compiled by Visual C++ build 1914, 64-bit" and I'm still seeing the same behaviour. Plans for PG12:- Query A: https://explain.depesz.com/s/zrVD Query B: https://explain.depesz.com/s/ZLWe The settings for my home setup are left at default, nothing special. Indy
Re: Duplicate WHERE condition changes performance and plan
> If you're using SSD storage, or if the DB is small compared with > shared_buffers or RAM, then random_page_cost should be closer to > seq_page_cost. I don't *think* we are using SSDs but I'll need to confirm that though. > How large are the indexes? problem_id_idx1 ? Using the query from here: https://wiki.postgresql.org/wiki/Index_Maintenance#Index_size.2Fusage_statistics Output here: https://gist.github.com/indy-singh/e33eabe5cc937043c93b42a8783b3bfb I've setup a repo here where it is possible to reproduce the weird behaviour I'm getting:- https://github.com/indy-singh/postgres-duplicate-where-conditon That contains the data (amended to remove any private information) as well as the statements need to recreate tables, indices, and constraints, I think after some trial and error this is something to do with the size of the table and statistics. I've been trying to put together a Short, Self Contained, Correct example (http://sscce.org/) and the problem only appears when fill problem_instance.message with junk, but I have to do it in two steps as outlined in the README in repo. Indy
Re: Duplicate WHERE condition changes performance and plan
> I don't *think* we are using SSDs but I'll need to confirm that though. Confirmed we are not using SSDs but '10K RPM SAS in RAID-10.' I've also been hunt for other queries that show this behaviour too, and I've found one. The PG settings/versions will be different in this example due to the earlier example being for our internal CI/CD tool which is hosted a on local instance of PG. This example is directly from our production servers. Query C (slow):- SELECT COUNT(1) FROM proposal.proposal INNER JOIN proposal.note ON proposal.note.proposal_reference = proposal.proposal.reference WHERE 1 = 1 AND proposal.proposal.system_id = 11 AND proposal.proposal.legacy_organisation_id IN (6, 7, 11, 16, 18, 44, 200, 218, 233, 237, 259, 47) AND proposal.proposal.has_been_anonymised = false AND proposal.note.legacy_read_by IS NULL AND proposal.note.type_id IN (1, 4, 9) AND proposal.note.entry_time > '2020-04-01' AND proposal.note.entry_time < '2020-05-01'; Query D (fast):- SELECT COUNT(1) FROM proposal.proposal INNER JOIN proposal.note ON proposal.note.proposal_reference = proposal.proposal.reference WHERE 1 = 1 AND proposal.proposal.system_id = 11 AND proposal.proposal.legacy_organisation_id IN (6, 7, 11, 16, 18, 44, 200, 218, 233, 237, 259, 47) AND proposal.proposal.has_been_anonymised = false AND proposal.proposal.has_been_anonymised = false AND proposal.note.legacy_read_by IS NULL AND proposal.note.type_id IN (1, 4, 9) AND proposal.note.entry_time > '2020-04-01' AND proposal.note.entry_time < '2020-05-01'; The EXPLAIN ANALYZE for both queries can be found here:- Query C: https://explain.depesz.com/s/5Mbu Query D: https://explain.depesz.com/s/jVnH The table definitions (including the indexes) can be found here:- proposal.proposal: https://gist.github.com/indy-singh/6ccd86ff859e7cdad2ec1bf73a61445c proposal.note: https://gist.github.com/indy-singh/6c1f85ad15cb92e138447a91d8cf3ecb Data stats:- proposal.proposal has 10,324,779 rows and once the table specific conditions are applied there are 39,223 rows left. proposal.note has 28,97,698 rows and once the table specific conditions are applied there are 54,359 rows left. PG version: PostgreSQL 9.5.17 on x86_64-pc-linux-gnu (Debian 9.5.17-1.pgdg90+1), compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit -- SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname='TABLE_NAME' Table metadata:- proposal.proposal: https://gist.github.com/indy-singh/24e7ec8f3d4e2c3ac73f724cea52f9de proposal.note: https://gist.github.com/indy-singh/104d6ec7ef8179461eb4f91c121615e0 Index Stats:- proposal.proposal: https://gist.github.com/indy-singh/1d41d15addb543bcdafc8641b9d7f036 proposal.note: https://gist.github.com/indy-singh/7a698dec98dd8ef2808345d1802e6b6a Last Vacuum:- proposal.proposal: Never proposal.note: 2020-04-17 15:10:57.256013+01 Last Analyze: proposal.proposal: Never proposal.note: 2020-04-07 11:48:49.689622+01 Server Configuration: https://gist.github.com/indy-singh/b19134873f266ee6ce2b9815504d130c Indy