[PERFORM] IN surpasses NOT EXISTS in 7.4RC2 ??
Hi, NOT EXISTS is taking almost double time than NOT IN . I know IN has been optimised in 7.4 but is anything wrong with the NOT EXISTS? I have vaccumed , analyze and run the query many times still not in is faster than exists :> Regds Mallah. NOT IN PLAN tradein_clients=# explain analyze SELECT count(*) from general.profile_master where profile_id not in (select profile_id from general.account_profiles ) ; QUERY PLAN - Aggregate (cost=32238.19..32238.19 rows=1 width=0) (actual time=5329.206..5329.207 rows=1 loops=1) -> Seq Scan on profile_master (cost=4458.25..31340.38 rows=359125 width=0) (actual time=1055.496..4637.908 rows=470386 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on account_profiles (cost=0.00..3817.80 rows=256180 width=4) (actual time=0.061..507.811 rows=256180 loops=1) Total runtime: 5337.591 ms (6 rows) tradein_clients=# explain analyze SELECT count(*) from general.profile_master where not exists (select profile_id from general.account_profiles where profile_id=general.profile_master.profile_id ) ; QUERY PLAN --- Aggregate (cost=1674981.97..1674981.97 rows=1 width=0) (actual time=14600.386..14600.387 rows=1 loops=1) -> Seq Scan on profile_master (cost=0.00..1674084.16 rows=359125 width=0) (actual time=13.687..13815.798 rows=470386 loops=1) Filter: (NOT (subplan)) SubPlan -> Index Scan using account_profiles_profile_id on account_profiles (cost=0.00..4.59 rows=2 width=4) (actual time=0.013..0.013 rows=0 loops=718250) Index Cond: (profile_id = $0) Total runtime: 14600.531 ms ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] performance optimzations
> > Most clients will be interested in say the last 7 days worth of data? > > Great.. Start out with 4GB ram on a good Dual CPU -- Opterons seem to > > work quite well -- and make sure the motherboard can hold double that in > > memory for an upgrade sometime next year when you've become popular. > > Unfortunately, the hardware available is pretty much fixed in regards to > the system. I can play around with the raid configurations and have > some limited choice in regards to the raid controller and number of > drivers but that's about all in terms of hardware. Good luck then. Unless the configuration takes into account incremental additions in ram and disk, sustained growth could get very expensive. I guess that depends on the business plan expectations. This just puts more emphasis to offload everything you can onto machines that can multiply. > The current file system holding the user and email information indicates > the current data has about 64GB (70K accounts, I'm not sure how many are > active but 50% might be good guess). This seems to be somewhat of a > steady state however. 35k clients checking their mail daily isn't so bad. Around 10 pages per second peak load? ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Value of Quad vs. Dual Processor machine
On Wed, 2003-11-12 at 09:28, Jeff wrote: > On Tue, 11 Nov 2003 21:13:19 -0500 > "Chris Field" <[EMAIL PROTECTED]> wrote: > > > we are looking at Xeon, We are currently running it on a quad sun v880 > > compiled to be 64bit and have been getting dreadful performance. I > > don't think we really have much to gain from going 64bit. > > > > > By chance, are you running 7.3.4 on that sun? > If so, try this: > export CFLAGS=-02 > ./configure > > and rebuild PG. > > Before 7.4 PG was build with _no_ optimization on Solaris. > Recompiling gives __HUGE__ (notice the underscores) performance gains. > > And onto the dual vs quad. > > PG will only use 1 cpu / connection / query. > > So if your machine iwll have 1-2 queries running at a time those other 2 > proc's will sit around idling. However if you are going to have a bunch > going, 4 cpus will be most useful. One of hte nicest things to do for > PG is more ram and fast IO. It really loves those things. > We've just started kicking around the idea of moving one of our boxes to a quad-proc machine from a dual. Under normal circumstances the 2 processors handle maybe 200 transactions per second with 90% system idle. However we have people who occasionally run historical reports on our data, and those reports are fairly CPU intensive. Usually it is not a problem for the main web system, but when pg_dump is running, that is also cpu intensive, so we end up with two highly cpu intensive items running on our machine, and we start to notice issues on the main web system. Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] IN surpasses NOT EXISTS in 7.4RC2 ??
Rajesh Kumar Mallah <[EMAIL PROTECTED]> writes: > NOT EXISTS is taking almost double time than NOT IN . > I know IN has been optimised in 7.4 but is anything > wrong with the NOT EXISTS? That's the expected behavior in 7.4. EXISTS in the style you are using it effectively forces a nestloop-with-inner-indexscan implementation. As of 7.4, IN can do that, but it can do several other things too, including the hash-type plan you have here. So assuming that the planner chooses the right plan choice (not always a given ;-)) IN should be as fast or faster than EXISTS in all cases. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] IN surpasses NOT EXISTS in 7.4RC2 ??
Rajesh Kumar Mallah wrote: > > Hi, > > NOT EXISTS is taking almost double time than NOT IN . > I know IN has been optimised in 7.4 but is anything > wrong with the NOT EXISTS? > > I have vaccumed , analyze and run the query many times > still not in is faster than exists :> Seems fine. In 7.4, NOT IN will often be faster that NOT EXISTS. NOT EXISTS didn't change --- there are restrictions on how far we can optimize NOT EXISTS. NOT IN has just become much faster in 7.4. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] IN surpasses NOT EXISTS in 7.4RC2 ??
Robert Treat wrote: It is believed that the IN optimization can lead to faster IN times than EXIST times on some queries, the extent of which is still a bit of an unknown. (Incidentally is there an FAQ item on this that needs updating?) Thanks every one for clarifying. Its really a nice thing to see IN working so well becoz its easier to read the SQL using IN. looks like NOT IN is indifferent to indexes where is IN uses indexes , is it true? does indexes affect the new manner in which IN works in 7.4 ? Does the not exist query produce worse results in 7.4 than it did in 7.3? Will surely post the overvation sometime. Regards Mallah. Robert Treat On Thu, 2003-11-13 at 02:53, Rajesh Kumar Mallah wrote: Hi, NOT EXISTS is taking almost double time than NOT IN . I know IN has been optimised in 7.4 but is anything wrong with the NOT EXISTS? I have vaccumed , analyze and run the query many times still not in is faster than exists :> Regds Mallah. NOT IN PLAN tradein_clients=# explain analyze SELECT count(*) from general.profile_master where profile_id not in (select profile_id from general.account_profiles ) ; QUERY PLAN - Aggregate (cost=32238.19..32238.19 rows=1 width=0) (actual time=5329.206..5329.207 rows=1 loops=1) -> Seq Scan on profile_master (cost=4458.25..31340.38 rows=359125 width=0) (actual time=1055.496..4637.908 rows=470386 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on account_profiles (cost=0.00..3817.80 rows=256180 width=4) (actual time=0.061..507.811 rows=256180 loops=1) Total runtime: 5337.591 ms (6 rows) tradein_clients=# explain analyze SELECT count(*) from general.profile_master where not exists (select profile_id from general.account_profiles where profile_id=general.profile_master.profile_id ) ; QUERY PLAN --- Aggregate (cost=1674981.97..1674981.97 rows=1 width=0) (actual time=14600.386..14600.387 rows=1 loops=1) -> Seq Scan on profile_master (cost=0.00..1674084.16 rows=359125 width=0) (actual time=13.687..13815.798 rows=470386 loops=1) Filter: (NOT (subplan)) SubPlan -> Index Scan using account_profiles_profile_id on account_profiles (cost=0.00..4.59 rows=2 width=4) (actual time=0.013..0.013 rows=0 loops=718250) Index Cond: (profile_id = $0) Total runtime: 14600.531 ms ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] IN surpasses NOT EXISTS in 7.4RC2 ??
It is believed that the IN optimization can lead to faster IN times than EXIST times on some queries, the extent of which is still a bit of an unknown. (Incidentally is there an FAQ item on this that needs updating?) Does the not exist query produce worse results in 7.4 than it did in 7.3? Robert Treat On Thu, 2003-11-13 at 02:53, Rajesh Kumar Mallah wrote: > > Hi, > > NOT EXISTS is taking almost double time than NOT IN . > I know IN has been optimised in 7.4 but is anything > wrong with the NOT EXISTS? > > I have vaccumed , analyze and run the query many times > still not in is faster than exists :> > > > Regds > Mallah. > > NOT IN PLAN > > tradein_clients=# explain analyze SELECT count(*) from general.profile_master where > profile_id not in (select profile_id from general.account_profiles ) ; > QUERY PLAN > - > Aggregate (cost=32238.19..32238.19 rows=1 width=0) (actual time=5329.206..5329.207 > rows=1 loops=1) > -> Seq Scan on profile_master (cost=4458.25..31340.38 rows=359125 width=0) > (actual time=1055.496..4637.908 rows=470386 loops=1) > Filter: (NOT (hashed subplan)) > SubPlan > -> Seq Scan on account_profiles (cost=0.00..3817.80 rows=256180 width=4) > (actual time=0.061..507.811 rows=256180 loops=1) > Total runtime: 5337.591 ms > (6 rows) > > > tradein_clients=# explain analyze SELECT count(*) from general.profile_master where > not exists > (select profile_id from general.account_profiles where > profile_id=general.profile_master.profile_id ) ; > > QUERY PLAN > --- > Aggregate (cost=1674981.97..1674981.97 rows=1 width=0) (actual > time=14600.386..14600.387 rows=1 loops=1) > -> Seq Scan on profile_master (cost=0.00..1674084.16 rows=359125 width=0) > (actual time=13.687..13815.798 rows=470386 loops=1) > Filter: (NOT (subplan)) > SubPlan > -> Index Scan using account_profiles_profile_id on account_profiles > (cost=0.00..4.59 rows=2 width=4) (actual time=0.013..0.013 rows=0 loops=718250) > Index Cond: (profile_id = $0) > Total runtime: 14600.531 ms > > > > ---(end of broadcast)--- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED]) -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] IN surpasses NOT EXISTS in 7.4RC2 ??
Robert Treat <[EMAIL PROTECTED]> writes: > Does the not exist query produce worse results in 7.4 than it did in > 7.3? EXISTS should work the same as before. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] IN surpasses NOT EXISTS in 7.4RC2 ??
On Thu, 2003-11-13 at 12:00, Tom Lane wrote: > Robert Treat <[EMAIL PROTECTED]> writes: > > Does the not exist query produce worse results in 7.4 than it did in > > 7.3? > > EXISTS should work the same as before. > right. the original poster is asking if there is "something wrong with exist" based on the comparison to IN, he needs to compare it vs. 7.3 EXISTS to determine if something is wrong. Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] strange estimate for number of rows
Hi all, I've one here that I cannot fathom. Any suggestions? We have a table, call it tablename, where we're selecting by a range of dates and an identifier. (This is redacted, obviously): \d tablename Column | Type | Modifiers +--+ id | integer | not null transaction_date | timestamp with time zone | not null product_id | integer | not null Indexes: "trans_posted_trans_date_idx" btree (transaction_date, product_id) The statistics on transaction_date and product_id are set to 1000. Everything is all analysed nicely. But I'm getting a poor plan, because of an estimate that the number of rows to be returned is about double how many actually are: explain analyse select * from transactions_posted where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and product_id = 2; QUERY PLAN --- Seq Scan on transactions_posted (cost=0.00..376630.33 rows=700923 width=91) (actual time=8422.253..36176.078 rows=316029 loops=1) Filter: ((transaction_date >= '2003-09-01 00:00:00-04'::timestamp with time zone) AND (transaction_date < '2003-10-01 00:00:00-04'::timestamp with time zone) AND (product_id = 2)) Total runtime: 36357.630 ms (3 rows) SET enable_seqscan = off; explain analyse select * from transactions_posted where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and product_id = 2; QUERY PLAN --- Index Scan using trans_posted_trans_date_idx on transactions_posted (cost=0.00..1088862.56 rows=700923 width=91) (actual time=35.214..14816.257 rows=316029 loops=1) Index Cond: ((transaction_date >= '2003-09-01 00:00:00-04'::timestamp with time zone) AND (transaction_date < '2003-10-01 00:00:00-04'::timestamp with time zone) AND (product_id = 2)) Total runtime: 15009.816 ms (3 rows) SELECT attname,null_frac,avg_width,n_distinct,correlation FROM pg_stats where tablename = 'transactions_posted' AND attname in ('transaction_date','product_id'); attname | null_frac | avg_width | n_distinct | correlation --+---+---++- product_id | 0 | 4 | 2 |0.200956 transaction_date | 0 | 8 | -0.200791 |0.289248 Any ideas? I'm loathe to recommend cluster, since the data will not stay clustered. A -- Andrew Sullivan 204-4141 Yonge Street Afilias CanadaToronto, Ontario Canada <[EMAIL PROTECTED]> M2P 2A8 +1 416 646 3304 x110 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PERFORM] Union+group by planner estimates way off?
On 7.4 RC2, I'm seeing a case where the query planner estimates are way out of line after grouping the result of a union. I've tried adjusting the statistics targets up to 200, and it made no difference in the planner's estimates. The point of the full query this came from is that it also has an aggregate function that produces a space-delimited list of commodity & fak for each id. Does anyone have any suggestions on tweaks to apply or ways to rewrite this? Is this one of those ugly corners where the query planner doesn't have a clue how to estimate this (seeing the nice round 200 estimate makes me suspicious)? EXPLAIN ANALYZE SELECT id FROM (SELECT id, commodity FROM commodities WHERE commodity IS NOT NULL UNION SELECT id, fak FROM commodities WHERE fak IS NOT NULL ) all_commodities GROUP BY id; QUERY PLAN HashAggregate (cost=15939.16..15939.16 rows=200 width=4) (actual time=3537.281..3680.418 rows=83306 loops=1) -> Subquery Scan all_commodities (cost=14002.00..15697.02 rows=96858 width=4) (actual time=2268.052..3214.996 rows=95715 loops=1) -> Unique (cost=14002.00..14728.44 rows=96858 width=15) (actual time=2268.043..2881.688 rows=95715 loops=1) -> Sort (cost=14002.00..14244.15 rows=96858 width=15) (actual time=2268.037..2527.083 rows=18 loops=1) Sort Key: id, commodity -> Append (cost=0.00..5034.42 rows=96858 width=15) (actual time=7.402..1220.320 rows=18 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..2401.23 rows=36831 width=15) (actual time=7.398..590.004 rows=39772 loops=1) -> Seq Scan on commodities (cost=0.00..2032.92 rows=36831 width=15) (actual time=7.388..468.415 rows=39772 loops=1) Filter: (commodity IS NOT NULL) -> Subquery Scan "*SELECT* 2" (cost=0.00..2633.19 rows=60027 width=14) (actual time=0.016..408.160 rows=60236 loops=1) -> Seq Scan on commodities (cost=0.00..2032.92 rows=60027 width=14) (actual time=0.010..221.635 rows=60236 loops=1) Filter: (fak IS NOT NULL) Total runtime: 3783.009 ms (13 rows) ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Union+group by planner estimates way off?
"Arthur Ward" <[EMAIL PROTECTED]> writes: > EXPLAIN ANALYZE SELECT id FROM > (SELECT id, commodity FROM commodities WHERE commodity IS NOT NULL >UNION > SELECT id, fak FROM commodities WHERE fak IS NOT NULL > ) all_commodities GROUP BY id; >QUERY > PLAN > > HashAggregate (cost=15939.16..15939.16 rows=200 width=4) (actual > time=3537.281..3680.418 rows=83306 loops=1) >-> Subquery Scan all_commodities (cost=14002.00..15697.02 rows=96858 > width=4) (actual time=2268.052..3214.996 rows=95715 loops=1) It's falling back to a default estimate because it doesn't know how to find any statistics for the output of a sub-select. I have a TODO somewhere about burrowing down into sub-selects to see if the output maps directly to a column that we'd have stats for ... but it's not done yet. In this particular case the inaccurate estimate doesn't matter too much, I think, although it might be encouraging the system to select hash aggregation since it thinks the hashtable will be pretty small. If the estimate were getting used to plan higher-up plan steps then it could be a bigger problem. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] strange estimate for number of rows
Andrew Sullivan <[EMAIL PROTECTED]> writes: > The statistics on transaction_date and product_id are set to 1000. > Everything is all analysed nicely. But I'm getting a poor plan, > because of an estimate that the number of rows to be returned is > about double how many actually are: > explain analyse select * from transactions_posted where > transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and > product_id = 2; Are the estimates accurate for queries on the two columns individually, ie ... where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' ... where product_id = 2 If so, the problem is that there's a correlation between transaction_date and product_id, which the system cannot model because it has no multi-column statistics. However, given that the estimate is only off by about a factor of 2, you'd still be getting the wrong plan even if the estimate were perfect, because the estimated costs differ by nearly a factor of 3. Given the actual runtimes, I'm thinking maybe you want to reduce random_page_cost. What are you using for that now? regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Union+group by planner estimates way off?
> In this particular case the inaccurate estimate doesn't matter too much, > I think, although it might be encouraging the system to select hash > aggregation since it thinks the hashtable will be pretty small. If the > estimate were getting used to plan higher-up plan steps then it could > be a bigger problem. That's my problem; this is a subselect feeding in to a larger query. That wrong estimate causes the planner to select a nested-loop at the next step up. At 83,000 rows, the word is "ouch!" At any rate, I discovered this while dissecting a giant & slow query. Hence, while disabling nested-loop joins avoids this particular pitfall, it's not good for the bigger picture. I think I'm going to end up splitting this larger query into smaller parts and reassemble the pieces in the application so I can push some smarts past other subselect boundaries. For my purposes, that should skirt the issue of union+group estimates not being calculated. As always, thanks for the fast answers! ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] IN surpasses NOT EXISTS in 7.4RC2 ??
Tom Lane wrote: Rajesh Kumar Mallah <[EMAIL PROTECTED]> writes: NOT EXISTS is taking almost double time than NOT IN . I know IN has been optimised in 7.4 but is anything wrong with the NOT EXISTS? That's the expected behavior in 7.4. EXISTS in the style you are using it effectively forces a nestloop-with-inner-indexscan implementation. As of 7.4, IN can do that, but it can do several other things too, including the hash-type plan you have here. So assuming that the planner chooses the right plan choice (not always a given ;-)) IN should be as fast or faster than EXISTS in all cases. Not in this case :) , did i miss something silly? tradein_clients=# explain SELECT count(*) from user_accounts where email is not null and email not in (select email from profile_master where email is not null) ; QUERY PLAN -- Aggregate (cost=9587726326.93..9587726326.93 rows=1 width=0) -> Seq Scan on user_accounts (cost=0.00..9587725473.40 rows=341412 width=0) Filter: ((email IS NOT NULL) AND (NOT (subplan))) SubPlan -> Seq Scan on profile_master (cost=0.00..25132.24 rows=674633 width=25) Filter: (email IS NOT NULL) (6 rows) The query above does not return tradein_clients=# explain analyze SELECT count(*) from user_accounts where email is not null and not exists (select email from profile_master where email=user_accounts.email) ; QUERY PLAN -- Aggregate (cost=2850847.55..2850847.55 rows=1 width=0) (actual time=34075.100..34075.101 rows=1 loops=1) -> Seq Scan on user_accounts (cost=0.00..2849994.02 rows=341412 width=0) (actual time=8.066..34066.329 rows=3882 loops=1) Filter: ((email IS NOT NULL) AND (NOT (subplan))) SubPlan -> Index Scan using profile_master_email on profile_master (cost=0.00..35.60 rows=9 width=25) (actual time=0.044..0.044 rows=1 loops=686716) Index Cond: ((email)::text = ($0)::text) Total runtime: 34075.213 ms (7 rows) tradein_clients=# regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] -- Rajesh Kumar Mallah, Infocom Network Limited, New Delhi phone: +91(11)6152172 (221) (L) ,9811255597 (M) Visit http://www.trade-india.com , India's Leading B2B eMarketplace.
[PERFORM] Storage space, RAM for NUMERIC
Folks, How would I calculate storage space/required ram on a 50-digit NUMERIC? And the docs state that NUMERIC is slow. Is this just slow for calculations (due to the conversion to float & back) or slow for index lookups as well? -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] strange estimate for number of rows
Hi, I'm the lead developer on the project this concerns (forgive my newbiness on this list). We tried a couple of scenarios with effective_cache_size=6, cpu_index_tuple_cost=0.0001 and random_page_cost=2 with no change in the plan. explain analyse select * from tablename where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1'; Seq Scan on tablename (cost=0.00..348199.14 rows=1180724 width=91) (actual time=7727.668..36286.898 rows=579238 loops=1) Filter: ((transaction_date >= '2003-09-01 00:00:00+00'::timestamp with time zone) AND (transaction_date < '2003-10-01 00:00:00+00'::timestamp with time zone)) Total runtime: 36625.351 ms explain analyse select * from transactions_posted where product_id = 2; Seq Scan on transactions_posted (cost=0.00..319767.95 rows=6785237 width=91) (actual time=0.091..35596.328 rows=5713877 loops=1) Filter: (product_id = 2) Total runtime: 38685.373 ms The product_id alone gives a difference of a millions rows from estimate to actual, vs. the factor of 2 from the transaction_date. Dan Manley Tom Lane пишет: Andrew Sullivan <[EMAIL PROTECTED]> writes: The statistics on transaction_date and product_id are set to 1000. Everything is all analysed nicely. But I'm getting a poor plan, because of an estimate that the number of rows to be returned is about double how many actually are: explain analyse select * from transactions_posted where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and product_id = 2; Are the estimates accurate for queries on the two columns individually, ie ... where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' ... where product_id = 2 If so, the problem is that there's a correlation between transaction_date and product_id, which the system cannot model because it has no multi-column statistics. However, given that the estimate is only off by about a factor of 2, you'd still be getting the wrong plan even if the estimate were perfect, because the estimated costs differ by nearly a factor of 3. Given the actual runtimes, I'm thinking maybe you want to reduce random_page_cost. What are you using for that now? regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] strange estimate for number of rows
Daniel Manley <[EMAIL PROTECTED]> writes: > The product_id alone gives a difference of a millions rows from estimate > to actual, vs. the factor of 2 from the transaction_date. You should be thinking in terms of ratios, not absolute difference. The rows estimate for product_id doesn't look too bad to me; the one for transaction_date is much further off (a factor of 2). Which is odd, because the system can usually do all right on range estimates if you've let it run an ANALYZE with enough histogram bins. Could we see the pg_stats row for transaction_date? > We tried a couple of scenarios with effective_cache_size=6, > cpu_index_tuple_cost=0.0001 and random_page_cost=2 with no change in the > plan. Since you need about a factor of 3 change in the cost estimate to get it to switch plans, changing random_page_cost by a factor of 2 ain't gonna do it (the other two numbers are second-order adjustments unlikely to have much effect, I think). Try 1.5, or even less ... of course, you have to keep an eye on your other queries and make sure they don't go nuts, but from what I've heard about your hardware setup a low random_page_cost isn't out of line with the physical realities. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] strange estimate for number of rows
On Thu, Nov 13, 2003 at 03:19:08PM -0500, Tom Lane wrote: > because the system can usually do all right on range estimates if you've > let it run an ANALYZE with enough histogram bins. Could we see the > pg_stats row for transaction_date? Do you want the whole thing? I left out the really verbose bits when I posted this in the original: SELECT attname,null_frac,avg_width,n_distinct,correlation FROM pg_stats where tablename = 'transactions_posted' AND attname in ('transaction_date','product_id'); attname | null_frac | avg_width | n_distinct | correlation --+---+---++- product_id | 0 | 4 | 2 |0.200956 transaction_date | 0 | 8 | -0.200791 |0.289248 > > Since you need about a factor of 3 change in the cost estimate to get it to > switch plans, changing random_page_cost by a factor of 2 ain't gonna do > it (the other two numbers are second-order adjustments unlikely to have > much effect, I think). Try 1.5, or even less ... of course, you have to > keep an eye on your other queries and make sure they don't go nuts, but > from what I've heard about your hardware setup a low random_page_cost > isn't out of line with the physical realities. Actually, this one's on an internal box, and I think 1.5 is too low -- it's really just a pair of mirrored SCSI disks on a PCI controller in this case. That does the trick, though, so maybe I'm just being too conservantive. A -- Andrew Sullivan 204-4141 Yonge Street Afilias CanadaToronto, Ontario Canada <[EMAIL PROTECTED]> M2P 2A8 +1 416 646 3304 x110 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] strange estimate for number of rows
On Thu, Nov 13, 2003 at 04:37:03PM -0500, Andrew Sullivan wrote: > Actually, this one's on an internal box, and I think 1.5 is too low > -- it's really just a pair of mirrored SCSI disks on a PCI controller > in this case. That does the trick, though, so maybe I'm just being > too conservantive. I spoke too soon. I'd left enable_seqscan=off set. It doesn't actually prefer an indexscan until I set the random_page_cost to .5. I think that's a little unrealistic ;-) A -- Andrew Sullivan 204-4141 Yonge Street Afilias CanadaToronto, Ontario Canada <[EMAIL PROTECTED]> M2P 2A8 +1 416 646 3304 x110 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly