Re: [PERFORM] Slow query with a lot of data
Am 18.08.2008 um 18:05 schrieb Matthew Wakeling: On Mon, 18 Aug 2008, Moritz Onken wrote: Running the query for more than one user is indeed not much slower. That's what I need. I'm clustering the results table on domain right now. But why is this better than clustering it on "user"? The reason is the way that the merge join algorithm works. What it does is takes two tables, and sorts them both by the join fields. Then it can stream through both tables producing results as it goes. It's the best join algorithm, but it does require both tables to be sorted by the same thing, which is domain in this case. The aggregating on user happens after the join has been done, and the hash aggregate can accept the users in random order. If you look at your last EXPLAIN, see that it has to sort the result table on domain, although it can read the domain_categories in domain order due to the clustered index. explain select a."user", b.category, sum(1.0/b.cat_count)::float from result a, domain_categories b where a."domain" = b."domain" group by a."user", b.category; "GroupAggregate (cost=21400443313.69..22050401897.13 rows=35049240 width=12)" " -> Sort (cost=21400443313.69..21562757713.35 rows=64925759864 width=12)" "Sort Key: a."user", b.category" "-> Merge Join (cost=4000210.40..863834009.08 rows=64925759864 width=12)" " Merge Cond: (b.domain = a.domain)" " -> Index Scan using domain_categories_domain on domain_categories b (cost=0.00..391453.79 rows=12105014 width=12)" " -> Materialize (cost=331.73..4253766.93 rows=20306816 width=8)" "-> Sort (cost=331.73..4050698.77 rows=20306816 width=8)" " Sort Key: a.domain" " -> Seq Scan on result a (cost=0.00..424609.16 rows=20306816 width=8)" Both results and domain_categories are clustered on domain and analyzed. It took 50 minutes to run this query for 280 users ("and "user" IN ([280 ids])"), 78000 rows were returned and stored in a table. Is this reasonable? Why is it still sorting on domain? I thought the clustering should prevent the planner from doing this? moritz -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with a lot of data
On Tue, 19 Aug 2008, Moritz Onken wrote: explain select a."user", b.category, sum(1.0/b.cat_count)::float from result a, domain_categories b where a."domain" = b."domain" group by a."user", b.category; Both results and domain_categories are clustered on domain and analyzed. Why is it still sorting on domain? I thought the clustering should prevent the planner from doing this? As far as I can tell, it should. If it is clustered on an index on domain, and then analysed, it should no longer have to sort on domain. Could you post here the results of running: select * from pg_stats where attname = 'domain'; It took 50 minutes to run this query for 280 users ("and "user" IN ([280 ids])"), 78000 rows were returned and stored in a table. Is this reasonable? Sounds like an awfully long time to me. Also, I think restricting it to 280 users is probably not making it any faster. Matthew -- It's one of those irregular verbs - "I have an independent mind," "You are an eccentric," "He is round the twist." -- Bernard Woolly, Yes Prime Minister -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with a lot of data
As far as I can tell, it should. If it is clustered on an index on domain, and then analysed, it should no longer have to sort on domain. Could you post here the results of running: select * from pg_stats where attname = 'domain'; schemaname | tablename| attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation public | result | domain | 0 | 4 | 1642 | {3491378,3213829,3316634,3013831,3062500,3242775,3290846,3171997,3412018,3454092 } | {0.352333,0.021,0.01,0.0077,0.0057,0.0053,0.0053,0.005,0.0027,0.0027 } | {3001780,3031753,3075043,3129688,3176566,3230067,3286784,3341445,3386233,3444374,3491203 } | 1 No idea what that means :) Sounds like an awfully long time to me. Also, I think restricting it to 280 users is probably not making it any faster. If I hadn't restricted it to 280 users it would have run ~350days... Thanks for your help! moritz -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with a lot of data
On Tue, 19 Aug 2008, Moritz Onken wrote: tablename| attname | n_distinct | correlation result | domain | 1642 | 1 Well, the important thing is the correlation, which is 1, indicating that Postgres knows that the table is clustered. So I have no idea why it is sorting the entire table. What happens when you run EXPLAIN SELECT * FROM result ORDER BY domain? Sounds like an awfully long time to me. Also, I think restricting it to 280 users is probably not making it any faster. If I hadn't restricted it to 280 users it would have run ~350days... What makes you say that? Perhaps you could post EXPLAINs of both of the queries. Matthew -- What goes up must come down. Ask any system administrator. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with a lot of data
Am 19.08.2008 um 14:17 schrieb Matthew Wakeling: On Tue, 19 Aug 2008, Moritz Onken wrote: tablename| attname | n_distinct | correlation result | domain | 1642 | 1 Well, the important thing is the correlation, which is 1, indicating that Postgres knows that the table is clustered. So I have no idea why it is sorting the entire table. What happens when you run EXPLAIN SELECT * FROM result ORDER BY domain? "Index Scan using result_domain_idx on result (cost=0.00..748720.72 rows=20306816 width=49)" ... as it should be. Sounds like an awfully long time to me. Also, I think restricting it to 280 users is probably not making it any faster. If I hadn't restricted it to 280 users it would have run ~350days... What makes you say that? Perhaps you could post EXPLAINs of both of the queries. Matthew That was just a guess. The query needs to retrieve the data for about 50,000 users. But it should be fast if I don't retrieve the data for specific users but let in run through all rows. explain insert into setup1 (select a."user", b.category, sum(1.0/b.cat_count)::float from result a, domain_categories b where a."domain" = b."domain" and b.depth < 4 and a.results > 100 and a."user" < 3 group by a."user", b.category); "GroupAggregate (cost=11745105.66..12277396.81 rows=28704 width=12)" " -> Sort (cost=11745105.66..11878034.93 rows=53171707 width=12)" "Sort Key: a."user", b.category" "-> Merge Join (cost=149241.25..1287278.89 rows=53171707 width=12)" " Merge Cond: (b.domain = a.domain)" " -> Index Scan using domain_categories_domain on domain_categories b (cost=0.00..421716.32 rows=5112568 width=12)" "Filter: (depth < 4)" " -> Materialize (cost=148954.16..149446.36 rows=39376 width=8)" "-> Sort (cost=148954.16..149052.60 rows=39376 width=8)" " Sort Key: a.domain" " -> Bitmap Heap Scan on result a (cost=1249.93..145409.79 rows=39376 width=8)" "Recheck Cond: ("user" < 3)" "Filter: (results > 100)" "-> Bitmap Index Scan on result_user_idx (cost=0.00..1240.08 rows=66881 width=0)" " Index Cond: ("user" < 3)" This query limits the number of users to 215 and this query took about 50 minutes. I could create to temp tables which have only those records which I need for this query. Would this be a good idea? moritz -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Cross Join Problem
Thanx alot... its solved my problem On Mon, Aug 18, 2008 at 8:50 PM, Tom Lane <[EMAIL PROTECTED]> wrote: > [ please keep the list cc'd for the archives' sake ] > > "Gauri Kanekar" <[EMAIL PROTECTED]> writes: > > On Mon, Aug 18, 2008 at 7:32 PM, Tom Lane <[EMAIL PROTECTED]> wrote: > >> No PG release since 7.3 would have voluntarily planned that query that > >> way. Maybe you were using join_collapse_limit = 1 to force the join > >> order? > > > Yes, We have set join_collapse_limit set to 1. > > Ah, so really your question is why join_collapse_limit isn't working as > you expect. That code changed quite a bit in 8.2, and the way it works > now is that the critical decision occurs while deciding whether to fold > the cross-join (a sub-problem of size 2) into the top-level join > problem. Which is a decision that's going to be driven by > from_collapse_limit not join_collapse_limit. > > So one way you could make it work is to reduce from_collapse_limit to > less than 3, but I suspect you'd find that that has too many bad > consequences for other queries. What's probably best is to write the > problem query like this: > >FROM table1 a cross join ( table2 b cross join table3 c ) > > which will cause join_collapse_limit to be the relevant number at both > steps. > >regards, tom lane > -- Regards Gauri
Re: [PERFORM] Slow query with a lot of data
What is your work_mem set to? The default? Try increasing it significantly if you have the RAM and seeing if that affects the explain plan. You may even want to set it to a number larger than the RAM you have just to see what happens. In all honesty, it may be faster to overflow to OS swap space than sort too many rows, but ONLY if it changes the plan to a significantly more efficient one. Simply type 'SET work_mem = '500MB'; before running your explain. Set it to even more RAM if you have the space for this experiment. In my experience the performance of aggregates on large tables is significantly affected by work_mem and the optimizer will chosse poorly without enough of it. It will rule out plans that may be fast enough when overflowing to disk in preference to colossal sized sorts (which likely also overflow to disk but take hours or days). On Tue, Aug 19, 2008 at 5:47 AM, Moritz Onken <[EMAIL PROTECTED]>wrote: > > Am 19.08.2008 um 14:17 schrieb Matthew Wakeling: > > On Tue, 19 Aug 2008, Moritz Onken wrote: >> >>> tablename| attname | n_distinct | correlation >>> result | domain | 1642 | 1 >>> >> >> Well, the important thing is the correlation, which is 1, indicating that >> Postgres knows that the table is clustered. So I have no idea why it is >> sorting the entire table. >> >> What happens when you run EXPLAIN SELECT * FROM result ORDER BY domain? >> >> > "Index Scan using result_domain_idx on result (cost=0.00..748720.72 > rows=20306816 width=49)" > ... as it should be. > > Sounds like an awfully long time to me. Also, I think restricting it to 280 users is probably not making it any faster. >>> >>> If I hadn't restricted it to 280 users it would have run ~350days... >>> >> >> What makes you say that? Perhaps you could post EXPLAINs of both of the >> queries. >> >> Matthew >> > > That was just a guess. The query needs to retrieve the data for about > 50,000 users. But it should be fast if I don't retrieve the data for > specific users but let in run through all rows. > > explain insert into setup1 (select > a."user", b.category, sum(1.0/b.cat_count)::float > from result a, domain_categories b > where a."domain" = b."domain" > and b.depth < 4 > and a.results > 100 > and a."user" < 3 > group by a."user", b.category); > > > "GroupAggregate (cost=11745105.66..12277396.81 rows=28704 width=12)" > " -> Sort (cost=11745105.66..11878034.93 rows=53171707 width=12)" > "Sort Key: a."user", b.category" > "-> Merge Join (cost=149241.25..1287278.89 rows=53171707 > width=12)" > " Merge Cond: (b.domain = a.domain)" > " -> Index Scan using domain_categories_domain on > domain_categories b (cost=0.00..421716.32 rows=5112568 width=12)" > "Filter: (depth < 4)" > " -> Materialize (cost=148954.16..149446.36 rows=39376 > width=8)" > "-> Sort (cost=148954.16..149052.60 rows=39376 > width=8)" > " Sort Key: a.domain" > " -> Bitmap Heap Scan on result a > (cost=1249.93..145409.79 rows=39376 width=8)" > "Recheck Cond: ("user" < 3)" > "Filter: (results > 100)" > "-> Bitmap Index Scan on result_user_idx > (cost=0.00..1240.08 rows=66881 width=0)" > " Index Cond: ("user" < 3)" > > > This query limits the number of users to 215 and this query took about 50 > minutes. > I could create to temp tables which have only those records which I need > for this query. Would this be a good idea? > > > moritz > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Re: [PERFORM] Slow query with a lot of data
Am 19.08.2008 um 16:49 schrieb Scott Carey: What is your work_mem set to? The default? Try increasing it significantly if you have the RAM and seeing if that affects the explain plan. You may even want to set it to a number larger than the RAM you have just to see what happens. In all honesty, it may be faster to overflow to OS swap space than sort too many rows, but ONLY if it changes the plan to a significantly more efficient one. Simply type 'SET work_mem = '500MB'; before running your explain. Set it to even more RAM if you have the space for this experiment. In my experience the performance of aggregates on large tables is significantly affected by work_mem and the optimizer will chosse poorly without enough of it. It will rule out plans that may be fast enough when overflowing to disk in preference to colossal sized sorts (which likely also overflow to disk but take hours or days). Thanks for that advice but the explain is not different :-( moritz -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Software vs. Hardware RAID Data
Hi all, We started an attempt to slice the data we've been collecting in another way, to show the results of software vs. hardware RAID: http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide#Hardware_vs._Software_Raid The angle we're trying to show here is the processor utilization and i/o throughput for a given file system and raid configuration. I wasn't sure about the best way to present it, so this is how it looks so far. Click on the results for a chart of the aggregate processor utilization for the test. Comments, suggestions, criticisms, et al. welcome. Regards, Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Software vs. Hardware RAID Data
On Tue, 19 Aug 2008, Mark Wong wrote: Hi all, We started an attempt to slice the data we've been collecting in another way, to show the results of software vs. hardware RAID: http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide#Hardware_vs._Software_Raid The angle we're trying to show here is the processor utilization and i/o throughput for a given file system and raid configuration. I wasn't sure about the best way to present it, so this is how it looks so far. Click on the results for a chart of the aggregate processor utilization for the test. Comments, suggestions, criticisms, et al. welcome. it's really good to show cpu utilization as well as throughput, but how about showing the cpu utilization as %cpu per MB/s (possibly with a flag to indicate any entries that look like they may have hit cpu limits) why did you use 4M stripe size on the software raid? especially on raid 5 this seems like a lot of data to have to touch when making an update. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance