Re: [PERFORM] Slow query with a lot of data

2008-08-19 Thread Moritz Onken


Am 18.08.2008 um 18:05 schrieb Matthew Wakeling:


On Mon, 18 Aug 2008, Moritz Onken wrote:
Running the query for more than one user is indeed not much slower.  
That's what I need. I'm clustering the results table on domain  
right now. But why is this better than clustering it on "user"?


The reason is the way that the merge join algorithm works. What it  
does is takes two tables, and sorts them both by the join fields.  
Then it can stream through both tables producing results as it goes.  
It's the best join algorithm, but it does require both tables to be  
sorted by the same thing, which is domain in this case. The  
aggregating on user happens after the join has been done, and the  
hash aggregate can accept the users in random order.


If you look at your last EXPLAIN, see that it has to sort the result  
table on domain, although it can read the domain_categories in  
domain order due to the clustered index.


explain select
  a."user", b.category, sum(1.0/b.cat_count)::float
  from result a, domain_categories b
  where a."domain" = b."domain"
  group by a."user", b.category;

"GroupAggregate  (cost=21400443313.69..22050401897.13 rows=35049240  
width=12)"
"  ->  Sort  (cost=21400443313.69..21562757713.35 rows=64925759864  
width=12)"

"Sort Key: a."user", b.category"
"->  Merge Join  (cost=4000210.40..863834009.08  
rows=64925759864 width=12)"

"  Merge Cond: (b.domain = a.domain)"
"  ->  Index Scan using domain_categories_domain on  
domain_categories b  (cost=0.00..391453.79 rows=12105014 width=12)"
"  ->  Materialize  (cost=331.73..4253766.93  
rows=20306816 width=8)"
"->  Sort  (cost=331.73..4050698.77  
rows=20306816 width=8)"

"  Sort Key: a.domain"
"  ->  Seq Scan on result a   
(cost=0.00..424609.16 rows=20306816 width=8)"


Both results and domain_categories are clustered on domain and analyzed.
It took 50 minutes to run this query for 280 users ("and "user" IN  
([280 ids])"), 78000 rows were returned and stored in a table. Is this  
reasonable?
Why is it still sorting on domain? I thought the clustering should  
prevent the planner from doing this?


moritz

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with a lot of data

2008-08-19 Thread Matthew Wakeling

On Tue, 19 Aug 2008, Moritz Onken wrote:

explain select
a."user", b.category, sum(1.0/b.cat_count)::float
from result a, domain_categories b
where a."domain" = b."domain"
group by a."user", b.category;



Both results and domain_categories are clustered on domain and analyzed.
Why is it still sorting on domain? I thought the clustering should prevent 
the planner from doing this?


As far as I can tell, it should. If it is clustered on an index on domain, 
and then analysed, it should no longer have to sort on domain.


Could you post here the results of running:

select * from pg_stats where attname = 'domain';

It took 50 minutes to run this query for 280 users ("and "user" IN ([280 
ids])"), 78000 rows were returned and stored in a table. Is this reasonable?


Sounds like an awfully long time to me. Also, I think restricting it to 
280 users is probably not making it any faster.


Matthew

--
It's one of those irregular verbs - "I have an independent mind," "You are
an eccentric," "He is round the twist."
 -- Bernard Woolly, Yes Prime Minister

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with a lot of data

2008-08-19 Thread Moritz Onken


As far as I can tell, it should. If it is clustered on an index on  
domain, and then analysed, it should no longer have to sort on domain.


Could you post here the results of running:

select * from pg_stats where attname = 'domain';




 schemaname |   tablename| attname | null_frac |  
avg_width | n_distinct |  
most_common_vals   
|
most_common_freqs
| 
histogram_bounds 
 | 
 correlation


 public | result | domain  | 0 |  
4 |   1642 |  
{3491378,3213829,3316634,3013831,3062500,3242775,3290846,3171997,3412018,3454092 
} |  
{0.352333,0.021,0.01,0.0077,0.0057,0.0053,0.0053,0.005,0.0027,0.0027 
} |  
{3001780,3031753,3075043,3129688,3176566,3230067,3286784,3341445,3386233,3444374,3491203 
}   | 
   1



No idea what that means :)




Sounds like an awfully long time to me. Also, I think restricting it  
to 280 users is probably not making it any faster.


If I hadn't restricted it to 280 users it would have run ~350days...

Thanks for your help!

moritz

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with a lot of data

2008-08-19 Thread Matthew Wakeling

On Tue, 19 Aug 2008, Moritz Onken wrote:

   tablename| attname | n_distinct | correlation
 result | domain  |   1642 |   1


Well, the important thing is the correlation, which is 1, indicating that 
Postgres knows that the table is clustered. So I have no idea why it is 
sorting the entire table.


What happens when you run EXPLAIN SELECT * FROM result ORDER BY domain?

Sounds like an awfully long time to me. Also, I think restricting it to 280 
users is probably not making it any faster.


If I hadn't restricted it to 280 users it would have run ~350days...


What makes you say that? Perhaps you could post EXPLAINs of both of the 
queries.


Matthew

--
What goes up must come down. Ask any system administrator.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with a lot of data

2008-08-19 Thread Moritz Onken


Am 19.08.2008 um 14:17 schrieb Matthew Wakeling:


On Tue, 19 Aug 2008, Moritz Onken wrote:

  tablename| attname | n_distinct | correlation
result | domain  |   1642 |   1


Well, the important thing is the correlation, which is 1, indicating  
that Postgres knows that the table is clustered. So I have no idea  
why it is sorting the entire table.


What happens when you run EXPLAIN SELECT * FROM result ORDER BY  
domain?




"Index Scan using result_domain_idx on result  (cost=0.00..748720.72  
rows=20306816 width=49)"

... as it should be.

Sounds like an awfully long time to me. Also, I think restricting  
it to 280 users is probably not making it any faster.


If I hadn't restricted it to 280 users it would have run ~350days...


What makes you say that? Perhaps you could post EXPLAINs of both of  
the queries.


Matthew


That was just a guess. The query needs to retrieve the data for about  
50,000 users. But it should be fast if I don't retrieve the data for  
specific users but let in run through all rows.


explain insert into setup1 (select
  a."user", b.category, sum(1.0/b.cat_count)::float
  from result a, domain_categories b
  where a."domain" = b."domain"
  and b.depth < 4
  and a.results > 100
  and a."user" < 3
  group by a."user", b.category);


"GroupAggregate  (cost=11745105.66..12277396.81 rows=28704 width=12)"
"  ->  Sort  (cost=11745105.66..11878034.93 rows=53171707 width=12)"
"Sort Key: a."user", b.category"
"->  Merge Join  (cost=149241.25..1287278.89 rows=53171707  
width=12)"

"  Merge Cond: (b.domain = a.domain)"
"  ->  Index Scan using domain_categories_domain on  
domain_categories b  (cost=0.00..421716.32 rows=5112568 width=12)"

"Filter: (depth < 4)"
"  ->  Materialize  (cost=148954.16..149446.36 rows=39376  
width=8)"
"->  Sort  (cost=148954.16..149052.60 rows=39376  
width=8)"

"  Sort Key: a.domain"
"  ->  Bitmap Heap Scan on result a   
(cost=1249.93..145409.79 rows=39376 width=8)"

"Recheck Cond: ("user" < 3)"
"Filter: (results > 100)"
"->  Bitmap Index Scan on  
result_user_idx  (cost=0.00..1240.08 rows=66881 width=0)"

"  Index Cond: ("user" < 3)"


This query limits the number of users to 215 and this query took about  
50 minutes.
I could create to temp tables which have only those records which I  
need for this query. Would this be a good idea?



moritz


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Cross Join Problem

2008-08-19 Thread Gauri Kanekar
Thanx alot... its solved my problem

On Mon, Aug 18, 2008 at 8:50 PM, Tom Lane <[EMAIL PROTECTED]> wrote:

> [ please keep the list cc'd for the archives' sake ]
>
> "Gauri Kanekar" <[EMAIL PROTECTED]> writes:
> > On Mon, Aug 18, 2008 at 7:32 PM, Tom Lane <[EMAIL PROTECTED]> wrote:
> >> No PG release since 7.3 would have voluntarily planned that query that
> >> way.  Maybe you were using join_collapse_limit = 1 to force the join
> >> order?
>
> > Yes, We have set join_collapse_limit set to 1.
>
> Ah, so really your question is why join_collapse_limit isn't working as
> you expect.  That code changed quite a bit in 8.2, and the way it works
> now is that the critical decision occurs while deciding whether to fold
> the cross-join (a sub-problem of size 2) into the top-level join
> problem.  Which is a decision that's going to be driven by
> from_collapse_limit not join_collapse_limit.
>
> So one way you could make it work is to reduce from_collapse_limit to
> less than 3, but I suspect you'd find that that has too many bad
> consequences for other queries.  What's probably best is to write the
> problem query like this:
>
>FROM table1 a cross join ( table2 b cross join table3 c )
>
> which will cause join_collapse_limit to be the relevant number at both
> steps.
>
>regards, tom lane
>



-- 
Regards
Gauri


Re: [PERFORM] Slow query with a lot of data

2008-08-19 Thread Scott Carey
What is your work_mem set to?  The default?

Try increasing it significantly if you have the RAM and seeing if that
affects the explain plan.  You may even want to set it to a number larger
than the RAM you have just to see what happens.  In all honesty, it may be
faster to overflow to OS swap space than sort too many rows, but ONLY if it
changes the plan to a significantly more efficient one.

Simply type
'SET work_mem = '500MB';
before running your explain.  Set it to even more RAM if you have the space
for this experiment.

In my experience the performance of aggregates on large tables is
significantly affected by work_mem and the optimizer will chosse poorly
without enough of it.  It will rule out plans that may be fast enough when
overflowing to disk in preference to colossal sized sorts (which likely also
overflow to disk but take hours or days).

On Tue, Aug 19, 2008 at 5:47 AM, Moritz Onken <[EMAIL PROTECTED]>wrote:

>
> Am 19.08.2008 um 14:17 schrieb Matthew Wakeling:
>
>  On Tue, 19 Aug 2008, Moritz Onken wrote:
>>
>>>  tablename| attname | n_distinct | correlation
>>> result | domain  |   1642 |   1
>>>
>>
>> Well, the important thing is the correlation, which is 1, indicating that
>> Postgres knows that the table is clustered. So I have no idea why it is
>> sorting the entire table.
>>
>> What happens when you run EXPLAIN SELECT * FROM result ORDER BY domain?
>>
>>
> "Index Scan using result_domain_idx on result  (cost=0.00..748720.72
> rows=20306816 width=49)"
> ... as it should be.
>
>  Sounds like an awfully long time to me. Also, I think restricting it to
 280 users is probably not making it any faster.

>>>
>>> If I hadn't restricted it to 280 users it would have run ~350days...
>>>
>>
>> What makes you say that? Perhaps you could post EXPLAINs of both of the
>> queries.
>>
>> Matthew
>>
>
> That was just a guess. The query needs to retrieve the data for about
> 50,000 users. But it should be fast if I don't retrieve the data for
> specific users but let in run through all rows.
>
> explain insert into setup1 (select
>  a."user", b.category, sum(1.0/b.cat_count)::float
>  from result a, domain_categories b
>  where a."domain" = b."domain"
>  and b.depth < 4
>  and a.results > 100
>  and a."user" < 3
>  group by a."user", b.category);
>
>
> "GroupAggregate  (cost=11745105.66..12277396.81 rows=28704 width=12)"
> "  ->  Sort  (cost=11745105.66..11878034.93 rows=53171707 width=12)"
> "Sort Key: a."user", b.category"
> "->  Merge Join  (cost=149241.25..1287278.89 rows=53171707
> width=12)"
> "  Merge Cond: (b.domain = a.domain)"
> "  ->  Index Scan using domain_categories_domain on
> domain_categories b  (cost=0.00..421716.32 rows=5112568 width=12)"
> "Filter: (depth < 4)"
> "  ->  Materialize  (cost=148954.16..149446.36 rows=39376
> width=8)"
> "->  Sort  (cost=148954.16..149052.60 rows=39376
> width=8)"
> "  Sort Key: a.domain"
> "  ->  Bitmap Heap Scan on result a
>  (cost=1249.93..145409.79 rows=39376 width=8)"
> "Recheck Cond: ("user" < 3)"
> "Filter: (results > 100)"
> "->  Bitmap Index Scan on result_user_idx
>  (cost=0.00..1240.08 rows=66881 width=0)"
> "  Index Cond: ("user" < 3)"
>
>
> This query limits the number of users to 215 and this query took about 50
> minutes.
> I could create to temp tables which have only those records which I need
> for this query. Would this be a good idea?
>
>
> moritz
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: [PERFORM] Slow query with a lot of data

2008-08-19 Thread Moritz Onken


Am 19.08.2008 um 16:49 schrieb Scott Carey:


What is your work_mem set to?  The default?

Try increasing it significantly if you have the RAM and seeing if  
that affects the explain plan.  You may even want to set it to a  
number larger than the RAM you have just to see what happens.  In  
all honesty, it may be faster to overflow to OS swap space than sort  
too many rows, but ONLY if it changes the plan to a significantly  
more efficient one.


Simply type
'SET work_mem = '500MB';
before running your explain.  Set it to even more RAM if you have  
the space for this experiment.


In my experience the performance of aggregates on large tables is  
significantly affected by work_mem and the optimizer will chosse  
poorly without enough of it.  It will rule out plans that may be  
fast enough when overflowing to disk in preference to colossal sized  
sorts (which likely also overflow to disk but take hours or days).


Thanks for that advice but the explain is not different :-(

moritz

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Software vs. Hardware RAID Data

2008-08-19 Thread Mark Wong
Hi all,

We started an attempt to slice the data we've been collecting in
another way, to show the results of software vs. hardware RAID:

http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide#Hardware_vs._Software_Raid

The angle we're trying to show here is the processor utilization and
i/o throughput for a given file system and raid configuration.  I
wasn't sure about the best way to present it, so this is how it looks
so far.  Click on the results for a chart of the aggregate processor
utilization for the test.

Comments, suggestions, criticisms, et al. welcome.

Regards,
Mark

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Software vs. Hardware RAID Data

2008-08-19 Thread david

On Tue, 19 Aug 2008, Mark Wong wrote:


Hi all,

We started an attempt to slice the data we've been collecting in
another way, to show the results of software vs. hardware RAID:

http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide#Hardware_vs._Software_Raid

The angle we're trying to show here is the processor utilization and
i/o throughput for a given file system and raid configuration.  I
wasn't sure about the best way to present it, so this is how it looks
so far.  Click on the results for a chart of the aggregate processor
utilization for the test.

Comments, suggestions, criticisms, et al. welcome.


it's really good to show cpu utilization as well as throughput, but how 
about showing the cpu utilization as %cpu per MB/s (possibly with a flag 
to indicate any entries that look like they may have hit cpu limits)


why did you use 4M stripe size on the software raid? especially on raid 5 
this seems like a lot of data to have to touch when making an update.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance