Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread Alex Stapleton
On 11 May 2005, at 23:35, PFC wrote:


However, memcached (and for us, pg_memcached) is an excellent way  
to improve
horizontal scalability by taking disposable data (like session  
information)
out of the database and putting it in protected RAM.

So, what is the advantage of such a system versus, say, a  
"sticky sessions" system where each session is assigned to ONE  
application server (not PHP then) which keeps it in RAM as native  
objects instead of serializing and deserializing it on each request ?
I'd say the sticky sessions should perform a lot better, and if  
one machine dies, only the sessions on this one are lost.
But of course you can't do it with PHP as you need an app  
server which can manage sessions. Potentially the savings are huge,  
though.
Theres no reason it couldn't be done with PHP to be fair as long as  
you could ensure that the client was always routed back to the same  
machines. Which has it's own set of issues entirely. I am not  
entirely sure that memcached actually does serialize data when it's  
comitted into memcached either, although I could be wrong, I have not  
looked at the source. Certainly if you can ensure that a client  
always goes back to the same machine you can simplify the whole thing  
hugely. It's generally not that easy though, you need a proxy server  
of some description capable of understanding the HTTP traffic and  
maintaining a central session lookup table to redirect with. Which  
isn't really solving the problem so much as moving it somewhere else.  
Instead of needing huge memcached pools, you need hardcore  
loadbalancers. Load Balancers tend to cost $ in comparison.  
Distributed sticky sessions are a rather nice idea, I would like to  
hear a way of implementing them cheaply (and on PHP) as well. I may  
have to give that some thought in fact. Oh yeah, and load balancers  
software often sucks in annoying (if not always important) ways.

On Google, their distributed system spans a huge number of PCs  
and it has redundancy, ie. individual PC failure is a normal thing  
and is a part of the system, it is handled gracefully. I read a  
paper on this matter, it's pretty impressive. The google filesystem  
has nothing to do with databases though, it's more a massive data  
store / streaming storage.

Since when did Massive Data stores have nothing to do with DBs? Isn't  
Oracle Cluster entirely based on forming an enormous scalable disk  
array to store your DB on?

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] AND OR combination: index not being used

2005-05-12 Thread David Teran
Hi,
postgres 8.0.1, mac os x 10.3.9
i have a select with multiple OR's combined with one AND:
explain analyze SELECT t0.ATTRIBUTE_TYPE FROM ATTRIBUTE_VALUE t0 WHERE 
(((t0.ATTRIBUTE_TYPE = 'pb'::varchar(10) OR t0.ATTRIBUTE_TYPE = 
'po'::varchar(10) OR t0.ATTRIBUTE_TYPE = 'pn'::varchar(10) OR 
t0.ATTRIBUTE_TYPE = 'ps'::varchar(10))) AND t0.ID_ATTRIBUTE = 
17::int8);

The result is the following. It shows that postgres does not use an 
index which makes the select pretty slow.

Seq Scan on attribute_value t0  (cost=0.00..529.13 rows=208 width=5) 
(actual time=66.591..66.591 rows=0 loops=1)
   Filter: attribute_type)::text = 'pb'::text) OR 
((attribute_type)::text = 'po'::text) OR ((attribute_type)::text = 
'pn'::text) OR ((attribute_type)::text = 'ps'::text)) AND (id_attribute 
= 17::bigint))
 Total runtime: 66.664 ms
(3 rows)

When i remove one OR qualifier one can see that now an index is used.
explain analyze SELECT t0.ATTRIBUTE_TYPE FROM ATTRIBUTE_VALUE t0 WHERE 
(((t0.ATTRIBUTE_TYPE = 'pb'::varchar(10) OR t0.ATTRIBUTE_TYPE = 
'po'::varchar(10) OR t0.ATTRIBUTE_TYPE = 'pn'::varchar(10))) AND 
t0.ID_ATTRIBUTE = 17::int8);

Index Scan using attribute_value__attribute_type__id_attribute, 
attribute_value__attribute_type__id_attribute, 
attribute_value__attribute_type__id_attribute on attribute_value t0  
(cost=0.00..451.82 rows=137 width=5) (actual time=0.301..0.301 rows=0 
loops=1)
   Index Cond: attribute_type)::text = 'pb'::text) AND 
(id_attribute = 17::bigint)) OR (((attribute_type)::text = 'po'::text) 
AND (id_attribute = 17::bigint)) OR (((attribute_type)::text = 
'pn'::text) AND (id_attribute = 17::bigint)))
   Filter: attribute_type)::text = 'pb'::text) OR 
((attribute_type)::text = 'po'::text) OR ((attribute_type)::text = 
'pn'::text)) AND (id_attribute = 17::bigint))
 Total runtime: 0.414 ms
(4 rows)

When i do 'set enable_seqscan=no' the index is used of course. 
Unfortunately the sql is generated on the fly and its not easy, more or 
less impossible to selectively enable / disable seqscan. Any hint how 
to force postgres to use the index even with more OR parts?

regards, David
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] BLOB's bypassing the OS Filesystem for better Image loading speed?

2005-05-12 Thread PFC

Filesystems with many Filesystem Objects can slow down the Performance
at opening and reading Data.
	On my laptop, lighttpd takes upto 15000 hits PER SECOND on static 2-3 Kb  
files (tested with apachebench 2).
	Apache is slower, of course : 3-4000 hits per second which is not that  
bad.
	Using a dynamic script with images in the database, you should account  
for query and transmission overhead, dynamic page overhead... mmm, I'd say  
using a fast application server you could maybe get 2-300 images served  
per second from the database, and that's very optimistic. And then the  
database will crawl, it will be disintegrated by the incoming flow of  
useless requests... scalability will be awful...
	Not mentioning that browsers ask the server "has this image changed since  
the last time ?" (HEAD request) and then they don't download it if it  
doesn't. The server just stat()'s the file. statting a file on any decent  
filesystem (ie. XFS Reiser JFS etc.) should take less than 10 microseconds  
if the information is in the cache. You'll have to look in the database to  
check the date... more queries !

	If you want to control download rights on files, you can still put the  
files on the filesystem (which is the right choice IMHO) and use a dynamic  
script to serve them. Even better, you could use lighttpd's authorized  
file download feature.

	The only case I see putting files in a database as interesting is if you  
want them to be part of a transaction. In that case, why not...

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread PFC

machines. Which has it's own set of issues entirely. I am not entirely  
sure that memcached actually does serialize data when it's comitted into
	I think it does, ie. it's a simple mapping of [string key] => [string  
value].

memcached either, although I could be wrong, I have not looked at the  
source. Certainly if you can ensure that a client always goes back to  
the same machine you can simplify the whole thing hugely. It's generally  
not that easy though, you need a proxy server of some description  
capable of understanding the HTTP traffic and maintaining a central
	Yes...
	You could implement it by mapping servers to the hash of the user session  
id.
	Statistically, the servers would get the same numbers of sessions on each  
of them, but you have to trust statistics...
	It does eliminate the lookup table though.

idea, I would like to hear a way of implementing them cheaply (and on  
PHP) as well. I may have to give that some thought in fact. Oh yeah, and  
load balancers software often sucks in annoying (if not always  
important) ways.
	You can use lighttpd as a load balancer, I believe it has a stick  
sessions plugin (or you could code one in, it's open source after all). It  
definitely support simple round-robin load balancing, acting as a proxy to  
any number of independent servers.


matter, it's pretty impressive. The google filesystem has nothing to do  
with databases though, it's more a massive data store / streaming  
storage.
Since when did Massive Data stores have nothing to do with DBs? Isn't  
Oracle Cluster entirely based on forming an enormous scalable disk array  
to store your DB on?
	Um, well, the Google Filesystem is (like its name implies) a filesystem  
designed to store huge files in a distributed and redundant manner. Files  
are structured as a stream of records (which are themselves big in size)  
and it's designed to support appending records to these stream files  
efficiently and without worrying about locking.

	It has no querying features however, that is why I said it was not a  
database.

	I wish I could find the whitepaper, I think the URL was on this list some  
day, maybe it's on Google's site ?

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


[PERFORM] tuning Postgres for large data import (using Copy from)

2005-05-12 Thread Marc Mamin
Hello,


I'd like to tune Postgres for large data import (using Copy from).


here are a few steps already done:



1) use 3 different disks for:

-1: source data
-2: index tablespaces
-3: data tablespaces


2) define all foreign keys as initially deferred


3) tune some parameters:



max_connections =20
shared_buffers =3
work_mem = 8192  
maintenance_work_mem = 32768
checkpoint_segments = 12

(I also modified the kernel accordingly)




4) runs VACUUM regulary


The server runs RedHat and has 1GB RAM

In the production (which may run on a better server), I plan to: 

- import a few millions rows per day,
- keep up to ca 100 millions rows in the db
- delete older data




I've seen a few posting on hash/btree indexes, which say that hash index do
not work very well in Postgres;
currently, I only use btree indexes. Could I gain performances whole using
hash indexes as well ?

How does Postgres handle concurrent copy from on: same table / different
tables ?


I'd be glad on any further suggestion on how to further increase my
performances.




Marc




-- 
+++ Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS +++
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread Alex Turner
Having local sessions is unnesesary, and here is my logic:

Generaly most people have less than 100Mb of bandwidth to the internet.

If you make the assertion that you are transferring equal or less
session data between your session server (lets say an RDBMS) and the
app server than you are between the app server and the client, an out
of band 100Mb network for session information is plenty of bandwidth. 
This also represents OLTP style traffic, which postgresql is pretty
good at.  You should easily be able to get over 100Tps.  100 hits per
second is an awful lot of traffic, more than any website I've managed
will ever see.

Why solve the complicated clustered sessions problem, when you don't
really need to?

Alex Turner
netEconomist

On 5/11/05, PFC <[EMAIL PROTECTED]> wrote:
> 
> 
> > However, memcached (and for us, pg_memcached) is an excellent way to
> > improve
> > horizontal scalability by taking disposable data (like session
> > information)
> > out of the database and putting it in protected RAM.
> 
> So, what is the advantage of such a system versus, say, a "sticky
> sessions" system where each session is assigned to ONE application server
> (not PHP then) which keeps it in RAM as native objects instead of
> serializing and deserializing it on each request ?
> I'd say the sticky sessions should perform a lot better, and if one
> machine dies, only the sessions on this one are lost.
> But of course you can't do it with PHP as you need an app server which
> can manage sessions. Potentially the savings are huge, though.
> 
> On Google, their distributed system spans a huge number of PCs and it 
> has
> redundancy, ie. individual PC failure is a normal thing and is a part of
> the system, it is handled gracefully. I read a paper on this matter, it's
> pretty impressive. The google filesystem has nothing to do with databases
> though, it's more a massive data store / streaming storage.
> 
> ---(end of broadcast)---
> TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
>

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] AND OR combination: index not being used

2005-05-12 Thread Tom Lane
David Teran <[EMAIL PROTECTED]> writes:
> Any hint how 
> to force postgres to use the index even with more OR parts?

More up-to-date statistics would evidently help; the thing is estimating
hundreds of rows returned and actually finding none.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] tuning Postgres for large data import (using Copy from)

2005-05-12 Thread Tom Lane
"Marc Mamin" <[EMAIL PROTECTED]> writes:
> 1) use 3 different disks for:

>   -1: source data
>   -2: index tablespaces
>   -3: data tablespaces

It's probably much more important to know where you put the WAL.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] tuning Postgres for large data import (using Copy from)

2005-05-12 Thread John A Meinel
Marc Mamin wrote:
Hello,
I'm not an expert, but I'll give some suggestions.
I'd like to tune Postgres for large data import (using Copy from).
I believe that COPY FROM  is supposed to be faster than COPY FROM
STDIN, but  must be available to the backend process. If you can
do it, you should think about it, as it eliminates the communication
between the client and the backend.
here are a few steps already done:

1) use 3 different disks for:
-1: source data
-2: index tablespaces
-3: data tablespaces

Make sure pg_xlog is on it's own filesystem. It contains the
write-ahead-log, and putting it by itself keeps the number of seeks
down. If you are constrained, I think pg_xlog is more important than
moving the index tablespaces.

2) define all foreign keys as initially deferred
3) tune some parameters:

max_connections =20
shared_buffers =3
work_mem = 8192
maintenance_work_mem = 32768
checkpoint_segments = 12
(I also modified the kernel accordingly)
Don't forget to increase your free space map if you are going to be
doing deletes frequently.

4) runs VACUUM regulary
The server runs RedHat and has 1GB RAM
In the production (which may run on a better server), I plan to:
- import a few millions rows per day,
- keep up to ca 100 millions rows in the db
- delete older data

I've seen a few posting on hash/btree indexes, which say that hash index do
not work very well in Postgres;
currently, I only use btree indexes. Could I gain performances whole using
hash indexes as well ?
I doubt it.
How does Postgres handle concurrent copy from on: same table / different
tables ?
I think it is better with different tables. If using the same table, and
there are indexes, it has to grab a lock for updating the index, which
causes contention between 2 processes writing to the same table.
I'd be glad on any further suggestion on how to further increase my
performances.
Since you are deleting data often, and copying often, I might recommend
using a partition scheme with a view to bind everything together. That
way you can just drop the old table rather than doing a delete. I don't
know how this would affect foreign key references.
But basically you can create a new table, and do a copy without having
any indexes, then build the indexes, analyze, update the view.
And when deleting you can update the view, and drop the old table.
Something like this:
CREATE TABLE table_2005_05_11 AS (blah);
COPY FROM ... ;
CREATE INDEX blah ON table_2005_05_11(blah);
CREATE OR REPLACE VIEW table AS
SELECT * FROM table_2005_05_10
UNION ALL SELECT * FROM table_2005_05_11;
VACUUM ANALYZE table_2005_05_11;
...
John
=:->

Marc





signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread Alex Stapleton
On 12 May 2005, at 15:08, Alex Turner wrote:
Having local sessions is unnesesary, and here is my logic:
Generaly most people have less than 100Mb of bandwidth to the  
internet.

If you make the assertion that you are transferring equal or less
session data between your session server (lets say an RDBMS) and the
app server than you are between the app server and the client, an out
of band 100Mb network for session information is plenty of bandwidth.
This also represents OLTP style traffic, which postgresql is pretty
good at.  You should easily be able to get over 100Tps.  100 hits per
second is an awful lot of traffic, more than any website I've managed
will ever see.
Why solve the complicated clustered sessions problem, when you don't
really need to?
100 hits a second = 8,640,000 hits a day. I work on a site which does  
> 100 million dynamic pages a day. In comparison Yahoo probably does  
> 100,000,000,000 (100 billion) views a day
 if I am interpreting Alexa's charts correctly. Which is about  
1,150,000 a second.

Now considering the site I work on is not even in the top 1000 on  
Alexa, theres a lot of sites out there which need to solve this  
problem I would assume.

There are also only so many hash table lookups a single machine can  
do, even if its a Quad Opteron behemoth.


Alex Turner
netEconomist
On 5/11/05, PFC <[EMAIL PROTECTED]> wrote:


However, memcached (and for us, pg_memcached) is an excellent way to
improve
horizontal scalability by taking disposable data (like session
information)
out of the database and putting it in protected RAM.
So, what is the advantage of such a system versus, say, a  
"sticky
sessions" system where each session is assigned to ONE application  
server
(not PHP then) which keeps it in RAM as native objects instead of
serializing and deserializing it on each request ?
I'd say the sticky sessions should perform a lot better,  
and if one
machine dies, only the sessions on this one are lost.
But of course you can't do it with PHP as you need an app  
server which
can manage sessions. Potentially the savings are huge, though.

On Google, their distributed system spans a huge number of  
PCs and it has
redundancy, ie. individual PC failure is a normal thing and is a  
part of
the system, it is handled gracefully. I read a paper on this  
matter, it's
pretty impressive. The google filesystem has nothing to do with  
databases
though, it's more a massive data store / streaming storage.

---(end of  
broadcast)---
TIP 1: subscribe and unsubscribe commands go to  
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread Alex Turner
Ok - my common sense alarm is going off here...

There are only 6.446 billion people worldwide.  100 Billion page views
would require every person in the world to view 18 pages of yahoo
every day.  Not very likely.

http://www.internetworldstats.com/stats.htm
suggests that there are around 1 billion people actualy on the internet.

That means each and every person on the internet has to view 100 pages
per day of yahoo.

pretty unlikely IMHO.  I for one don't even use Yahoo ;)

100 million page views per day suggests that 1 in 100 people on the
internet each viewed 10 pages of a site.  Thats a pretty high
percentage if you ask me.

If I visit 20 web sites in a day, and see an average of 10 pages per
site. that means only about 2000 or so sites generate 100 million page
views in a day or better.

100 million pageviews averages to 1157/sec, which we'll double for
peak load to 2314.

I can easily see a system doing 2314 hash lookups per second.  Hell I
wrote a system that could do a thousand times that four years ago on a
single 1Ghz Athlon.  Heck - you can get 2314 lookups/sec on a 486 ;)

Given that session information doesn't _have_ to persist to storage,
and can be kept in RAM.  A single server could readily manage session
information for even very large sites (of course over a million
concurrent users could really start chewing into RAM, but if you are
Yahoo, you can probably afford a box with 100GB of RAM ;).

We get over 1000 tps on a dual opteron with a couple of mid size RAID
arrays on 10k discs with fsync on for small transactions.  I'm sure
that could easily be bettered with a few more dollars.

Maybe my number are off, but somehow it doesn't seem like that many
people need a highly complex session solution to me.

Alex Turner
netEconomist

On 5/12/05, Alex Stapleton <[EMAIL PROTECTED]> wrote:
> 
> On 12 May 2005, at 15:08, Alex Turner wrote:
> 
> > Having local sessions is unnesesary, and here is my logic:
> >
> > Generaly most people have less than 100Mb of bandwidth to the
> > internet.
> >
> > If you make the assertion that you are transferring equal or less
> > session data between your session server (lets say an RDBMS) and the
> > app server than you are between the app server and the client, an out
> > of band 100Mb network for session information is plenty of bandwidth.
> > This also represents OLTP style traffic, which postgresql is pretty
> > good at.  You should easily be able to get over 100Tps.  100 hits per
> > second is an awful lot of traffic, more than any website I've managed
> > will ever see.
> >
> > Why solve the complicated clustered sessions problem, when you don't
> > really need to?
> 
> 100 hits a second = 8,640,000 hits a day. I work on a site which does
>  > 100 million dynamic pages a day. In comparison Yahoo probably does
>  > 100,000,000,000 (100 billion) views a day
>   if I am interpreting Alexa's charts correctly. Which is about
> 1,150,000 a second.
> 
> Now considering the site I work on is not even in the top 1000 on
> Alexa, theres a lot of sites out there which need to solve this
> problem I would assume.
> 
> There are also only so many hash table lookups a single machine can
> do, even if its a Quad Opteron behemoth.
> 
> 
> > Alex Turner
> > netEconomist
> >
> > On 5/11/05, PFC <[EMAIL PROTECTED]> wrote:
> >
> >>
> >>
> >>
> >>> However, memcached (and for us, pg_memcached) is an excellent way to
> >>> improve
> >>> horizontal scalability by taking disposable data (like session
> >>> information)
> >>> out of the database and putting it in protected RAM.
> >>>
> >>
> >> So, what is the advantage of such a system versus, say, a
> >> "sticky
> >> sessions" system where each session is assigned to ONE application
> >> server
> >> (not PHP then) which keeps it in RAM as native objects instead of
> >> serializing and deserializing it on each request ?
> >> I'd say the sticky sessions should perform a lot better,
> >> and if one
> >> machine dies, only the sessions on this one are lost.
> >> But of course you can't do it with PHP as you need an app
> >> server which
> >> can manage sessions. Potentially the savings are huge, though.
> >>
> >> On Google, their distributed system spans a huge number of
> >> PCs and it has
> >> redundancy, ie. individual PC failure is a normal thing and is a
> >> part of
> >> the system, it is handled gracefully. I read a paper on this
> >> matter, it's
> >> pretty impressive. The google filesystem has nothing to do with
> >> databases
> >> though, it's more a massive data store / streaming storage.
> >>
> >> ---(end of
> >> broadcast)---
> >> TIP 1: subscribe and unsubscribe commands go to
> >> [EMAIL PROTECTED]
> >>
> >>
> >
> >
> 
>

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread John A Meinel
Alex Turner wrote:
Ok - my common sense alarm is going off here...
There are only 6.446 billion people worldwide.  100 Billion page views
would require every person in the world to view 18 pages of yahoo
every day.  Not very likely.
http://www.internetworldstats.com/stats.htm
suggests that there are around 1 billion people actualy on the internet.
That means each and every person on the internet has to view 100 pages
per day of yahoo.
pretty unlikely IMHO.  I for one don't even use Yahoo ;)
100 million page views per day suggests that 1 in 100 people on the
internet each viewed 10 pages of a site.  Thats a pretty high
percentage if you ask me.
In general I think your point is valid. Just remember that it probably
also matters how you count page views. Because technically images are a
separate page (and this thread did discuss serving up images). So if
there are 20 graphics on a specific page, that is 20 server hits just
for that one page.
I could easily see an image heavy site getting 100 hits / page. Which
starts meaning that if 1M users hit 10 pages, then you get 1M*10*100 = 1G.
I still think 100G views on a single website is a lot, but 100M is
certainly possible.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread PFC

100 hits a second = 8,640,000 hits a day. I work on a site which does  >  
100 million dynamic pages a day. In comparison Yahoo probably does  >  
100,000,000,000 (100 billion) views a day
  if I am interpreting Alexa's charts correctly. Which is about  
1,150,000 a second.

Read the help on Alexa's site... ;)
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] tuning Postgres for large data import (using Copy from)

2005-05-12 Thread Josh Berkus
Marc,

> 1) use 3 different disks for:
>
>   -1: source data
>   -2: index tablespaces
>   -3: data tablespaces

Others have already told you about the importance of relocating WAL.  If you 
are going to be building indexes on the imported data, you might find it 
beneficial to relocate pgsql_tmp for the database in question as well.   
Also, I generally find it more beneficial to seperate the few largest tables 
to their own disk resources than to put all tables on one resource and all 
disks on another.  For example, for TPCH-like tests, I do
array0: OS and pgsql_tmp
array1: LINEITEM
array2: LINEITEM Indexes
array3: all other tables and indexes
array4: pg_xlog
array5: source data

This allows me to load a 100G (actually 270G) TPCH-like database in < 2 hours, 
not counting index-building.

> 2) define all foreign keys as initially deferred

It would be better to drop them before import and recreate them afterwards.  
Same for indexes unless those indexes are over 2G in size.

>   max_connections =20
>   shared_buffers =3
>   work_mem = 8192

Not high enough, unless you have very little RAM.  On an 8G machine I'm using 
256MB.  You might want to use 64MB or 128MB.

>   maintenance_work_mem = 32768

REALLY not high enough.  You're going to need to build big indexes and 
possibly vacuum large tables.  I use the maximum of 1.98GB.  Use up to 1/3 of 
your RAM for this.

>   checkpoint_segments = 12

Also way too low.  Put pg_xlog on its own disk, give in 128 to 512 segments 
(up to 8G).

> The server runs RedHat and has 1GB RAM

Make sure you're running a 2.6.10+ kernel.  Make sure ext3 is set noatime, 
data=writeback.  Buy more RAM.  Etc.

> How does Postgres handle concurrent copy from on: same table / different
> tables ?

Same table is useless; the imports will effectively serialize  (unless you use 
pseudo-partitioning).  You can parallel load on multiple tables up to the 
lower of your number of disk channels or number of processors.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread Josh Berkus
People,

> In general I think your point is valid. Just remember that it probably
> also matters how you count page views. Because technically images are a
> separate page (and this thread did discuss serving up images). So if
> there are 20 graphics on a specific page, that is 20 server hits just
> for that one page.

Also, there's bots and screen-scrapers and RSS, web e-mails, and web services 
and many other things which create hits but are not "people".  I'm currently 
working on clickstream for a site which is nowhere in the top 100, and is 
getting 3 million real hits a day ... and we know for a fact that at least 
1/4 of that is bots.

Regardless, the strategy you should be employing for a high traffic site is 
that if your users hit the database for anything other than direct 
interaction (like filling out a webform) then you're lost.Use memcached, 
squid, lighttpd caching, ASP.NET caching, pools, etc.   Keep the load off the 
database except for the stuff that only the database can do.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread Alex Stapleton
On 12 May 2005, at 18:33, Josh Berkus wrote:
People,

In general I think your point is valid. Just remember that it  
probably
also matters how you count page views. Because technically images  
are a
separate page (and this thread did discuss serving up images). So if
there are 20 graphics on a specific page, that is 20 server hits just
for that one page.

Also, there's bots and screen-scrapers and RSS, web e-mails, and  
web services
and many other things which create hits but are not "people".  I'm  
currently
working on clickstream for a site which is nowhere in the top 100,  
and is
getting 3 million real hits a day ... and we know for a fact that  
at least
1/4 of that is bots.
I doubt bots are generally Alexa toolbar enabled.
Regardless, the strategy you should be employing for a high traffic  
site is
that if your users hit the database for anything other than direct
interaction (like filling out a webform) then you're lost.Use  
memcached,
squid, lighttpd caching, ASP.NET caching, pools, etc.   Keep the  
load off the
database except for the stuff that only the database can do.
This is the aproach I would take as well. There is no point storing  
stuff in a DB, if your only doing direct lookups on it and it isn't  
the sort of data that you care so much about the integrity of.


--
Josh Berkus
Aglio Database Solutions
San Francisco
---(end of  
broadcast)---
TIP 8: explain analyze is your friend



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread Josh Berkus
Ross,

> Memcached is a PG memory store, I gather,

Nope.  It's a hyperfast resident-in-memory hash that allows you to stash stuff 
like user session information and even materialized query set results.  
Thanks to SeanC, we even have a plugin, pgmemcached.

> but...what is squid, lighttpd? 
> anything directly PG-related?

No.   These are all related to making the web server do more.   The idea is 
NOT to hit the database every time you have to serve up a web page, and 
possibly not to hit the web server either.  For example, you can use squid 3 
for "reverse" caching in front of your web server, and serve far more page 
views than you could with Apache alone.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Sort and index

2005-05-12 Thread Manfred Koizar
On Wed, 11 May 2005 16:15:16 -0500, "Jim C. Nasby" <[EMAIL PROTECTED]>
wrote:
>> This is divided by the number of index columns, so the index correlation
>> is estimated to be 0.219.
>
>That seems like a pretty bad assumption to make.

Any assumption we make without looking at entire index tuples has to be
bad.  A new GUC variable secondary_correlation introduced by my patch at
least gives you a chance to manually control the effects of additional
index columns.

>> In my tests I got much more plausible results with
>> 
>>  1 - (1 - abs(correlation))^2
>
>What's the theory behind that?

The same as for csquared -- pure intuition.  But the numbers presented
in http://archives.postgresql.org/pgsql-hackers/2002-10/msg00072.php
seem to imply that in this case my intiution is better ;-)

Actually above formula was not proposed in that mail.  AFAIR it gives
results between p2 and p3.

>And I'd still like to know why correlation squared is used.

On Wed, 02 Oct 2002 18:48:49 -0400, Tom Lane <[EMAIL PROTECTED]> wrote:
|The indexCorrelation^2 algorithm was only a quick hack with no theory
|behind it :-(.

>It depends on the patches, since this is a production machine. Currently
>it's running 7.4.*mumble*,

The patch referenced in
http://archives.postgresql.org/pgsql-hackers/2003-08/msg00931.php is
still available.  It doesn't touch too many places and should be easy to
review.  I'm using it and its predecessors in production for more than
two years.  Let me know, if the 74b1 version does not apply cleanly to
your source tree.

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] Optimize complex join to use where condition before join

2005-05-12 Thread Sebastian Hennebrueder
Hello,
I am facing a problem in optimizing the query shown below.
Most queries in the application do only find about 20 to 100 matching rows.
The query joins the table taufgaben_mitarbeiter to taufgaben on which a 
condition like the following "where clause" is frequently used.

where
am.fmitarbeiter_id = 54
then there is a nested join to taufgaben -> tprojekt -> tkunden_kst -> 
tkunden.

What I would like to achieve is that before joining all the tables that 
the join of 

taufgaben_mitarbeiter 
(... from
taufgaben left join taufgaben_mitarbeiter am
on taufgaben.fid = am.faufgaben_id)

is done and that the where condition is evaluated. Than an index scan to join 
the other data is run.
What is happening at the moment (if I understood the explain analyze) is that 
the full join is done and at the end the where condition is done.
The query with seqscan and nestloop enabled takes about 3 seconds.
The query with both disabled takes 0.52 seconds
The query with only nestlop disabled takes 0.6 seconds
and
with only sesscan disabled takes about 3 seconds.
Below you can find the explain analyze from "seqscan and nestloop enabled" and 
from both disabled. The problem seems to be right at the beginning when the rows are 
badly estimated.
...
Merge Cond: ("outer".fid = "inner".faufgaben_id)"
"  ->  Nested Loop  (cost=1621.51..1729.28 rows=6 width=2541) (actual 
time=328.000..3125.000 rows=1118 loops=1)"
...
I am using PostgreSQL 8.0 on Windows
Thank you for any idea

--
Kind Regards / Viele Grüße
Sebastian Hennebrueder
-
http://www.laliluna.de/tutorials.html
Tutorials for Java, Struts, JavaServer Faces, JSP, Hibernate, EJB and more.
enabled seqscan and nested_loop
explain analyze
SELECT taufgaben.fid AS taufgaben_fid, taufgaben.fprojekt_id AS
   taufgaben_fprojekt_id, taufgaben.fnummer AS taufgaben_fnummer,
   taufgaben.fbudget AS taufgaben_fbudget, taufgaben.ftyp AS taufgaben_ftyp,
   taufgaben.fberechnungsart AS taufgaben_fberechnungsart,
   taufgaben.fverrechnung_extern AS taufgaben_fverrechnung_extern,
   taufgaben.fverrechnungsbasis AS taufgaben_fverrechnungsbasis,
   taufgaben.fstatus AS taufgaben_fstatus, taufgaben.fkurzbeschreibung AS
   taufgaben_fkurzbeschreibung, taufgaben.fansprechpartner AS
   taufgaben_fansprechpartner, taufgaben.fanforderer AS taufgaben_fanforderer,
   taufgaben.fstandort_id AS taufgaben_fstandort_id, taufgaben.fwunschtermin
   AS taufgaben_fwunschtermin, taufgaben.fstarttermin AS
   taufgaben_fstarttermin, taufgaben.fgesamtaufwand AS
   taufgaben_fgesamtaufwand, taufgaben.fistaufwand AS taufgaben_fistaufwand,
   taufgaben.fprio AS taufgaben_fprio, taufgaben.ftester AS taufgaben_ftester,
   taufgaben.ffaellig AS taufgaben_ffaellig, taufgaben.flevel AS
   taufgaben_flevel, taufgaben.fkategorie AS taufgaben_fkategorie,
   taufgaben.feintragbearbeitung AS taufgaben_feintragbearbeitung,
   taufgaben.fbearbeitungsstatus AS taufgaben_fbearbeitungsstatus,
   taufgaben.fsolllimit AS taufgaben_fsolllimit, taufgaben.fistlimit AS
   taufgaben_fistlimit, taufgaben.fpauschalbetrag AS
   taufgaben_fpauschalbetrag, taufgaben.frechnungslaeufe_id AS
   taufgaben_frechnungslaeufe_id, taufgaben.fzuberechnen AS
   taufgaben_fzuberechnen, tprojekte.fid AS tprojekte_fid,
   tprojekte.fbezeichnung AS tprojekte_fbezeichnung, tprojekte.fprojektnummer
   AS tprojekte_fprojektnummer, tprojekte.fbudget AS tprojekte_fbudget,
   tprojekte.fverrechnung_extern AS tprojekte_fverrechnung_extern,
   tprojekte.fstatus AS tprojekte_fstatus, tprojekte.fkunden_kst_id AS
   tprojekte_fkunden_kst_id, tprojekte.fverrechnungsbasis AS
   tprojekte_fverrechnungsbasis, tprojekte.fberechnungsart AS
   tprojekte_fberechnungsart, tprojekte.fprojekttyp AS tprojekte_fprojekttyp,
   tprojekte.fkostentraeger_id AS tprojekte_fkostentraeger_id,
   tprojekte.fprojektleiter_id AS tprojekte_fprojektleiter_id,
   tprojekte.fpauschalsatz AS tprojekte_fpauschalsatz,
   tprojekte.frechnungslaeufe_id AS tprojekte_frechnungslaeufe_id,
   tprojekte.fzuberechnen AS tprojekte_fzuberechnen, tprojekte.faufschlagrel
   AS tprojekte_faufschlagrel, tprojekte.faufschlagabs AS
   tprojekte_faufschlagabs, tprojekte.fbearbeitungsstatus AS
   tprojekte_fbearbeitungsstatus, tuser.fusername AS tuser_fusername,
   tuser.fpassword AS tuser_fpassword, tuser.fvorname AS tuser_fvorname,
   tuser.fnachname AS tuser_fnachname, tuser.fismitarbeiter AS
   tuser_fismitarbeiter, tuser.flevel AS tuser_flevel, tuser.fkuerzel AS
   tuser_fkuerzel, taufgaben.floesungsbeschreibung AS
   taufgaben_floesungsbeschreibung, taufgaben.ffehlerbeschreibung AS
   taufgaben_ffehlerbeschreibung, taufgaben.faufgabenstellung AS
   taufgaben_faufgabenstellung, taufgaben.fkritischeaenderungen AS
   taufgaben_fkritischeaenderungen, taufgaben.fbdeaufgabenersteller_id AS
   taufgaben_fbdeaufgabenersteller_id, taufgaben.fzufaktorieren AS
   taufgaben_fzufaktorieren, tprojekte.fzufaktorieren AS
   tprojekte_fzufaktorieren, taufgaben.fisdirty AS taufgaben_fisdirty,
   taufgaben.fnf_kunde_stunden

Re: [PERFORM] AND OR combination: index not being used

2005-05-12 Thread David Teran
On 12.05.2005, at 16:15, Tom Lane wrote:
David Teran <[EMAIL PROTECTED]> writes:
Any hint how
to force postgres to use the index even with more OR parts?
More up-to-date statistics would evidently help; the thing is 
estimating
hundreds of rows returned and actually finding none.

I always do a 'vacuum analyze' if something does not work as expected. 
But this did not help. Any other tip?

regards, David
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Optimize complex join to use where condition before

2005-05-12 Thread Sebastian Hennebrueder
Solution to my problem.
I added indexes to each foreign_key (there had been some missing). I
will try tomorrow by daylight what influence this had actually. Only the
indexes did not change anything! Even with lower random_page_costs and
higher shared mem.
The big change was the following
I created a view which holds a part of the query. The part is the nested
join I am doing from rpojekt, tkunden_kst, 
See below
Than I changed my query to include the view which improved the
performance from 3000 to 450 ms which is quite good now.
But I am having two more question
a) ###
I estimated the theoretical speed a little bit higher.
The query without joining the view takes about 220 ms. A query to the
view with a condition projekt_id in ( x,y,z), beeing x,y,z all the
projekt I got with the first query, takes 32 ms.
So my calculation is
query a 220
query b to view with project in ... 32
= 252 ms
+ some time to add the adequate row from query b to one of the 62 rows
from query a
This sometime seems to be quite high with 200 ms
or alternative
query a 220 ms
for each of the 62 rows a query to the view with project_id = x
220
62*2 ms
= 344 ms + some time to assemble all this.
=> 100 ms for assembling. This is quite a lot or am I wrong
b) ###
My query does take about 200 ms. Most of the time is taken by the
following part
LEFT JOIN (
   SELECT DISTINCT taufgaben_patches.faufgaben_id
   FROM taufgaben_patches
   ORDER BY taufgaben_patches.faufgaben_id
   ) patchdaten ON taufgaben.fid = patchdaten.faufgaben_id
What I want to achieve is one column in my query beeing null or not null
and indicating if there is a patch which includes the aufgabe (engl.: task)
Is there a better way?
--
Kind Regards / Viele Grüße
Sebastian Hennebrueder
-
http://www.laliluna.de/tutorials.html
Tutorials for Java, Struts, JavaServer Faces, JSP, Hibernate, EJB and more.
##
Below you can find
query solution I found
explain analyze of the complete query (my solution)
explain analyze of query a
explain analyze of view with one project_id as condition
explain analyze
SELECT taufgaben.fid AS taufgaben_fid,
taufgaben.fprojekt_id AStaufgaben_fprojekt_id,
taufgaben.fnummer AS taufgaben_fnummer,
 taufgaben.fbudget AS taufgaben_fbudget,
 taufgaben.ftyp AS taufgaben_ftyp,
   taufgaben.fberechnungsart AS taufgaben_fberechnungsart,
   taufgaben.fverrechnung_extern AS taufgaben_fverrechnung_extern,
   taufgaben.fverrechnungsbasis AS taufgaben_fverrechnungsbasis,
   taufgaben.fstatus AS taufgaben_fstatus, taufgaben.fkurzbeschreibung AS
   taufgaben_fkurzbeschreibung, taufgaben.fansprechpartner AS
   taufgaben_fansprechpartner, taufgaben.fanforderer AS
taufgaben_fanforderer,
   taufgaben.fstandort_id AS taufgaben_fstandort_id,
taufgaben.fwunschtermin
   AS taufgaben_fwunschtermin, taufgaben.fstarttermin AS
   taufgaben_fstarttermin, taufgaben.fgesamtaufwand AS
   taufgaben_fgesamtaufwand, taufgaben.fistaufwand AS
taufgaben_fistaufwand,
   taufgaben.fprio AS taufgaben_fprio, taufgaben.ftester AS
taufgaben_ftester,
   taufgaben.ffaellig AS taufgaben_ffaellig, taufgaben.flevel AS
   taufgaben_flevel, taufgaben.fkategorie AS taufgaben_fkategorie,
   taufgaben.feintragbearbeitung AS taufgaben_feintragbearbeitung,
   taufgaben.fbearbeitungsstatus AS taufgaben_fbearbeitungsstatus,
   taufgaben.fsolllimit AS taufgaben_fsolllimit, taufgaben.fistlimit AS
   taufgaben_fistlimit, taufgaben.fpauschalbetrag AS
   taufgaben_fpauschalbetrag, taufgaben.frechnungslaeufe_id AS
   taufgaben_frechnungslaeufe_id, taufgaben.fzuberechnen AS
   taufgaben_fzuberechnen,
taufgaben.floesungsbeschreibung AS
   taufgaben_floesungsbeschreibung, taufgaben.ffehlerbeschreibung AS
   taufgaben_ffehlerbeschreibung, taufgaben.faufgabenstellung AS
   taufgaben_faufgabenstellung, taufgaben.fkritischeaenderungen AS
   taufgaben_fkritischeaenderungen, taufgaben.fbdeaufgabenersteller_id AS
   taufgaben_fbdeaufgabenersteller_id, taufgaben.fzufaktorieren AS
   taufgaben_fzufaktorieren,
   taufgaben.fisdirty AS taufgaben_fisdirty,
   taufgaben.fnf_kunde_stunden AS taufgaben_fnf_kunde_stunden,
   taufgaben.fzf_kunde_stunden AS taufgaben_fzf_kunde_stunden,
   taufgaben.fbf_kunde_stunden AS taufgaben_fbf_kunde_stunden,
   taufgaben.fnf_kunde_betrag AS taufgaben_fnf_kunde_betrag,
   taufgaben.fzf_kunde_betrag AS taufgaben_fzf_kunde_betrag,
   taufgaben.fbf_kunde_betrag AS taufgaben_fbf_kunde_betrag,
   taufgaben.fgesamt_brutto_stunden AS taufgaben_fgesamt_brutto_stunden,
   taufgaben.fgesamt_brutto_betrag AS taufgaben_fgesamt_brutto_betrag,
   taufgaben.fhinweisgesendet AS taufgaben_fhinweisgesendet,
   taufgaben.fwarnunggesendet AS taufgaben_fwarnunggesendet,
   taufgaben.fnfgesamtaufwand AS
   taufgaben_fnfgesamtaufwand, taufgaben.fnf_netto_stunden AS
   taufgaben_fnf_netto_stunden, taufgaben.fnf_brutto_stunden AS
   taufgaben_fnf_brutto_stunden, taufgaben.fnfhinweisgesendet AS
   taufgaben_fnfhinweisgesendet, taufgaben.fnfwarnunggesendet AS
   taufgaben_f

Re: [PERFORM] Optimize complex join to use where condition before

2005-05-12 Thread Sebastian Hennebrueder
Solution not found as I thought. I integrated the query in a view and 
the query plan became very bad once again.
The reason is that when I am using the view I have the joins in a 
differerent order.

Does anyone have an idea to solve this.
Sebastian
a) bad order but the one I have in my application
explain analyze
SELECT taufgaben.fid AS taufgaben_fid, taufgaben.fprojekt_id AS
   taufgaben_fprojekt_id, taufgaben.fnummer AS taufgaben_fnummer,
   taufgaben.fbudget AS taufgaben_fbudget,
   taufgaben.ftyp AS taufgaben_ftyp,
   taufgaben.fberechnungsart AS taufgaben_fberechnungsart,
   taufgaben.fverrechnung_extern AS taufgaben_fverrechnung_extern,
   taufgaben.fverrechnungsbasis AS taufgaben_fverrechnungsbasis,
   taufgaben.fstatus AS taufgaben_fstatus,
   taufgaben.fkurzbeschreibung AS
   taufgaben_fkurzbeschreibung,
   taufgaben.fansprechpartner AS
   taufgaben_fansprechpartner,
   taufgaben.fanforderer AS taufgaben_fanforderer,
   taufgaben.fstandort_id AS taufgaben_fstandort_id,
   taufgaben.fwunschtermin  AS taufgaben_fwunschtermin,
   taufgaben.fstarttermin AS taufgaben_fstarttermin,
   taufgaben.fgesamtaufwand AS taufgaben_fgesamtaufwand,
   taufgaben.fistaufwand AS taufgaben_fistaufwand,
   taufgaben.fprio AS taufgaben_fprio,
   taufgaben.ftester AS taufgaben_ftester,
   taufgaben.ffaellig AS taufgaben_ffaellig,
   taufgaben.flevel AS   taufgaben_flevel,
   taufgaben.fkategorie AS taufgaben_fkategorie,
   taufgaben.feintragbearbeitung AS taufgaben_feintragbearbeitung,
   taufgaben.fbearbeitungsstatus AS taufgaben_fbearbeitungsstatus,
   taufgaben.fsolllimit AS taufgaben_fsolllimit,
   taufgaben.fistlimit AStaufgaben_fistlimit,
   taufgaben.fpauschalbetrag AS   taufgaben_fpauschalbetrag,
   taufgaben.frechnungslaeufe_id AS   taufgaben_frechnungslaeufe_id,
   taufgaben.fzuberechnen AS   taufgaben_fzuberechnen,
   taufgaben.floesungsbeschreibung AS   taufgaben_floesungsbeschreibung,
   taufgaben.ffehlerbeschreibung AStaufgaben_ffehlerbeschreibung,
   taufgaben.faufgabenstellung AS   taufgaben_faufgabenstellung,
   taufgaben.fkritischeaenderungen AStaufgaben_fkritischeaenderungen,
   taufgaben.fbdeaufgabenersteller_id AS
taufgaben_fbdeaufgabenersteller_id,
   taufgaben.fzufaktorieren AStaufgaben_fzufaktorieren,
   taufgaben.fisdirty AS taufgaben_fisdirty,
   taufgaben.fnf_kunde_stunden AS taufgaben_fnf_kunde_stunden,
   taufgaben.fzf_kunde_stunden AS taufgaben_fzf_kunde_stunden,
   taufgaben.fbf_kunde_stunden AS taufgaben_fbf_kunde_stunden,
   taufgaben.fnf_kunde_betrag AS taufgaben_fnf_kunde_betrag,
   taufgaben.fzf_kunde_betrag AS taufgaben_fzf_kunde_betrag,
   taufgaben.fbf_kunde_betrag AS taufgaben_fbf_kunde_betrag,
   taufgaben.fgesamt_brutto_stunden AS taufgaben_fgesamt_brutto_stunden,
   taufgaben.fgesamt_brutto_betrag AS taufgaben_fgesamt_brutto_betrag,
   taufgaben.fhinweisgesendet AS taufgaben_fhinweisgesendet,
   taufgaben.fwarnunggesendet AS taufgaben_fwarnunggesendet,
   taufgaben.fnfgesamtaufwand AS   taufgaben_fnfgesamtaufwand,
   taufgaben.fnf_netto_stunden AStaufgaben_fnf_netto_stunden,
   taufgaben.fnf_brutto_stunden AS   taufgaben_fnf_brutto_stunden,
   taufgaben.fnfhinweisgesendet AS   taufgaben_fnfhinweisgesendet,
   taufgaben.fnfwarnunggesendet AStaufgaben_fnfwarnunggesendet,
   taufgaben.fhatzeiten AS taufgaben_fhatzeiten,
   taufgaben.fnichtpublicrechnungsfaehig AS
taufgaben_fnichtpublicrechnungsfaehig,
   taufgaben.fnichtpublicrechnungsfaehigbetrag AS
taufgaben_fnichtpublicrechnungsfaehigbetrag,
   taufgaben.fnichtberechenbar AStaufgaben_fnichtberechenbar,
   taufgaben.fnichtberechenbarbetrag AS
taufgaben_fnichtberechenbarbetrag,
   taufgaben.finternertester AS   taufgaben_finternertester,
   taufgaben.finterngetestet AS   taufgaben_finterngetestet,
   taufgaben.fanzahlbearbeiter AS taufgaben_fanzahlbearbeiter,
   patchdaten.faufgaben_id AS pataid
   ,vprojekt.*
FROM taufgaben LEFT JOIN (
   SELECT DISTINCT taufgaben_patches.faufgaben_id
   FROM taufgaben_patches
   ) patchdaten ON taufgaben.fid = patchdaten.faufgaben_id
JOIN vprojekt ON taufgaben.fprojekt_id = vprojekt.tprojekte_fid

join taufgaben_mitarbeiter am on taufgaben.fid = am.faufgaben_id
where
am.fmitarbeiter_id = 54
and
taufgaben.fbearbeitungsstatus <> 2
"Nested Loop  (cost=1349.13..1435.29 rows=1 width=2541) (actual 
time=1640.000..3687.000 rows=62 loops=1)"
"  Join Filter: ("inner".fid = "outer".faufgaben_id)"
"  ->  Index Scan using idx_tauf_mit_mitid on taufgaben_mitarbeiter am  
(cost=0.00..80.65 rows=35 width=4) (actual time=0.000..0.000 rows=765 
loops=1)"
"Index Cond: (fmitarbeiter_id = 54)"
"  ->  Materialize  (cost=1349.13..1349.20 rows=7 width=2541) (actual 
time=0.531..1.570 rows=1120 loops=765)"
"->  Merge Join  (cost=1343.42..1349.13 rows=7 width=2541) 
(actual time=406.000..515.000 rows=1120 loops=1)"
"  Merge Cond: ("outer".fid = "inner".fprojekt_id)"
"  ->  Sort  (cost=130.89..130.90 rows=6 width=1494) (actual 
time=203

[PERFORM] Recommendations for set statistics

2005-05-12 Thread Sebastian Hennebrueder
Hello,
I could not find any recommandations for the level of set statistics and 
what a specific level does actually mean.
What is the difference between 1, 50 and 100? What is recommanded for a 
table or column?

--
Kind Regards / Viele Grüße
Sebastian Hennebrueder
-
http://www.laliluna.de/tutorials.html
Tutorials for Java, Struts, JavaServer Faces, JSP, Hibernate, EJB and more.
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Recommendations for set statistics

2005-05-12 Thread John A Meinel
Sebastian Hennebrueder wrote:
Hello,
I could not find any recommandations for the level of set statistics and
what a specific level does actually mean.
What is the difference between 1, 50 and 100? What is recommanded for a
table or column?
Default I believe is 10. The higher the number, the more statistics are
kept, with a maximum of 1000.
The default is a little bit low for columns used in foreign keys, though
frequently it is okay.
When problems start, try setting them to 100 or 200. Higher is more
accurate, but takes longer to compute, *and* takes longer when planning
the optimal query method. It can be worth it, though.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Recommendations for set statistics

2005-05-12 Thread Christopher Browne
After a long battle with technology, [EMAIL PROTECTED] (Sebastian 
Hennebrueder), an earthling, wrote:
> I could not find any recommandations for the level of set statistics
> and what a specific level does actually mean.
> What is the difference between 1, 50 and 100? What is recommanded for
> a table or column?

The numbers represent the numbers of "bins" used to establish
histograms that estimate how the data looks.

The default is to have 10 bins, and 300 items are sampled at ANALYZE
time per bin.

1 would probably be rather bad, having very little ability to express
the distribution of data.  100 bins would be 10x as expensive to
store than 10, but would provide a much distribution.

It is widely believed that a somewhat larger default than 10 would be
a "good thing," as it seems to be fairly common for 10 to be too small
to allow statistics to be stable.  But nobody has done any formal
evaluation as to whether it would make sense to jump from 10 to:

 - 15?
 - 20?
 - 50?
 - 100?
 - More than that?

If we could show that 90% of the present "wrong results" that come
from the default of 10 could be addressed by an increase to 20 bins,
and the remainder could be left to individual tuning, well, getting
rid of 90% of the "query plan errors" would seem worthwhile.

I'd hope that a moderate (e.g. - from 10 to 20) increase, which would
be pretty cheap, would help a fair bit, but there is no evidence one
way or the other.  Unfortunately, nobody has come up with a decent way
of evaluating how much good a change to the default would actually do.

If you can discover an evaluation scheme, your results are likely to
get an ear.
-- 
"cbbrowne","@","gmail.com"
http://linuxdatabases.info/info/lsf.html
"In 1555,  Nostradamus wrote: 'Come  the millennium, month 12,  in the
home  of greatest  power,  the village  idiot  will come  forth to  be
acclaimed the leader.'"

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Optimize complex join to use where condition before

2005-05-12 Thread Greg Stark

Sebastian Hennebrueder <[EMAIL PROTECTED]> writes:

> User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
> ...
> 
> "Nested Loop  (cost=1349.13..1435.29 rows=1 width=2541) (actual 
> time=1640.000..3687.000 rows=62 loops=1)"
> "  Join Filter: ("inner".fid = "outer".faufgaben_id)"
> "  ->  Index Scan using idx_tauf_mit_mitid on taufgaben_mitarbeiter am  
> (cost=0.00..80.65 rows=35 width=4) (actual time=0.000..0.000 rows=765 
> loops=1)"

Is it really Mozilla Thunderbird that's causing this new craptastic mangling
of plans in people's mails? I was assuming it was some new idea of how to mess
up people's mail coming out of Exchange or Lotus or some other such "corporate
messaging" software that only handled SMTP mail as an afterthought. This is,
uh, disappointing.

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match