Re: [PERFORM] PostgreSQL vs. MySQL

2003-07-05 Thread Grega Bremec
...and on Sat, Jul 05, 2003 at 12:24:18AM +0200, Bjoern Metzdorf used the keyboard:
> >> Afaik, your original posting said postgresql was 3 times slower than
> >> mysql and that you are going to leave this list now. This implied
> >> that you have made your decision between postgresql and mysql,
> >> taking mysql because it is faster.
> >
> > Well, that shows what you get for making implications.  The client is
> > sticking with postgres and we are coding around the issue in other
> > ways.
> 
> As many other guys here pointed out, there are numerous ways to tune
> postgresql for maximum performance. If you are willing to share more
> information about your particular project, we might be able to help you out
> and optimize your application, without the need to code around the issue as
> much as you may be doing right now.
> Even if it is not possible for you to share enough information, there are a
> lot of places where you can read about performance tuning (if not in the
> docs then in the archives).
> 

Also, I should think the clients would not be too offended if Brian posted
some hint about the actual quantity of data involved here, both the total
expected database size and some info about the estimated "working set" size,
such as a sum of sizes of tables most commonly used in JOIN queries and the
percentage of data being shuffled around in those. Are indexes big? Are
there any multicolumn indexes in use? Lots of sorting expected? Lots of
UPDATEs/INSERTs/DELETEs?

Also, it would be helpful to know just how normalized the database is, to
provide some advice about possible query optimization, which could again
prove helpful in speeding the machinery up.

Another useful piece of information would be the amount of memory consumed
by other applications vs. the amount of memory reserved by the OS for cache,
and the nature of those other applications running - are they big cache
consumers, such as Apache with static content and a large load would be,
or do they keep a low profile?

I think this would, in combination with the information already posted, such
as the amount of memory and I/O subsystem info, at least enable us to advise
about the recommended shared_buffers, effective_cache_size, sort_mem,
vacuum_mem, and others, without compromising the intellectual property of
Brian's clients.

> > over and out.

I CC'd this post over to you, Brian, 'cause this signoff made me rather
unsure as to whether or not you're still on the list. Hope you don't mind.

Sincerely,
-- 
Grega Bremec
System Administration & Development Support
grega.bremec-at-noviforum.si
http://najdi.si/
http://www.noviforum.si/

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Strange result: UNIX vs. TCP/IP sockets

2003-07-05 Thread Andrew Sullivan
On Fri, Jul 04, 2003 at 05:47:27PM -0400, Tom Lane wrote:
> Andrew Sullivan <[EMAIL PROTECTED]> writes:
> > How could it be the transport affects the time for the query as
> > reported by the back end?
> 
> How much data is being sent back by the query?

In this case, it's an all-aggregate query:

select count(*), min(id) from sometable where owner = int4;

(Yeah, yeah, I know.  I didn't write it.)

But it's the EXPLAIN ANALYSE that's reporting different times
depending on the transport.  That's what I find so strange.

> Do you have SSL enabled?  SSL encryption overhead is nontrivial,
> especially if any renegotiations happen.

No.

-- 

Andrew Sullivan 204-4141 Yonge Street
Liberty RMS   Toronto, Ontario Canada
<[EMAIL PROTECTED]>  M2P 2A8
 +1 416 646 3304 x110


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Moving postgresql.conf tunables into 2003...

2003-07-05 Thread Sean Chittenden
> Sean Chittenden <[EMAIL PROTECTED]> writes:
> > Getting the planner to pick
> > using the index to filter out data inserted in the last 3 days over
> > doing a seq scan...  well, I don't know how you could do that without
> > changing the random_page_cost.
> 
> This sounds a *whole* lot like a correlation issue.  If the data in
> question were scattered randomly in the table, it's likely that an
> indexscan would be a loser.  The recently-inserted data is probably
> clustered near the end of the table (especially if they're doing
> VACUUM FULL after data purges; are they?).  But the planner's
> correlation stats are much too crude to recognize that situation, if
> the rest of the table is not well-ordered.

Data isn't scattered randomly from what I can tell and is basically
already clustered just because the data is inserted linearly and
based off of time.  I don't think they're doing a VACUUM FULL after a
purge, but I'll double check on that on Monday when they get in.  Is
there an easy way of determining or setting a planner stat to suggest
that data is ordered around a column in a permanent way?  CLUSTER has
always been a one shot deal and its effects wear off quickly depending
on the way that data is inserted.  It seems as though that this would
be a circumstance in which preallocated disk space would be a win
(that way data wouldn't always be appended to the heap and could be
inserted in order, of most use for non-time related data: ex, some
non-unique ID).

> If their typical process involves a periodic data purge and then a
> VACUUM FULL, it might be worth experimenting with doing a CLUSTER on
> the timestamp index instead of the VACUUM FULL.  The CLUSTER would
> reclaim space as effectively as VACUUM FULL + REINDEX, and it would
> leave the table with an unmistakable 1.0 correlation ... which
> should tilt the planner towards an indexscan without needing a
> physically impossible random_page_cost to do it.  I think CLUSTER
> would probably be a little slower than VACUUM FULL but it's hard to
> be sure without trying.

Hrm, I understand what clustering does, I'm just not convinced that
it'll "fix" this performance problem unless CLUSTER sets some kind of
hint that ANALYZE uses to modify the way in which it collects
statistics.  Like I said, I'll let you know on Monday when they're
back in the shop, but I'm not holding my breath.  I know
random_page_cost is set to something physically impossible, but in
terms of performance, it's always been the biggest win for me to set
this puppy quite low.  Bug in the planner, or documentation
surrounding what this knob does, I'm not sure, but setting this to a
low value consistently yields good results for me.  Faster the drive,
the lower the random_page_cost value.  *shrug*

> That's one heck of a poor estimate for the number of rows returned.
> 
> > ->  Seq Scan on mss_fwevent  (cost=0.00..223312.60 rows=168478 width=12) (actual 
> > time=24253.66..24319.87 rows=320 loops=1)

The stats for the columns are already set to 1000 to aid with
this... don't know what else I can do here.  Having the planner off by
as much as even half the actual size isn't uncommon in my experience.

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Moving postgresql.conf tunables into 2003...

2003-07-05 Thread Sean Chittenden
> > # The default values for PostgreSQL are extremely conservative and
> > # are likely far from ideal for a site's needs.  Included in this
> > # configuration, however, are _suggested_ values to help aid in >
> > # 
> 
> This sort of narrative belongs in the SGML docs, not in a CONF file.
> In fact, one could argue that we should take *all* commentary out of
> the CONF file in order to force people to read the docs.

The SGML docs aren't in the DBA's face and are way out of the way for
DBAs rolling out a new system or who are tuning the system.  SGML ==
Developer, conf == DBA.

> Database performance tuning will always be a "black art," as it
> necessitates a broad knowledge of PostgreSQL, OS architecture, and
> computer hardware.  So I doubt that we can post docs that would
> allow any 10% time DBA to make PostgreSQL "fly", but hopefully over
> the next year we can make enough knowledge public to allow anyone to
> make PostgreSQL "sprint".

I'm highly resistant to/disappointed in this attitude and firmly
believe that there are well understood algorithms that DBAs use to
diagnose and solve performance problems.  It's only a black art
because it hasn't been documented.  Performance tuning isn't voodoo,
it's adjusting constraints to align with the execution of applications
and we know what the applications do, therefore the database can mold
to the applications' needs.  Some of those parameters are based on
hardware constraints and should be pooled and organized as such.

random_page_cost ==
avg cost of a random disk seek/read (eg: disk seek time) ==
constant integer for a given piece of hardware

There are other settings that are RAM based as well, which should be
formulaic and derived though a formula hasn't been defined to date.

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Moving postgresql.conf tunables into 2003...

2003-07-05 Thread Josh Berkus
Sean,

> The SGML docs aren't in the DBA's face and are way out of the way for
> DBAs rolling out a new system or who are tuning the system.  SGML ==
> Developer, conf == DBA.

That's exactly my point.  We cannot provide enough documentation in the CONF 
file without septupling its length.  IF we remove all commentary, and instead 
provide a pointer to the documentation, more DBAs will read it.

>   Some of those parameters are based on
> hardware constraints and should be pooled and organized as such.
> 
> random_page_cost ==
>   avg cost of a random disk seek/read (eg: disk seek time) ==
>   constant integer for a given piece of hardware

But, you see, this is exactly what I'm talking about.   random_page_cost isn't 
static to a specific piece of hardware ... it depends as well on what else is 
on the disk/array, concurrent disk activity, disk controller settings, 
filesystem, OS, distribution of records and tables, and arrangment of the 
partitions on disk.   One can certainly get a "good enough" value by 
benchmarking the disk's random seek and calculating based on that ... but to 
get an "ideal" value requires a long interactive session by someone with 
experience and in-depth knowledge of the machine and database.

> There are other settings that are RAM based as well, which should be
> formulaic and derived though a formula hasn't been defined to date.

You seem pretty passionate about this ... how about you help me an Kevin 
define a benchmarking suite when I get back into the country (July 17)?   If 
we're going to define formulas, it requires that we have a near-comprehensive 
and consistent test database and test battery that we can run on a variety of 
machines and platforms.

-- 
-Josh Berkus

__AGLIO DATABASE SOLUTIONS___
Josh Berkus
   Complete information technology  [EMAIL PROTECTED]
and data management solutions   (415) 565-7293
   for law firms, small businesses   fax 621-2533
and non-profit organizations.   San Francisco


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] PostgreSQL vs. MySQL

2003-07-05 Thread Hilmar Lapp
On Friday, July 4, 2003, at 07:07  AM, Brian Tarbox wrote:

We had about 40 tables in the db, with joined queries on about 8-12 
tables.

A while ago a tested a moderately complex schema on MySQL, Pg, and 
Oracle. I usually heavily normalize schemas and then define views as a 
denormalized API, which sends MySQL to the book of toys already. The 
views more often than not would join anywhere from 6-12 tables, using 
plain (as opposed to compound) foreign keys to primary key straight 
joins.

I noticed that Pg was more than an order of magnitude slower for joins 
> 8 tables than Oracle. I won't claim that none of this can have been 
due to lack of tuning. My point is the following though. After I dug in 
it turned out that of the 4 secs Pg needed to execute the query it 
spent 3.9 secs in the planner. The execution plan Pg came up with was 
pretty good - it just needed an extraordinary amount of time to arrive 
at it, spoiling its own results.

Asking this list I then learned how to tweak GEQO such that it would 
pick up the planning and do it faster than it would otherwise. I was 
able to get the planner time down to a quarter - still a multitude of 
the actual execution time.

I was told on this list that query planning suffers from combinatorial 
explosion very quickly - and I completely buy that. It's just - Oracle 
planned the same query in a fraction of a second, using the cost-based 
optimizer, on a slower machine. I've seen it plan 15-table joins in 
much less than a second, and I have no idea how it would do that. In 
addition, once you've prepared a query in Oracle, the execution plan is 
pre-compiled.

If I were a CS student I'd offer myself to the hall of humiliation and 
set out to write a fast query planner for Pg ...

-hilmar
--
-
Hilmar Lappemail: lapp at gnf.org
GNF, San Diego, Ca. 92121  phone: +1-858-812-1757
-
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] PostgreSQL vs. MySQL

2003-07-05 Thread Hannu Krosing
Brian Tarbox kirjutas R, 04.07.2003 kell 15:27:
> I recently took a system from MySQL to Postgres.  Same HW, SW, same data.
> The major operations where moderately complex queries (joins on 8 tables).
> The results we got was that Postgres was fully 3 times slower than MySql.

For each and every query ??

> We were on this  list a fair bit looking for answers and tried all the
> standard answers. 

Could you post the list of "standard answers" you tried ?

>  It was still much  much much slower.

Was this with InnoDB ?

what kind of joins were they (i.e 
"FROM a JOIN b on a.i=b.i" 
or "FROM a,b WHERE a.i = b.i" ?

What was the ratio of planning time to actual execution time in pgsql?

Where the queries originally optimized for MySQL ?


Hannu

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Moving postgresql.conf tunables into 2003...

2003-07-05 Thread Sean Chittenden
> > The SGML docs aren't in the DBA's face and are way out of the way
> > for DBAs rolling out a new system or who are tuning the system.
> > SGML == Developer, conf == DBA.
> 
> That's exactly my point.  We cannot provide enough documentation in
> the CONF file without septupling its length.  IF we remove all
> commentary, and instead provide a pointer to the documentation, more
> DBAs will read it.

Which I don't think would happen and why I think the terse bits that
are included are worth while.  :)

> > Some of those parameters are based on hardware constraints and
> > should be pooled and organized as such.
> > 
> > random_page_cost ==
> > avg cost of a random disk seek/read (eg: disk seek time) ==
> > constant integer for a given piece of hardware
> 
> But, you see, this is exactly what I'm talking about.
> random_page_cost isn't static to a specific piece of hardware ... it
> depends as well on what else is on:

*) the disk/array

translation: how fast data is accessed and over how many drives.

*) concurrent disk activity

A disk/database activity metric is different than the cost of a seek
on the platters.  :) Because PostgreSQL doesn't currently support such
a disk concurrency metric doesn't mean that its definition should get
rolled into a different number in an attempt to accommodate for a lack
thereof.

*) disk controller settings

This class of settings falls into the same settings that affect random
seeks on the platters/disk array(s).

*) filesystem

Again, this influences avg seek time

*) OS

Again, avg seek time

*) distribution of records and tables

This has nothing to do with PostgreSQL's random_page_cost setting
other than that if data is fragmented on the platter, the disk is
going to have to do a lot of seeking.  This is a stat that should get
set by ANALYZE, not by a human.

*) arrangement of the partitions on disk

Again, avg seek time.

> One can certainly get a "good enough" value by benchmarking the
> disk's random seek and calculating based on that ... but to get an
> "ideal" value requires a long interactive session by someone with
> experience and in-depth knowledge of the machine and database.

An "ideal" value isn't obtained via guess and check.  Checking is only
the verification of some calculable set of settingsthough right now
those calculated settings are guessed, unfortunately.

> > There are other settings that are RAM based as well, which should
> > be formulaic and derived though a formula hasn't been defined to
> > date.
> 
> You seem pretty passionate about this ... how about you help me an
> Kevin define a benchmarking suite when I get back into the country
> (July 17)?  If we're going to define formulas, it requires that we
> have a near-comprehensive and consistent test database and test
> battery that we can run on a variety of machines and platforms.

Works for me, though a benchmark will be less valuable than adding a
disk concurrency stat, improving data trend/distribution analysis, and
using numbers that are concrete and obtainable through the OS kernel
API or an admin manually plunking numbers in.  I'm still recovering
from my move from Cali to WA so with any luck, I'll be settled in by
then.

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Moving postgresql.conf tunables into 2003...

2003-07-05 Thread Josh Berkus
Sean,

> > That's exactly my point.  We cannot provide enough documentation in
> > the CONF file without septupling its length.  IF we remove all
> > commentary, and instead provide a pointer to the documentation, more
> > DBAs will read it.
>
> Which I don't think would happen and why I think the terse bits that
> are included are worth while.  :)

Depressingly enough, you are probably correct, unless we assemble a more 
user-friendly "getting started" guide.

> *) concurrent disk activity
>
> A disk/database activity metric is different than the cost of a seek
> on the platters.  :) Because PostgreSQL doesn't currently support such
> a disk concurrency metric doesn't mean that its definition should get
> rolled into a different number in an attempt to accommodate for a lack
> thereof.

I was talking about concurrent activity by *other* applications.  For example, 
if a DBA has a java app that is accessing XML on the same array as postgres 
500 times/minute, then you'd need to adjust random_page_cost upwards to allow 
for the resource contest.

> An "ideal" value isn't obtained via guess and check.  Checking is only
> the verification of some calculable set of settingsthough right now
> those calculated settings are guessed, unfortunately.

> Works for me, though a benchmark will be less valuable than adding a
> disk concurrency stat, improving data trend/distribution analysis, and
> using numbers that are concrete and obtainable through the OS kernel
> API or an admin manually plunking numbers in.  I'm still recovering
> from my move from Cali to WA so with any luck, I'll be settled in by
> then.

The idea is that for a lot of statistics, we're only going to be able to 
obtain valid numbers if you have something constant to check them against.

Talk to you later this month!

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Extreme high load averages

2003-07-05 Thread Martin Foster
The only time that I have ever seen load averages of 30 or more under 
OpenBSD is when one of my scripts goes wild.However, I can say that 
I am also seeing these load averages under PostgreSQL 7.3.2 after a 
migration to it from MySQL.

MySQL Statistics:
Uptime: 1055352  Threads: 178  Questions: 75161710  Slow queries: 46 
Opens: 1084  Flush tables: 1  Open tables: 206 Queries per second avg: 
71.220

The above are statistics from older generation scripts that would make 
use of MySQL as to give an idea of what's going on.   That generation of 
scripts would handle the referential integrity, since foreign key 
constraints are not enforced under that system.   However, the system 
handled 250 concurrent users without a singular problem, while under 
Postgres with new scripts using functions, referential integrity, 
transactions and lighter code, the system starts to buckle at even less 
then 70 users.

What I would like to know is.   Why?   The kernel has been compiled to 
handle the number of concurrent connections, the server may not be the 
best, but it should be able to handle the requests: PIII 1Ghz, 1GB 
SDRAM, 2 IDE 20GB drives.

I have changed settings to take advantage of the memory.  So the 
following settings are of interest:
	shared_buffers = 16384
	wal_buffers = 256
	sort_mem = 16384
	vacuum_mem = 32768

Statistics gathering has now been disabled, and logging is done through 
syslog.I do not expect those settings to cripple system performance 
however.

The scripts are heavy SELECTS with a fair dose of UPDATES and INSERTS. 
 To get a concept of what these scripts done, you can look at Ethereal 
Realms (http://www.ethereal-realms.org) which are running the PostgreSQL 
script variants or consider that this is a chat site.

Anyone have ideas?   Is the use of connection pooling consider bad? 
Should flush be run more then once a day? I have no intention of going 
back to MySQL, and would like to make this new solution work.

Martin Foster
Creator/Designer Ethereal Realms
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster