Re: [PERFORM] Reading data in bulk - help?

2003-09-10 Thread Chris Huston
Thanks Josh that helped. I had gone looking for some kind of cluster 
option but was looking under create database, create index and 
initlocation - didn't see the CLUSTER index ON table.

I ran the CLUSTER which took about 2 1/2 hours to complete. That 
improved the query performance about 6x - which is great - but is still 
taking 26 minutes to do what a serial read does in about 2 1/2 minutes.

At this point I'm ok because each fetch is taking around 200 
milliseconds from call to the time the data is ready. The processing 
takes 300-600ms per batch. I've got the fetch and the processing 
running in separate threads so even if postgres was running faster it 
wouldn't help this implementation.

However, "iostat" is still reporting average size per transfer of about 
10kB and total thru-put of about 1MB/s. The transfers per second went 
from >200/s to about 80/s. It still seams like it ought to be a faster.

The system is currently running on a single processor 500Mhz G4. We're 
likely to move to a two processor 2Ghz G5 in the next few months. Then 
each block may take only a 30-60 milliseconds to complete and their can 
be two concurrent blocks processing at once.

Sometime before then I need to figure out how to cut the fetch times 
from the now 200ms to something like 10ms. There are currently 
1,628,800 records in the single data table representing 6817 groups. 
Each group has 2 to 284 records - with 79% having the max 284 (max 
grows by 1 every day - although the value may change throughout the 
day). Each record is maybe 1 or 2k so ideally each batch/group should 
require 284-568k - at 10MB/s - that'd be

RELATED QUESTION: How now do I speed up the following query: "select 
distinct group_id from datatable"? Which results in a sequential scan 
of the db. Why doesn't it use the group_id index? I only do this once 
per run so it's not as critical as the fetch speed which is done 6817 
times.

Thanks for the help!
- Chris
On Tuesday, Sep 9, 2003, at 18:11 America/Denver, Josh Berkus wrote:

Chris,

I've got an application that needs to chunk through ~2GB of data. The
data is ~7000 different sets of 300 records each. I put all of the 
data
into a postgres database but that doesn't look like its going to work
because of how the data lives on the disk.
Your problem is curable through 4 steps:

1) adjust your postgresql.conf to appropriate levels for memory usage.

2) if those sets of 300 are blocks in some contiguous order, then 
cluster them
to force their physical ordering on disk to be the same order you want 
to
read them in.   This will require you to re-cluster whenever you 
change a
significant number of records, but from the sound of it that happens in
batches.

3) Get better disks, preferrably a RAID array, or just very fast scsi 
if the
database is small.If you're budget-constrained, Linux software 
raid (or
BSD raid) on IDE disks is cheap.  What kind of RAID depends on what 
else
you'll be doing with the app; RAID 5 is better for read-only access, 
RAID 1+0
is better for read-write.

4) Make sure that you aren't dumping the data to the same disk 
postgreSQL
lives on!   Preferably, make sure that your swap partition is on a 
different
disk/array from postgresql.   If the computing app is complex and 
requires
disk reads aside from postgres data, you should make sure that it 
lives on
yet another disk.   Or you can simplify this with a good, really large
multi-channel RAID array.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco
---(end of 
broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Reading data in bulk - help?

2003-09-10 Thread Dennis Bjorklund
On Wed, 10 Sep 2003, Chris Huston wrote:

> Sometime before then I need to figure out how to cut the fetch times 
> from the now 200ms to something like 10ms.

You didn't say anything about Joshs first point of adjusting
postgresql.conf to match your machine. Settings like effective_cache_size
you almost always want to increase from the default setting, also shared 
memory.

-- 
/Dennis


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Need advice about triggers

2003-09-10 Thread Mindaugas Riauba

> >> Well, try it without the trigger. If performance improves markedly, it
> >> might be worth rewriting in C.
>
> >   Nope. Execution time is practically the same without trigger.
>
> >> If not, you're probably saturating the disk I/O -
>
> >   Bottleneck in this case is CPU. postmaster process uses almost 100% of
> > CPU.
>
> That seems very odd.  Updates should be I/O intensive, not CPU
> intensive.  I wouldn't have been surprised to hear of a plpgsql trigger
> consuming lots of CPU, but without it, I'm not sure where the time is
> going.  Can you show us an EXPLAIN ANALYZE result for a typical update
> command?

  Two EXPLAIN ANALYZE below. One is before another is after REINDEX. It
seems
that REINDEX before updates helps. Time went down to ~17s. Also CPU is not
at
100%. vmstat output is below (machine is 2xCPU so 40% load means 80% on one
CPU).

  So the solution would be REINDEX before updates and VACUUM at the same
time?
Without REINDEX performance slowly degrades.

  Mindaugas


router_db=# explain analyze update ifdata set ifspeed=256000,
ifreason='12121', iflastupdate=CURRENT_TIMESTAMP WHERE clientid='#0003904#';
 QUERY PLAN


 Index Scan using ifdata_clientid_key on ifdata  (cost=0.00..5.64 rows=1
width=116) (actual time=0.17..0.36 rows=1 loops=1)
   Index Cond: (clientid = '#0003904#'::character varying)
 Total runtime: 1.70 msec
(3 rows)

router_db=# reindex table ifdata;
REINDEX
router_db=# explain analyze update ifdata set ifspeed=256000,
ifreason='12121', iflastupdate=CURRENT_TIMESTAMP WHERE clientid='#0003904#';
 QUERY PLAN


 Index Scan using ifdata_clientid_key on ifdata  (cost=0.00..5.65 rows=1
width=116) (actual time=0.06..0.07 rows=1 loops=1)
   Index Cond: (clientid = '#0003904#'::character varying)
 Total runtime: 0.47 msec
(3 rows)


---

   procs  memoryswap  io system
cpu
 r  b  w   swpd   free   buff  cache  si  sobibo   incs  us  sy
id
 0  0  0   5048  20616 273556 1614692   0   0 4 32 0   0   1
3
 0  0  0   5048  20612 273556 1614692   0   0 0 0  109 8   0   0
100
 0  0  0   5048  20612 273556 1614692   0   0 0   168  14420   0   0
100
 1  0  0   5048  19420 273556 1614612   0   0 0   192  123  4120  35   2
63
 0  1  1   5048  19420 273572 1614652   0   0 0   672  144  4139  32   2
66
 1  0  0   5048  19420 273580 1614660   0   0 0   360  125  4279  33  12
55
 1  0  0   5048  19420 273580 1614724   0   0 0   272  119  5887  41   2
57
 1  0  0   5048  19420 273580 1614716   0   0 0   488  124  4871  40   1
59


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Hardware recommendations to scale to silly load

2003-09-10 Thread Gregory S. Williamson
Nitpicking --

Perhaps the 4th data line is meant to be:
Inserts in separate transactions 2500 inserts/second
   ^^^
??


Greg Williamson

-Original Message-
From:   Bruce Momjian [mailto:[EMAIL PROTECTED]
Sent:   Tue 9/9/2003 8:25 PM
To: Matt Clark
Cc: Ron Johnson; PgSQL Performance ML
Subject:Re: [PERFORM] Hardware recommendations to scale to silly load

Matt Clark wrote:
> > Just a data point, but on my Dual Xeon 2.4Gig machine with a 10k SCSI
> > drive I can do 4k inserts/second if I turn fsync off.  If you have a
> > battery-backed controller, you should be able to do the same.  (You will
> > not need to turn fsync off --- fsync will just be fast because of the
> > disk drive RAM).
> >
> > Am I missing something?
> 
> I think Ron asked this, but I will too, is that 4k inserts in
> one transaction or 4k transactions each with one insert?
> 
> fsync is very much faster (as are all random writes) with the
> write-back cache, but I'd hazard a guess that it's still not
> nearly as fast as turning fsync off altogether.  I'll do a test
> perhaps...

Sorry to be replying late.  Here is what I found.

fsync on
Inserts all in one transaction 3700 inserts/second
Inserts in separate transactions870 inserts/second

fsync off
Inserts all in one transaction 3700 inserts/second
Inserts all in one transaction 2500 inserts/second

ECPG test program attached.

--

  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073




---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Hardware recommendations to scale to silly load

2003-09-10 Thread Bruce Momjian
Gregory S. Williamson wrote:
> Nitpicking --
> 
> Perhaps the 4th data line is meant to be:
> Inserts in separate transactions 2500 inserts/second
>^^^


Oh, yes, sorry.  It is:

> Sorry to be replying late.  Here is what I found.
> 
> fsync on
> Inserts all in one transaction 3700 inserts/second
> Inserts in separate transactions870 inserts/second
> 
> fsync off
> Inserts all in one transaction 3700 inserts/second
> Inserts in separate transactions   2500 inserts/second

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Reading data in bulk - help?

2003-09-10 Thread Josh Berkus
Chris,

> The system is currently running on a single processor 500Mhz G4. We're
> likely to move to a two processor 2Ghz G5 in the next few months. Then
> each block may take only a 30-60 milliseconds to complete and their can
> be two concurrent blocks processing at once.

What about explaining your disk setup?   Or mentioning postgresql.conf?   For 
somebody who wants help, you're ignoring a lot of advice and questions.

Personally, I'm not going to be of any further help until you report back on 
the other 3 of 4 options.

> RELATED QUESTION: How now do I speed up the following query: "select
> distinct group_id from datatable"? Which results in a sequential scan
> of the db. Why doesn't it use the group_id index? I only do this once
> per run so it's not as critical as the fetch speed which is done 6817
> times.

Because it can't until PostgreSQL 7.4, which has hash aggregates.Up to 
7.3, we have to use seq scans for all group bys.  I'd suggest that you keep a 
table of group_ids, instead.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Reading data in bulk - help?

2003-09-10 Thread Magnus Naeslund(w)
Chris Huston said:
> Thanks Josh that helped. I had gone looking for some kind of cluster
> option but was looking under create database, create index and
> initlocation - didn't see the CLUSTER index ON table.
>
> I ran the CLUSTER which took about 2 1/2 hours to complete. That
> improved the query performance about 6x - which is great - but is still
> taking 26 minutes to do what a serial read does in about 2 1/2 minutes.
>
> At this point I'm ok because each fetch is taking around 200
> milliseconds from call to the time the data is ready. The processing
> takes 300-600ms per batch. I've got the fetch and the processing
> running in separate threads so even if postgres was running faster it
> wouldn't help this implementation.
>
> However, "iostat" is still reporting average size per transfer of about
> 10kB and total thru-put of about 1MB/s. The transfers per second went
> from >200/s to about 80/s. It still seams like it ought to be a faster.
>
> The system is currently running on a single processor 500Mhz G4. We're
> likely to move to a two processor 2Ghz G5 in the next few months. Then
> each block may take only a 30-60 milliseconds to complete and their can
> be two concurrent blocks processing at once.
>
> Sometime before then I need to figure out how to cut the fetch times
> from the now 200ms to something like 10ms. There are currently
> 1,628,800 records in the single data table representing 6817 groups.
> Each group has 2 to 284 records - with 79% having the max 284 (max
> grows by 1 every day - although the value may change throughout the
> day). Each record is maybe 1 or 2k so ideally each batch/group should
> require 284-568k - at 10MB/s - that'd be
>
> RELATED QUESTION: How now do I speed up the following query: "select
> distinct group_id from datatable"? Which results in a sequential scan
> of the db. Why doesn't it use the group_id index? I only do this once
> per run so it's not as critical as the fetch speed which is done 6817
> times.
>
> Thanks for the help!
> - Chris
>

How are you fetching the data?
If you are using cursors, be sure to fetch a substatial bit at a time so
that youre not punished by latency.
I got a big speedup when i changed my original clueless code to fetch 64
rows in a go instead of only one.

Magnus



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] [GENERAL] how to get accurate values in pg_statistic (continued)

2003-09-10 Thread Bruce Momjian
Tom Lane wrote:
> Mary Edie Meredith <[EMAIL PROTECTED]> writes:
> > Stephan Szabo kindly responded to our earlier queries suggesting we look
> > at default_statistics_target and ALTER TABLE ALTER COLUMN SET
> > STATISTICS.  
> 
> > These determine the number of bins in the histogram for a given column. 
> > But for a large number of rows (for example 6 million) the maximum value
> > (1000) does not guarantee that ANALYZE will do a full scan of the table.
> > We do not see a way to guarantee the same statistics run to run without
> > forcing ANALYZE to examine every row of every table.  
> 
> Do you actually still have a problem with the plans changing when the
> stats target is above 100 or so?  I think the notion of "force ANALYZE
> to do a full scan" is inherently wrongheaded ... it certainly would not
> produce numbers that have anything to do with ordinary practice.
> 
> If you have data statistics that are so bizarre that the planner still
> gets things wrong with a target of 1000, then I'd like to know more
> about why.

Has there been any progress in determining if the number of default
buckets (10) is the best value?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Reading data in bulk - help?

2003-09-10 Thread Chris Huston
On Wednesday, Sep 10, 2003, at 11:16 America/Denver, Josh Berkus wrote:

What about explaining your disk setup?   Or mentioning 
postgresql.conf?   For
somebody who wants help, you're ignoring a lot of advice and questions.

Personally, I'm not going to be of any further help until you report 
back on
the other 3 of 4 options.
EEEK! Peace. Sorry I didn't include that info in the response.

1) Memory - clumsily adjusted shared_buffer - tried three values: 64, 
128, 256 with no discernible change in performance. Also adjusted, 
clumsily, effective_cache_size to 1000, 2000, 4000 - with no 
discernible change in performance. I looked at the Admin manual and 
googled around for how to set these values and I confess I'm clueless 
here. I have no idea how many kernel disk page buffers are used nor do 
I understand what the "shared memory buffers" are used for (although 
the postgresql.conf file hints that it's for communication between 
multiple connections). Any advice or pointers to articles/docs is 
appreciated.

2) Clustering - tried it - definite improvement - thanks for the tip

3) RAID - haven't tried it - but I'm guessing that the speed 
improvement from a RAID 5 may be on the order of 10x - which I can 
likely get from using something like HDF. Since the data is unlikely to 
grow beyond 10-20gig, a fast drive and firewire ought to give me the 
performance I need. I know experimentally that the current machine can 
sustain a 20MB/s transfer rate which is 20-30x the speed of these 
queries. (If there's any concern about my enthusiasm for postgres - no 
worries - I've been very happy with it on several projects - it might 
not be the right tool for this kind of job - but I haven't come to that 
conclusion yet.)

4) I'd previously commented out the output/writing steps from the app - 
to isolate read performance.

On Wednesday, Sep 10, 2003, at 05:47 America/Denver, Magnus Naeslund(w) 
wrote:
How are you fetching the data?
If you are using cursors, be sure to fetch a substatial bit at a time 
so
that youre not punished by latency.
I got a big speedup when i changed my original clueless code to fetch 
64
rows in a go instead of only one.
That's an excellent question... I hadn't thought about it. I'm using a 
JDBC connection... I have no idea (yet) how the results are moving 
between postgres and the client app. I'm testing once with the app and 
the DB on the same machine (to remove network latency) and once with 
db/app on separate machines. However, I wonder if postgres is blocking 
on network io (even if it's the loopback interface) and not on disk?!

I'll definitely look into it. Maybe I'll try a loop in psql and see 
what the performance looks like. Thanks Magnus.

On Wednesday, Sep 10, 2003, at 07:05 America/Denver, Sean McCorkle 
wrote:

I ended up solving the problem by going "retro" and using the
quasi-database functions of unix and flat files:  grep, sort,
uniq and awk.
That's an cool KISS approach. If I end up moving out of postgres I'll 
speed test this approach against HDF. Thanks.

This is a very helpful list,
- Chris
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Reading data in bulk - help?

2003-09-10 Thread William Yu
1) Memory - clumsily adjusted shared_buffer - tried three values: 64, 
128, 256 with no discernible change in performance. Also adjusted, 
clumsily, effective_cache_size to 1000, 2000, 4000 - with no discernible 
change in performance. I looked at the Admin manual and googled around 
for how to set these values and I confess I'm clueless here. I have no 
idea how many kernel disk page buffers are used nor do I understand what 
the "shared memory buffers" are used for (although the postgresql.conf 
file hints that it's for communication between multiple connections). 
Any advice or pointers to articles/docs is appreciated.
The standard procedure is 1/4 of your memory for shared_buffers. Easiest 
way to calculate would be ###MB / 32 * 1000. E.g. if you have 256MB of 
memory, your shared_buffers should be 256 / 32 * 1000 = 8000.

The remaining memory you have leftover should be "marked" as OS cache 
via the effective_cache_size setting. I usually just multiply the 
shared_buffers value by 3 on systems with a lot of memory. With less 
memory, OS/Postgres/etc takes up a larger percentage of memory so values 
of 2 or 2.5 would be more accurate.

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Query too slow

2003-09-10 Thread Bruce Momjian
Bruce Momjian wrote:
> Tom Lane wrote:
> > Stephan Szabo <[EMAIL PROTECTED]> writes:
> > > On Tue, 26 Aug 2003, Ang Chin Han wrote:
> > >> Veering aside a bit, since we usually pinpoint performance problems by
> > >> looking at EXPLAIN ANALYZE's differences between the planner's
> > >> estimation and actual execution's stats, what's involved in parsing the
> > >> EXPLAIN ANALYZE results, and highlighting the places where they are way
> > >> different? Bold, underline, or put some asterisks in front of those steps.
> > 
> > > The hardest part is determining where it matters I think.  You can use the
> > > row counts as the base for that, but going from 1 row to 50 is not
> > > necessarily going to be an issue, but it might be if a nested loop is
> > > chosen.
> > 
> > We've been chatting about this idea among the Red Hat group.  The RHDB
> > Visual Explain tool (get it at http://sources.redhat.com/rhdb/) already
> > computes the percent of total runtime represented by each plan node.
> > It seems like we could highlight nodes based on a large difference
> > between estimated and actual percentage, or just highlight the nodes
> > that are more than X percent of the runtime.
> 
> Is there a TODO here?  Perhaps:
> 
>   o Have EXPLAIN ANALYZE highlight poor optimizer estimates

No one commented, so I had to guess --- I added it.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Upgrade Woes

2003-09-10 Thread Christopher Browne
[EMAIL PROTECTED] writes:
> P.S. Sorry this is so long, but I wanted to include as much info as possible.

Throw in the non-commented lines in postgresql.conf; that would more
than likely make numeric answers possible, for some of it.  If the
config is "out-of-the-box," then it's pretty likely that some
significant improvements can be gotten from modifying a few of the
config parameters.  Increasing buffers would probably help query
speed, and if you're getting too many dead tuples, increasing the free
space map would make it possible for more to vacuum out.

Beyond that, you might want to grab the code for pg_autovacuum, and
drop that into place, as that would do periodic ANALYZEs that would
probably improve the quality of your selects somewhat.  (It's in the
7.4 code "contrib" base, but works fine with 7.3.)

I think you might also get some significant improvements out of
changing the way you load the properties.  If you set up a schema that
is suitably "permissive," and write a script that massages it a
little, COPY should do the trick to load the data in, which should be
helpful to the load process.  If the data comes in a little more
intelligently (which might well involve some parts of the process
"dumbing down" :-)), you might take advantage of COPY and perhaps
other things (we see through the glass darkly).

I would think it also begs the question of whether or not you _truly_
need the "vacuum full."  Are you _certain_ you need that?  I would
think it likely that running "vacuum analyze" (and perhaps doing it a
little bit, continuously, during the load, via pg_autovacuum) would
likely suffice.  Have you special reason to think otherwise?
-- 
output = ("cbbrowne" "@" "libertyrms.info")

Christopher Browne
(416) 646 3304 x124 (land)

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [osdldbt-general] Re: [PERFORM] [GENERAL] how to get accurate

2003-09-10 Thread Mary Edie Meredith
We tried 1000 as the default and found that the plans were good plans
and were consistent, but the pg_statistics was not exactly the same.

We took Tom's' advice and tried SET SEED=0 (actually select setseed (0)
).

We did runs last night on our project machine which produced consistent
pg_statistics data and (of course) the same plans.

We will next try runs where we vary the default buckets.  Other than 10
and 1000, what numbers would you like us to try besides.  Previously the
number 100 was mentioned.  Are there others?

On Wed, 2003-09-10 at 12:44, Bruce Momjian wrote:
> Tom Lane wrote:
> > Mary Edie Meredith <[EMAIL PROTECTED]> writes:
> > > Stephan Szabo kindly responded to our earlier queries suggesting we look
> > > at default_statistics_target and ALTER TABLE ALTER COLUMN SET
> > > STATISTICS.  
> > 
> > > These determine the number of bins in the histogram for a given column. 
> > > But for a large number of rows (for example 6 million) the maximum value
> > > (1000) does not guarantee that ANALYZE will do a full scan of the table.
> > > We do not see a way to guarantee the same statistics run to run without
> > > forcing ANALYZE to examine every row of every table.  
> > 
> > Do you actually still have a problem with the plans changing when the
> > stats target is above 100 or so?  I think the notion of "force ANALYZE
> > to do a full scan" is inherently wrongheaded ... it certainly would not
> > produce numbers that have anything to do with ordinary practice.
> > 
> > If you have data statistics that are so bizarre that the planner still
> > gets things wrong with a target of 1000, then I'd like to know more
> > about why.
> 
> Has there been any progress in determining if the number of default
> buckets (10) is the best value?
-- 
Mary Edie Meredith <[EMAIL PROTECTED]>
Open Source Development Lab


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] Upgrade Woes

2003-09-10 Thread aturner
Hi,

My name is Alex Turner and I work for a small Tech company in Pottstown PA.  We run 
Postgresql on a number of systems for a variety of different applications, and it has 
been a joy to deal with all around, working fast and reliably for over 2 years.

We recently upgraded from RedHat 7.2 to RedHat 9.0, and we are running Postgres 7.3.2 
on our Proliant ML370 (Raid 1 2x18 10k, and Raid 5 3x36 10k, 2x866 PIII, 2GB RAM).

We seem to have had a serious drop after the upgrade.  The database is a database of 
properties that is updated on a daily basis, and when I say updated I mean that I 
insert/update the whole data download because the data provider doesn't tell us what 
changed, just gives us a complete dump. The integrity of the dumb isn't great so I 
can't process as a COPY or a block transaction because some of the data is often bad.  
Each and every row is a seperate insert or update.  
Data insert performance used to degrade in a linear fasion as time progressed I'm 
guessing as the transaction logs filled up.  About once every six weeks I would dump 
the database, destroy and recreate the db and reload the dump.  This 'reset' the whole 
thing, and brought insert/vacuum times back down.  Since the upgrade, performance has 
degraded very rapidly over the first week, and then more slowly later, but enough that 
we now have to reload the db every 2-3 weeks.  The insert procedure triggers a stored 
procedure that updates a timestamp on the record so that we can figure out what 
records have been touched, and which have not so that we can determine which 
properties have been removed from the feed as the record was not touched in the last 
two days.

I have noticed that whilst inserts seem to be slower than before, the vacuum full 
doesn't seem to take as long overall.

postgresql.conf is pretty virgin, and we run postmaster with -B512 -N256 -i.  
/var/lib/pgsql/data is a symlink to /eda/data, /eda being the mount point for the Raid 
5 array.

the database isn't huge, storing about 3 properties, and the largest table is 2.1 
Million rows for property features.  The dump file is only 221MB. Alas, I did not 
design the schema, but I have made several 'tweaks' to it to greatly improve read 
performance allowing us to be the fastest provider in the Tristate area.  
Unfortunately the Job starts at 01:05 (thats the earliest the dump is available) and 
runs until completion finishing with a vacuum full.  The vacuum full locks areas of 
the database long enough that our service is temporarily down.  At the worst point, 
the vacuum full was starting after 09:00, which our customers didn't appreciate.

I'm wondering if there is anything I can do with postgres to allieviate this problem.  
Either upgrading to 7.3.4 (although I tried 7.3.3 for another app, and we had to roll 
back to 7.3.2 because of performance problems), or working with the postgresql.conf to 
enhance performance.  I really don't want to roll back the OS version if possible, but 
I'm not ruling it out at this point, as that seems to be the biggest thing that has 
changed.  All the drive lights are showing green, so I don't believe the array is 
running in degraded mode.  I keep logs of all the insert jobs, and plotting average 
insert times on a graph revealed that this started at the time of the upgrade.

Any help/suggestions would be grealy appreciated,

Thanks,

Alex Turner
NetEconomist

P.S. Sorry this is so long, but I wanted to include as much info as possible.

---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] Attempt at work around of int4 query won't touch int8 index ...

2003-09-10 Thread James Robinson
Hello,

	I'm trying a work-around on the "index on int8 column gets ignored by  
planner when queried by literal numbers lacking the explicit '::int8'"  
issue, and had hoped that perhaps I could create a functional index on  
the result of casting the pk field to int4, and mabye with a little  
luck the planner would consider the functional index instead. Here's  
what I'm playing with on 7.3.4:

social=# create table foo (id int8 primary key, stuff text);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index  
'foo_pkey' for table 'foo'
CREATE TABLE
social=# create index foo_pkey_int4 on foo(int4(id));
CREATE INDEX

social=# explain analyze select id from foo where id = 42;
 QUERY PLAN
 

 Seq Scan on foo  (cost=0.00..22.50 rows=1 width=8) (actual  
time=0.01..0.01 rows=0 loops=1)
   Filter: (id = 42)
 Total runtime: 0.15 msec
(3 rows)

social=# explain analyze select id from foo where id = 42::int8;
 QUERY PLAN
 

 Index Scan using foo_pkey on foo  (cost=0.00..4.82 rows=1 width=8)  
(actual time=0.02..0.02 rows=0 loops=1)
   Index Cond: (id = 42::bigint)
 Total runtime: 0.09 msec
(3 rows)

social=# explain analyze select id from foo where id = int4(33);
 QUERY PLAN
 

 Seq Scan on foo  (cost=0.00..22.50 rows=1 width=8) (actual  
time=0.01..0.01 rows=0 loops=1)
   Filter: (id = 33)
 Total runtime: 0.07 msec
(3 rows)

Is this just a dead end, or is there some variation of this that might  
possibly work, so that ultimately an undoctored literal number, when  
applied to an int8 column, could find an index?

Thanks,
James
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Need advice about triggers

2003-09-10 Thread Hannu Krosing
Mindaugas Riauba kirjutas K, 10.09.2003 kell 13:21:

> 
> router_db=# explain analyze update ifdata set ifspeed=256000,
> ifreason='12121', iflastupdate=CURRENT_TIMESTAMP WHERE clientid='#0003904#';
>  QUERY PLAN
> 
> 
>  Index Scan using ifdata_clientid_key on ifdata  (cost=0.00..5.64 rows=1
> width=116) (actual time=0.17..0.36 rows=1 loops=1)
>Index Cond: (clientid = '#0003904#'::character varying)
>  Total runtime: 1.70 msec
> (3 rows)

could you try the same query on similar table, where clientid is int4 ?

is it faster ?

does the performance degrade at a slower rate?

---
Hannu


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Attempt at work around of int4 query won't touch int8 index ...

2003-09-10 Thread Tom Lane
James Robinson <[EMAIL PROTECTED]> writes:
> Is this just a dead end, or is there some variation of this that might  
> possibly work, so that ultimately an undoctored literal number, when  
> applied to an int8 column, could find an index?

I think it's a dead end.  What I was playing with this afternoon was
removing the int8-and-int4 comparison operators from pg_operator.
It works as far as making "int8col = 42" do the right thing, but I'm
not sure yet about side-effects.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Reading data in bulk - help?

2003-09-10 Thread Josh Berkus
Chris,

> 1) Memory - clumsily adjusted shared_buffer - tried three values: 64,
> 128, 256 with no discernible change in performance. Also adjusted,
> clumsily, effective_cache_size to 1000, 2000, 4000 - with no
> discernible change in performance. I looked at the Admin manual and
> googled around for how to set these values and I confess I'm clueless
> here. I have no idea how many kernel disk page buffers are used nor do
> I understand what the "shared memory buffers" are used for (although
> the postgresql.conf file hints that it's for communication between
> multiple connections). Any advice or pointers to articles/docs is
> appreciated.

You want values *much* higher than that.   How much RAM do you have?  See:
http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html
http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html

For example, if you have 512mb RAM, I'd crank up the shared buffers to 8000. 
the sort_mem to 8mb, and the effective_cache_size to 24,000.

> 3) RAID - haven't tried it - but I'm guessing that the speed
> improvement from a RAID 5 may be on the order of 10x

Probably not ... more like 1.5x - 2.0x, but that's still a significant help, 
yes?  Also, the advantage will get better the more your data grows.

>  - which I can
> likely get from using something like HDF. 

HDF sucks for I/O speed.XServe will become a much more significant option 
in the market when Apple can bring themselves to abandon HDF, and adopt XFS 
or something.  This is part of your problem.

> Since the data is unlikely to
> grow beyond 10-20gig, a fast drive and firewire ought to give me the
> performance I need.

Not sure about that.   Is Firewire really faster for I/O than modern SCSI or 
233mhz ATA?I don't do much Mac anymore, but I'd the impression that 
Firewire was mainly for peripherals   

What is important for your app in terms of speed is to get the data coming 
from multiple drives over multiple channels.   Were it a PC, I'd recommend a 
motherboard with 4 IDE channels or Serial ATA, and spreading the data over 4 
drives via RAID 0 or RAID 5, and adding dual processors.  Then you could use 
multiple postgres connections to read different parts of the table 
simultaneously.

>  I know experimentally that the current machine can
> sustain a 20MB/s transfer rate which is 20-30x the speed of these
> queries.

That is interesting.   Adjust your PostgreSQL.conf and see what results you 
get.   It's possible that PostgreSQL is convinced that you have little or no 
RAM because of your .conf settings, and is swapping stuff to temp file on 
disk.

> 4) I'd previously commented out the output/writing steps from the app -
> to isolate read performance.

OK.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Upgrade Woes

2003-09-10 Thread Richard Huxton
On Wednesday 10 September 2003 18:53, [EMAIL PROTECTED] wrote:
> Hi,
>
> My name is Alex Turner and I work for a small Tech company in Pottstown PA.
>  We run Postgresql on a number of systems for a variety of different
> applications, and it has been a joy to deal with all around, working fast
> and reliably for over 2 years.
>
> We recently upgraded from RedHat 7.2 to RedHat 9.0, and we are running
> Postgres 7.3.2 on our Proliant ML370 (Raid 1 2x18 10k, and Raid 5 3x36 10k,
> 2x866 PIII, 2GB RAM).
[snip]
> I have noticed that whilst inserts seem to be slower than before, the
> vacuum full doesn't seem to take as long overall.
>
> postgresql.conf is pretty virgin, and we run postmaster with -B512 -N256
> -i.  /var/lib/pgsql/data is a symlink to /eda/data, /eda being the mount
> point for the Raid 5 array.

First things first then, go to:
http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php
and read the item on Performance Tuning and the commented postgresql.conf

> the database isn't huge, storing about 3 properties, and the largest
> table is 2.1 Million rows for property features.  The dump file is only
> 221MB. Alas, I did not design the schema, but I have made several 'tweaks'
> to it to greatly improve read performance allowing us to be the fastest
> provider in the Tristate area.  Unfortunately the Job starts at 01:05
> (thats the earliest the dump is available) and runs until completion
> finishing with a vacuum full.  The vacuum full locks areas of the database
> long enough that our service is temporarily down.  At the worst point, the
> vacuum full was starting after 09:00, which our customers didn't
> appreciate.

You might be able to avoid a vacuum full by tweaking the *fsm* settings to be 
able to cope with activity.

> I'm wondering if there is anything I can do with postgres to allieviate
> this problem.  Either upgrading to 7.3.4 (although I tried 7.3.3 for
> another app, and we had to roll back to 7.3.2 because of performance
> problems), 

Hmm - can't think what would have changed radically between 7.3.2 and 7.3.3, 
upgrading to .4 is probably sensible.

[snip]
> Any help/suggestions would be grealy appreciated,

You say that each insert/update is a separate transaction. I don't know how 
much "bad" data you get in the dump, but you might be able to do something 
like:

1. Set batch size to 128 items
2. Read batch-size rows from the dump
3. Try to insert/update the batch. If it works, move along by the size of the 
batch and back to #1
4. If batch-size=1, record error, move along one row and back to #1
5. If batch-size>1, halve batch-size and go back to #3

Your initial batch-size will depend on how many errors there are (but 
obviously use a power of 2).

You could also run an ordinary vacuum every 1000 rows or so (number depends on 
your *fsm* settings as mentioned above).

You might also want to try a REINDEX once a night/week too.
-- 
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] [osdldbt-general] Re: [GENERAL] how to get accurate

2003-09-10 Thread Christopher Browne
The world rejoiced as [EMAIL PROTECTED] (Mary Edie Meredith) wrote:
> We tried 1000 as the default and found that the plans were good
> plans and were consistent, but the pg_statistics was not exactly the
> same.
>
> We took Tom's' advice and tried SET SEED=0 (actually select setseed
> (0) ).

When you're trying to get strict replicability of results, setting the
seed to some specific value is necessary.

Some useful results could be attained by varying the seed, and seeing
how the plans change.

> We did runs last night on our project machine which produced
> consistent pg_statistics data and (of course) the same plans.

> We will next try runs where we vary the default buckets.  Other than
> 10 and 1000, what numbers would you like us to try besides.
> Previously the number 100 was mentioned.  Are there others?

That presumably depends on what your goal is.

A useful experiment would be to see at what point (e.g. - at what
bucket size) plans tend to "settle down" to the right values.

It might well be that defaulting to 23 buckets (I'm picking that out
of thin air) would cause the plans to typically be stable whatever
seed got used.

A test for this would be to, for each bucket size value, repeatedly
ANALYZE and check query plans.

At bucket size 10, you have seen the query plans vary quite a bit.

At 1000, they seem to stabilize very well.

The geometric centre, between 10 and 1000, is 100, so it would surely
be useful to see if query plans are stable at that bucket size.

The most interesting number to know would be the lowest number of
buckets at which query plans are nearly always stable.  Supposing that
number was 23 (the number I earlier pulled out of the air), then that
can be used as evidence that the default value for SET STATISTICS
should be changed from 10 to 23.
-- 
wm(X,Y):-write(X),write('@'),write(Y). wm('aa454','freenet.carleton.ca').
http://www3.sympatico.ca/cbbrowne/nonrdbms.html
Sturgeon's Law: 90% of *EVERYTHING* is crud.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly