[PERFORM] Suggestions for a data-warehouse migration routine

2005-04-27 Thread Richard Rowell
I've ported enough of my companies database to Postgres to make
warehousing on PG a real possibility.  I thought I would toss my data
migration architecture ideas out for the list to shoot apart..

1.  Script on production server dumps the production database (MSSQL) to
a set of delimited text files.  
2.  Script on production server moves files via FTP to a Postgres
database server.  
3.  File Alteration Monitor trigger on PG server executes script when
last file is transferred.
4.  Script on PG server drops the target database (by issuing a "dropdb"
command).
5.  Script on PG server re-creates target database. (createdb command)
6.  Script on PG server re-creates the tables.
7.  Script on PG server issues COPY commands to import data.
8.  Script on PG server indexes tables.
9.  Script on PG server builds de-normalized reporting tables.
10. Script on PG server indexes the reporting tables.
11. Script on PG server creates needed reporting functions.
12. Vacuum analyze?

My question revolves around the drop/create for the database.  Is their
significant downside to this approach?  I'm taking this approach because
it is simpler from a scripting point of view to simply start from
scratch on each warehouse update.  If I do not drop the database I would
need to delete the contents of each table and drop all indexes prior to
the COPY/data import.  My assumption is all the table deletes and index
drops would be more expensive then just droping/re-creating the entire
database.

Also, is the Vacuum analyze step needed on a freshly minted database
where the indexes have all been newly created?

Thanks in advance for all feedback.


-- 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Whence the Opterons?

2005-06-10 Thread Richard Rowell
I will second the nod to Penguin computing.  We have a bit of Penguin
hardware here (though the majority is Dell).  We did have issues with
one machine a couple of years ago, but Penguin was very pro-active in
addressing that.

We recently picked up a Dual Opteron system from them and have been very
pleased with it so far.  

I would be careful of the RHES that it ships with though.  We had
machine lockups immediately after the suggested  kernel update (had to
down grade manually).  Also, the RH supplied Postgres binary has issues,
so you would need to compile Postgres yourself until the next RH update.

On Fri, 2005-05-06 at 14:39 -0700, Mischa Sandberg wrote:
> After reading the comparisons between Opteron and Xeon processors for Linux,
> I'd like to add an Opteron box to our stable of Dells and Sparcs, for 
> comparison.
> 
> IBM, Sun and HP have their fairly pricey Opteron systems.
> The IT people are not swell about unsupported purchases off ebay.
> Anyone care to suggest any other vendors/distributors?
> Looking for names with national support, so that we can recommend as much to 
> our
> customers.
> 
> Many thanks in advance.
-- 
--
Richard Rowell
[EMAIL PROTECTED]
Bowman Systems
(318) 213-8780


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


[PERFORM] Strange behaviour

2006-07-30 Thread Richard Rowell
We are using a BI tool that generates some rather ugly queries.  One of
the ugly queries is taking much longer to return thin I think it
should.  

The select expression when run alone returns in 2 seconds with 35k rows
(http://www.bowmansystems.com/~richard/explain_select.analyze) 

The "where clause" when run alone returns 5200 rows in 10 seconds
(http://www.bowmansystems.com/~richard/explain_where.analyze)

However when I put to two together it takes much, much longer to run.
(http://www.bowmansystems.com/~richard/full.analyze)

Can anyone shed any light on what is going on here?  Why does the
optimizer choose such a slow plan in the combined query when the only
real difference between the full query and the "where only" query is the
number of rows in the result set on the "outside" of the "IN" clause?

A few pertinent observations/facts below

1.  The query is generated by a BI tool, I know it is ugly and stupid in
many cases. However, please try to see the larger issue, that if the
select and where portions are run separately they are both fast but
together it is insanely slow.

2.  The database has vacuumdb -f -z run on it nightly.

3.  Modifications to the stock postgresql.conf:
shared_buffers = 15000 
work_mem = 131072   
default_statistics_target = 100

4.  Dual Dual core Opterons, 4 gigs of ram, 6 disk Serial ATA hardware
RAID 10 running Postgres 8.03 compiled from source running on Debian
stable.

5. The tables being queried are only 200 megs or so combined on disk,
the whole DB is ~ 4 gigs
SELECT sum(relpages*8/1024) AS size_M FROM pg_class;
 size_m

   4178

Thanks!


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[PERFORM] Improve performance of query

2004-12-16 Thread Richard Rowell
I'm trying to port our application from MS-SQL to Postgres.  We have
implemented all of our rather complicated application security in the
database.  The query that follows takes a half of a second or less on
MS-SQL server and around 5 seconds on Postgres.  My concern is that this
data set is rather "small" by our applications standards.  It is not
unusual for the da_answer table to have 2-4 million records.  I'm
worried that if this very small data set is taking 5 seconds, then a
"regular sized" data set will take far too long.

I originally thought the NOT EXISTS on the
"da_data_restrict_except_open" table was killing performance, but the
query took the exact same amount of time after I deleted all rows from
this table.  Note that the hard-coded 9.0, and 4000 parameters,
as well as the parameter to svp_getparentproviders are the three
variables that change from one run of this query to the next.

I'm using Postgres 7.4.5 as packaged in Debian.  shared_buffers is set
to 57344 and sort_mem=4096.

The machine has an AMD 1.8+ and ` gig of RAM.  Here are some relevant
performance statistics:
richard:/usr/share/cups/model# cat /proc/sys/kernel/shmmax
536870912
richard:/usr/share/cups/model# cat /proc/sys/kernel/shmall
536870912
richard:/home/richard# hdparm -tT /dev/hda
 Timing cached reads:   1112 MB in  2.00 seconds = 556.00 MB/sec
 Timing buffered disk reads:  176 MB in  3.02 seconds =  58.28 MB/sec

I have included an EXPLAIN ANALYZE, relevant table counts, and relevant
indexing information.  If anyone has any suggestions on how to improve
performance  TIA!

SELECT tab.answer_id, client_id, question_id, recordset_id,
date_effective, virt_field_name
FROM
(
SELECT a.uid AS answer_id, a.client_id, a.question_id, recordset_id,
date_effective
FROM da_answer a
WHERE a.date_effective <= 99.0
AND a.inactive != 1
AND
(
5000 = 4000 
OR
(EXISTS (SELECT * FROM svp_getparentproviderids(1) WHERE
svp_getparentproviderids = a.provider_id))
)
UNION
SELECT a.uid AS answer_id, a.client_id, a.question_id, recordset_id,
date_effective
FROM da_answer a,
(
SELECT main_id 
FROM da_data_restrict
WHERE type_id = 2
AND (provider_id IN (SELECT * FROM svp_getparentproviderids(1)))
   
UNION

SELECT sa.uid AS main_id 
FROM da_answer sa
JOIN da_data_restrict_except_closed dr ON dr.main_id =
sa.uid AND dr.type_id = 2 AND dr.except_provider_id = 1
WHERE (restricted = 1) 
AND (restricted_closed_except = 1)
AND sa.covered_by_roi = 1
UNION
SELECT sa.uid AS main_id 
FROM da_answer sa
WHERE (restricted = 0) 
AND (restricted_open_except = 1)
AND (NOT EXISTS (SELECT dr.main_id FROM
da_data_restrict_except_open dr WHERE (dr.main_id = sa.uid) AND
(dr.type_id = 2) AND (dr.except_provider_id in (select * from
svp_getparentproviderids(1)
AND sa.covered_by_roi = 1
UNION
SELECT sa.uid AS main_id FROM da_answer sa WHERE (restricted
= 0) AND (restricted_open_except = 0)
AND sa.covered_by_roi = 1
) sec
WHERE a.covered_by_roi = 1
AND a.date_effective <= 99.0
AND a.inactive != 1
AND a.uid = sec.main_id
AND 5000 > 4000
) tab, da_question q
WHERE tab.question_id = q.uid AND (min_access_level <= 4000 OR
min_access_level IS NULL)

Table counts from relevant tables
da_question  1095
da_answer   21117
da_question  1095
da_data_restrict_except_closed   3087
da_data_restrict_except_open13391
svp_getparentproviderids(1) 1

Relevant Index
create index in_da_data_restrict_provider_id on
da_data_restrict(provider_id);
create index in_da_data_restrict_main_id on da_data_restrict(main_id);
create index in_da_data_restrict_type_id on da_data_restrict(type_id);
create index in_da_data_restrict_client_id on
da_data_restrict(client_id);
create index in_da_dr_type_provider on
da_data_restrict(type_id,provider_id);

create index in_da_data_rec_provider_id ON
da_data_restrict_except_closed(provider_id);
create index in_da_data_rec_type_id ON
da_data_restrict_except_closed(type_id);
create index in_da_data_rec_main_id ON
da_data_restrict_except_closed(main_id);
create index in_da_data_rec_except_provider_id ON
da_data_restrict_except_closed(except_provider_id);

create index in_da_data_reo_provider_id ON
da_data_restrict_except_open(provider_id);
create index in_da_data_reo_type_id ON
da_data_restrict_except_open(type_id);
create index in_da_data_reo_main_id ON
da_data_restrict_except_open(main_id);
create index in_da_data_reo_except_provider_id ON
da_data_restrict_except_open(except_provider_id);

create index in_da_answer_client_id ON