Re: [PERFORM] oracle to psql migration - slow query in postgres

2010-10-15 Thread Samuel Gendler
On Thu, Oct 14, 2010 at 8:59 PM, Mladen Gogala wrote:

>  If working with partitioning, be very aware that PostgreSQL optimizer has
> certain problems with partitions, especially with group functions. If you
> want speed, everything must be prefixed with partitioning column: indexes,
> expressions, joins. There is no explicit star schema and creating hash
> indexes will not buy you much, as a matter of fact, Postgres community is
> extremely suspicious of the hash indexes and I don't see them widely used.
> Having said that, I was able to solve the problems with my speed and
> partitioning.
>
>
Could you elaborate on this, please? What do you mean by 'everythin must be
prefixed with partitioning column?'

--sam


Re: [PERFORM] Slow count(*) again...

2010-10-15 Thread Devrim GÜNDÜZ
On Wed, 2010-10-13 at 09:02 -0400, Greg Smith wrote:

> XFS support is available as an optional module starting in RHEL 5.5.
> In CentOS, you just grab it, so that's what I've been doing.  My 
> understanding is that you may have to ask your sales rep to enable 
> access to it under the official RedHat Network channels if you're
> using a subscription from them.  I'm not sure exactly what the support
> situation is with it, but it's definitely available as an RPM from
> RedHat.

Right. It is called "Red Hat Scalable File System", and once paid, it is
available via RHN.
-- 
Devrim GÜNDÜZ
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
PostgreSQL RPM Repository: http://yum.pgrpms.org
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org  Twitter: http://twitter.com/devrimgunduz


signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] oracle to psql migration - slow query in postgres

2010-10-15 Thread Mladen Gogala

Samuel Gendler wrote:



On Thu, Oct 14, 2010 at 8:59 PM, Mladen Gogala 
mailto:mladen.gog...@vmsinfo.com>> wrote:


 If working with partitioning, be very aware that PostgreSQL
optimizer has certain problems with partitions, especially with
group functions. If you want speed, everything must be prefixed
with partitioning column: indexes, expressions, joins. There is no
explicit star schema and creating hash indexes will not buy you
much, as a matter of fact, Postgres community is extremely
suspicious of the hash indexes and I don't see them widely used.
Having said that, I was able to solve the problems with my speed
and partitioning.


Could you elaborate on this, please? What do you mean by 'everythin 
must be prefixed with partitioning column?'


--sam
If you have partitioned table part_tab, partitioned on the column 
item_date and if there is a global primary key in Oracle, let's call it 
item_id, then queries like "select * from part_tab where item_id=12345" 
will perform worse than queries with item_date"


select * from part_tab where item_id=12345 and item_date='2010-10-15'

This also applies to inserts and updates. Strictly speaking, the 
item_date column in the query above is not necessary, after all, the 
item_id column is the primary key. However, with range scans you will 
get much better results if you include the item_date column than if you 
use combination of columns without. The term "prefixed indexes" is 
borrowed from Oracle RDBMS and means that the beginning column in the 
index is the column on which the table is partitioned. Oracle, as 
opposed to Postgres, has global indexes, the indexes that span all 
partitions. PostgreSQL only maintains indexes on each of the partitions 
separately.  Oracle calls such indexes "local indexes" and defines them 
on the partitioned table level. Here is a brief and rather succinct  
explanation of the terminology:


http://www.oracle-base.com/articles/8i/PartitionedTablesAndIndexes.php


Of, course, there are other differences between Oracle partitioning and 
PostgreSQL partitioning. The main difference is $1/CPU.

I am talking from experience:

news=> \d moreover_documents
 Table "moreover.moreover_documents"
   Column|Type | Modifiers
--+-+---
document_id  | bigint  | not null
dre_reference| bigint  | not null
headline | character varying(4000) |
author   | character varying(200)  |
url  | character varying(1000) |
rank | bigint  |
content  | text|
stories_like_this| character varying(1000) |
internet_web_site_id | bigint  | not null
harvest_time | timestamp without time zone |
valid_time   | timestamp without time zone |
keyword  | character varying(200)  |
article_id   | bigint  | not null
media_type   | character varying(20)   |
source_type  | character varying(20)   |
created_at   | timestamp without time zone |
autonomy_fed_at  | timestamp without time zone |
language | character varying(150)  |
Indexes:
   "moreover_documents_pkey" PRIMARY KEY, btree (document_id)
Triggers:
   insert_moreover_trigger BEFORE INSERT ON moreover_documents FOR EACH 
ROW EXE

CUTE PROCEDURE moreover_insert_trgfn()
Number of child tables: 8 (Use \d+ to list them.)

The child tables are, of course, partitions.

Here is the original:


Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> desc moreover_documents
Name   Null?Type
-  


DOCUMENT#   NOT NULL NUMBER
DRE_REFERENCE   NOT NULL NUMBER
HEADLINEVARCHAR2(4000)
AUTHOR VARCHAR2(200)
URLVARCHAR2(1000)
RANKNUMBER
CONTENTCLOB
STORIES_LIKE_THISVARCHAR2(1000)
INTERNET_WEB_SITE#   NOT NULL NUMBER
HARVEST_TIMEDATE
VALID_TIMEDATE
KEYWORDVARCHAR2(200)
ARTICLE_ID   NOT NULL NUMBER
MEDIA_TYPEVARCHAR2(20)
CREATED_ATDATE
SOURCE_TYPEVARCHAR2(50)
PUBLISH_DATEDATE
AUTONOMY_FED_ATDATE
LANGUAGEVARCHAR2(150)

SQL>



I must say that it took me some time to get things right.

--

Mladen Gogala 
Sr. Oracle DBA

1500 Broadway
New

Re: [PERFORM] oracle to psql migration - slow query in postgres

2010-10-15 Thread Merlin Moncure
On Thu, Oct 14, 2010 at 3:43 PM, Tony Capobianco
 wrote:
> explain analyze create table tmp_srcmem_emws1
> as
> select emailaddress, websiteid
>  from members
>  where emailok = 1
>   and emailbounced = 0;

*) as others have noted, none of your indexes will back this
expression.  For an index to match properly the index must have all
the fields matched in the 'where' clause in left to right order.  you
could rearrange indexes you already have and probably get things to
work properly.

*) If you want things to go really fast, and the combination of
emailok, emailbounced is a small percentage (say, less than 5) in the
table, and you are not interested in the schema level changes your
table is screaming, and the (1,0) combination is what you want to
frequently match and you should consider:

create function email_interesting(ok numeric, bounced numeric) returns bool as
$$
  select $1 = 1 and $2 = 0;
$$ language sql immutable;

create function members_email_interesting_idx on
  members(email_interesting(emailok, emailbounced)) where email_interesting();

This will build a partial index which you can query via:
select emailaddress, websiteid
 from members
 where email_interesting(emailok, emailbounced);

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] oracle to psql migration - slow query in postgres

2010-10-15 Thread Tony Capobianco
The recommendations on the numeric columns are fantastic.  Thank you
very much.  We will revisit our methods of assigning datatypes when we
migrate our data over from Oracle.
Regarding the full table scans; it appears inevitable that full table
scans are necessary for the volume of data involved and the present
design of our indexes.  Over time, indexes were added/removed to satisfy
particular functionality.  Considering this is our most important table,
I will research exactly how this table is queried to better
optimize/reorganize our indexes.

Thanks for your help.
Tony


On Thu, 2010-10-14 at 23:59 -0400, Mladen Gogala wrote:
> On 10/14/2010 4:10 PM, Jon Nelson wrote:
> > The first thing I'd do is think real hard about whether you really
> > really want 'numeric' instead of boolean, smallint, or integer.  The
> > second thing is that none of your indices (which specify a whole bunch
> > of fields, by the way) have only just emailok, emailbounced, or only
> > the pair of them. Without knowing the needs of your app, I would
> > reconsider your index choices and go with fewer columns per index.
> >
> Also, make sure that the statistics is good, that histograms are large 
> enough and that Geico (the genetic query optimizer) will really work 
> hard to save you 15% or more on the query execution time. You can also 
> make sure that any index existing index is used,  by disabling the 
> sequential scan and then activating and de-activating indexes with the 
> dummy expressions,  just as it was done with Oracle's rule based optimizer.
> I agree that a good data model is even more crucial for Postgres than is 
> the case with Oracle. Oracle, because of its rich assortment of tweaking 
> & hacking tools and parameters, can be made to perform, even if the 
> model is designed by someone who didn't apply the rules of good design. 
> Postgres is much more susceptible to bad models and it is much harder to 
> work around a badly designed model in Postgres than in Oracle. What 
> people do not understand is that every application in the world will 
> become badly designed after years of maintenance, adding columns, 
> creating additional indexes, views, tables and triggers and than 
> deploying various tools to design applications.  As noted by Murphy, 
> things develop from bad to worse. Keep Postgres models simple and 
> separated, because it's much easier to keep clearly defined models 
> simple and effective than to keep models with 700 tables and 350 views, 
> frequently with conflicting names, different columns named the same and 
> same columns named differently. And monitor, monitor, monitor. Use 
> strace, ltrace,  pgstatspack, auto_explain, pgfouine, pgadmin, top, sar, 
> iostat and all tools you can get hold of. Without the event interface, 
> it's frequently a guessing game.  It is, however, possible to manage 
> things.  If working with partitioning, be very aware that PostgreSQL 
> optimizer has certain problems with partitions, especially with group 
> functions. If you want speed, everything must be prefixed with 
> partitioning column: indexes, expressions, joins. There is no explicit 
> star schema and creating hash indexes will not buy you much, as a matter 
> of fact, Postgres community is extremely suspicious of the hash indexes 
> and I don't see them widely used.
> Having said that, I was able to solve the problems with my speed and 
> partitioning.
> 
> -- 
> Mladen Gogala
> Sr. Oracle DBA
> 1500 Broadway
> New York, NY 10036
> (212) 329-5251
> www.vmsinfo.com
> 
> 



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] oracle to psql migration - slow query in postgres

2010-10-15 Thread Maciek Sakrejda
>> This table has approximately 300million rows.
>
> and your query grab rows=236 660 930 of them. An index might be
> useless in this situation.

I want to point out that this is probably the most important comment
here. A couple of people have noted out that the index won't work for
this query, but more importantly, an index is (probably) not desirable
for this query. As an analogy (since everyone loves half-baked
programming analogies), if you want to find a couple of bakeries to
sponsor your MySQL Data Integrity Issues Awareness Walk by donating
scones, you use the yellow pages. If you want to hit up every business
in the area to donate whatever they can, you're better off canvasing
the neighborhood.
---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Stored procedure declared as VOLATILE => no good optimization is done

2010-10-15 Thread Damon Snyder
Hello,
I have heard it said that if a stored procedure is declared as VOLATILE,
then no good optimizations can be done on queries within the stored
procedure or queries that use the stored procedure (say as the column in a
view). I have seen this in practice, recommended on the irc channel, and in
the archives (
http://archives.postgresql.org/pgsql-performance/2008-01/msg00283.php). Can
someone help me understand or point me to some documentation explaining why
this is so?

Any insights would be appreciated. I'm new to pgsql and would like to know a
little more about what is going on under the hood.

Thanks,
Damon


Re: [PERFORM] Slow count(*) again...

2010-10-15 Thread Mladen Gogala

Jon Nelson wrote:


Well, I didn't quite mean that - having no familiarity with Oracle I
don't know what the alter system statement does, but I was talking
specifically about the linux buffer and page cache. 
  


Those are not utilized by Oracle.  This is a RAC instance, running on 
top of ASM, which is an Oracle volume manager, using raw devices. There 
is no file system on those disks:


SQL> select file_name from dba_data_files
 2  where tablespace_name='ADBASE_DATA';

FILE_NAME

+DGDATA/stag3/datafile/adbase_data.262.727278257
+DGDATA/stag3/datafile/adbase_data.263.727278741
+DGDATA/stag3/datafile/adbase_data.264.727280145
+DGDATA/stag3/datafile/adbase_data.265.727280683

[ora...@lpo-oracle-30 ~]$ $ORA_CRS_HOME/bin/crs_stat -l
NAME=ora.STAG3.STAG31.inst
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.STAG3.STAG32.inst
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.STAG3.db
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.lpo-oracle-30.ASM1.asm
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.lpo-oracle-30.LISTENER_LPO-ORACLE-30.lsnr
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.lpo-oracle-30.gsd
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.lpo-oracle-30.ons
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.lpo-oracle-30.vip
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.lpo-oracle-31.ASM2.asm
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.lpo-oracle-31.LISTENER_LPO-ORACLE-31.lsnr
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.lpo-oracle-31.gsd
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.lpo-oracle-31.ons
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.lpo-oracle-31.vip
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31



The only way to flush cache is the aforementioned "alter system" 
command. AFAIK, Postgres doesn't have anything like that. Oracle uses 
raw devices precisely to avoid double buffering.


--

Mladen Gogala 
Sr. Oracle DBA

1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com 
The Leader in Integrated Media Intelligence Solutions





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] help with understanding EXPLAIN and boosting performance

2010-10-15 Thread Brandon Casci
Hello

I have an application hosted on Heroku. They use postgres. It's more or less
abstracted away, but I can get some performance data from New Relic. For the
most part, performance is ok, but now and then some queries take a few
seconds, and spike up to 15 or even 16 seconds! Ouch!

This is the most detailed information I could get form New Relic. Do you
have any suggestions how I could improve the performance?

QUERY PLANLimit (cost=26.49..1893.46 rows=1 width=4) QUERY PLAN -> Unique
(cost=26.49..44833.82 rows=24 width=4) QUERY PLAN -> Nested Loop
(cost=26.49..44833.81 rows=24 width=4) QUERY PLAN -> Merge Join
(cost=26.49..44532.99 rows=4773 width=8) QUERY PLAN Merge Cond: (songs.id =
artists_songs.song_id) QUERY PLAN -> Index Scan using songs_pkey on songs
(cost=0.00..25219.30 rows=4773 width=4) QUERY PLAN Filter:
(lower((name)::text) = 'thirteen'::text) QUERY PLAN -> Index Scan using
index_artists_songs_on_song_id on artists_songs (cost=0.00..18822.04
rows=960465 width=8) QUERY PLAN -> Index Scan using artists_pkey on artists
(cost=0.00..0.06 rows=1 width=4) QUERY PLAN Index Cond: (artists.id =
artists_songs.artist_id) QUERY PLAN Filter: (lower((artists.name)::text) =
'red mountain church'::text)

Thanks!

-- 
=
Brandon Casci
Loudcaster
http://loudcaster.com
=


[PERFORM] Index scan / Index cond limitation or ?

2010-10-15 Thread Nikolai Zhubr

Hello people,

I'm having trouble to persuade index scan to check all of the conditions 
I specify _inside_ index cond. That is, _some_ condition always get 
pushed out of index cond and applied later (which will often result, for 
my real table contents, in too many unwanted rows initially hit by index 
scan and hence randomly slow queries)

An index with all relevant columns does exist of course.

Here goes an example.

create table foo (
  id serial primary key,
  rec_time timestamp with time zone DEFAULT now(),
  some_value integer,
  some_data text
);
CREATE INDEX foo_test ON foo (id, rec_time, some_value);
set enable_seqscan = false;
set enable_bitmapscan = true;

explain select id from foo where true
  and rec_time > '2010-01-01 22:00:06'
  --and rec_time < '2010-10-14 23:59'
  and some_value in (1, 2)
  and id > 123

This one works perfectly as I want it (and note "and rec_time < ... " 
condition is commented out):


Bitmap Heap Scan on foo  (cost=13.18..17.19 rows=1 width=4)
  Recheck Cond: ((id > 123) AND (rec_time > '2010-01-01 
22:00:06+03'::timestamp with time zone) AND (some_value = ANY 
('{1,2}'::integer[])))

  ->  Bitmap Index Scan on foo_test  (cost=0.00..13.18 rows=1 width=0)
Index Cond: ((id > 123) AND (rec_time > '2010-01-01 
22:00:06+03'::timestamp with time zone) AND (some_value = ANY 
('{1,2}'::integer[])))"


Now, as soon as I enable "and rec_time < ... " condition, I get the
following:

explain select id from foo where true
  and rec_time > '2010-01-01 22:00:06'
  and rec_time < '2010-10-14 23:59'
  and some_value in (1, 2)
  and id > 123

Bitmap Heap Scan on foo  (cost=8.59..13.94 rows=1 width=4)
  Recheck Cond: ((id > 123) AND (rec_time > '2010-01-01 
22:00:06+03'::timestamp with time zone) AND (rec_time < '2010-10-14 
23:59:00+04'::timestamp with time zone))

  Filter: (some_value = ANY ('{1,2}'::integer[]))
  ->  Bitmap Index Scan on foo_test  (cost=0.00..8.59 rows=2 width=0)
Index Cond: ((id > 123) AND (rec_time > '2010-01-01 
22:00:06+03'::timestamp with time zone) AND (rec_time < '2010-10-14 
23:59:00+04'::timestamp with time zone))


So, "in (1, 2)" condition is not in Index Cond anymore! Why is that? How 
can I push it back?


SELECT version();
PostgreSQL 8.3.1, compiled by Visual C++ build 1400
but the behaviour seems exactly the same in 9.0 (just checked it briefly).

Thank you!
Please CC me, I'm not on the list.

Nikolai

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Auto ANALYZE criteria

2010-10-15 Thread Joe Miller
Thanks for fixing the docs, but if that's the case, I shouldn't be
seeing the behavior that I'm seeing.

Should I flesh out this test case a little better and file a bug?

Thanks,

Joe


On Tue, Sep 21, 2010 at 4:44 PM, Tom Lane  wrote:
> Joe Miller  writes:
>> I was looking at the autovacuum documentation:
>> http://www.postgresql.org/docs/9.0/interactive/routine-vacuuming.html#AUTOVACUUM
>
>>    For analyze, a similar condition is used: the threshold, defined as:
>>    analyze threshold = analyze base threshold + analyze scale factor *
>> number of tuples
>>    is compared to the total number of tuples inserted or updated since
>> the last ANALYZE.
>
>> I guess that should be updated to read "insert, updated or deleted".
>
> Mph.  We caught the other places where the docs explain what the analyze
> threshold is, but missed that one.  Fixed, thanks for pointing it out.
>
>                        regards, tom lane
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] help with understanding EXPLAIN

2010-10-15 Thread Brandon Casci
Hello

I have an application hosted on Heroku. They use postgres. It's more or less
abstracted away, but I can get some performance data from New Relic. For the
most part, performance is ok, but now and then some queries take a few
seconds, and spike up to 15 or even 16 seconds! Ouch!

This is the most detailed information I could get form New Relic. Do you
have any suggestions how I could improve the performance?

QUERY PLANLimit (cost=26.49..1893.46 rows=1 width=4) QUERY PLAN -> Unique
(cost=26.49..44833.82 rows=24 width=4) QUERY PLAN -> Nested Loop
(cost=26.49..44833.81 rows=24 width=4) QUERY PLAN -> Merge Join
(cost=26.49..44532.99 rows=4773 width=8) QUERY PLAN Merge Cond: (songs.id =
artists_songs.song_id) QUERY PLAN -> Index Scan using songs_pkey on songs
(cost=0.00..25219.30 rows=4773 width=4) QUERY PLAN Filter:
(lower((name)::text) = 'thirteen'::text) QUERY PLAN -> Index Scan using
index_artists_songs_on_song_id on artists_songs (cost=0.00..18822.04
rows=960465 width=8) QUERY PLAN -> Index Scan using artists_pkey on artists
(cost=0.00..0.06 rows=1 width=4) QUERY PLAN Index Cond: (artists.id =
artists_songs.artist_id) QUERY PLAN Filter: (lower((artists.name)::text) =
'red mountain church'::text)


-- 
=
Brandon Casci
Loudcaster
http://loudcaster.com
=


Re: [PERFORM] Slow count(*) again...

2010-10-15 Thread Mladen Gogala

Neil Whelchel wrote:



That is why I suggested an estimate(*) that works like (a faster) count(*) 
except that it may be off a bit. I think that is what he was talking about 
when he wrote this.


  
The main problem with "select count(*)" is that it gets seriously 
mis-used.  Using "select count(*)" to establish existence is bad for 
performance and for code readability. 


--

Mladen Gogala 
Sr. Oracle DBA

1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com 
The Leader in Integrated Media Intelligence Solutions





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] oracle to psql migration - slow query in postgres

2010-10-15 Thread Igor Neyman
 

> -Original Message-
> From: Tony Capobianco [mailto:tcapobia...@prospectiv.com] 
> Sent: Thursday, October 14, 2010 3:43 PM
> To: pgsql-performance@postgresql.org
> Subject: oracle to psql migration - slow query in postgres
> 
> We are in the process of testing migration of our oracle data 
> warehouse over to postgres.  A potential showstopper are full 
> table scans on our members table.  We can't function on 
> postgres effectively unless index scans are employed.  I'm 
> thinking I don't have something set correctly in my 
> postgresql.conf file, but I'm not sure what.
> 
> This table has approximately 300million rows.
> 
> Version:
> SELECT version();
> 
> version  
> --
> 
>  PostgreSQL 8.4.2 on x86_64-redhat-linux-gnu, compiled by GCC 
> gcc (GCC)
> 4.1.2 20071124 (Red Hat 4.1.2-42), 64-bit
> 
> We have 4 quad-core processors and 32GB of RAM.  The below 
> query uses the members_sorted_idx_001 index in oracle, but in 
> postgres, the optimizer chooses a sequential scan.
> 
> explain analyze create table tmp_srcmem_emws1 as select 
> emailaddress, websiteid
>   from members
>  where emailok = 1
>and emailbounced = 0;
>   QUERY
> PLAN  
> --
> 
>  Seq Scan on members  (cost=0.00..14137154.64 rows=238177981 
> width=29) (actual time=0.052..685834.785 rows=236660930 loops=1)
>Filter: ((emailok = 1::numeric) AND (emailbounced = 
> 0::numeric))  Total runtime: 850306.220 ms
> (3 rows)
> 
> show shared_buffers ;
>  shared_buffers
> 
>  7680MB
> (1 row)
> 
> show effective_cache_size ;
>  effective_cache_size
> --
>  22GB
> (1 row)
> 
> show work_mem ;
>  work_mem
> --
>  768MB
> (1 row)
> 
> show enable_seqscan ;
>  enable_seqscan
> 
>  on
> (1 row)
> 
> Below are the data definitions for the table/indexes in question:
> 
> \d members
>  Table "members"
>Column|Type | Modifiers 
> -+-+---
>  memberid| numeric | not null
>  firstname   | character varying(50)   | 
>  lastname| character varying(50)   | 
>  emailaddress| character varying(50)   | 
>  password| character varying(50)   | 
>  address1| character varying(50)   | 
>  address2| character varying(50)   | 
>  city| character varying(50)   | 
>  statecode   | character varying(50)   | 
>  zipcode | character varying(50)   | 
>  birthdate   | date| 
>  emailok | numeric(2,0)| 
>  gender  | character varying(1)| 
>  addeddate   | timestamp without time zone | 
>  emailbounced| numeric(2,0)| 
>  changedate  | timestamp without time zone | 
>  optoutsource| character varying(100)  | 
>  websiteid   | numeric | 
>  promotionid | numeric | 
>  sourceid| numeric | 
>  siteid  | character varying(64)   | 
>  srcwebsiteid| numeric | 
>  homephone   | character varying(20)   | 
>  homeareacode| character varying(10)   | 
>  campaignid  | numeric | 
>  srcmemberid | numeric | 
>  optoutdate  | date| 
>  regcomplete | numeric(1,0)| 
>  regcompletesourceid | numeric | 
>  ipaddress   | character varying(25)   | 
>  pageid  | numeric | 
>  streetaddressstatus | numeric(1,0)| 
>  middlename  | character varying(50)   | 
>  optinprechecked | numeric(1,0)| 
>  optinposition   | numeric | 
>  homephonestatus | numeric | 
>  addeddate_id| numeric | 
>  changedate_id   | numeric | 
>  rpmindex| numeric | 
>  optmode | numeric(1,0)| 
>  countryid   | numeric | 
>  confirmoptin| numeric(2,0)| 
>  bouncedate  | date| 
>  memberageid | numeric | 
>  sourceid2   | numeric | 
>  remoteuserid| character varying(50)   | 

Re: [PERFORM] oracle to psql migration - slow query in postgres

2010-10-15 Thread Tony Capobianco
Thanks for all your responses. What's interesting is that an index is
used when this query is executed in Oracle.  It appears to do some
parallel processing:

SQL> set line 200
delete from plan_table;
explain plan for
select websiteid, emailaddress
  from members
 where emailok = 1
   and emailbounced = 0;

SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY());
SQL> 
3 rows deleted.

SQL>   2345  
Explained.

SQL> SQL> 
PLAN_TABLE_OUTPUT

Plan hash value: 4247959398

---
| Id  | Operation   | Name   | Rows  | Bytes
| Cost (%CPU)| Time |TQ  |IN-OUT| PQ Distrib |
---
|   0 | SELECT STATEMENT||   237M|
7248M|   469K  (2)| 01:49:33 ||  ||
|   1 |  PX COORDINATOR ||   |
||  ||  ||
|   2 |   PX SEND QC (RANDOM)   | :TQ1   |   237M|
7248M|   469K  (2)| 01:49:33 |  Q1,00 | P->S | QC (RAND)  |
|   3 |PX BLOCK ITERATOR||   237M|
7248M|   469K  (2)| 01:49:33 |  Q1,00 | PCWC ||
|*  4 | INDEX FAST FULL SCAN| MEMBERS_SORTED_IDX_001 |   237M|
7248M|   469K  (2)| 01:49:33 |  Q1,00 | PCWP ||
---

PLAN_TABLE_OUTPUT


Predicate Information (identified by operation id):
---

   4 - filter("EMAILBOUNCED"=0 AND "EMAILOK"=1)

16 rows selected.


On Fri, 2010-10-15 at 13:43 -0400, Igor Neyman wrote:
> 
> > -Original Message-
> > From: Tony Capobianco [mailto:tcapobia...@prospectiv.com] 
> > Sent: Thursday, October 14, 2010 3:43 PM
> > To: pgsql-performance@postgresql.org
> > Subject: oracle to psql migration - slow query in postgres
> > 
> > We are in the process of testing migration of our oracle data 
> > warehouse over to postgres.  A potential showstopper are full 
> > table scans on our members table.  We can't function on 
> > postgres effectively unless index scans are employed.  I'm 
> > thinking I don't have something set correctly in my 
> > postgresql.conf file, but I'm not sure what.
> > 
> > This table has approximately 300million rows.
> > 
> > Version:
> > SELECT version();
> > 
> > version  
> > --
> > 
> >  PostgreSQL 8.4.2 on x86_64-redhat-linux-gnu, compiled by GCC 
> > gcc (GCC)
> > 4.1.2 20071124 (Red Hat 4.1.2-42), 64-bit
> > 
> > We have 4 quad-core processors and 32GB of RAM.  The below 
> > query uses the members_sorted_idx_001 index in oracle, but in 
> > postgres, the optimizer chooses a sequential scan.
> > 
> > explain analyze create table tmp_srcmem_emws1 as select 
> > emailaddress, websiteid
> >   from members
> >  where emailok = 1
> >and emailbounced = 0;
> >   QUERY
> > PLAN  
> > --
> > 
> >  Seq Scan on members  (cost=0.00..14137154.64 rows=238177981 
> > width=29) (actual time=0.052..685834.785 rows=236660930 loops=1)
> >Filter: ((emailok = 1::numeric) AND (emailbounced = 
> > 0::numeric))  Total runtime: 850306.220 ms
> > (3 rows)
> > 
> > show shared_buffers ;
> >  shared_buffers
> > 
> >  7680MB
> > (1 row)
> > 
> > show effective_cache_size ;
> >  effective_cache_size
> > --
> >  22GB
> > (1 row)
> > 
> > show work_mem ;
> >  work_mem
> > --
> >  768MB
> > (1 row)
> > 
> > show enable_seqscan ;
> >  enable_seqscan
> > 
> >  on
> > (1 row)
> > 
> > Below are the data definitions for the table/indexes in question:
> > 
> > \d members
> >  Table "members"
> >Column|Type | Modifiers 
> > -+-+---
> >  memberid| numeric | not null
> >  firstname   | character varying(50)   | 
> >  lastname| chara

Re: [PERFORM] Index scan / Index cond limitation or ?

2010-10-15 Thread Tom Lane
Nikolai Zhubr  writes:
> So, "in (1, 2)" condition is not in Index Cond anymore! Why is that? How 
> can I push it back?

It thinks the indexscan condition is sufficiently selective already.
An = ANY condition like that will force multiple index searches,
one for each of the OR'd possibilities, so it's far from "free" to add
it to the index condition.  The planner doesn't think it's worth it.
Perhaps on your real query it is, but there's not much point in
debating about the behavior on this toy table; without realistic
table sizes and up-to-date stats it's impossible to say whether that
choice is correct or not.

> SELECT version();
> PostgreSQL 8.3.1, compiled by Visual C++ build 1400

You really, really, really ought to be running 8.3.something-newer.
We didn't put out the last 11 8.3.x bugfix updates just because
we didn't have anything better to do.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] help with understanding EXPLAIN and boosting performance

2010-10-15 Thread Tom Lane
Brandon Casci  writes:
> I have an application hosted on Heroku. They use postgres. It's more or less
> abstracted away, but I can get some performance data from New Relic. For the
> most part, performance is ok, but now and then some queries take a few
> seconds, and spike up to 15 or even 16 seconds! Ouch!

The particular query you're showing here doesn't look terribly
expensive.  Are you sure this is one that took that long?

If you're seeing identical queries take significantly different times,
I'd wonder about what else is happening on the server.  The most obvious
explanation for that type of behavior is that everything is swapped into
RAM when it's fast, but has to be read from disk when it's slow.  If
that's what's happening you should consider buying more RAM or
offloading some activities to other machines, so that there's not so
much competition for memory space.

If specific queries are consistently slow then EXPLAIN might give some
useful info about those.  It's unlikely to tell you much about
non-reproducible slowdowns, though.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] oracle to psql migration - slow query in postgres

2010-10-15 Thread Igor Neyman

> -Original Message-
> From: Tony Capobianco [mailto:tcapobia...@prospectiv.com] 
> Sent: Friday, October 15, 2010 2:14 PM
> To: pgsql-performance@postgresql.org
> Subject: Re: oracle to psql migration - slow query in postgres
> 
> Thanks for all your responses. What's interesting is that an 
> index is used when this query is executed in Oracle.  It 
> appears to do some parallel processing:
> 
> SQL> set line 200
> delete from plan_table;
> explain plan for
> select websiteid, emailaddress
>   from members
>  where emailok = 1
>and emailbounced = 0;
> 
> SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY());
> SQL> 
> 3 rows deleted.
> 
> SQL>   2345  
> Explained.
> 
> SQL> SQL> 
> PLAN_TABLE_OUTPUT
> --
> --
> --
> --
> Plan hash value: 4247959398
> 
> --
> -
> | Id  | Operation   | Name   | 
> Rows  | Bytes
> | Cost (%CPU)| Time |TQ  |IN-OUT| PQ Distrib |
> --
> -
> |   0 | SELECT STATEMENT||   237M|
> 7248M|   469K  (2)| 01:49:33 ||  ||
> |   1 |  PX COORDINATOR ||   |
> ||  ||  ||
> |   2 |   PX SEND QC (RANDOM)   | :TQ1   |   237M|
> 7248M|   469K  (2)| 01:49:33 |  Q1,00 | P->S | QC (RAND)  |
> |   3 |PX BLOCK ITERATOR||   237M|
> 7248M|   469K  (2)| 01:49:33 |  Q1,00 | PCWC ||
> |*  4 | INDEX FAST FULL SCAN| MEMBERS_SORTED_IDX_001 |   237M|
> 7248M|   469K  (2)| 01:49:33 |  Q1,00 | PCWP ||
> --
> -
> 
> PLAN_TABLE_OUTPUT
> --
> --
> --
> --
> 
> Predicate Information (identified by operation id):
> ---
> 
>4 - filter("EMAILBOUNCED"=0 AND "EMAILOK"=1)
> 
> 16 rows selected.
> 
> 

1. Postgres doesn't have "FAST FULL SCAN" because even if all the info
is in the index, it need to visit the row in the table ("visibility"
issue).

2. Postgres doesn't have parallel executions.

BUT, it's free anf has greate community support, as you already saw.

Regards,
Igor Neyman

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] oracle to psql migration - slow query in postgres

2010-10-15 Thread Tony Capobianco
Very true Igor!  Free is my favorite price.  
I'll figure a way around this issue.

Thanks for your help.
Tony

On Fri, 2010-10-15 at 14:54 -0400, Igor Neyman wrote:
> > -Original Message-
> > From: Tony Capobianco [mailto:tcapobia...@prospectiv.com] 
> > Sent: Friday, October 15, 2010 2:14 PM
> > To: pgsql-performance@postgresql.org
> > Subject: Re: oracle to psql migration - slow query in postgres
> > 
> > Thanks for all your responses. What's interesting is that an 
> > index is used when this query is executed in Oracle.  It 
> > appears to do some parallel processing:
> > 
> > SQL> set line 200
> > delete from plan_table;
> > explain plan for
> > select websiteid, emailaddress
> >   from members
> >  where emailok = 1
> >and emailbounced = 0;
> > 
> > SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY());
> > SQL> 
> > 3 rows deleted.
> > 
> > SQL>   2345  
> > Explained.
> > 
> > SQL> SQL> 
> > PLAN_TABLE_OUTPUT
> > --
> > --
> > --
> > --
> > Plan hash value: 4247959398
> > 
> > --
> > -
> > | Id  | Operation   | Name   | 
> > Rows  | Bytes
> > | Cost (%CPU)| Time |TQ  |IN-OUT| PQ Distrib |
> > --
> > -
> > |   0 | SELECT STATEMENT||   237M|
> > 7248M|   469K  (2)| 01:49:33 ||  ||
> > |   1 |  PX COORDINATOR ||   |
> > ||  ||  ||
> > |   2 |   PX SEND QC (RANDOM)   | :TQ1   |   237M|
> > 7248M|   469K  (2)| 01:49:33 |  Q1,00 | P->S | QC (RAND)  |
> > |   3 |PX BLOCK ITERATOR||   237M|
> > 7248M|   469K  (2)| 01:49:33 |  Q1,00 | PCWC ||
> > |*  4 | INDEX FAST FULL SCAN| MEMBERS_SORTED_IDX_001 |   237M|
> > 7248M|   469K  (2)| 01:49:33 |  Q1,00 | PCWP ||
> > --
> > -
> > 
> > PLAN_TABLE_OUTPUT
> > --
> > --
> > --
> > --
> > 
> > Predicate Information (identified by operation id):
> > ---
> > 
> >4 - filter("EMAILBOUNCED"=0 AND "EMAILOK"=1)
> > 
> > 16 rows selected.
> > 
> > 
> 
> 1. Postgres doesn't have "FAST FULL SCAN" because even if all the info
> is in the index, it need to visit the row in the table ("visibility"
> issue).
> 
> 2. Postgres doesn't have parallel executions.
> 
> BUT, it's free anf has greate community support, as you already saw.
> 
> Regards,
> Igor Neyman
> 



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Stored procedure declared as VOLATILE => no good optimization is done

2010-10-15 Thread Kevin Grittner
Damon Snyder  wrote:
 
> I have heard it said that if a stored procedure is declared as
> VOLATILE, then no good optimizations can be done on queries within
> the stored procedure or queries that use the stored procedure (say
> as the column in a view). I have seen this in practice, recommended
> on the irc channel, and in the archives (
> http://archives.postgresql.org/pgsql-performance/2008-01/msg00283.php
> ). Can
> someone help me understand or point me to some documentation
> explaining why this is so?
 
Here's the documentation:
 
http://www.postgresql.org/docs/current/interactive/sql-createfunction.html
 
http://www.postgresql.org/docs/current/interactive/xfunc-volatility.html
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Stored procedure declared as VOLATILE => no good optimization is done

2010-10-15 Thread Merlin Moncure
On Mon, Oct 11, 2010 at 7:10 PM, Damon Snyder  wrote:
> Hello,
> I have heard it said that if a stored procedure is declared as VOLATILE,
> then no good optimizations can be done on queries within the stored
> procedure or queries that use the stored procedure (say as the column in a
> view). I have seen this in practice, recommended on the irc channel, and in
> the archives
> (http://archives.postgresql.org/pgsql-performance/2008-01/msg00283.php). Can
> someone help me understand or point me to some documentation explaining why
> this is so?
> Any insights would be appreciated. I'm new to pgsql and would like to know a
> little more about what is going on under the hood.
> Thanks,
> Damon

The theory behind 'volatile' is pretty simple -- each execution of the
function, regardless of the inputs, can be expected to produce a
completely independent result, or modifies the datbase.  In the case
of immutable, which is on the other end, particular set of inputs will
produce one and only result, and doesn't modify anything.

In the immutable case, the planner can shuffle the function call
around in the query, calling it less, simplifying joins, etc.  There
are lots of theoretical optimizations that can be done since the
inputs (principally table column values and literal values) can be
assumed static for the duration of the query.

'stable' is almost like immutable, but is only guaranteed static for
the duration of the query.  most functions that read from but don't
write to the database will fit in this category.  Most optimizations
still apply here, but stable functions can't be used in indexes and
can't be executed and saved off in plan time where it might be helpful
(prepared statements and pl/pgsql plans).

broadly speaking:
*) function generates same output from inputs regardless of what's
going on in the database, and has no side effects: IMMUTABLE
*) function reads (only) from tables, or is an immutable function in
most senses but influenced from the GUC (or any other out of scope
thing): STABLE
*) all other cases: VOLATILE (which is btw the default)

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Stored procedure declared as VOLATILE => no good optimization is done

2010-10-15 Thread Tatsuo Ishii
> broadly speaking:
> *) function generates same output from inputs regardless of what's
> going on in the database, and has no side effects: IMMUTABLE

So can I say "if a function is marked IMMUTABLE, then it should never
modify database"? Is there any counter example?

> *) function reads (only) from tables, or is an immutable function in
> most senses but influenced from the GUC (or any other out of scope
> thing): STABLE

It seems if above is correct, I can say STABLE functions should never
modify databases as well.

> *) all other cases: VOLATILE (which is btw the default)
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] UUID performance as primary key

2010-10-15 Thread Navkirat Singh
Hi Guys,

I am interested in finding out the pros/cons of using UUID as a primary key 
field. My requirement states that UUID would be perfect in my case as I will be 
having many small databases which will link up to a global database using the 
UUID. Hence, the need for a unique key across all databases. It would be 
extremely helpful if someone could help me figure this out, as it is critical 
for my project.

Thanks in advance,

Nav 
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Stored procedure declared as VOLATILE => no good optimization is done

2010-10-15 Thread Tom Lane
Tatsuo Ishii  writes:
> So can I say "if a function is marked IMMUTABLE, then it should never
> modify database"? Is there any counter example?
> It seems if above is correct, I can say STABLE functions should never
> modify databases as well.

Both of those things are explicitly stated here:
http://developer.postgresql.org/pgdocs/postgres/xfunc-volatility.html

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] UUID performance as primary key

2010-10-15 Thread Craig Ringer

On 16/10/2010 9:58 AM, Navkirat Singh wrote:

Hi Guys,

I am interested in finding out the pros/cons of using UUID as a primary key 
field. My requirement states that UUID would be perfect in my case as I will be 
having many small databases which will link up to a global database using the 
UUID. Hence, the need for a unique key across all databases. It would be 
extremely helpful if someone could help me figure this out, as it is critical 
for my project.


Pro: No need for (serverid,serverseq) pair primary keys or hacks with 
modulus based key generation. Doesn't set any pre-determined limit on 
how many servers/databases may be in a cluster.


Con: Slower than modulo key generation approach, uses more storage. 
Foreign key relationships may be slower too.


Overall, UUIDs seem to be a favoured approach. The other way people seem 
to do this is by assigning a unique instance id to each server/database 
out of a maximum "n" instances decided at setup time. Every key 
generation sequence increments by "n" whenever it generates a key, with 
an offset of the server/database id. That way, if n=100, server 1 will 
generate primary keys 001, 101, 201, 301, ..., server 2 will generate 
primary keys 002, 102, 202, 302, ... and so on.


That works great until you need more than 100 instances, at which point 
you're really, REALLY boned. In really busy systems it also limits the 
total amount of primary key space - but with BIGINT primary keys, that's 
unlikely to be something you need to worry about.


The composite primary key (serverid,sequenceid) approach avoids the need 
for a pre-defined maximum number of servers, but can be slow to index 
and can require more storage, especially because of tuple headers.


I have no firsthand experience with any of these approaches so I can't 
offer you a considered opinion. I know that the MS-SQL crowd at least 
strongly prefer UUIDs, but they have very strong in-database UUID 
support. MySQL folks seem to mostly favour the modulo primary key 
generation approach. I don't see much discussion of the issue here - I 
get the impression Pg doesn't see heavy use in sharded environments.


--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow count(*) again...

2010-10-15 Thread Greg Smith

Jesper Krogh wrote:

To be honest, if it is EDB, Redpill, Command Prompt, 2nd Quadrant or
whoever end up doing the job is, seen from this perspective not
important, just it ends in the hands of someone "capable" of doing
it. ... although Heikki has done some work on this task already.


Now you're closing in on why this is a touchy subject.  Heikki has 
already done work here funded by EDB.  As such, the idea of anyone else 
being put in charge of fund raising and allocation for this particular 
feature would be a political mess.  While it would be nice if there was 
a completely fair sponsorship model for developing community PostgreSQL 
features, overseen by a benevolent, free, and completely unaffiliated 
overlord, we're not quite there yet.  In cases like these, where there's 
evidence a company with a track record of delivering features is already 
involved, you're probably better off contacting someone from there 
directly--rather than trying to fit that into the public bounty model 
some PostgreSQL work is getting done via lately.  The visibility map is 
a particularly troublesome one, because the list of "capable" people who 
could work on that, but who aren't already working at a company having 
some relations with EDB, is rather slim.


I know that's kind of frustrating to hear, for people who would like to 
get a feature done but can't finance the whole thing themselves.  But 
look on the bright side--the base price is free, and when you give most 
PostgreSQL companies money to work on something it's at least possible 
to get what you want done.  You'd have to pay a whole lot more than the 
$15K number you threw out there before any of the commercial database 
vendors would pay any attention to your particular feature request.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] UUID performance as primary key

2010-10-15 Thread Andy
Wouldn't UUID PK cause a significant drop in insert performance because every 
insert is now out of order, which leads to a constant re-arranging of the B+ 
tree? The amount of random IO's that's going to generate would just kill the 
performance.

--- On Fri, 10/15/10, Craig Ringer  wrote:

From: Craig Ringer 
Subject: Re: [PERFORM] UUID performance as primary key
To: "Navkirat Singh" 
Cc: pgsql-performance@postgresql.org
Date: Friday, October 15, 2010, 10:59 PM

On 16/10/2010 9:58 AM, Navkirat Singh wrote:
> Hi Guys,
> 
> I am interested in finding out the pros/cons of using UUID as a primary key 
> field. My requirement states that UUID would be perfect in my case as I will 
> be having many small databases which will link up to a global database using 
> the UUID. Hence, the need for a unique key across all databases. It would be 
> extremely helpful if someone could help me figure this out, as it is critical 
> for my project.

Pro: No need for (serverid,serverseq) pair primary keys or hacks with modulus 
based key generation. Doesn't set any pre-determined limit on how many 
servers/databases may be in a cluster.

Con: Slower than modulo key generation approach, uses more storage. Foreign key 
relationships may be slower too.

Overall, UUIDs seem to be a favoured approach. The other way people seem to do 
this is by assigning a unique instance id to each server/database out of a 
maximum "n" instances decided at setup time. Every key generation sequence 
increments by "n" whenever it generates a key, with an offset of the 
server/database id. That way, if n=100, server 1 will generate primary keys 
001, 101, 201, 301, ..., server 2 will generate primary keys 002, 102, 202, 
302, ... and so on.

That works great until you need more than 100 instances, at which point you're 
really, REALLY boned. In really busy systems it also limits the total amount of 
primary key space - but with BIGINT primary keys, that's unlikely to be 
something you need to worry about.

The composite primary key (serverid,sequenceid) approach avoids the need for a 
pre-defined maximum number of servers, but can be slow to index and can require 
more storage, especially because of tuple headers.

I have no firsthand experience with any of these approaches so I can't offer 
you a considered opinion. I know that the MS-SQL crowd at least strongly prefer 
UUIDs, but they have very strong in-database UUID support. MySQL folks seem to 
mostly favour the modulo primary key generation approach. I don't see much 
discussion of the issue here - I get the impression Pg doesn't see heavy use in 
sharded environments.

-- Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

-- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



  

Re: [PERFORM] No hash join across partitioned tables?

2010-10-15 Thread Alvaro Herrera
Excerpts from Robert Haas's message of mié jun 09 15:47:55 -0400 2010:

> In going back through emails I had marked as possibly needing another
> look before 9.0 is released, I came across this issue again.  As I
> understand it, analyze (or analyse) now collects statistics for both
> the parent individually, and for the parent and its children together.
>  However, as I further understand it, autovacuum won't actually fire
> off an analyze unless there's enough activity on the parent table
> considered individually to warrant it.  So if you have an empty parent
> and a bunch of children with data in it, your stats will still stink,
> unless you analyze by hand.

So, is there something we could now do about this, while there's still
time before 9.1?

I haven't followed this issue very closely, but it seems to me that what
we want is that we want an ANALYZE in a child table to be mutated into
an analyze of its parent table, if the conditions are right; and that an
ANALYZE of a parent removes the child tables from being analyzed on the
same run.

If we analyze the parent, do we also update the children stats, or is it
just that we keep two stats for the parent, one with children and one
without, both being updated when the parent is analyzed?

If the latter's the case, maybe we should modify ANALYZE a bit more, so
that we can analyze the whole hierarchy in one go, and store the lot of
stats with a single pass (each child alone, the parent alone, the parent
plus children).  However it's not real clear how would this work with
multiple inheritance levels.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] No hash join across partitioned tables?

2010-10-15 Thread Tom Lane
Alvaro Herrera  writes:
> If we analyze the parent, do we also update the children stats, or is it
> just that we keep two stats for the parent, one with children and one
> without, both being updated when the parent is analyzed?

The latter.

The trick here is that we need to fire an analyze on the parent even
though only its children may have had any updates.

> If the latter's the case, maybe we should modify ANALYZE a bit more, so
> that we can analyze the whole hierarchy in one go, and store the lot of
> stats with a single pass (each child alone, the parent alone, the parent
> plus children).  However it's not real clear how would this work with
> multiple inheritance levels.

It's also not clear how it works without blowing out memory...

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] No hash join across partitioned tables?

2010-10-15 Thread Samuel Gendler
On Fri, Oct 15, 2010 at 10:22 PM, Tom Lane  wrote:

> Alvaro Herrera  writes:
> > If we analyze the parent, do we also update the children stats, or is it
> > just that we keep two stats for the parent, one with children and one
> > without, both being updated when the parent is analyzed?
>
> The latter.
>
> The trick here is that we need to fire an analyze on the parent even
> though only its children may have had any updates.
>
> > If the latter's the case, maybe we should modify ANALYZE a bit more, so
> > that we can analyze the whole hierarchy in one go, and store the lot of
> > stats with a single pass (each child alone, the parent alone, the parent
> > plus children).  However it's not real clear how would this work with
> > multiple inheritance levels.
>

An issue with automatically analyzing the entire hierarchy is 'abstract'
table definitions.  I've got a set of tables for storing the same data at
different granularities of aggregation.  Within each granularity, I've got
partitions, but because the set of columns is identical for each
granularity, I've got an abstract table definition that is inherited by
everything.  I don't need or want statistics kept on that table because I
never query across the abstract table, only the parent table of each
aggregation granularity

create table abstract_fact_table (
time timestamp,
measure1 bigint,
measure2 bigint,
measure3 bigint,
fk1 bigint,
fk2 bigint
);

create table minute_scale_fact_table (
} inherits abstract_fact_table;

// Then there are several partitions for minute scale data

create table hour_scale_fact_table (
) inherits abstract_fact_table;

// then several partitions for hour scale data

etc.  I do run queries on the minute_scale_fact_table and
hour_scale_fact_table but never do so on abstract_fact_table.  I could
certainly modify my schema such that the abstract table goes away entirely
easily enough, but I find this easier for new developers to come in and
comprehend, since the similarity between the table definitions is explicit.

I'm glad this topic came up, as I was unaware that I need to run analyze on
the parent partitions separately - and no data is every inserted directly
into the top level of each granularity hierarchy, so it will never fire by
itself.

If I am using ORM and I've got functionality in a common baseclass in the
source code, I'll often implement its mapping in the database via a parent
table that the table for any subclass mapping can inherit from.  Again, I
have no interest in maintaining statistics on the parent table, since I
never query against it directly.