Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-18 Thread Howard Cole




Actually, the index returns page numbers in the table on disc which 
may contain one or more rows that are relevant. Postgres has to fetch 
the whole row to find out the email_id and any other information, 
including whether the row is visible in your current transaction 
(concurrency control complicates it all). Just having a page number 
isn't much use to you!


Matthew

Out of interest, if I could create a multicolumn index with both the 
primary key and the fts key (I don't think I can create a multi-column 
index using GIST with both the email_id and the fts field), would this 
reduce access to the table due to the primary key being part of the index?


More importantly, are there other ways that I can improve performance on 
this? I am guessing that a lot of the problem is that the email table is 
so big. If I cut out some of the text fields that are not needed in the 
search and put them in another table, presumably the size of the table 
will be reduced to a point where it will reduce the number of disk hits 
and speed the query up.


So I could split the table into two parts:

create table email_part2 (
email_id int8 references email_part1 (email_id),
fts ...,
email_directory_id ...,
)

create table email_part1(
email_id serial8 primary key,
cc text,
bcc text,
...
)

and the query will be
select email_id from email_part2 where to_tsquery('default', 'howard') 
@@ fts;


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tsearch2 Initial Search Speed

2008-06-18 Thread Matthew Wakeling

On Wed, 18 Jun 2008, Howard Cole wrote:
Out of interest, if I could create a multicolumn index with both the primary 
key and the fts key (I don't think I can create a multi-column index using 
GIST with both the email_id and the fts field), would this reduce access to 
the table due to the primary key being part of the index?


Unfortunately not, since the indexes do not contain information on whether 
a particular row is visible in your current transaction. Like I said, 
concurrency control really complicates things!


More importantly, are there other ways that I can improve performance on 
this? I am guessing that a lot of the problem is that the email table is so 
big. If I cut out some of the text fields that are not needed in the search 
and put them in another table, presumably the size of the table will be 
reduced to a point where it will reduce the number of disk hits and speed the 
query up.


Good idea. Note that Postgres is already doing this to some extent with 
TOAST - read 
http://www.postgresql.org/docs/8.3/interactive/storage-toast.html - 
unfortunately, there doesn't seem to be an option to always move 
particular columns out to TOAST. Your idea will produce an even smaller 
table. However, are email_ids all that you want from the query?


Matthew

--
Okay, I'm weird! But I'm saving up to be eccentric.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Partial backup of linked tables

2008-06-18 Thread Cisko

Hi, i'm new to this ML, i'll try to explain my issue:

I've two tables defined as is (postgresql 8.1):

CREATE TABLE table1
(
   _id  serial,
   num1 int4 not null,
   num2 int4 not null,

   primary key(_id)
);

CREATE INDEX table1IDX1 ON table1(num1);

CREATE TABLE table2
 (
   _id serial,
   _table1_id  int  not null,
   num3int4 not null,
   num4int4 not null,

   primary key(_id),

   foreign key(_table1_id) references table1(_id) on delete CASCADE
 );

CREATE INDEX table2IDX1 ON table2(_table1_id);


I need to select only a subset of table1/table2 records and backup  
them (to disk).


I proceed as following:

1. Create equivalent tables with _tmp name with indexes and cascade;

CREATE TABLE table1_tmp
(
   _id  serial,
   num1 int4 not null,
   num2 int4 not null,

   primary key(_id)
);

CREATE INDEX table1_tmpIDX1 ON table1_tmp(num1);

CREATE TABLE table2_tmp
 (
   _id serial,
   _table1_id  int  not null,
   num3int4 not null,
   num4int4 not null,

   primary key(_id),

   foreign key(_table1_id) references table1_tmp(_id) on delete CASCADE
 );

CREATE INDEX table2_tmpIDX1 ON table2_tmp(_table1_id);


2. Select and insert into table1_tmp a subset of table1 based on a  
query (num1 < 10)


INSERT INTO table1_tmp SELECT * from table1 WHERE num1 < 10;


3. Populate other tables with a foreign key;

INSERT INTO table2_tmp SELECT table2.* from table2, table1_tmp WHERE  
table2._table1_id = table1_tmp._id;



4. Copy each table into a file (i don't have an 8.2, so that i can't  
execute pg_dump with several -t options)


COPY table1_tmp TO "/tmp/table1_tmp.data";
COPY table2_tmp TO "/tmp/table2_tmp.data";


This is only an example, i've more complex tables, but schema is  
equivalent to previous.


My question is: There'are some optimization/tips that i can do for  
achieve better performance?
When i have several rows (10^6 or greater) returned by query into  
table1, that starts to hogs time and CPU.


Doing an EXPLAIN, all queries on join are performed using indexes.

Thanks in advance,
Cisko

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] WAL DUDAS

2008-06-18 Thread Antonio Perez
Tengo una pregunta, y este es el escenario de lo que tengo





 






Se crea una instancia de
postgreSQL
Se crea un directorio
$PGDATA/walback donde se almacenararn los wal antiguos
Se exporta una variable $PGDATA2
que es la ubicacion del respaldo del contenido de $PGDATA
Se activa el wal
Se crea una BD y una tabla
En psql  se ejecuta
pg_start_backup('etiqueta');
Se realiza una copia de todo lo
que esta en $PGDATA hacia otro directorio ($PGDATA2)
En psql  se ejecuta
pg_stop_backup();
Se actualiza el valor de un
registro en la tabla que se creo
Se baja la instancia
Se copia todo el contenido de
$PGDATA/pg_xlog y $PGDATA/walback en $PGDATA2/pg_xlog y
$PGDATA2/walback
Se inicia la instancia con pg_ctl
-D $PGDATA2 --log $PGDATA2/log.log start
Se ejecuta psql
Se consulta la tabla y no existen
registro

Si alguien sabe el porque pasa esto me
avisan. Gracias


[PERFORM] An "obvious" index not being used

2008-06-18 Thread Daniele Varrazzo

Hello,

I am experiencing a query for which an useful index is not being used by 
PostgreSQL. The query is in the form:


select count(*) from foo
where foo.account_id in (
select id from accounts where system = 'abc');

and the size of the tables it works on is:

  - 270 records in "accounts" 22 of which match the condition 'abc';
  - 5.3M records in "foo", 92K of which match the query condition.

There is an index in the field "foo.account_id" but is not used. The resulting 
query plan is:


 Aggregate  (cost=300940.70..300940.71 rows=1 width=0) (actual
time=13412.088..13412.089 rows=1 loops=1)
   ->  Hash IN Join  (cost=11.97..299858.32 rows=432953 width=0) (actual
time=0.678..13307.074 rows=92790 loops=1)
 Hash Cond: (foo.account_id = accounts.id)
 ->  Seq Scan on foo  (cost=0.00..275591.14 rows=5313514 width=4)
(actual time=0.014..7163.538 rows=5313514 loops=1)
 ->  Hash  (cost=11.70..11.70 rows=22 width=4) (actual
time=0.199..0.199 rows=22 loops=1)
   ->  Bitmap Heap Scan on accounts  (cost=1.42..11.70 rows=22
width=4) (actual time=0.092..0.160 rows=22 loops=1)
 Recheck Cond: (("system")::text = 'abc'::text)
 ->  Bitmap Index Scan on iaccounts_x1
(cost=0.00..1.42 rows=22 width=0) (actual time=0.077..0.077 rows=22
loops=1)
   Index Cond: (("system")::text = 'abc'::text)
 Total runtime: 13412.226 ms


There is a seqscan on the large table. If seqscans are disabled, the plan 
becomes the more acceptable:


 Aggregate  (cost=2471979.99..2471980.00 rows=1 width=0) (actual
time=630.977..630.978 rows=1 loops=1)
   ->  Nested Loop  (cost=1258.12..2470897.61 rows=432953 width=0) (actual
time=0.164..526.174 rows=92790 loops=1)
 ->  HashAggregate  (cost=12.75..12.97 rows=22 width=4) (actual
time=0.131..0.169 rows=22 loops=1)
   ->  Bitmap Heap Scan on accounts  (cost=2.42..12.70 rows=22
width=4) (actual time=0.047..0.091 rows=22 loops=1)
 Recheck Cond: (("system")::text = 'abc'::text)
 ->  Bitmap Index Scan on iaccounts_x1
(cost=0.00..2.42 rows=22 width=0) (actual time=0.036..0.036 rows=22
loops=1)
   Index Cond: (("system")::text = 'abc'::text)
 ->  Bitmap Heap Scan on foo  (cost=1245.37..111275.14 rows=83024
width=4) (actual time=3.086..14.391 rows=4218 loops=22)
   Recheck Cond: (foo.account_id = accounts.id)
   ->  Bitmap Index Scan on ifoo_x1  (cost=0.00..1224.61
rows=83024 width=0) (actual time=2.962..2.962 rows=4218 loops=22)
 Index Cond: (foo.account_id = accounts.id)
 Total runtime: 631.121 ms

where the index "ifoo_x1" is used.


A similar query plan can be also obtained performing first the internal query 
and hardcoding the result in a new query:


explain analyze select count(*) from foo
where account_id in
(70,33,190,21,191,223,203,202,148,246,85,281,280,319,234,67,245,310,318,279,320,9);


I have tried to:

  - rewrite the query with a JOIN instead of an IN (no change in the plan),
  - rewrite the query using EXISTS (it gets worse),
  - raise the statistics for the foo.account_id field to 100 and to 1000,
  - decrease the random_page_cost down to 1,
  - vacuum-analyze the tables at each change,

none of which has changed the situation.

The system is an Ubuntu Hardy 64 bits running PG 8.3. The issue has been 
confirmed on Mac OS 1.5/PG 8.3. Although I made fewer tests on a PG 8.2 we 
recently switched from, I think the issue presents on that version too.


This is the first time I see the query planner failing a plan rather obvious: 
is there any other setting to tweak to force it to do good? (but a sensible 
tweaking: the random_page_cost to 1 was just a try to have the index used, 
nothing to be really put in production)


If you want to try the issue, an anonimized dataset is available on 
http://piro.develer.com/test.sql.bz2 . The file size is 46MB (1.5GB 
uncompressed). Chris Mair, who tested it on Mac OS, also noticed that PG 
behaved correctly with the freshly imported data: as soon as he VACUUMed the 
database he started experiencing the described issue.


Thank you very much.

--
Daniele Varrazzo - Develer S.r.l.
http://www.develer.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] An "obvious" index not being used

2008-06-18 Thread Tom Lane
Daniele Varrazzo <[EMAIL PROTECTED]> writes:
> There is an index in the field "foo.account_id" but is not used. The 
> resulting 
> query plan is:

>   Aggregate  (cost=300940.70..300940.71 rows=1 width=0) (actual
> time=13412.088..13412.089 rows=1 loops=1)
> ->  Hash IN Join  (cost=11.97..299858.32 rows=432953 width=0) (actual
> time=0.678..13307.074 rows=92790 loops=1)
>   Hash Cond: (foo.account_id = accounts.id)
>   ->  Seq Scan on foo  (cost=0.00..275591.14 rows=5313514 width=4)
> (actual time=0.014..7163.538 rows=5313514 loops=1)

Well, if the estimate of 432953 rows selected were correct, it'd be
right not to use the index.  Fetching one row in ten is not a chore
for an indexscan.  (I'm not sure it'd prefer an indexscan even with an
accurate 92K-row estimate, but at least you'd be in the realm where
tweaking random_page_cost would make a difference.)

I'm not sure why that estimate is so bad, given that you said you
increased the stats target on the table.  Is there anything particularly
skewed about the distribution of the account IDs?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] An "obvious" index not being used

2008-06-18 Thread Daniele Varrazzo

Tom Lane ha scritto:

Daniele Varrazzo <[EMAIL PROTECTED]> writes:
There is an index in the field "foo.account_id" but is not used. The resulting 
query plan is:



  Aggregate  (cost=300940.70..300940.71 rows=1 width=0) (actual
time=13412.088..13412.089 rows=1 loops=1)
->  Hash IN Join  (cost=11.97..299858.32 rows=432953 width=0) (actual
time=0.678..13307.074 rows=92790 loops=1)
  Hash Cond: (foo.account_id = accounts.id)
  ->  Seq Scan on foo  (cost=0.00..275591.14 rows=5313514 width=4)
(actual time=0.014..7163.538 rows=5313514 loops=1)


Well, if the estimate of 432953 rows selected were correct, it'd be
right not to use the index.  Fetching one row in ten is not a chore
for an indexscan.  (I'm not sure it'd prefer an indexscan even with an
accurate 92K-row estimate, but at least you'd be in the realm where
tweaking random_page_cost would make a difference.)


Let me guess: because the account tables has an estimated (and correct) guess 
of 22 records fetched out from 270 =~ 8%, it assumes that it will need to 
fetch the 8% of 5.3M records (which... yes, it matches the estimate of 433K). 
Well, this seems terribly wrong for this data set :(



I'm not sure why that estimate is so bad, given that you said you
increased the stats target on the table.  Is there anything particularly
skewed about the distribution of the account IDs?


Probably there is, in the sense that the relatively many accounts of 'abc' 
type are referred by relatively few records. In the plan for the hardcoded 
query the estimate is:


->  Bitmap Index Scan on ifoo_x1  (cost=0.00..4115.67 rows=178308
width=0) (actual time=89.766..89.766 rows=92790 loops=1)

which is actually more accurate.

I suspect the foo.account_id statistical data are not used at all in query: 
the query planner can only estimate the number of accounts to look for, not 
how they are distributed in the referencing tables. It seems the only way to 
get the proper plan is to add a load of fake accounts! Well, I'd rather have 
the query executed in 2 times, in order to have the stats correctly used: this 
is the first time it happens to me.


--
Daniele Varrazzo - Develer S.r.l.
http://www.develer.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance