Re: [PERFORM] Tsearch2 Initial Search Speed
Actually, the index returns page numbers in the table on disc which may contain one or more rows that are relevant. Postgres has to fetch the whole row to find out the email_id and any other information, including whether the row is visible in your current transaction (concurrency control complicates it all). Just having a page number isn't much use to you! Matthew Out of interest, if I could create a multicolumn index with both the primary key and the fts key (I don't think I can create a multi-column index using GIST with both the email_id and the fts field), would this reduce access to the table due to the primary key being part of the index? More importantly, are there other ways that I can improve performance on this? I am guessing that a lot of the problem is that the email table is so big. If I cut out some of the text fields that are not needed in the search and put them in another table, presumably the size of the table will be reduced to a point where it will reduce the number of disk hits and speed the query up. So I could split the table into two parts: create table email_part2 ( email_id int8 references email_part1 (email_id), fts ..., email_directory_id ..., ) create table email_part1( email_id serial8 primary key, cc text, bcc text, ... ) and the query will be select email_id from email_part2 where to_tsquery('default', 'howard') @@ fts; -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tsearch2 Initial Search Speed
On Wed, 18 Jun 2008, Howard Cole wrote: Out of interest, if I could create a multicolumn index with both the primary key and the fts key (I don't think I can create a multi-column index using GIST with both the email_id and the fts field), would this reduce access to the table due to the primary key being part of the index? Unfortunately not, since the indexes do not contain information on whether a particular row is visible in your current transaction. Like I said, concurrency control really complicates things! More importantly, are there other ways that I can improve performance on this? I am guessing that a lot of the problem is that the email table is so big. If I cut out some of the text fields that are not needed in the search and put them in another table, presumably the size of the table will be reduced to a point where it will reduce the number of disk hits and speed the query up. Good idea. Note that Postgres is already doing this to some extent with TOAST - read http://www.postgresql.org/docs/8.3/interactive/storage-toast.html - unfortunately, there doesn't seem to be an option to always move particular columns out to TOAST. Your idea will produce an even smaller table. However, are email_ids all that you want from the query? Matthew -- Okay, I'm weird! But I'm saving up to be eccentric. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Partial backup of linked tables
Hi, i'm new to this ML, i'll try to explain my issue: I've two tables defined as is (postgresql 8.1): CREATE TABLE table1 ( _id serial, num1 int4 not null, num2 int4 not null, primary key(_id) ); CREATE INDEX table1IDX1 ON table1(num1); CREATE TABLE table2 ( _id serial, _table1_id int not null, num3int4 not null, num4int4 not null, primary key(_id), foreign key(_table1_id) references table1(_id) on delete CASCADE ); CREATE INDEX table2IDX1 ON table2(_table1_id); I need to select only a subset of table1/table2 records and backup them (to disk). I proceed as following: 1. Create equivalent tables with _tmp name with indexes and cascade; CREATE TABLE table1_tmp ( _id serial, num1 int4 not null, num2 int4 not null, primary key(_id) ); CREATE INDEX table1_tmpIDX1 ON table1_tmp(num1); CREATE TABLE table2_tmp ( _id serial, _table1_id int not null, num3int4 not null, num4int4 not null, primary key(_id), foreign key(_table1_id) references table1_tmp(_id) on delete CASCADE ); CREATE INDEX table2_tmpIDX1 ON table2_tmp(_table1_id); 2. Select and insert into table1_tmp a subset of table1 based on a query (num1 < 10) INSERT INTO table1_tmp SELECT * from table1 WHERE num1 < 10; 3. Populate other tables with a foreign key; INSERT INTO table2_tmp SELECT table2.* from table2, table1_tmp WHERE table2._table1_id = table1_tmp._id; 4. Copy each table into a file (i don't have an 8.2, so that i can't execute pg_dump with several -t options) COPY table1_tmp TO "/tmp/table1_tmp.data"; COPY table2_tmp TO "/tmp/table2_tmp.data"; This is only an example, i've more complex tables, but schema is equivalent to previous. My question is: There'are some optimization/tips that i can do for achieve better performance? When i have several rows (10^6 or greater) returned by query into table1, that starts to hogs time and CPU. Doing an EXPLAIN, all queries on join are performed using indexes. Thanks in advance, Cisko -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] WAL DUDAS
Tengo una pregunta, y este es el escenario de lo que tengo Se crea una instancia de postgreSQL Se crea un directorio $PGDATA/walback donde se almacenararn los wal antiguos Se exporta una variable $PGDATA2 que es la ubicacion del respaldo del contenido de $PGDATA Se activa el wal Se crea una BD y una tabla En psql se ejecuta pg_start_backup('etiqueta'); Se realiza una copia de todo lo que esta en $PGDATA hacia otro directorio ($PGDATA2) En psql se ejecuta pg_stop_backup(); Se actualiza el valor de un registro en la tabla que se creo Se baja la instancia Se copia todo el contenido de $PGDATA/pg_xlog y $PGDATA/walback en $PGDATA2/pg_xlog y $PGDATA2/walback Se inicia la instancia con pg_ctl -D $PGDATA2 --log $PGDATA2/log.log start Se ejecuta psql Se consulta la tabla y no existen registro Si alguien sabe el porque pasa esto me avisan. Gracias
[PERFORM] An "obvious" index not being used
Hello, I am experiencing a query for which an useful index is not being used by PostgreSQL. The query is in the form: select count(*) from foo where foo.account_id in ( select id from accounts where system = 'abc'); and the size of the tables it works on is: - 270 records in "accounts" 22 of which match the condition 'abc'; - 5.3M records in "foo", 92K of which match the query condition. There is an index in the field "foo.account_id" but is not used. The resulting query plan is: Aggregate (cost=300940.70..300940.71 rows=1 width=0) (actual time=13412.088..13412.089 rows=1 loops=1) -> Hash IN Join (cost=11.97..299858.32 rows=432953 width=0) (actual time=0.678..13307.074 rows=92790 loops=1) Hash Cond: (foo.account_id = accounts.id) -> Seq Scan on foo (cost=0.00..275591.14 rows=5313514 width=4) (actual time=0.014..7163.538 rows=5313514 loops=1) -> Hash (cost=11.70..11.70 rows=22 width=4) (actual time=0.199..0.199 rows=22 loops=1) -> Bitmap Heap Scan on accounts (cost=1.42..11.70 rows=22 width=4) (actual time=0.092..0.160 rows=22 loops=1) Recheck Cond: (("system")::text = 'abc'::text) -> Bitmap Index Scan on iaccounts_x1 (cost=0.00..1.42 rows=22 width=0) (actual time=0.077..0.077 rows=22 loops=1) Index Cond: (("system")::text = 'abc'::text) Total runtime: 13412.226 ms There is a seqscan on the large table. If seqscans are disabled, the plan becomes the more acceptable: Aggregate (cost=2471979.99..2471980.00 rows=1 width=0) (actual time=630.977..630.978 rows=1 loops=1) -> Nested Loop (cost=1258.12..2470897.61 rows=432953 width=0) (actual time=0.164..526.174 rows=92790 loops=1) -> HashAggregate (cost=12.75..12.97 rows=22 width=4) (actual time=0.131..0.169 rows=22 loops=1) -> Bitmap Heap Scan on accounts (cost=2.42..12.70 rows=22 width=4) (actual time=0.047..0.091 rows=22 loops=1) Recheck Cond: (("system")::text = 'abc'::text) -> Bitmap Index Scan on iaccounts_x1 (cost=0.00..2.42 rows=22 width=0) (actual time=0.036..0.036 rows=22 loops=1) Index Cond: (("system")::text = 'abc'::text) -> Bitmap Heap Scan on foo (cost=1245.37..111275.14 rows=83024 width=4) (actual time=3.086..14.391 rows=4218 loops=22) Recheck Cond: (foo.account_id = accounts.id) -> Bitmap Index Scan on ifoo_x1 (cost=0.00..1224.61 rows=83024 width=0) (actual time=2.962..2.962 rows=4218 loops=22) Index Cond: (foo.account_id = accounts.id) Total runtime: 631.121 ms where the index "ifoo_x1" is used. A similar query plan can be also obtained performing first the internal query and hardcoding the result in a new query: explain analyze select count(*) from foo where account_id in (70,33,190,21,191,223,203,202,148,246,85,281,280,319,234,67,245,310,318,279,320,9); I have tried to: - rewrite the query with a JOIN instead of an IN (no change in the plan), - rewrite the query using EXISTS (it gets worse), - raise the statistics for the foo.account_id field to 100 and to 1000, - decrease the random_page_cost down to 1, - vacuum-analyze the tables at each change, none of which has changed the situation. The system is an Ubuntu Hardy 64 bits running PG 8.3. The issue has been confirmed on Mac OS 1.5/PG 8.3. Although I made fewer tests on a PG 8.2 we recently switched from, I think the issue presents on that version too. This is the first time I see the query planner failing a plan rather obvious: is there any other setting to tweak to force it to do good? (but a sensible tweaking: the random_page_cost to 1 was just a try to have the index used, nothing to be really put in production) If you want to try the issue, an anonimized dataset is available on http://piro.develer.com/test.sql.bz2 . The file size is 46MB (1.5GB uncompressed). Chris Mair, who tested it on Mac OS, also noticed that PG behaved correctly with the freshly imported data: as soon as he VACUUMed the database he started experiencing the described issue. Thank you very much. -- Daniele Varrazzo - Develer S.r.l. http://www.develer.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] An "obvious" index not being used
Daniele Varrazzo <[EMAIL PROTECTED]> writes: > There is an index in the field "foo.account_id" but is not used. The > resulting > query plan is: > Aggregate (cost=300940.70..300940.71 rows=1 width=0) (actual > time=13412.088..13412.089 rows=1 loops=1) > -> Hash IN Join (cost=11.97..299858.32 rows=432953 width=0) (actual > time=0.678..13307.074 rows=92790 loops=1) > Hash Cond: (foo.account_id = accounts.id) > -> Seq Scan on foo (cost=0.00..275591.14 rows=5313514 width=4) > (actual time=0.014..7163.538 rows=5313514 loops=1) Well, if the estimate of 432953 rows selected were correct, it'd be right not to use the index. Fetching one row in ten is not a chore for an indexscan. (I'm not sure it'd prefer an indexscan even with an accurate 92K-row estimate, but at least you'd be in the realm where tweaking random_page_cost would make a difference.) I'm not sure why that estimate is so bad, given that you said you increased the stats target on the table. Is there anything particularly skewed about the distribution of the account IDs? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] An "obvious" index not being used
Tom Lane ha scritto: Daniele Varrazzo <[EMAIL PROTECTED]> writes: There is an index in the field "foo.account_id" but is not used. The resulting query plan is: Aggregate (cost=300940.70..300940.71 rows=1 width=0) (actual time=13412.088..13412.089 rows=1 loops=1) -> Hash IN Join (cost=11.97..299858.32 rows=432953 width=0) (actual time=0.678..13307.074 rows=92790 loops=1) Hash Cond: (foo.account_id = accounts.id) -> Seq Scan on foo (cost=0.00..275591.14 rows=5313514 width=4) (actual time=0.014..7163.538 rows=5313514 loops=1) Well, if the estimate of 432953 rows selected were correct, it'd be right not to use the index. Fetching one row in ten is not a chore for an indexscan. (I'm not sure it'd prefer an indexscan even with an accurate 92K-row estimate, but at least you'd be in the realm where tweaking random_page_cost would make a difference.) Let me guess: because the account tables has an estimated (and correct) guess of 22 records fetched out from 270 =~ 8%, it assumes that it will need to fetch the 8% of 5.3M records (which... yes, it matches the estimate of 433K). Well, this seems terribly wrong for this data set :( I'm not sure why that estimate is so bad, given that you said you increased the stats target on the table. Is there anything particularly skewed about the distribution of the account IDs? Probably there is, in the sense that the relatively many accounts of 'abc' type are referred by relatively few records. In the plan for the hardcoded query the estimate is: -> Bitmap Index Scan on ifoo_x1 (cost=0.00..4115.67 rows=178308 width=0) (actual time=89.766..89.766 rows=92790 loops=1) which is actually more accurate. I suspect the foo.account_id statistical data are not used at all in query: the query planner can only estimate the number of accounts to look for, not how they are distributed in the referencing tables. It seems the only way to get the proper plan is to add a load of fake accounts! Well, I'd rather have the query executed in 2 times, in order to have the stats correctly used: this is the first time it happens to me. -- Daniele Varrazzo - Develer S.r.l. http://www.develer.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance