[PERFORM] join vs exists
Which one is good - join between table or using exists in where condition? Query 1; Select a.* from a where exists ( select 1 from b inner join c on b.id1 = c.id where a.id = b.id) Query 2: select a.* from a inner join (select b.id from b inner join c on b.id1 = c.id) as q on a.id = q.id Any suggestion please.
[PERFORM] Obtaining the exact size of the database.
Hi All, I am using Postgres 8.1.9 for my application. My application also has a clean up module which cleans up specified percentage of total database size at regular intervals. Now the problem is I use *pg_database_size* to obtain the size of the database. After deleting the records, we run *Vacuum Analyze* to reorder the indexes. The problem here is even though some records are cleared, it still shows the original DB Size. Is there any way to find out the actual DB Size or it would be more useful, if I can get the size of each table. I can't run *Vacuum Full* because the application should be run 24*7 without downtime. Can someone please help me in solving this. Please let me know if you need any clarifications. Thank you, Venu.
[PERFORM] HashAggregate slower than sort?
Hi, I've noticed something that I find strange with the hash-aggregate feature of Postgres. I'm currently running Postgres v8.4.1 on Debian Linux 64-bit. I have a simple query that when planned either uses hash-aggregates or a sort depending on the amount of working memory available. The problem is that when it uses the hash-aggregates, the query runs 25% slower than when using the sort method. The table in question contains about 60 columns, many of which are boolean, 32-bit integers and some are 64-bit integers. Many fields are text - and some of these can be quite long (eg 32Kb). The SQL is as follows: explain analyse select distinct T1.* from role T1 where T1.endDate is null and T1.latest=true and T1.active=true and T1.deceased=false and T1.desk in (BIG LIST OF INTEGERS); select version() --> "PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10), 64-bit" show enable_hashagg --> "on" set work_mem='8MB' show work_mem --> "8MB" Explain analyse of the SQL above: Unique (cost=47033.71..48410.27 rows=8881 width=1057) (actual time=18.803..38.969 rows=6449 loops=1) -> Sort (cost=47033.71..47055.91 rows=8881 width=1057) (actual time=18.801..20.560 rows=6449 loops=1) Sort Key: id, version, latest, active, deceased, person, formalnotes, informalnotes, description, desk, rolelevel, roletype, promotiondate, primaryrole, headofplace, careergrading, startdate, enddate, percentsalary, deskf, rolelevelf, roletypef, promotiondatef, primaryrolef, headofplacef, careergradingf, startdatef, enddatef, percentsalaryf, descriptionf, deskmv, rolelevelmv, roletypemv, promotiondatemv, primaryrolemv, headofplacemv, careergradingmv, startdatemv, enddatemv, percentsalarymv, descriptionmv, hasattachments, hasrelationships, hasprojects, audwho, audwhen, audcreated, costcentre, reportsto, manages, startdateest, enddateest, hasstarperformers, projectnames, sourcefrom, sourceto, checkedwho, checkedwhen, checkednotes, hasqueries, querytitles Sort Method: quicksort Memory: 2001kB -> Bitmap Heap Scan on role t1 (cost=4888.59..42321.27 rows=8881 width=1057) (actual time=7.041..12.504 rows=6449 loops=1) Recheck Cond: (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])) Filter: ((enddate IS NULL) AND latest AND active AND (NOT deceased)) -> Bitmap Index Scan on role_ix2 (cost=0.00..4886.37 rows=10984 width=0) (actual time=6.948..6.948 rows=9296 loops=1) Index Cond: ((latest = true) AND (active = true) AND (deceased = false) AND (desk = ANY ('BIG LIST OF INTEGERS'::bigint[]))) Total runtime: 40.777 ms This execution of the query used a sort to perform the "distinct". Now for the second run: select version() --> "PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10), 64-bit" show enable_hashagg --> "on" set work_mem='64MB' show work_mem --> "64MB" Explain analyse of the SQL above: HashAggregate (cost=43675.63..43764.44 rows=8881 width=1057) (actual time=46.556..55.694 rows=6449 loops=1) -> Bitmap Heap Scan on role t1 (cost=4888.59..42321.27 rows=8881 width=1057) (actual time=7.179..13.023 rows=6449 loops=1) Recheck Cond: (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])) Filter: ((enddate IS NULL) AND latest AND active AND (NOT deceased)) -> Bitmap Index Scan on role_ix2 (cost=0.00..4886.37 rows=10984 width=0) (actual time=7.086..7.086 rows=9296 loops=1) Index Cond: ((latest = true) AND (active = true) AND (deceased = false) AND (desk = ANY ('BIG LIST OF INTEGERS'::bigint[]))) Total runtime: 57.536 ms I've tested this with v8.4.4 as well with the same results. I also tested the same query with our previous production version of Postgres (v8.3.8) and that version only appears to use sorting not hash-aggregates. Obviously, I can re-write the query to use a "distinct on (...)" clause to improve performance - which is what I've done, but my question is: Why is the hash-aggregate slower than the sort? Is it something to do with the number of columns? ie. When sorting, the first few columns defined on the table (id, version) make the row unique - but when using the hash-aggregate feature, presumably every column needs to be hashed which takes longer especially for long text fields? Thanks, --Jatinder -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Slow function in queries SELECT clause.
I think I have read what is to be read about queries being prepared in plpgsql functions, but I still can not explain the following, so I thought to post it here: Suppose 2 functions: factor(int,int) and offset(int, int). Suppose a third function: convert(float,int,int) which simply returns $1*factor($2,$3)+offset($2,$3) All three functions are IMMUTABLE. Very simple, right? Now I have very fast AND very slow executing queries on some 150k records: VERY FAST (half a second): SELECT data*factor(1,2)+offset(1,2) FROM tbl_data; VERY SLOW (a minute): SELECT convert(data, 1, 2) FROM tbl_data; The slowness cannot be due to calling a function 150k times. If I define convert2(float,int,int) to return a constant value, then it executes in about a second. (still half as slow as the VERY FAST query). I assume that factor and offset are cached in the VERY FAST query, and not in the slow one? If so, why not and how can I "force" it? Currently I need only one function for conversions. Regards, Davor -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow function in queries SELECT clause.
2010/6/19 Davor J. > I think I have read what is to be read about queries being prepared in > plpgsql functions, but I still can not explain the following, so I thought > to post it here: > > Suppose 2 functions: factor(int,int) and offset(int, int). > Suppose a third function: convert(float,int,int) which simply returns > $1*factor($2,$3)+offset($2,$3) > All three functions are IMMUTABLE. > > Very simple, right? Now I have very fast AND very slow executing queries on > some 150k records: > > VERY FAST (half a second): > > SELECT data*factor(1,2)+offset(1,2) FROM tbl_data; > > VERY SLOW (a minute): > > SELECT convert(data, 1, 2) FROM tbl_data; > > The slowness cannot be due to calling a function 150k times. If I define > convert2(float,int,int) to return a constant value, then it executes in > about a second. (still half as slow as the VERY FAST query). > > I assume that factor and offset are cached in the VERY FAST query, and not > in the slow one? If so, why not and how can I "force" it? Currently I need > only one function for conversions. > > Regards, > Davor > > > > Hi, show us the code of those two functions and explain analyze of those queries. regards Szymon Guz
Re: [PERFORM] Slow function in queries SELECT clause.
I didn't consider them to be important as they showed the same, only the execution time was different. Also, they are a bit more complex than the ones put in the previous post. But here they are: Definitions: --- CREATE OR REPLACE FUNCTION appfunctions.fnc_unit_conversion_factor(_tree_id integer, _unit_to_id integer) RETURNS real AS $BODY$ DECLARE BEGIN RETURN (SELECT unit_conv_factor AS factor FROM vew_unit_conversions AS c INNER JOIN tbl_sensors AS s ON (s.unit_id = c.unit_id_from) INNER JOIN tbl_trees USING (sens_id) WHERE tree_id = _tree_id AND unit_id_to = _unit_to_id)::real; END; $BODY$ LANGUAGE 'plpgsql' IMMUTABLE -- CREATE OR REPLACE FUNCTION appfunctions.fnc_unit_conversion_offset(_tree_id integer, _unit_to_id integer) RETURNS real AS $BODY$ DECLARE BEGIN RETURN (SELECT unit_conv_offset AS offset FROM vew_unit_conversions AS c INNER JOIN tbl_sensors AS s ON (s.unit_id = c.unit_id_from) INNER JOIN tbl_trees USING (sens_id) WHERE tree_id = _tree_id AND unit_id_to = _unit_to_id)::real; END; $BODY$ LANGUAGE 'plpgsql' IMMUTABLE -- CREATE OR REPLACE FUNCTION appfunctions.fnc_unit_convert(_rawdata real, _tree_id integer, _unit_to_id integer) RETURNS real AS $BODY$ DECLARE BEGIN RETURN _rawdata * fnc_unit_conversion_factor(_tree_id, _unit_to_id) + fnc_unit_conversion_offset(_tree_id, _unit_to_id); END; $BODY$ LANGUAGE 'plpgsql' IMMUTABLE Executions: --- EXPLAIN ANALYSE SELECT timestamp, data_from_tree_id_70 AS "flow_11" FROM (SELECT sens_chan_data_timestamp AS timestamp, sens_chan_data_data AS data_from_tree_id_70 FROM tbl_sensor_channel_data WHERE tree_id = 70 AND sens_chan_data_timestamp >= '2008-06-11T00:00:00' AND sens_chan_data_timestamp <= '2008-06-18T00:00:00' ) AS "70" ORDER BY timestamp; "Sort (cost=175531.00..175794.64 rows=105456 width=12) (actual time=598.454..638.400 rows=150678 loops=1)" " Sort Key: tbl_sensor_channel_data.sens_chan_data_timestamp" " Sort Method: external sort Disk: 3240kB" " -> Bitmap Heap Scan on tbl_sensor_channel_data (cost=3005.29..166732.66 rows=105456 width=12) (actual time=34.810..371.099 rows=150678 loops=1)" "Recheck Cond: ((tree_id = 70) AND (sens_chan_data_timestamp >= '2008-06-11 00:00:00'::timestamp without time zone) AND (sens_chan_data_timestamp <= '2008-06-18 00:00:00'::timestamp without time zone))" "-> Bitmap Index Scan on tbl_sensor_channel_data_pkey (cost=0.00..2978.92 rows=105456 width=0) (actual time=28.008..28.008 rows=150678 loops=1)" " Index Cond: ((tree_id = 70) AND (sens_chan_data_timestamp >= '2008-06-11 00:00:00'::timestamp without time zone) AND (sens_chan_data_timestamp <= '2008-06-18 00:00:00'::timestamp without time zone))" "Total runtime: 663.478 ms" --- EXPLAIN ANALYSE SELECT timestamp, fnc_unit_convert(data_from_tree_id_70, 70, 7) AS "flow_11" FROM (SELECT sens_chan_data_timestamp AS timestamp, sens_chan_data_data AS data_from_tree_id_70 FROM tbl_sensor_channel_data WHERE tree_id = 70 AND sens_chan_data_timestamp >= '2008-06-11T00:00:00' AND sens_chan_data_timestamp <= '2008-06-18T00:00:00' ) AS "70" ORDER BY timestamp; "Sort (cost=201895.00..202158.64 rows=105456 width=12) (actual time=35334.017..35372.977 rows=150678 loops=1)" " Sort Key: tbl_sensor_channel_data.sens_chan_data_timestamp" " Sort Method: external sort Disk: 3240kB" " -> Bitmap Heap Scan on tbl_sensor_channel_data (cost=3005.29..193096.66 rows=105456 width=12) (actual time=60.012..35037.129 rows=150678 loops=1)" "Recheck Cond: ((tree_id = 70) AND (sens_chan_data_timestamp >= '2008-06-11 00:00:00'::timestamp without time zone) AND (sens_chan_data_timestamp <= '2008-06-18 00:00:00'::timestamp without time zone))" "-> Bitmap Index Scan on tbl_sensor_channel_data_pkey (cost=0.00..2978.92 rows=105456 width=0) (actual time=21.884..21.884 rows=150678 loops=1)" " Index Cond: ((tree_id = 70) AND (sens_chan_data_timestamp >= '2008-06-11 00:00:00'::timestamp without time zone) AND (sens_chan_data_timestamp <= '2008-06-18 00:00:00'::timestamp without time zone))" "Total runtime: 35397.841 ms" --- EXPLAIN ANALYSE SELECT timestamp, data_from_tree_id_70*fnc_unit_conversion_factor(70, 7)+ fnc_unit_conversion_offset(70, 7) AS "flow_11" FROM (SELECT sens_chan_data_timestamp AS timestamp, sens_chan_data_data AS data_from_tree_id_70 FROM tbl_sensor_channel_data WHERE tree_id = 70 AND sens_chan_data_timestamp >= '2008-06-11T00:00:00' AND sens_chan_data_timestamp <= '2008-06-18T00:00:00' ) AS "70" ORDER BY timestamp; EXPLAIN ANALYSE SELECT timestamp, "Sort (cost=176058.28..176321.92 rows=105456 width=12) (actual time=630.350..669.843 rows=1506
Re: [PERFORM] Obtaining the exact size of the database.
Hi there 1. PG 8.1.9 is ancient ... you should upgrade. 2. The database gross size on disk is not affected by VACUUM ANALYZE ... all this does is return space used by deleted row-versions to PG for re-use. The only way to reduce it and thus return disk space to the OS is to do a VACUUM FULL, or to delete the entire table. 3. If you can suspend writes for a while, you can pull off an "online" VACCUM FULL, or copy and delete the table in order to repack it. Check out the CLUSTER command. 4. If you're trying to figure out the net size of the table, i.e. how much free space is inside the table files for reuse by PG, then you need the pg_stat_tuple function ... this is built in to PG 8.4, and a plug-in activated by a script for PG 8.3, don't know if it exists in 8.1 or not. Like SELECT COUNT(*) this requires a full table scan. Cheers Dave sent from my Android phone On Jun 20, 2010 6:18 AM, "venu madhav" wrote: Hi All, I am using Postgres 8.1.9 for my application. My application also has a clean up module which cleans up specified percentage of total database size at regular intervals. Now the problem is I use *pg_database_size* to obtain the size of the database. After deleting the records, we run *Vacuum Analyze* to reorder the indexes. The problem here is even though some records are cleared, it still shows the original DB Size. Is there any way to find out the actual DB Size or it would be more useful, if I can get the size of each table. I can't run *Vacuum Full* because the application should be run 24*7 without downtime. Can someone please help me in solving this. Please let me know if you need any clarifications. Thank you, Venu.
Re: [PERFORM] join vs exists
AI Rumman wrote: > Which one is good - join between table or using exists in where > condition? Your example wouldn't return the same results unless there was at most one matching row in b and one matching row in c, at least without resorting to DISTINCT (which you don't show). So, be careful of not getting the wrong results in an attempt to optimize. You don't say which version of PostgreSQL you're using, but if its a fairly recent major version, I would expect nearly identical performance if the queries returned the same results without DISTINCT, and would usually expect better results for the EXISTS than the JOIN with DISTINCT. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow function in queries SELECT clause.
"Davor J." writes: > Suppose 2 functions: factor(int,int) and offset(int, int). > Suppose a third function: convert(float,int,int) which simply returns > $1*factor($2,$3)+offset($2,$3) > All three functions are IMMUTABLE. You should write the third function as a SQL function, which'd allow it to be inlined. > VERY FAST (half a second): > > SELECT data*factor(1,2)+offset(1,2) FROM tbl_data; In this case both factor() calls are folded to constants, hence executed only once. > VERY SLOW (a minute): > > SELECT convert(data, 1, 2) FROM tbl_data; Without inlining, there's no hope of any constant-folding here. The optimizer just sees the plpgsql function as a black box and can't do anything with it. BTW, your later mail shows that the factor() functions are not really IMMUTABLE, since they select from tables that presumably are subject to change. The "correct" declaration would be STABLE. If you're relying on constant-folding to get reasonable application performance, you're going to have to continue to mislabel them as IMMUTABLE; but be aware that you're likely to have issues any time you do change the table contents. The changes won't get reflected into existing query plans. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Obtaining the exact size of the database.
Dave Crooke writes: > 4. If you're trying to figure out the net size of the table, i.e. how much > free space is inside the table files for reuse by PG, then you need the > pg_stat_tuple function ... this is built in to PG 8.4, and a plug-in > activated by a script for PG 8.3, don't know if it exists in 8.1 or not. > Like SELECT COUNT(*) this requires a full table scan. I think what the OP actually wants is the number of live rows, so plain old SELECT COUNT(*) would do it. If that's too slow, a good alternative is to ANALYZE the table and then look at its pg_class.reltuples entry --- of course that will only be an approximate count. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Aggressive autovacuuming ?
Hi. I have been wondering if anyone has been experimenting with "really agressive" autovacuuming. The database I'm adminstrating rarely have "long running" transactions (over several minutes). And a fair amount of buffercache and an OS cache of (at best 64GB). A lot of the OS cache is being used for read-caching. My thought was that if I tuned autovacuum to be "really aggressive" then I could get autovacuum to actually vacuum the tuples before they get evicted from the OS cache thus effectively "saving" the IO-overhead of vacuuming. The largest consequence I can see at the moment is that when I get a full vacuum (for preventing transaction-id wraparound) it would be run with the same aggressive settings, thus giving a real performance hit in that situation. Has anyone tried to do similar? What is your experience? Is the idea totally bogus? Jesper -- Jesper Krogh -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] B-Heaps
Robert Haas wrote: This is drifting a bit off-topic for this thread, but it's not so easy to figure out from looking at the TODO which things are actually important. Performance-related improvements are mixed in with non-performance related improvements, which are mixed in with things that are probably not improvements at all. And even to the extent that you can identify the stuff that's performance-related, it's far from obvious which things are most important. Any thoughts on that I don't think it's off topic at all actually, and as usually I'll be happy to argue why. Reorganizing the TODO so that it's easier for newcomers to consume is certainly a worthwhile but hard to "fund" (find time to do relative to more important things) effort itself. My point was more that statistically, *anything* on that list is likely a better candidate for something to work on usefully than one of the random theoretical performance improvements from research that pop on the lists from time to time. People get excited about these papers and blog posts sometimes, but the odds of those actually being in the critical path where it represents a solution to a current PostgreSQL bottleneck is dramatically lower than that you'll find one reading the list of *known* issues. Want to improve PostgreSQL performance? Spend more time reading the TODO, less looking around elsewhere for problems the database may or may not have. I have a major time sink I'm due to free myself from this week, and the idea of providing some guidance for a "low hanging performance fruit" section of the TODO is a good one I should take a look at. I have a personal list of that sort already I should probably just make public, since the ideas for improving things are not the valuable part I should worry about keeping private anyway. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Obtaining the exact size of the database.
venu madhav wrote: The problem here is even though some records are cleared, it still shows the original DB Size. Is there any way to find out the actual DB Size or it would be more useful, if I can get the size of each table. One of the queries at http://wiki.postgresql.org/wiki/Disk_Usage should give you the breakdown per table. Regular VACUUM doesn't ever shrink the database from the operating system perspective unless you hit a very unusual situation (all of the free space is at the end). There is no way to do that without system downtime of sorts in the form a potentially long database lock, such as VACUUM FULL (the main option on 8.1, the alternative of using CLUSTER isn't a good idea until 8.3). The best you can do is making sure you VACUUM often enough that space is regularly reused. It's hard to run a 24x7 environment on 8.1. Much easier on 8.4, where the major things that regularly left people with quite bad VACUUM cleanup situations are all less likely to occur than on any previous version. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Aggressive autovacuuming ?
On Sun, Jun 20, 2010 at 11:44 AM, Jesper Krogh wrote: > Hi. > > I have been wondering if anyone has been experimenting with "really > agressive" > autovacuuming. I have been using moderately aggressive autovac, with 6 or more threads running with 1ms sleep, then keeping track of them to see if they're being too aggresive. Basically as long as io utilization doesn't hit 100% it doesn't seem to have any negative or even noticeable effect. I head more in the direction of running a few more threads than I absolutely need to keep up with bloat. If I'm set for 5 threads and I always have five threads running, I go to 6, 7, 8 or wherever they're never all active. But you need the IO capability to use aggresive vacuuming. > The database I'm adminstrating rarely have "long running" > transactions > (over several minutes). And a fair amount of buffercache and an OS cache of > (at best 64GB). A lot of the OS cache is being used for read-caching. > > My thought was that if I tuned autovacuum to be "really aggressive" then > I could get autovacuum to actually vacuum the tuples before they > get evicted from the OS cache thus effectively "saving" the IO-overhead > of vacuuming. But vacuuming by design has to write out and that's the real resource you're likely to use up first. > The largest consequence I can see at the moment is that when I get a > full vacuum (for preventing transaction-id wraparound) it would be I assume you mean the automatic database wide vacuum. I don't think 8.4 and above need that anymore. I thnk 8.3 does that too, but I'm not 100% sure. > run with the same aggressive settings, thus giving a real performance > hit in that situation. > > Has anyone tried to do similar? What is your experience? > Is the idea totally bogus? Cranking up autovacuum is not a bogus idea, but it directly impacts your IO subsystem, and if you get too aggressive (zero naptime is way aggressive) you have to back off on the number of threads to keep things sane. If your IO subsystem is one 7200RPM SATA drive with write cache disabled / fsync properly working, you're not gonna be able to get very aggresive before you make your IO subsystem bog down. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Obtaining the exact size of the database.
On Sun, Jun 20, 2010 at 2:04 PM, Greg Smith wrote: > It's hard to run a 24x7 environment on 8.1. Much easier on 8.4, where the > major things that regularly left people with quite bad VACUUM cleanup > situations are all less likely to occur than on any previous version. Here here. keeping anything before 8.2 fed and happy is pretty difficult in 24/7 environments. 8.2 and 8.3 are ok if you keep a close eye on them. And it just gets better from there. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance