Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Hi, Josh, Josh Berkus wrote: > Yes, actually. We need 3 different estimation methods: > 1 for tables where we can sample a large % of pages (say, >= 0.1) > 1 for tables where we sample a small % of pages but are "easily estimated" > 1 for tables which are not easily estimated by we can't afford to sample a > large % of pages. > > If we're doing sampling-based estimation, I really don't want people to lose > sight of the fact that page-based random sampling is much less expensive than > row-based random sampling. We should really be focusing on methods which > are page-based. Would it make sense to have a sample method that scans indices? I think that, at least for tree based indices (btree, gist), rather good estimates could be derived. And the presence of a unique index should lead to 100% distinct values estimation without any scan at all. Markus ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] batch inserts are "slow"
Hi, all, David Parker wrote: > We ran into the need to use COPY, but our application is also in Java. > We wrote a JNI bridge to a C++ routine that uses the libpq library to do > the COPY. The coding is a little bit weird, but not too complicated - > the biggest pain in the neck is probably getting it into your build > system. There are several hacks floating around that add COPY capabilities to the pgjdbc driver. As they all are rather simple hacks, they have not been included in the cvs yet, but they tend to work fine. Markus ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Foreign key constraints compile faster in 7.4
Hello Everybody, We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the foreign key constraints compile noticeably faster. In 7.3 the constraints would typically take more than an hour to run on our production data. Now they take a minute or two. Can anybody explain such a major performance improvement ? Thanks -- Ashish Arte Open Sky Software ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] batch inserts are "slow"
People, > There are several hacks floating around that add COPY capabilities to > the pgjdbc driver. As they all are rather simple hacks, they have not > been included in the cvs yet, but they tend to work fine. FWIW, Dave Cramer just added beta COPY capability to JDBC. Contact him on the JDBC list for details; I think he needs testers. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PERFORM]
In our application we have tables that we regularly load with 5-10 million records daily. We *were* using INSERT (I know... Still kicking ourselves for *that* design decision), and we now converting over to COPY. For the sake of robustness, we are planning on breaking the entire load into chunks of a couple hundred thousand records each. This is to constrain the amount of data we'd have to re-process if one of the COPYs fails. My question is, are there any advantages, drawbacks, or outright restrictions to using multiple simultaneous COPY commands to load data into the same table? One issue that comes to mind is the loss of data sequencing if we have multiple chunks interleaving records in the table at the same time. But from a purely technical point of view, is there any reason why the backend would not be happy with two or more COPY commands trying to insert data into the same table at the same time? Does COPY take out any locks on a table? Thanks in advance, --- Steve ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] batch inserts are "slow"
On Tue, 3 May 2005, Josh Berkus wrote: > > There are several hacks floating around that add COPY capabilities to > > the pgjdbc driver. As they all are rather simple hacks, they have not > > been included in the cvs yet, but they tend to work fine. > > FWIW, Dave Cramer just added beta COPY capability to JDBC. Contact him on > the JDBC list for details; I think he needs testers. > I believe Dave has remerged a patch for COPY I posted over a year ago, but he has not yet published it. I would guess it has the same bugs as the original (transaction + error handling) and will meet the same objections that kept the original patch out of the driver in the first place (we want a friendlier API than just a data stream). Kris Jurka ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Foreign key constraints compile faster in 7.4
Ashish Arte <[EMAIL PROTECTED]> writes: > We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the > foreign key constraints compile noticeably faster. In 7.3 the > constraints would typically take more than an hour to run on our > production data. Now they take a minute or two. > Can anybody explain such a major performance improvement ? Hey, we do do some work on this thing from time to time ;-) Probably you are talking about this: 2003-10-06 12:38 tgl * src/: backend/commands/tablecmds.c, backend/utils/adt/ri_triggers.c, include/commands/trigger.h: During ALTER TABLE ADD FOREIGN KEY, try to check the existing rows using a single LEFT JOIN query instead of firing the check trigger for each row individually. Stephan Szabo, with some kibitzing from Tom Lane and Jan Wieck. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM]
Steven Rosenstein <[EMAIL PROTECTED]> writes: > My question is, are there any advantages, drawbacks, or outright > restrictions to using multiple simultaneous COPY commands to load data into > the same table? It will work; not sure about whether there is any performance benefit. I vaguely recall someone having posted about doing this, so you might check the archives. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Testing list access
Testing list access ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PERFORM] Bad choice of query plan from PG 7.3.6 to PG 7.3.9 part 2
Please refer to part 1 for question and query 1 Cheers Jona --- Query 2: EXPLAIN ANALYZE SELECT DISTINCT CatType_Tbl.id, CatType_Tbl.url, Category_Tbl.name, Min(SubCatType_Tbl.id) AS subcatid FROM (CatType_Tbl INNER JOIN Category_Tbl ON CatType_Tbl.id = Category_Tbl.cattpid AND Category_Tbl.enabled = true INNER JOIN Language_Tbl ON Language_Tbl.id = Category_Tbl.langid AND Language_Tbl.sysnm = UPPER('us') AND Language_Tbl.enabled = true INNER JOIN SubCatType_Tbl ON CatType_Tbl.id = SubCatType_Tbl.cattpid AND SubCatType_Tbl.enabled = true INNER JOIN SCT2SubCatType_Tbl ON SubCatType_Tbl.id = SCT2SubCatType_Tbl.subcattpid INNER JOIN Price_Tbl ON SCT2SubCatType_Tbl.sctid = Price_Tbl.sctid AND Price_Tbl.affid = 8) WHERE CatType_Tbl.spcattpid = 1 AND CatType_Tbl.enabled = true GROUP BY CatType_Tbl.id, CatType_Tbl.url, Category_Tbl.name ORDER BY CatType_Tbl.id ASC Plan on PostGre 7.3.6 on Red Hat Linux 3.2.3-39 "Unique (cost=94.57..94.58 rows=1 width=147) (actual time=134.85..134.86 rows=4 loops=1)" " -> Sort (cost=94.57..94.57 rows=1 width=147) (actual time=134.85..134.85 rows=4 loops=1)" "Sort Key: cattype_tbl.id, cattype_tbl.url, category_tbl.name, min(subcattype_tbl.id)" "-> Aggregate (cost=94.54..94.56 rows=1 width=147) (actual time=127.49..134.77 rows=4 loops=1)" " -> Group (cost=94.54..94.55 rows=1 width=147) (actual time=114.85..132.44 rows=2117 loops=1)" "-> Sort (cost=94.54..94.55 rows=1 width=147) (actual time=114.84..116.10 rows=2117 loops=1)" " Sort Key: cattype_tbl.id, cattype_tbl.url, category_tbl.name" " -> Nested Loop (cost=4.54..94.53 rows=1 width=147) (actual time=0.64..52.65 rows=2117 loops=1)" "-> Nested Loop (cost=4.54..88.51 rows=1 width=143) (actual time=0.55..18.23 rows=2838 loops=1)" " -> Hash Join (cost=4.54..8.93 rows=1 width=135) (actual time=0.44..1.34 rows=48 loops=1)" "Hash Cond: ("outer".langid = "inner".id)" "-> Hash Join (cost=3.47..7.84 rows=1 width=131) (actual time=0.35..1.05 rows=96 loops=1)" " Hash Cond: ("outer".cattpid = "inner".id)" " -> Seq Scan on subcattype_tbl (cost=0.00..3.98 rows=79 width=8) (actual time=0.03..0.37 rows=156 loops=1)" "Filter: (enabled = true)" " -> Hash (cost=3.46..3.46 rows=1 width=123) (actual time=0.30..0.30 rows=0 loops=1)" "-> Hash Join (cost=1.50..3.46 rows=1 width=123) (actual time=0.12..0.29 rows=10 loops=1)" " Hash Cond: ("outer".cattpid = "inner".id)" " -> Seq Scan on category_tbl (cost=0.00..1.80 rows=32 width=51) (actual time=0.03..0.13 rows=64 loops=1)" " Filter: (enabled = true)" " -> Hash (cost=1.50..1.50 rows=1 width=72) (actual time=0.07..0.07 rows=0 loops=1)" "-> Seq Scan on cattype_tbl (cost=0.00..1.50 rows=1 width=72) (actual time=0.04..0.06 rows=5 loops=1)" " Filter: ((spcattpid = 1) AND (enabled = true))" "-> Hash (cost=1.07..1.07 rows=1 width=4) (actual time=0.05..0.05 rows=0 loops=1)" " -> Seq Scan on language_tbl (cost=0.00..1.07 rows=1 width=4) (actual time=0.05..0.05 rows=1 loops=1)" "Filter: (((sysnm)::text = 'US'::text) AND (enabled = true))" " -> Index Scan using subcat_uq on sct2subcattype_tbl (cost=0.00..79.26 rows=26 width=8) (actual time=0.01..0.17 rows=59 loops=48)" "Index Cond: ("outer".id = sct2subcattype_tbl.subcattpid)" "-> Index Scan using aff_price_uq on price_tbl (cost=0.00..6.01 rows=1 width=4) (actual time=0.01..0.01 rows=1 loops=2838)" " Index Cond: ((price_tbl.affid = 8) AND ("outer".sctid = price_tbl.sctid))" "Total runtime: 135.39 msec" Plan on PostGre 7.3.9 on Red Hat Linux 3.2.3-49 "Unique (cost=1046.36..1046.54 rows=1 width=75) (actual time=279.67..279.69 rows=4 loops=1)" " -> Sort (cost=1046.36..1046.40 rows=15 w
Re: [PERFORM] batch inserts are "slow"
> > I'm converting an application to be using postgresql instead of oracle. > > There seems to be only one issue left, batch inserts in postgresql seem > > significant slower than in oracle. I have about 200 batch jobs, each > > consisting of about 14 000 inserts. Each job takes 1.3 seconds in > > postgresql and 0.25 seconds in oracle. With 200 jobs this means several > > more minutes to complete the task. By fixing this I think the > > application using postgresql over all would be faster than when using > > oracle. > > Just as on Oracle you would use SQL*Loader for this application, you > should use the COPY syntax for PostgreSQL. You will find it a lot > faster. I have used it by building the input files and executing > 'psql' with a COPY command, and also by using it with a subprocess, > both are quite effective. I tried this now. Now it's down to 0.45 seconds. It feels a bit hacky to run /usr/bin/psql from java, but it sure works. Thanks for the hint! Tim ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] Bad choice of query plan from PG 7.3.6 to PG 7.3.9 part 1b
Please refer to part 1a for questions and part 2 for more queries and query plans. Why won't this list accept my questions and sample data in one mail??? /Jona Query 1: EXPLAIN ANALYZE SELECT DISTINCT StatConTrans_Tbl.id, Code_Tbl.sysnm AS code, PriceCat_Tbl.amount AS price, Country_Tbl.currency, CreditsCat_Tbl.amount AS credits, Info_Tbl.title, Info_Tbl.description FROM (SCT2SubCatType_Tbl INNER JOIN SCT2Lang_Tbl ON SCT2SubCatType_Tbl.sctid = SCT2Lang_Tbl.sctid INNER JOIN Language_Tbl ON SCT2Lang_Tbl.langid = Language_Tbl.id AND Language_Tbl.sysnm = UPPER('us') AND Language_Tbl.enabled = true INNER JOIN Info_Tbl ON SCT2SubCatType_Tbl.sctid = Info_Tbl.sctid AND Language_Tbl.id = Info_Tbl.langid INNER JOIN SubCatType_Tbl ON SCT2SubCatType_Tbl.subcattpid = SubCatType_Tbl.id AND SubCatType_Tbl.enabled = true INNER JOIN CatType_Tbl ON SubCatType_Tbl.cattpid = CatType_Tbl.id AND CatType_Tbl.enabled = true INNER JOIN SuperCatType_Tbl ON CatType_Tbl.spcattpid = SuperCatType_Tbl.id AND SuperCatType_Tbl.enabled = true INNER JOIN StatConTrans_Tbl ON SCT2SubCatType_Tbl.sctid = StatConTrans_Tbl.id AND StatConTrans_Tbl.enabled = true INNER JOIN Price_Tbl ON StatConTrans_Tbl.id = Price_Tbl.sctid AND Price_Tbl.affid = 8 INNER JOIN PriceCat_Tbl ON Price_Tbl.prccatid = PriceCat_Tbl.id AND PriceCat_Tbl.enabled = true INNER JOIN Country_Tbl ON PriceCat_Tbl.cntid = Country_Tbl.id AND Country_Tbl.enabled = true INNER JOIN CreditsCat_Tbl ON Price_Tbl.crdcatid = CreditsCat_Tbl.id AND CreditsCat_Tbl.enabled = true INNER JOIN StatCon_Tbl ON StatConTrans_Tbl.id = StatCon_Tbl.sctid AND StatCon_Tbl.ctpid = 1 INNER JOIN Code_Tbl ON SuperCatType_Tbl.id = Code_Tbl.spcattpid AND Code_Tbl.affid = 8 AND Code_Tbl.cdtpid = 1) WHERE SCT2SubCatType_Tbl.subcattpid = 79 ORDER BY StatConTrans_Tbl.id DESC LIMIT 8 OFFSET 0 Plan on PostGre 7.3.6 on Red Hat Linux 3.2.3-39 "Limit (cost=178.59..178.61 rows=1 width=330) (actual time=22.77..28.51 rows=4 loops=1)" " -> Unique (cost=178.59..178.61 rows=1 width=330) (actual time=22.77..28.50 rows=4 loops=1)" "-> Sort (cost=178.59..178.60 rows=1 width=330) (actual time=22.76..22.85 rows=156 loops=1)" " Sort Key: statcontrans_tbl.id, code_tbl.sysnm, pricecat_tbl.amount, country_tbl.currency, creditscat_tbl.amount, info_tbl.title, info_tbl.description" " -> Hash Join (cost=171.19..178.58 rows=1 width=330) (actual time=3.39..6.55 rows=156 loops=1)" "Hash Cond: ("outer".cntid = "inner".id)" "-> Nested Loop (cost=170.13..177.51 rows=1 width=312) (actual time=3.27..5.75 rows=156 loops=1)" " Join Filter: ("inner".sctid = "outer".sctid)" " -> Hash Join (cost=170.13..171.48 rows=1 width=308) (actual time=3.12..3.26 rows=4 loops=1)" "Hash Cond: ("outer".crdcatid = "inner".id)" "-> Hash Join (cost=169.03..170.38 rows=1 width=300) (actual time=3.00..3.11 rows=4 loops=1)" " Hash Cond: ("outer".spcattpid = "inner".spcattpid)" " -> Hash Join (cost=167.22..168.56 rows=1 width=253) (actual time=2.88..2.97 rows=4 loops=1)" "Hash Cond: ("outer".id = "inner".prccatid)" "-> Seq Scan on pricecat_tbl (cost=0.00..1.29 rows=12 width=12) (actual time=0.04..0.08 rows=23 loops=1)" " Filter: (enabled = true)" "-> Hash (cost=167.21..167.21 rows=1 width=241) (actual time=2.80..2.80 rows=0 loops=1)" " -> Nested Loop (cost=3.77..167.21 rows=1 width=241) (actual time=1.31..2.79 rows=4 loops=1)" "Join Filter: ("inner".sctid = "outer".sctid)" "-> Nested Loop (cost=3.77..161.19 rows=1 width=229) (actual time=1.19..2.60 rows=4 loops=1)" " Join Filter: ("outer".sctid = "inner".sctid)" " -> Hash Join (cost=3.77..155.17 rows=1 width=44) (actual time=1.07..2.37 rows=4 loops=1)" " Hash Cond: ("outer".langid = "inner".id)" "-> Nested Loop (cost=2.69..154.06 rows=7 width=40) (actual time=0.90..2.18 rows=8 loops=1)" " Join Filter: ("outer".sctid = "inner".sctid)" " -> Nested Loop (cost=2
Re: [PERFORM] batch inserts are "slow"
Tim Terlegård wrote: >> >>Just as on Oracle you would use SQL*Loader for this application, you >>should use the COPY syntax for PostgreSQL. You will find it a lot >>faster. I have used it by building the input files and executing >>'psql' with a COPY command, and also by using it with a subprocess, >>both are quite effective. > > > I tried this now. Now it's down to 0.45 seconds. It feels a bit hacky to > run /usr/bin/psql from java, but it sure works. Thanks for the hint! There was a patch against 7.4 that provided direct JDBC access to PostgreSQL's COPY. (I have it installed here and *love* it - it gives outstanding performance.) However, it hasn't made into an official release yet. I don't know why, perhaps there's a problem yet to be solved with it ('works for me', though)? Is this still on the board? I won't upgrade past 7.4 without it. -- Steve Wampler -- [EMAIL PROTECTED] The gods that smiled on your birth are now laughing out loud. ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] COPY vs INSERT
> Steven Rosenstein <[EMAIL PROTECTED]> writes: > > My question is, are there any advantages, drawbacks, or outright > > restrictions to using multiple simultaneous COPY commands to load > data into > > the same table? Do you mean, multiple COPY commands (connections) being putline'd from the same thread (process)? I have indirect evidence that this may hurt. Two copy commands from different threads/processes are fine, and can help, if they alternate contention on some other resource (disk/CPU). I'm basing this on being at the third generation of a COPY implementation. The app loads about 1M objects/hour from 6 servers. Each object is split across four tables. The batch load opens four connections and firehoses records down each. A batch is 10K objects. COPY invokes all the same logic as INSERT on the server side (rowexclusive locking, transaction log, updating indexes, rules). The difference is that all the rows are inserted as a single transaction. This reduces the number of fsync's on the xlog, which may be a limiting factor for you. You'll want to crank WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. One of my streams has 6K records; I run with WB=1000, CS=128. The downside I found with multiple clients inserting large blocks of rows was, that they serialized. I THINK that's because at some point they all needed to lock the same portions of the same indexes. I'm still working on how to avoid that, tuning the batch size and inserting into a "queue" table with fewer indexes. COPY (via putline) didn't do measurably better than INSERT until I batched 40 newline-separate rows into one putline call, which improved it 2-3:1. The suspect problem was stalling on the TCP stream; the driver was flushing small packets. This may or may not be relevant to you; depends on how much processing (waiting) your app does between posting of rows. In such a case, writing alternately to two TCP streams from the same process increases the likelihood of a stall. I've never tested that set-up; it would have been heading AWAY from the solution in my case. Hope that helps. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] batch inserts are "slow"
On 5/3/05, Tim Terlegård <[EMAIL PROTECTED]> wrote: > > Just as on Oracle you would use SQL*Loader for this application, you > > should use the COPY syntax for PostgreSQL. You will find it a lot > > faster. I have used it by building the input files and executing > > 'psql' with a COPY command, and also by using it with a subprocess, > > both are quite effective. > > I tried this now. Now it's down to 0.45 seconds. It feels a bit hacky to > run /usr/bin/psql from java, but it sure works. Thanks for the hint! It may feel hacky, but I think if you want to use SQL*Loader on Oracle, you have to do the same thing. I know a C++ app that I use that runs SQL*Loader about once per second to deal with a HUGE volume (10K/sec). In fact, moving the load files onto ramdisk has helped a lot. Chris -- | Christopher Petrilli | [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] batch inserts are "slow"
Kris is correct, This code was not added or even submitted to CVS. The purpose of this was to work out the bugs with people who are actually using copy. The api is a separate issue however. There's no reason that copy can't support more than one api. Dave Kris Jurka wrote: On Tue, 3 May 2005, Josh Berkus wrote: There are several hacks floating around that add COPY capabilities to the pgjdbc driver. As they all are rather simple hacks, they have not been included in the cvs yet, but they tend to work fine. FWIW, Dave Cramer just added beta COPY capability to JDBC. Contact him on the JDBC list for details; I think he needs testers. I believe Dave has remerged a patch for COPY I posted over a year ago, but he has not yet published it. I would guess it has the same bugs as the original (transaction + error handling) and will meet the same objections that kept the original patch out of the driver in the first place (we want a friendlier API than just a data stream). Kris Jurka ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- Dave Cramer http://www.postgresintl.com 519 939 0336 ICQ#14675561 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Markus Schaber <[EMAIL PROTECTED]>: > Hi, Josh, > > Josh Berkus wrote: > > > Yes, actually. We need 3 different estimation methods: > > 1 for tables where we can sample a large % of pages (say, >= 0.1) > > 1 for tables where we sample a small % of pages but are "easily > estimated" > > 1 for tables which are not easily estimated by we can't afford to > sample a > > large % of pages. > > > > If we're doing sampling-based estimation, I really don't want > people to lose > > sight of the fact that page-based random sampling is much less > expensive than > > row-based random sampling. We should really be focusing on > methods which > > are page-based. Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) How about applying the distinct-sampling filter on a small extra data stream to the stats collector? -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Mischa, > Okay, although given the track record of page-based sampling for > n-distinct, it's a bit like looking for your keys under the streetlight, > rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is that to do, for example, 10% of rows purely randomly would actually mean loading 50% of pages. With 20% of rows, you might as well scan the whole table. Unless, of course, we use indexes for sampling, which seems like a *really good* idea to me -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Josh Berkus wrote: Mischa, Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is that to do, for example, 10% of rows purely randomly would actually mean loading 50% of pages. With 20% of rows, you might as well scan the whole table. Unless, of course, we use indexes for sampling, which seems like a *really good* idea to me But doesn't an index only sample one column at a time, whereas with page-based sampling, you can sample all of the columns at once. And not all columns would have indexes, though it could be assumed that if a column doesn't have an index, then it doesn't matter as much for calculations such as n_distinct. But if you had 5 indexed rows in your table, then doing it index wise means you would have to make 5 passes instead of just one. Though I agree that page-based sampling is important for performance reasons. John =:-> signature.asc Description: OpenPGP digital signature
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
John, > But doesn't an index only sample one column at a time, whereas with > page-based sampling, you can sample all of the columns at once. Hmmm. Yeah, we're not currently doing that though. Another good idea ... -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Kernel Resources and max_connections
Hi, I have postgres 8.0.2 installed on FreeBSD FreeBSD 4.11-RELEASE with 2GB of RAM. When trying to set max_connections=256 I get the following error message: FATAL: could not create semaphores: No space left on device DETAIL: Failed system call was semget(5432017, 17, 03600). HINT: This error does *not* mean that you have run out of disk space. It occurs when either the system limit for the maximum number of semaphore sets (SEMMNI), or the system wide maximum number of semaphores (SEMMNS), would be exceeded. You need to raise the respective kernel parameter. Alternatively, reduce PostgreSQL's consumption of semaphores by reducing its max_connections parameter (currently 256). The PostgreSQL documentation contains more information about configuring your system for PostgreSQL. I have read through the kernel resources documentation for postgres 8 and set values accordingly. Some settings are not staying after a reboot, even if I place them in /boot/loader.conf. So far I'm able to get max_connections to 250. Here is a dump of kern.ipc values: kern.ipc.maxsockbuf: 262144 kern.ipc.sockbuf_waste_factor: 8 kern.ipc.somaxconn: 128 kern.ipc.max_linkhdr: 16 kern.ipc.max_protohdr: 60 kern.ipc.max_hdr: 76 kern.ipc.max_datalen: 136 kern.ipc.nmbclusters: 65536 kern.ipc.msgmax: 16384 kern.ipc.msgmni: 40 kern.ipc.msgmnb: 2048 kern.ipc.msgtql: 40 kern.ipc.msgssz: 8 kern.ipc.msgseg: 2048 kern.ipc.semmap: 30 kern.ipc.semmni: 256 kern.ipc.semmns: 272 kern.ipc.semmnu: 30 kern.ipc.semmsl: 60 kern.ipc.semopm: 100 kern.ipc.semume: 10 kern.ipc.semusz: 92 kern.ipc.semvmx: 32767 kern.ipc.semaem: 16384 kern.ipc.shmmax: 33554432 kern.ipc.shmmin: 1 kern.ipc.shmmni: 192 kern.ipc.shmseg: 128 kern.ipc.shmall: 8192 kern.ipc.shm_use_phys: 0 kern.ipc.shm_allow_removed: 0 kern.ipc.mbuf_wait: 32 kern.ipc.mbtypes: 38 551 3 0 0 0 0 0 0 0 0 0 0 0 0 0 kern.ipc.nmbufs: 262144 kern.ipc.nsfbufs: 8704 kern.ipc.nsfbufspeak: 7 kern.ipc.nsfbufsused: 0 kern.ipc.m_clreflimithits: 0 kern.ipc.mcl_pool_max: 0 kern.ipc.mcl_pool_now: 0 kern.ipc.maxsockets: 65536 And boot/loader.conf: userconfig_script_load="YES" kern.ipc.nmbclusters="65536" kern.maxfiles="65536" kern.maxfilesperproc="65536" net.inet.tcp.mssdflt="1460" kern.somaxconn="4096" kern.ipc.semmns="272" kern.ipc.semmni="256" kern.ipc.shmmax="66099200" kern.ipc.shmmax and kern.ipc.shmmin will not stay to what I set them to. What am I doing wrong or not doing at all? Your help is greatly appreciated. Regards, Chris. -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.308 / Virus Database: 266.11.2 - Release Date: 5/2/2005 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Kernel Resources and max_connections
Chris Hebrard wrote: kern.ipc.shmmax and kern.ipc.shmmin will not stay to what I set them to. What am I doing wrong or not doing at all? These need to go in /etc/sysctl.conf. You might need to set shmall as well. (This not-very-clear distinction between what is sysctl'abe and what is a kernel tunable is a bit of a downer). cheers Mark ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Kernel Resources and max_connections
Mark Kirkwood wrote: Chris Hebrard wrote: kern.ipc.shmmax and kern.ipc.shmmin will not stay to what I set them to. What am I doing wrong or not doing at all? These need to go in /etc/sysctl.conf. You might need to set shmall as well. (This not-very-clear distinction between what is sysctl'abe and what is a kernel tunable is a bit of a downer). cheers Mark ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] Thanks for your reply, I set the values in etc/sysctl.conf: # $FreeBSD: src/etc/sysctl.conf,v 1.1.2.3 2002/04/15 00:44:13 dougb Exp $ # # This file is read when going to multi-user and its contents piped thru # ``sysctl'' to adjust kernel values. ``man 5 sysctl.conf'' for details. # # Added by IMP 2005-05-04 net.inet.tcp.rfc1323=1 kern.ipc.somaxconn=1024 kern.ipc.maxsockbuf=8388608 net.inet.tcp.sendspace=3217968 net.inet.tcp.recvspace=3217968 kern.ipc.semmns="272" kern.ipc.semmni="256" kern.ipc.shmmax="66099200" kern.ipc.shmmin="256" After a restart both shmmax and shmmin are now 0 and postgres failed to start. kern.ipc.maxsockbuf: 8388608 kern.ipc.sockbuf_waste_factor: 8 kern.ipc.somaxconn: 1024 kern.ipc.max_linkhdr: 16 kern.ipc.max_protohdr: 60 kern.ipc.max_hdr: 76 kern.ipc.max_datalen: 136 kern.ipc.nmbclusters: 65536 kern.ipc.msgmax: 16384 kern.ipc.msgmni: 40 kern.ipc.msgmnb: 2048 kern.ipc.msgtql: 40 kern.ipc.msgssz: 8 kern.ipc.msgseg: 2048 kern.ipc.semmap: 30 kern.ipc.semmni: 10 kern.ipc.semmns: 60 kern.ipc.semmnu: 30 kern.ipc.semmsl: 60 kern.ipc.semopm: 100 kern.ipc.semume: 10 kern.ipc.semusz: 92 kern.ipc.semvmx: 32767 kern.ipc.semaem: 16384 kern.ipc.shmmax: 0 kern.ipc.shmmin: 0 kern.ipc.shmmni: 192 kern.ipc.shmseg: 128 kern.ipc.shmall: 8192 kern.ipc.shm_use_phys: 0 kern.ipc.shm_allow_removed: 0 kern.ipc.mbuf_wait: 32 kern.ipc.mbtypes: 24 550 2 0 0 0 0 0 0 0 0 0 0 0 0 0 kern.ipc.nmbufs: 262144 kern.ipc.nsfbufs: 8704 kern.ipc.nsfbufspeak: 0 kern.ipc.nsfbufsused: 0 kern.ipc.m_clreflimithits: 0 kern.ipc.mcl_pool_max: 0 kern.ipc.mcl_pool_now: 0 kern.ipc.maxsockets: 65536 I'm lost here. Chris. -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.308 / Virus Database: 266.11.2 - Release Date: 5/2/2005 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] Kernel Resources Solved
Problem sovled by setting: kern.ipc.semmni: 280 kern.ipc.semmns: 300 Chris. Mark Kirkwood wrote: Chris Hebrard wrote: kern.ipc.shmmax and kern.ipc.shmmin will not stay to what I set them to. What am I doing wrong or not doing at all? These need to go in /etc/sysctl.conf. You might need to set shmall as well. (This not-very-clear distinction between what is sysctl'abe and what is a kernel tunable is a bit of a downer). cheers Mark ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] kern.ipc.maxsockbuf: 8388608 kern.ipc.sockbuf_waste_factor: 8 kern.ipc.somaxconn: 1024 kern.ipc.max_linkhdr: 16 kern.ipc.max_protohdr: 60 kern.ipc.max_hdr: 76 kern.ipc.max_datalen: 136 kern.ipc.nmbclusters: 65536 kern.ipc.msgmax: 16384 kern.ipc.msgmni: 40 kern.ipc.msgmnb: 2048 kern.ipc.msgtql: 40 kern.ipc.msgssz: 8 kern.ipc.msgseg: 2048 kern.ipc.semmap: 30 kern.ipc.semmni: 256 kern.ipc.semmns: 272 kern.ipc.semmnu: 30 kern.ipc.semmsl: 60 kern.ipc.semopm: 100 kern.ipc.semume: 10 kern.ipc.semusz: 92 kern.ipc.semvmx: 32767 kern.ipc.semaem: 16384 kern.ipc.shmmax: 66099200 kern.ipc.shmmin: 256 kern.ipc.shmmni: 192 kern.ipc.shmseg: 128 kern.ipc.shmall: 8192 kern.ipc.shm_use_phys: 0 kern.ipc.shm_allow_removed: 0 kern.ipc.mbuf_wait: 32 kern.ipc.mbtypes: 37 552 3 0 0 0 0 0 0 0 0 0 0 0 0 0 kern.ipc.nmbufs: 262144 kern.ipc.nsfbufs: 8704 kern.ipc.nsfbufspeak: 4 kern.ipc.nsfbufsused: 0 kern.ipc.m_clreflimithits: 0 kern.ipc.mcl_pool_max: 0 kern.ipc.mcl_pool_now: 0 kern.ipc.maxsockets: 65536 I've got the values to what I want them to be now, after loading some values in loader.conf and others in sysctl.conf. loader.conf: userconfig_script_load="YES" kern.ipc.nmbclusters="65536" kern.maxfiles="65536" kern.maxfilesperproc="65536" net.inet.tcp.mssdflt="1460" kern.somaxconn="4096" kern.ipc.semmns="272" kern.ipc.semmni="256" sysctl.conf: net.inet.tcp.rfc1323=1 kern.ipc.somaxconn=1024 kern.ipc.maxsockbuf=8388608 net.inet.tcp.sendspace=3217968 net.inet.tcp.recvspace=3217968 kern.ipc.shmmax=66099200 kern.ipc.shmmin=256 kern.ipc.shmall=16138 and kern.ipc values are now: kern.ipc.maxsockbuf: 8388608 kern.ipc.sockbuf_waste_factor: 8 kern.ipc.somaxconn: 1024 kern.ipc.max_linkhdr: 16 kern.ipc.max_protohdr: 60 kern.ipc.max_hdr: 76 kern.ipc.max_datalen: 136 kern.ipc.nmbclusters: 65536 kern.ipc.msgmax: 16384 kern.ipc.msgmni: 40 kern.ipc.msgmnb: 2048 kern.ipc.msgtql: 40 kern.ipc.msgssz: 8 kern.ipc.msgseg: 2048 kern.ipc.semmap: 30 kern.ipc.semmni: 256 kern.ipc.semmns: 272 kern.ipc.semmnu: 30 kern.ipc.semmsl: 60 kern.ipc.semopm: 100 kern.ipc.semume: 10 kern.ipc.semusz: 92 kern.ipc.semvmx: 32767 kern.ipc.semaem: 16384 kern.ipc.shmmax: 66099200 kern.ipc.shmmin: 256 kern.ipc.shmmni: 192 kern.ipc.shmseg: 128 kern.ipc.shmall: 16138 kern.ipc.shm_use_phys: 0 kern.ipc.shm_allow_removed: 0 kern.ipc.mbuf_wait: 32 kern.ipc.mbtypes: 7 550 3 0 0 0 0 0 0 0 0 0 0 0 0 0 kern.ipc.nmbufs: 262144 kern.ipc.nsfbufs: 8704 kern.ipc.nsfbufspeak: 6 kern.ipc.nsfbufsused: 0 kern.ipc.m_clreflimithits: 0 kern.ipc.mcl_pool_max: 0 kern.ipc.mcl_pool_now: 0 kern.ipc.maxsockets: 65536 Postgres still refuses to start with 256 max_connections. Chris. -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.308 / Virus Database: 266.11.2 - Release Date: 5/2/2005 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Josh Berkus : > Mischa, > > > Okay, although given the track record of page-based sampling for > > n-distinct, it's a bit like looking for your keys under the > streetlight, > > rather than in the alley where you dropped them :-) > > Bad analogy, but funny. Bad analogy? Page-sampling effort versus row-sampling effort, c'est moot. It's not good enough for stats to produce good behaviour on the average. Straight random sampling, page or row, is going to cause enough untrustworthy engine behaviour,for any %ages small enough to allow sampling from scratch at any time. I'm curious what the problem is with relying on a start-up plus incremental method, when the method in the distinct-sampling paper doesn't degenerate: you can start when the table is still empty. Constructing an index requires an initial full scan plus incremental update; what's the diff? > Unless, of course, we use indexes for sampling, which seems like a > *really > good* idea to me "distinct-sampling" applies for indexes, too. I started tracking the discussion of this a bit late. Smart method for this is in VLDB'92: Gennady Antoshenkov, "Random Sampling from Pseudo-ranked B+-trees". I don't think this is online anywhere, except if you have a DBLP membership. Does nybod else know better? Antoshenkov was the brains behind some of the really cool stuff in DEC Rdb (what eventually became Oracle). Compressed bitmap indices, parallel competing query plans, and smart handling of keys with hyperbolic distributions. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Kernel Resources and max_connections
On Wed, May 04, 2005 at 01:46:34PM +1200, Mark Kirkwood wrote: > (This not-very-clear distinction between what is sysctl'abe and what is > a kernel tunable is a bit of a downer). I think this is documented somewhere, though I can't think of where right now. Also, note that some sysctl's can only be set in /boot/loader.conf. hw.ata.wc=0 is an example (which you want to set on any box with IDE drives if you want fsync to actually do what it thinks it's doing). -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?" ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Kernel Resources and max_connections
Chris Hebrard wrote: I set the values in etc/sysctl.conf: # $FreeBSD: src/etc/sysctl.conf,v 1.1.2.3 2002/04/15 00:44:13 dougb Exp $ # # This file is read when going to multi-user and its contents piped thru # ``sysctl'' to adjust kernel values. ``man 5 sysctl.conf'' for details. # # Added by IMP 2005-05-04 net.inet.tcp.rfc1323=1 kern.ipc.somaxconn=1024 kern.ipc.maxsockbuf=8388608 net.inet.tcp.sendspace=3217968 net.inet.tcp.recvspace=3217968 kern.ipc.semmns="272" kern.ipc.semmni="256" kern.ipc.shmmax="66099200" kern.ipc.shmmin="256" After a restart both shmmax and shmmin are now 0 and postgres failed to start. Hmmm - puzzling. One point to check, did you take them out of /boot/loader.conf ? Assuming so, maybe don't quote 'em (see below). Finally you need to to set shmall, otherwise it will over(under)ride the shmmax setting. So try: net.inet.tcp.rfc1323=1 kern.ipc.somaxconn=1024 kern.ipc.maxsockbuf=8388608 net.inet.tcp.sendspace=3217968 net.inet.tcp.recvspace=3217968 kern.ipc.semmns=272 kern.ipc.semmni=256 kern.ipc.shmmax=66099200 kern.ipc.shmmin=256 kern.ipc.shmall=32768 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]