Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Markus Schaber
Hi, Josh,

Josh Berkus wrote:

> Yes, actually.   We need 3 different estimation methods:
> 1 for tables where we can sample a large % of pages (say, >= 0.1)
> 1 for tables where we sample a small % of pages but are "easily estimated"
> 1 for tables which are not easily estimated by we can't afford to sample a 
> large % of pages.
> 
> If we're doing sampling-based estimation, I really don't want people to lose 
> sight of the fact that page-based random sampling is much less expensive than 
> row-based random sampling.   We should really be focusing on methods which 
> are page-based.

Would it make sense to have a sample method that scans indices? I think
that, at least for tree based indices (btree, gist), rather good
estimates could be derived.

And the presence of a unique index should lead to 100% distinct values
estimation without any scan at all.

Markus


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] batch inserts are "slow"

2005-05-03 Thread Markus Schaber
Hi, all,

David Parker wrote:
> We ran into the need to use COPY, but our application is also in Java.
> We wrote a JNI bridge to a C++ routine that uses the libpq library to do
> the COPY. The coding is a little bit weird, but not too complicated -
> the biggest pain in the neck is probably getting it into your build
> system.

There are several hacks floating around that add COPY capabilities to
the pgjdbc driver. As they all are rather simple hacks, they have not
been included in the cvs yet, but they tend to work fine.

Markus


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Foreign key constraints compile faster in 7.4

2005-05-03 Thread Ashish Arte
Hello Everybody,

We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the
foreign key constraints compile noticeably faster. In 7.3 the
constraints would typically take more than an hour to run on our
production data. Now they take a minute or two.

Can anybody explain such a major performance improvement ?

Thanks

-- 
Ashish Arte
Open Sky Software


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] batch inserts are "slow"

2005-05-03 Thread Josh Berkus
People,

> There are several hacks floating around that add COPY capabilities to
> the pgjdbc driver. As they all are rather simple hacks, they have not
> been included in the cvs yet, but they tend to work fine.

FWIW, Dave Cramer just added beta COPY capability to JDBC.   Contact him on 
the JDBC list for details; I think he needs testers.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[PERFORM]

2005-05-03 Thread Steven Rosenstein




In our application we have tables that we regularly load with 5-10 million
records daily.  We *were* using INSERT (I know...  Still kicking ourselves
for *that* design decision), and we now converting over to COPY.  For the
sake of robustness, we are planning on breaking the entire load into chunks
of a couple hundred thousand records each. This is to constrain the amount
of data we'd have to re-process if one of the COPYs fails.

My question is, are there any advantages, drawbacks, or outright
restrictions to using multiple simultaneous COPY commands to load data into
the same table?   One issue that comes to mind is the loss of data
sequencing if we have multiple chunks interleaving records in the table at
the same time.  But from a purely technical point of view, is there any
reason why the backend would not be happy with two or more COPY commands
trying to insert data into the same table at the same time?  Does COPY take
out any locks on a table?

Thanks in advance,
--- Steve


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] batch inserts are "slow"

2005-05-03 Thread Kris Jurka


On Tue, 3 May 2005, Josh Berkus wrote:

> > There are several hacks floating around that add COPY capabilities to
> > the pgjdbc driver. As they all are rather simple hacks, they have not
> > been included in the cvs yet, but they tend to work fine.
> 
> FWIW, Dave Cramer just added beta COPY capability to JDBC.   Contact him on 
> the JDBC list for details; I think he needs testers.
> 

I believe Dave has remerged a patch for COPY I posted over a year ago, but 
he has not yet published it.  I would guess it has the same bugs as the 
original (transaction + error handling) and will meet the same objections 
that kept the original patch out of the driver in the first place (we want 
a friendlier API than just a data stream).

Kris Jurka

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Foreign key constraints compile faster in 7.4

2005-05-03 Thread Tom Lane
Ashish Arte <[EMAIL PROTECTED]> writes:
> We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the
> foreign key constraints compile noticeably faster. In 7.3 the
> constraints would typically take more than an hour to run on our
> production data. Now they take a minute or two.

> Can anybody explain such a major performance improvement ?

Hey, we do do some work on this thing from time to time ;-)

Probably you are talking about this:

2003-10-06 12:38  tgl

* src/: backend/commands/tablecmds.c,
backend/utils/adt/ri_triggers.c, include/commands/trigger.h: During
ALTER TABLE ADD FOREIGN KEY, try to check the existing rows using a
single LEFT JOIN query instead of firing the check trigger for each
row individually.  Stephan Szabo, with some kibitzing from Tom Lane
and Jan Wieck.


regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM]

2005-05-03 Thread Tom Lane
Steven Rosenstein <[EMAIL PROTECTED]> writes:
> My question is, are there any advantages, drawbacks, or outright
> restrictions to using multiple simultaneous COPY commands to load data into
> the same table?

It will work; not sure about whether there is any performance benefit.
I vaguely recall someone having posted about doing this, so you might
check the archives.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] Testing list access

2005-05-03 Thread Jona
Testing list access
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


[PERFORM] Bad choice of query plan from PG 7.3.6 to PG 7.3.9 part 2

2005-05-03 Thread Jona
Please refer to part 1 for question and query 1
Cheers
Jona
--- 

Query 2:
EXPLAIN ANALYZE
SELECT DISTINCT CatType_Tbl.id, CatType_Tbl.url, Category_Tbl.name, 
Min(SubCatType_Tbl.id) AS subcatid
FROM (CatType_Tbl
INNER JOIN Category_Tbl ON CatType_Tbl.id = Category_Tbl.cattpid AND 
Category_Tbl.enabled = true
INNER JOIN Language_Tbl ON Language_Tbl.id = Category_Tbl.langid AND 
Language_Tbl.sysnm = UPPER('us') AND Language_Tbl.enabled = true
INNER JOIN SubCatType_Tbl ON CatType_Tbl.id = SubCatType_Tbl.cattpid AND 
SubCatType_Tbl.enabled = true
INNER JOIN SCT2SubCatType_Tbl ON SubCatType_Tbl.id = 
SCT2SubCatType_Tbl.subcattpid
INNER JOIN Price_Tbl ON SCT2SubCatType_Tbl.sctid = Price_Tbl.sctid AND 
Price_Tbl.affid = 8)
WHERE CatType_Tbl.spcattpid = 1 AND CatType_Tbl.enabled = true
GROUP BY CatType_Tbl.id, CatType_Tbl.url, Category_Tbl.name
ORDER BY CatType_Tbl.id ASC

Plan on PostGre 7.3.6 on Red Hat Linux 3.2.3-39
"Unique  (cost=94.57..94.58 rows=1 width=147) (actual 
time=134.85..134.86 rows=4 loops=1)"
"  ->  Sort  (cost=94.57..94.57 rows=1 width=147) (actual 
time=134.85..134.85 rows=4 loops=1)"
"Sort Key: cattype_tbl.id, cattype_tbl.url, category_tbl.name, 
min(subcattype_tbl.id)"
"->  Aggregate  (cost=94.54..94.56 rows=1 width=147) (actual 
time=127.49..134.77 rows=4 loops=1)"
"  ->  Group  (cost=94.54..94.55 rows=1 width=147) (actual 
time=114.85..132.44 rows=2117 loops=1)"
"->  Sort  (cost=94.54..94.55 rows=1 width=147) 
(actual time=114.84..116.10 rows=2117 loops=1)"
"  Sort Key: cattype_tbl.id, cattype_tbl.url, 
category_tbl.name"
"  ->  Nested Loop  (cost=4.54..94.53 rows=1 
width=147) (actual time=0.64..52.65 rows=2117 loops=1)"
"->  Nested Loop  (cost=4.54..88.51 
rows=1 width=143) (actual time=0.55..18.23 rows=2838 loops=1)"
"  ->  Hash Join  (cost=4.54..8.93 
rows=1 width=135) (actual time=0.44..1.34 rows=48 loops=1)"
"Hash Cond: ("outer".langid 
= "inner".id)"
"->  Hash Join  
(cost=3.47..7.84 rows=1 width=131) (actual time=0.35..1.05 rows=96 
loops=1)"
"  Hash Cond: 
("outer".cattpid = "inner".id)"
"  ->  Seq Scan on 
subcattype_tbl  (cost=0.00..3.98 rows=79 width=8) (actual 
time=0.03..0.37 rows=156 loops=1)"
"Filter: 
(enabled = true)"
"  ->  Hash  
(cost=3.46..3.46 rows=1 width=123) (actual time=0.30..0.30 rows=0 loops=1)"
"->  Hash Join  
(cost=1.50..3.46 rows=1 width=123) (actual time=0.12..0.29 rows=10 
loops=1)"
"  Hash 
Cond: ("outer".cattpid = "inner".id)"
"  ->  Seq 
Scan on category_tbl  (cost=0.00..1.80 rows=32 width=51) (actual 
time=0.03..0.13 rows=64 loops=1)"
"
Filter: (enabled = true)"
"  ->  Hash  
(cost=1.50..1.50 rows=1 width=72) (actual time=0.07..0.07 rows=0 loops=1)"
"->  
Seq Scan on cattype_tbl  (cost=0.00..1.50 rows=1 width=72) (actual 
time=0.04..0.06 rows=5 loops=1)"
"  
Filter: ((spcattpid = 1) AND (enabled = true))"
"->  Hash  (cost=1.07..1.07 
rows=1 width=4) (actual time=0.05..0.05 rows=0 loops=1)"
"  ->  Seq Scan on 
language_tbl  (cost=0.00..1.07 rows=1 width=4) (actual time=0.05..0.05 
rows=1 loops=1)"
"Filter: 
(((sysnm)::text = 'US'::text) AND (enabled = true))"
"  ->  Index Scan using subcat_uq on 
sct2subcattype_tbl  (cost=0.00..79.26 rows=26 width=8) (actual 
time=0.01..0.17 rows=59 loops=48)"
"Index Cond: ("outer".id = 
sct2subcattype_tbl.subcattpid)"
"->  Index Scan using aff_price_uq on 
price_tbl  (cost=0.00..6.01 rows=1 width=4) (actual time=0.01..0.01 
rows=1 loops=2838)"
"  Index Cond: ((price_tbl.affid = 
8) AND ("outer".sctid = price_tbl.sctid))"
"Total runtime: 135.39 msec"

Plan on PostGre 7.3.9 on Red Hat Linux 3.2.3-49
"Unique  (cost=1046.36..1046.54 rows=1 width=75) (actual 
time=279.67..279.69 rows=4 loops=1)"
"  ->  Sort  (cost=1046.36..1046.40 rows=15 w

Re: [PERFORM] batch inserts are "slow"

2005-05-03 Thread Tim Terlegård
> > I'm converting an application to be using postgresql instead of oracle.
> > There seems to be only one issue left, batch inserts in postgresql seem
> > significant slower than in oracle. I have about 200 batch jobs, each
> > consisting of about 14 000 inserts. Each job takes 1.3 seconds in
> > postgresql and 0.25 seconds in oracle. With 200 jobs this means several
> > more minutes to complete the task. By fixing this I think the
> > application using postgresql over all would be faster than when using
> > oracle.
>
> Just as on Oracle you would use SQL*Loader for this application, you
> should use the COPY syntax for PostgreSQL.  You will find it a lot
> faster.  I have used it by building the input files and executing
> 'psql' with a COPY command, and also by using it with a subprocess,
> both are quite effective.

I tried this now. Now it's down to 0.45 seconds. It feels a bit hacky to
run /usr/bin/psql from java, but it sure works. Thanks for the hint!

Tim


---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] Bad choice of query plan from PG 7.3.6 to PG 7.3.9 part 1b

2005-05-03 Thread Jona
Please refer to part 1a for questions and part 2 for more queries and 
query plans.
Why won't this list accept my questions and sample data in one mail???

/Jona
 

Query 1:
EXPLAIN ANALYZE
SELECT DISTINCT StatConTrans_Tbl.id, Code_Tbl.sysnm AS code, 
PriceCat_Tbl.amount AS price, Country_Tbl.currency,
 CreditsCat_Tbl.amount AS credits, Info_Tbl.title, Info_Tbl.description
FROM (SCT2SubCatType_Tbl
INNER JOIN SCT2Lang_Tbl ON SCT2SubCatType_Tbl.sctid = SCT2Lang_Tbl.sctid
INNER JOIN Language_Tbl ON SCT2Lang_Tbl.langid = Language_Tbl.id AND 
Language_Tbl.sysnm = UPPER('us') AND Language_Tbl.enabled = true
INNER JOIN Info_Tbl ON SCT2SubCatType_Tbl.sctid = Info_Tbl.sctid AND 
Language_Tbl.id = Info_Tbl.langid
INNER JOIN SubCatType_Tbl ON SCT2SubCatType_Tbl.subcattpid = 
SubCatType_Tbl.id AND SubCatType_Tbl.enabled = true
INNER JOIN CatType_Tbl ON SubCatType_Tbl.cattpid = CatType_Tbl.id AND 
CatType_Tbl.enabled = true
INNER JOIN SuperCatType_Tbl ON CatType_Tbl.spcattpid = 
SuperCatType_Tbl.id AND SuperCatType_Tbl.enabled = true
INNER JOIN StatConTrans_Tbl ON SCT2SubCatType_Tbl.sctid = 
StatConTrans_Tbl.id AND StatConTrans_Tbl.enabled = true
INNER JOIN Price_Tbl ON StatConTrans_Tbl.id = Price_Tbl.sctid AND 
Price_Tbl.affid = 8
INNER JOIN PriceCat_Tbl ON Price_Tbl.prccatid = PriceCat_Tbl.id AND 
PriceCat_Tbl.enabled = true
INNER JOIN Country_Tbl ON PriceCat_Tbl.cntid = Country_Tbl.id AND 
Country_Tbl.enabled = true
INNER JOIN CreditsCat_Tbl ON Price_Tbl.crdcatid = CreditsCat_Tbl.id AND 
CreditsCat_Tbl.enabled = true
INNER JOIN StatCon_Tbl ON StatConTrans_Tbl.id = StatCon_Tbl.sctid AND 
StatCon_Tbl.ctpid = 1
INNER JOIN Code_Tbl ON SuperCatType_Tbl.id = Code_Tbl.spcattpid AND 
Code_Tbl.affid = 8 AND Code_Tbl.cdtpid = 1)
WHERE SCT2SubCatType_Tbl.subcattpid = 79
ORDER BY StatConTrans_Tbl.id DESC
LIMIT 8 OFFSET 0

Plan on PostGre 7.3.6 on Red Hat Linux 3.2.3-39
"Limit  (cost=178.59..178.61 rows=1 width=330) (actual time=22.77..28.51 
rows=4 loops=1)"
"  ->  Unique  (cost=178.59..178.61 rows=1 width=330) (actual 
time=22.77..28.50 rows=4 loops=1)"
"->  Sort  (cost=178.59..178.60 rows=1 width=330) (actual 
time=22.76..22.85 rows=156 loops=1)"
"  Sort Key: statcontrans_tbl.id, code_tbl.sysnm, 
pricecat_tbl.amount, country_tbl.currency, creditscat_tbl.amount, 
info_tbl.title, info_tbl.description"
"  ->  Hash Join  (cost=171.19..178.58 rows=1 width=330) 
(actual time=3.39..6.55 rows=156 loops=1)"
"Hash Cond: ("outer".cntid = "inner".id)"
"->  Nested Loop  (cost=170.13..177.51 rows=1 
width=312) (actual time=3.27..5.75 rows=156 loops=1)"
"  Join Filter: ("inner".sctid = "outer".sctid)"
"  ->  Hash Join  (cost=170.13..171.48 rows=1 
width=308) (actual time=3.12..3.26 rows=4 loops=1)"
"Hash Cond: ("outer".crdcatid = 
"inner".id)"
"->  Hash Join  (cost=169.03..170.38 
rows=1 width=300) (actual time=3.00..3.11 rows=4 loops=1)"
"  Hash Cond: ("outer".spcattpid = 
"inner".spcattpid)"
"  ->  Hash Join  
(cost=167.22..168.56 rows=1 width=253) (actual time=2.88..2.97 rows=4 
loops=1)"
"Hash Cond: ("outer".id = 
"inner".prccatid)"
"->  Seq Scan on 
pricecat_tbl  (cost=0.00..1.29 rows=12 width=12) (actual time=0.04..0.08 
rows=23 loops=1)"
"  Filter: (enabled = 
true)"
"->  Hash  
(cost=167.21..167.21 rows=1 width=241) (actual time=2.80..2.80 rows=0 
loops=1)"
"  ->  Nested Loop  
(cost=3.77..167.21 rows=1 width=241) (actual time=1.31..2.79 rows=4 
loops=1)"
"Join Filter: 
("inner".sctid = "outer".sctid)"
"->  Nested 
Loop  (cost=3.77..161.19 rows=1 width=229) (actual time=1.19..2.60 
rows=4 loops=1)"
"  Join 
Filter: ("outer".sctid = "inner".sctid)"
"  ->  Hash 
Join  (cost=3.77..155.17 rows=1 width=44) (actual time=1.07..2.37 rows=4 
loops=1)"
"
Hash Cond: ("outer".langid = "inner".id)"
"->  
Nested Loop  (cost=2.69..154.06 rows=7 width=40) (actual time=0.90..2.18 
rows=8 loops=1)"
"  
Join Filter: ("outer".sctid = "inner".sctid)"
"  
->  Nested Loop  (cost=2

Re: [PERFORM] batch inserts are "slow"

2005-05-03 Thread Steve Wampler
Tim Terlegård wrote:
>>
>>Just as on Oracle you would use SQL*Loader for this application, you
>>should use the COPY syntax for PostgreSQL.  You will find it a lot
>>faster.  I have used it by building the input files and executing
>>'psql' with a COPY command, and also by using it with a subprocess,
>>both are quite effective.
> 
> 
> I tried this now. Now it's down to 0.45 seconds. It feels a bit hacky to
> run /usr/bin/psql from java, but it sure works. Thanks for the hint!

There was a patch against 7.4 that provided direct JDBC access to
PostgreSQL's COPY.  (I have it installed here and *love* it - it
gives outstanding performance.)  However, it hasn't made into an
official release yet.  I don't know why, perhaps there's
a problem yet to be solved with it ('works for me', though)?

Is this still on the board?  I won't upgrade past 7.4 without it.

-- 
Steve Wampler -- [EMAIL PROTECTED]
The gods that smiled on your birth are now laughing out loud.

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] COPY vs INSERT

2005-05-03 Thread Mischa Sandberg
> Steven Rosenstein <[EMAIL PROTECTED]> writes:
> > My question is, are there any advantages, drawbacks, or outright
> > restrictions to using multiple simultaneous COPY commands to load
> data into
> > the same table?

Do you mean, multiple COPY commands (connections) being putline'd from
the same thread (process)? I have indirect evidence that this may hurt.

Two copy commands from different threads/processes are fine, and can
help, if they alternate contention on some other resource (disk/CPU).

I'm basing this on being at the third generation of a COPY
implementation. The app loads about 1M objects/hour from 6 servers. 
Each object is split across four tables.
The batch load opens four connections and firehoses records down each.
A batch is 10K objects.

COPY invokes all the same logic as INSERT on the server side
(rowexclusive locking, transaction log, updating indexes, rules). 
The difference is that all the rows are inserted as a single
transaction. This reduces the number of fsync's on the xlog,
which may be a limiting factor for you. You'll want to crank 
WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. 
One of my streams has 6K records; I run with WB=1000, CS=128.

The downside I found with multiple clients inserting large blocks of
rows was, that they serialized. I THINK that's because at some point
they all needed to lock the same portions of the same indexes. I'm still
working on how to avoid that, tuning the batch size and inserting into a
 "queue" table with fewer indexes.

COPY (via putline) didn't do measurably better than INSERT until I
batched 40 newline-separate rows into one putline call, which improved
it 2-3:1. The suspect problem was stalling on the TCP stream; the driver
was flushing small packets. This may or may not be relevant to you;
depends on how much processing (waiting) your app does between posting
of rows.

In such a case, writing alternately to two TCP streams from the same
process increases the likelihood of a stall. I've never tested that
set-up; it would have been heading AWAY from the solution in my case.

Hope that helps.
-- 
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] batch inserts are "slow"

2005-05-03 Thread Christopher Petrilli
On 5/3/05, Tim Terlegård <[EMAIL PROTECTED]> wrote:
> > Just as on Oracle you would use SQL*Loader for this application, you
> > should use the COPY syntax for PostgreSQL.  You will find it a lot
> > faster.  I have used it by building the input files and executing
> > 'psql' with a COPY command, and also by using it with a subprocess,
> > both are quite effective.
> 
> I tried this now. Now it's down to 0.45 seconds. It feels a bit hacky to
> run /usr/bin/psql from java, but it sure works. Thanks for the hint!

It may feel hacky, but I think if you want to use SQL*Loader on
Oracle, you have to do the same thing.  I know a C++ app that I use
that runs SQL*Loader about once per second to deal with a HUGE volume
(10K/sec).  In fact, moving the load files onto ramdisk has helped a
lot.

Chris
-- 
| Christopher Petrilli
| [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] batch inserts are "slow"

2005-05-03 Thread Dave Cramer
Kris is correct,
This code was not added or even submitted to CVS. The purpose of this 
was to work
out the bugs with people who are actually using copy.

The api is a separate issue however. There's no reason that copy can't 
support more
than one api.

Dave
Kris Jurka wrote:
On Tue, 3 May 2005, Josh Berkus wrote:
 

There are several hacks floating around that add COPY capabilities to
the pgjdbc driver. As they all are rather simple hacks, they have not
been included in the cvs yet, but they tend to work fine.
 

FWIW, Dave Cramer just added beta COPY capability to JDBC.   Contact him on 
the JDBC list for details; I think he needs testers.

   

I believe Dave has remerged a patch for COPY I posted over a year ago, but 
he has not yet published it.  I would guess it has the same bugs as the 
original (transaction + error handling) and will meet the same objections 
that kept the original patch out of the driver in the first place (we want 
a friendlier API than just a data stream).

Kris Jurka
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match
 

--
Dave Cramer
http://www.postgresintl.com
519 939 0336
ICQ#14675561
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Mischa Sandberg
Quoting Markus Schaber <[EMAIL PROTECTED]>:

> Hi, Josh,
> 
> Josh Berkus wrote:
> 
> > Yes, actually.   We need 3 different estimation methods:
> > 1 for tables where we can sample a large % of pages (say, >= 0.1)
> > 1 for tables where we sample a small % of pages but are "easily
> estimated"
> > 1 for tables which are not easily estimated by we can't afford to
> sample a 
> > large % of pages.
> > 
> > If we're doing sampling-based estimation, I really don't want
> people to lose 
> > sight of the fact that page-based random sampling is much less
> expensive than 
> > row-based random sampling.   We should really be focusing on
> methods which 
> > are page-based.

Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the streetlight,
rather than in the alley where you dropped them :-)

How about applying the distinct-sampling filter on a small extra data
stream to the stats collector? 

-- 
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Josh Berkus
Mischa,

> Okay, although given the track record of page-based sampling for
> n-distinct, it's a bit like looking for your keys under the streetlight,
> rather than in the alley where you dropped them :-)

Bad analogy, but funny.

The issue with page-based vs. pure random sampling is that to do, for example, 
10% of rows purely randomly would actually mean loading 50% of pages.  With 
20% of rows, you might as well scan the whole table.

Unless, of course, we use indexes for sampling, which seems like a *really 
good* idea to me 

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread John A Meinel
Josh Berkus wrote:
Mischa,

Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the streetlight,
rather than in the alley where you dropped them :-)

Bad analogy, but funny.
The issue with page-based vs. pure random sampling is that to do, for example,
10% of rows purely randomly would actually mean loading 50% of pages.  With
20% of rows, you might as well scan the whole table.
Unless, of course, we use indexes for sampling, which seems like a *really
good* idea to me 
But doesn't an index only sample one column at a time, whereas with
page-based sampling, you can sample all of the columns at once. And not
all columns would have indexes, though it could be assumed that if a
column doesn't have an index, then it doesn't matter as much for
calculations such as n_distinct.
But if you had 5 indexed rows in your table, then doing it index wise
means you would have to make 5 passes instead of just one.
Though I agree that page-based sampling is important for performance
reasons.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Josh Berkus
John,

> But doesn't an index only sample one column at a time, whereas with
> page-based sampling, you can sample all of the columns at once. 

Hmmm.  Yeah, we're not currently doing that though.  Another good idea ...

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Kernel Resources and max_connections

2005-05-03 Thread Chris Hebrard
Hi,
I have postgres 8.0.2 installed on FreeBSD FreeBSD 4.11-RELEASE with 2GB 
of RAM.

When trying to set max_connections=256 I get the following error message:
FATAL:  could not create semaphores: No space left on device
DETAIL:  Failed system call was semget(5432017, 17, 03600).
HINT:  This error does *not* mean that you have run out of disk space.
   It occurs when either the system limit for the maximum number of 
semaphore sets (SEMMNI), or the system wide maximum number of semaphores 
(SEMMNS), would be exceeded.  You need to raise the respective kernel 
parameter.  Alternatively, reduce PostgreSQL's consumption of semaphores 
by reducing its max_connections parameter (currently 256).
   The PostgreSQL documentation contains more information about 
configuring your system for PostgreSQL.

I have read through the kernel resources documentation for postgres 8 
and set values accordingly. 
Some settings are not staying after a reboot, even if I place them in 
/boot/loader.conf.

So far I'm able to get max_connections to 250.
Here is  a dump of  kern.ipc values:
kern.ipc.maxsockbuf: 262144
kern.ipc.sockbuf_waste_factor: 8
kern.ipc.somaxconn: 128
kern.ipc.max_linkhdr: 16
kern.ipc.max_protohdr: 60
kern.ipc.max_hdr: 76
kern.ipc.max_datalen: 136
kern.ipc.nmbclusters: 65536
kern.ipc.msgmax: 16384
kern.ipc.msgmni: 40
kern.ipc.msgmnb: 2048
kern.ipc.msgtql: 40
kern.ipc.msgssz: 8
kern.ipc.msgseg: 2048
kern.ipc.semmap: 30
kern.ipc.semmni: 256
kern.ipc.semmns: 272
kern.ipc.semmnu: 30
kern.ipc.semmsl: 60
kern.ipc.semopm: 100
kern.ipc.semume: 10
kern.ipc.semusz: 92
kern.ipc.semvmx: 32767
kern.ipc.semaem: 16384
kern.ipc.shmmax: 33554432
kern.ipc.shmmin: 1
kern.ipc.shmmni: 192
kern.ipc.shmseg: 128
kern.ipc.shmall: 8192
kern.ipc.shm_use_phys: 0
kern.ipc.shm_allow_removed: 0
kern.ipc.mbuf_wait: 32
kern.ipc.mbtypes: 38 551 3 0 0 0 0 0 0 0 0 0 0 0 0 0
kern.ipc.nmbufs: 262144
kern.ipc.nsfbufs: 8704
kern.ipc.nsfbufspeak: 7
kern.ipc.nsfbufsused: 0
kern.ipc.m_clreflimithits: 0
kern.ipc.mcl_pool_max: 0
kern.ipc.mcl_pool_now: 0
kern.ipc.maxsockets: 65536
And boot/loader.conf:
userconfig_script_load="YES"
kern.ipc.nmbclusters="65536"
kern.maxfiles="65536"
kern.maxfilesperproc="65536"
net.inet.tcp.mssdflt="1460"
kern.somaxconn="4096"
kern.ipc.semmns="272"
kern.ipc.semmni="256"
kern.ipc.shmmax="66099200"
kern.ipc.shmmax and kern.ipc.shmmin will not stay to what I set them to.
What am I doing wrong or not doing at all?
Your help is greatly appreciated.
Regards,
Chris.



--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.11.2 - Release Date: 5/2/2005
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Kernel Resources and max_connections

2005-05-03 Thread Mark Kirkwood
Chris Hebrard wrote:
kern.ipc.shmmax and kern.ipc.shmmin will not stay to what I set them to.
What am I doing wrong or not doing at all?
These need to go in /etc/sysctl.conf. You might need to set shmall as well.
(This not-very-clear distinction between what is sysctl'abe and what is 
a kernel tunable is a bit of a downer).

cheers
Mark

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Kernel Resources and max_connections

2005-05-03 Thread Chris Hebrard
Mark Kirkwood wrote:
Chris Hebrard wrote:
kern.ipc.shmmax and kern.ipc.shmmin will not stay to what I set them to.
What am I doing wrong or not doing at all?
These need to go in /etc/sysctl.conf. You might need to set shmall as 
well.

(This not-very-clear distinction between what is sysctl'abe and what 
is a kernel tunable is a bit of a downer).

cheers
Mark

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Thanks for your reply,
I set the values in etc/sysctl.conf:
# $FreeBSD: src/etc/sysctl.conf,v 1.1.2.3 2002/04/15 00:44:13 dougb Exp $
#
#  This file is read when going to multi-user and its contents piped thru
#  ``sysctl'' to adjust kernel values.  ``man 5 sysctl.conf'' for details.
#
# Added by IMP 2005-05-04
net.inet.tcp.rfc1323=1
kern.ipc.somaxconn=1024
kern.ipc.maxsockbuf=8388608
net.inet.tcp.sendspace=3217968
net.inet.tcp.recvspace=3217968
kern.ipc.semmns="272"
kern.ipc.semmni="256"
kern.ipc.shmmax="66099200"
kern.ipc.shmmin="256"
After a restart both shmmax and shmmin are now 0 and postgres failed to 
start.

kern.ipc.maxsockbuf: 8388608
kern.ipc.sockbuf_waste_factor: 8
kern.ipc.somaxconn: 1024
kern.ipc.max_linkhdr: 16
kern.ipc.max_protohdr: 60
kern.ipc.max_hdr: 76
kern.ipc.max_datalen: 136
kern.ipc.nmbclusters: 65536
kern.ipc.msgmax: 16384
kern.ipc.msgmni: 40
kern.ipc.msgmnb: 2048
kern.ipc.msgtql: 40
kern.ipc.msgssz: 8
kern.ipc.msgseg: 2048
kern.ipc.semmap: 30
kern.ipc.semmni: 10
kern.ipc.semmns: 60
kern.ipc.semmnu: 30
kern.ipc.semmsl: 60
kern.ipc.semopm: 100
kern.ipc.semume: 10
kern.ipc.semusz: 92
kern.ipc.semvmx: 32767
kern.ipc.semaem: 16384
kern.ipc.shmmax: 0
kern.ipc.shmmin: 0
kern.ipc.shmmni: 192
kern.ipc.shmseg: 128
kern.ipc.shmall: 8192
kern.ipc.shm_use_phys: 0
kern.ipc.shm_allow_removed: 0
kern.ipc.mbuf_wait: 32
kern.ipc.mbtypes: 24 550 2 0 0 0 0 0 0 0 0 0 0 0 0 0
kern.ipc.nmbufs: 262144
kern.ipc.nsfbufs: 8704
kern.ipc.nsfbufspeak: 0
kern.ipc.nsfbufsused: 0
kern.ipc.m_clreflimithits: 0
kern.ipc.mcl_pool_max: 0
kern.ipc.mcl_pool_now: 0
kern.ipc.maxsockets: 65536
I'm lost here.
Chris.
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.11.2 - Release Date: 5/2/2005
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Kernel Resources Solved

2005-05-03 Thread Chris Hebrard
Problem sovled by setting:
kern.ipc.semmni: 280
kern.ipc.semmns: 300
Chris.

Mark Kirkwood wrote:
Chris Hebrard wrote:
kern.ipc.shmmax and kern.ipc.shmmin will not stay to what I set them 
to.

What am I doing wrong or not doing at all?
These need to go in /etc/sysctl.conf. You might need to set shmall as 
well.

(This not-very-clear distinction between what is sysctl'abe and what 
is a kernel tunable is a bit of a downer).

cheers
Mark

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

kern.ipc.maxsockbuf: 8388608
kern.ipc.sockbuf_waste_factor: 8
kern.ipc.somaxconn: 1024
kern.ipc.max_linkhdr: 16
kern.ipc.max_protohdr: 60
kern.ipc.max_hdr: 76
kern.ipc.max_datalen: 136
kern.ipc.nmbclusters: 65536
kern.ipc.msgmax: 16384
kern.ipc.msgmni: 40
kern.ipc.msgmnb: 2048
kern.ipc.msgtql: 40
kern.ipc.msgssz: 8
kern.ipc.msgseg: 2048
kern.ipc.semmap: 30
kern.ipc.semmni: 256
kern.ipc.semmns: 272
kern.ipc.semmnu: 30
kern.ipc.semmsl: 60
kern.ipc.semopm: 100
kern.ipc.semume: 10
kern.ipc.semusz: 92
kern.ipc.semvmx: 32767
kern.ipc.semaem: 16384
kern.ipc.shmmax: 66099200
kern.ipc.shmmin: 256
kern.ipc.shmmni: 192
kern.ipc.shmseg: 128
kern.ipc.shmall: 8192
kern.ipc.shm_use_phys: 0
kern.ipc.shm_allow_removed: 0
kern.ipc.mbuf_wait: 32
kern.ipc.mbtypes: 37 552 3 0 0 0 0 0 0 0 0 0 0 0 0 0
kern.ipc.nmbufs: 262144
kern.ipc.nsfbufs: 8704
kern.ipc.nsfbufspeak: 4
kern.ipc.nsfbufsused: 0
kern.ipc.m_clreflimithits: 0
kern.ipc.mcl_pool_max: 0
kern.ipc.mcl_pool_now: 0
kern.ipc.maxsockets: 65536

I've got the values to what I want them to be now, after loading some 
values in loader.conf and others in sysctl.conf.

loader.conf:
userconfig_script_load="YES"
kern.ipc.nmbclusters="65536"
kern.maxfiles="65536"
kern.maxfilesperproc="65536"
net.inet.tcp.mssdflt="1460"
kern.somaxconn="4096"
kern.ipc.semmns="272"
kern.ipc.semmni="256"
sysctl.conf:
net.inet.tcp.rfc1323=1
kern.ipc.somaxconn=1024
kern.ipc.maxsockbuf=8388608
net.inet.tcp.sendspace=3217968
net.inet.tcp.recvspace=3217968
kern.ipc.shmmax=66099200
kern.ipc.shmmin=256
kern.ipc.shmall=16138
and kern.ipc values are now:
kern.ipc.maxsockbuf: 8388608
kern.ipc.sockbuf_waste_factor: 8
kern.ipc.somaxconn: 1024
kern.ipc.max_linkhdr: 16
kern.ipc.max_protohdr: 60
kern.ipc.max_hdr: 76
kern.ipc.max_datalen: 136
kern.ipc.nmbclusters: 65536
kern.ipc.msgmax: 16384
kern.ipc.msgmni: 40
kern.ipc.msgmnb: 2048
kern.ipc.msgtql: 40
kern.ipc.msgssz: 8
kern.ipc.msgseg: 2048
kern.ipc.semmap: 30
kern.ipc.semmni: 256
kern.ipc.semmns: 272
kern.ipc.semmnu: 30
kern.ipc.semmsl: 60
kern.ipc.semopm: 100
kern.ipc.semume: 10
kern.ipc.semusz: 92
kern.ipc.semvmx: 32767
kern.ipc.semaem: 16384
kern.ipc.shmmax: 66099200
kern.ipc.shmmin: 256
kern.ipc.shmmni: 192
kern.ipc.shmseg: 128
kern.ipc.shmall: 16138
kern.ipc.shm_use_phys: 0
kern.ipc.shm_allow_removed: 0
kern.ipc.mbuf_wait: 32
kern.ipc.mbtypes: 7 550 3 0 0 0 0 0 0 0 0 0 0 0 0 0
kern.ipc.nmbufs: 262144
kern.ipc.nsfbufs: 8704
kern.ipc.nsfbufspeak: 6
kern.ipc.nsfbufsused: 0
kern.ipc.m_clreflimithits: 0
kern.ipc.mcl_pool_max: 0
kern.ipc.mcl_pool_now: 0
kern.ipc.maxsockets: 65536
Postgres still refuses to start with 256 max_connections.
Chris.



--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.11.2 - Release Date: 5/2/2005
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Mischa Sandberg
Quoting Josh Berkus : 
 
> Mischa, 
>  
> > Okay, although given the track record of page-based sampling for 
> > n-distinct, it's a bit like looking for your keys under the 
> streetlight, 
> > rather than in the alley where you dropped them :-) 
>  
> Bad analogy, but funny. 
 
Bad analogy? Page-sampling effort versus row-sampling effort, c'est 
moot. It's not good enough for stats to produce good behaviour on the 
average. Straight random sampling, page or row, is going to cause 
enough untrustworthy engine behaviour,for any %ages small enough to 
allow sampling from scratch at any time. 
 
I'm curious what the problem is with relying on a start-up plus 
incremental method, when the method in the distinct-sampling paper 
doesn't degenerate: you can start when the table is still empty. 
Constructing an index requires an initial full scan plus incremental 
update; what's the diff? 
 
> Unless, of course, we use indexes for sampling, which seems like a 
> *really  
> good* idea to me  
 
"distinct-sampling" applies for indexes, too. I started tracking the 
discussion of this a bit late.  Smart method for this is in VLDB'92: 
Gennady Antoshenkov, "Random Sampling from Pseudo-ranked B+-trees". I 
don't think this is online anywhere, except if you have a DBLP 
membership. Does nybod else know better? 
Antoshenkov was the brains behind some of the really cool stuff in DEC 
Rdb (what eventually became Oracle). Compressed bitmap indices, 
parallel competing query plans, and smart handling of keys with 
hyperbolic distributions.  
--  
Engineers think equations approximate reality. 
Physicists think reality approximates the equations. 
Mathematicians never make the connection. 


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Kernel Resources and max_connections

2005-05-03 Thread Jim C. Nasby
On Wed, May 04, 2005 at 01:46:34PM +1200, Mark Kirkwood wrote:
> (This not-very-clear distinction between what is sysctl'abe and what is 
> a kernel tunable is a bit of a downer).

I think this is documented somewhere, though I can't think of where
right now.

Also, note that some sysctl's can only be set in /boot/loader.conf.
hw.ata.wc=0 is an example (which you want to set on any box with IDE
drives if you want fsync to actually do what it thinks it's doing).
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Kernel Resources and max_connections

2005-05-03 Thread Mark Kirkwood
Chris Hebrard wrote:
I set the values in etc/sysctl.conf:
# $FreeBSD: src/etc/sysctl.conf,v 1.1.2.3 2002/04/15 00:44:13 dougb Exp $
#
#  This file is read when going to multi-user and its contents piped thru
#  ``sysctl'' to adjust kernel values.  ``man 5 sysctl.conf'' for details.
#
# Added by IMP 2005-05-04
net.inet.tcp.rfc1323=1
kern.ipc.somaxconn=1024
kern.ipc.maxsockbuf=8388608
net.inet.tcp.sendspace=3217968
net.inet.tcp.recvspace=3217968
kern.ipc.semmns="272"
kern.ipc.semmni="256"
kern.ipc.shmmax="66099200"
kern.ipc.shmmin="256"
After a restart both shmmax and shmmin are now 0 and postgres failed to 
start.


Hmmm - puzzling. One point to check, did you take them out of 
/boot/loader.conf ?

Assuming so, maybe don't quote 'em (see below).
Finally you need to to set shmall, otherwise it will over(under)ride the 
shmmax setting. So try:

net.inet.tcp.rfc1323=1
kern.ipc.somaxconn=1024
kern.ipc.maxsockbuf=8388608
net.inet.tcp.sendspace=3217968
net.inet.tcp.recvspace=3217968
kern.ipc.semmns=272
kern.ipc.semmni=256
kern.ipc.shmmax=66099200
kern.ipc.shmmin=256
kern.ipc.shmall=32768



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]