Re: [GENERAL] FUNCTION problem

2004-04-02 Thread Greg Stark

Sky <[EMAIL PROTECTED]> writes:

>   uid CHARACTER(20) NOT NULL,
>   pwd CHARACTER(20) NOT NULL,

Incidentally, are you sure you want character(20) ? The input will be padded
out to 20 characters with spaces. Usually people find varchar() more
convenient.

-- 
greg


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[GENERAL] glibc update.. effect on PG

2004-04-02 Thread JM
Hi,

I want to upgrade glibc on one of my box.. and i have installed postgres via 
source.. just incase I updated the glibc do I have to recompile postgres?

TIA,
jm

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[GENERAL] PostgreSQL 7.4.2 warnings in HPUX IPF: LOG: could not resolve "localhost": host nor service provided

2004-04-02 Thread Durai
Hello All,
  I built postgresql on HPUX IPF(11.22) platform. It works fine. But I 
have warning message   "could not 
resolve "localhost": host nor service provided, or not known"when I start the 
postmaster.  I have given 
entries in pg_hba.conf file to access also.
  
$ ./postmaster -D /var/opt/iexpress/postgresql &
[1] 2410
$ LOG:  could not resolve "localhost": host nor service provided, or not known
LOG:  database system was shut down at 2004-03-26 00:54:37 PST
LOG:  checkpoint record is at 0/9DE158
LOG:  redo record is at 0/9DE158; undo record is at 0/0; shutdown TRUE
LOG:  next transaction ID: 536; next OID: 17142
LOG:  database system is ready
$

Is there any configuration missed?

Regs,
Durai.

Lycos Email has 10 MB of FREE storage space. http://mail.lycos.co.uk


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[GENERAL] Optimization on UPDATEs and FOREIGN KEYs...

2004-04-02 Thread Sean Chittenden
Does the optimizer optimize away the foreign key checks on a primary 
key if its value doesn't change, even though it's had a value assigned 
to it?  Here's the example:

CREATE TABLE t1 (
  i INT PRIMARY,
  j TEXT
);
CREATE TABLE t2 (
  i INT,
  k INT8,
  FOREIGN KEY(i) REFERENCES t1(i)
);
INSERT INTO t1 (i,j) VALUES (1,'foo');
UPDATE t1 SET i = 1 WHERE i = 1;
Does the optimizer optimize away the foreign key checks since t1.i's 
value hasn't changed (OLD.i = NEW.i)?  I couldn't find anything that 
suggested that this statement became a no-op internally.  In the 
EXPLAIN output, it's clear that the backend is searching through t1.i's 
index, but I don't know if the UPDATE is scanning through t2 looking 
for key violations.  Since foreign key constraints don't appear in the 
output of EXPLAIN VERBOSE (or maybe it does and it's already doing this 
optimization and is trimming it before the VERBOSE output is produced) 
and grep(1) wasn't of much use, I figured I'd ask.  TIA.  -sc

--
Sean Chittenden
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [GENERAL] row-level security model

2004-04-02 Thread Adam Witney
On 2/4/04 4:50 am, "John DeSoi" <[EMAIL PROTECTED]> wrote:

> Marc,
> 
> On Apr 1, 2004, at 4:21 PM, Marc Durham wrote:
> 
>> Do you think this would work?
>> There are a lot of joins. And I assumed it would need to look up the
>> parent's and teacher's  usernames, and that your many-to-many
>> relationships
>> were in tables like students_parent_link.
>> 
> 
> Thanks very much -- this helped get me started.
> 
> I think I finally came up with something that lets me have modularized
> access functions where I can combine access but still maintain a single
> view. Here is a rough example.
> 
> Let's say I have two access functions contact_students and
> staff_teaches_students. If the current user is in the contact group it
> returns the primary keys (integer dbid in my example) of the related
> students. Similarly, if the current user is on the teaching staff, it
> returns the keys for all students in his/her classes. So I create a
> function to combine all of my access functions with union:
> 
> create or replace function student_access ()
> returns setof integer as '
> select * from contact_students()
> union select * from staff_teaches_students();
> ' language sql;
> 
> Then my view is
> 
> create view student_v as select student.* from student,
> student_access() as id
> where student.dbid = id;
> 
> 
> Comments/criticisms about design or performance issues?
> 
> Is there a way to provide column security without creating different
> views for every possible scenario?

Hi John,

I don't know if this will fit your needs, but this is how I handled row
level security in an application I have. It uses arrays, so may be
PostgreSQL specific I think... But basically I have person and group tables

CREATE TABLE person (
  person_idINT4NOT NULL,
  ..
  
  ..
  username TEXTNOT NULL,
  lab_group_id INT4NOT NULL,
  groups_ids   INT[]   NULL
);

CREATE TABLE groups (
  group_id INT4NOT NULL,
  name TEXTNOT NULL
);

Then each object has a base table:

CREATE TABLE experiment_base (
  expt_id  INT4NOT NULL,
  ..
  
  ..
  owner_id INT NOT NULL,
  writer_idINT[]   NOT NULL,
  readers_id   INT[]   NOT NULL
);

I can then control who can update the row at the user level, and who can
read the row at the group level using a view like so:

CREATE OR REPLACE VIEW experiment
  AS 
SELECT 
  FROM experiment_base a,
   person b
  WHERE
  a.owner_id = b.person_id AND
   (readers_id && 
(select groups_ids from person a where a.username = current_user)
  OR 
(select person_id from person a where a.username = current_user) = ANY
(writer_id) 
  OR
owner_id = (select person_id from person a where a.username =
current_user));

I then have a couple of functions to add or remove group_id's from the
readers_id array, and also to add or remove person_id's from the writer_id
array

I don't have large numbers of users or groups, so it performs ok... Not sure
how the array approach will scale with more though.

I don't think this is a classical approach But it seems to work for me.
But I would appreciate comments/criticisms from others?

Cheers

Adam



-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [GENERAL] FUNCTION problem

2004-04-02 Thread Bill Moran
Sky wrote:
HI everybody !

I have a problem, but I don't know the solution:

CREATE TABLE person(
 user_id SERIAL NOT NULL,
 uid CHARACTER(20) NOT NULL,
 pwd CHARACTER(20) NOT NULL,
 PRIMARY KEY (user_id)
 );
OK, That's right...

CREATE FUNCTION getuserid (CHARACTER(20),CHARACTER(20))
RETURNS SETOF INTEGER
AS
'
SELECT user_id FROM person WHERE uid=$1 AND pwd=$2;
'
LANGUAGE 'sql';
:-(

ERROR:  Unable to identify an operator '=$' for types 'character' and 
'integer
You will have to retype this query using an explicit cast.
You don't state what version of Postgres you're using, but I'll bet that
it's 7.3 or older.
Be a little more liberal with spaces to seperate the tokens in your
statement, i.e.:
SELECT user_id FROM person WHERE uid = $1 AND pwd = $2;
It seems to me that 7.3 and older don't parse quite as intelligently
as 7.4 does (which would explain why other people are saying "it works
for me")  What appears to be happening is that Postgres 7.3 looks at
uid=$1 and breaks it down into uid =$ 1, but (unless you created one)
it doesn't know anything about how to use =$ as a comparison, so it
throws an error.
7.4 seems to get this right more often, but that may be a bug in
the other direction ... I mean, what if you defind a =$ operator and
really want to compare uid =$ 1?
I think the real solution is to write your SQL so it's unambiguious
to the parser.  I saw this as a suggestion for C programming a few
years ago, that you always seperate tokens with space (even if not
strictly necessary) to make it unambiguous to the parser, as well as
easier for humans to read.  I think it's good advice all around.
--
Bill Moran
Potential Technologies
http://www.potentialtech.com
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [GENERAL] select statement sorting

2004-04-02 Thread Peter Eisentraut
Alexander Cohen wrote:
> i will get them back in the following order in the tuples:
> Alex
> Barbara
> Cohen
> alex
>
> But i want them back likke this:
> Alex
> alex
> Barbara
> Cohen

Set your locale to something other than C.


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[GENERAL] column oid ?

2004-04-02 Thread sferriol
hello
i see that each table has an oid
but the columns ??
is there an columnoid ?
sylvain


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [GENERAL] Optimization on UPDATEs and FOREIGN KEYs...

2004-04-02 Thread Tom Lane
Sean Chittenden <[EMAIL PROTECTED]> writes:
> Does the optimizer optimize away the foreign key checks on a primary 
> key if its value doesn't change, even though it's had a value assigned 
> to it?

The optimizer has nothing to do with this, but in most cases the foreign
key triggers themselves will short-circuit when the key value is found
to be the same as before.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [GENERAL] row-level security model

2004-04-02 Thread Bruno Wolff III
On Thu, Apr 01, 2004 at 22:50:48 -0500,
  John DeSoi <[EMAIL PROTECTED]> wrote:
> 
> Comments/criticisms about design or performance issues?

Except for SQL functions which can be inlined, the optimizer can't
optimize what is being done in the function with what is being done
in the select statement. This might be a performance issue depending
on your particular circumstances.

> Is there a way to provide column security without creating different 
> views for every possible scenario?

You should be able to have one view per table and use an OR in the where
clause to check for student, teacher or parent access. These tests
should be simple enough, that you should be able to just do the joins
in the view.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [GENERAL] Large DB

2004-04-02 Thread Manfred Koizar
On Thu, 01 Apr 2004 12:22:58 +0200, I wrote:
>BTW, ANALYSE is basically a constant time operation.

On closer inspection, this is not the whole truth.  ANALY[SZ]E is a two
stage process:  First it collects a sample of rows, then these rows are
examined to produce various statistics.

The cost of the latter depends on the sample size, which itself depends
on the default or column-specific statistics target, and the number (and
types) of columns, so it *should* take more or less constant time.

The first step, however, (acquire_sample_rows() in analyze.c) has to
read more rows than finally end up in the sample.  It visits less than
O(nblocks) pages but certainly more than O(1).

A vague feeling tries to tell me that the number of page reads is
somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
grow like O(ln(n)).

I have an idea how this could be done with O(1) page reads.  If I'm able
to translate it into C, I'll send a patch ...

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [GENERAL] row-level security model

2004-04-02 Thread Bruno Wolff III
On Thu, Apr 01, 2004 at 13:21:15 -0800,
  Marc Durham <[EMAIL PROTECTED]> wrote:
> From: "John DeSoi" <[EMAIL PROTECTED]>
> 
> Do you think this would work?
> There are a lot of joins. And I assumed it would need to look up the
> parent's and teacher's  usernames, and that your many-to-many relationships
> were in tables like students_parent_link.
> 
> CREATE VIEW your_students AS
> SELECT s.*
>  FROM student AS s
>   INNER JOIN class_student_link AS cs ON s.student_id = cs.student_id
>   INNER JOIN class AS c ON cs.class_id = c.class_id
>   INNER JOIN students_parent_link AS sp ON s.student_id = sp.student_id
>   INNER JOIN parent AS p ON sp.parent_id = p.parent_id
>   INNER JOIN teacher AS t ON c.teacher_id = t.teacher_id
> WHERE t.username = CURRENT_USER()
>  OR p.username = CURRENT_USER()

This makes the implicit assumption that students always have at least
one parent and at least one teacher. If that isn't necessarily true
you will need to use a couple of left (or right) joins or records of
students missing one or the other will not be accessible.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [GENERAL] Large DB

2004-04-02 Thread Tom Lane
Manfred Koizar <[EMAIL PROTECTED]> writes:
> The first step, however, (acquire_sample_rows() in analyze.c) has to
> read more rows than finally end up in the sample.  It visits less than
> O(nblocks) pages but certainly more than O(1).

> A vague feeling tries to tell me that the number of page reads is
> somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
> grow like O(ln(n)).

Good guess.  Vitter's paper says the expected time to sample n rows from
a table of size N is O(n * (1 + log(N/n))).

> I have an idea how this could be done with O(1) page reads.

The hard part is getting a genuinely random sample when we don't know N
in advance.  We do however know the table size in blocks, so if you're
willing to make assumptions about constant tuple density you could do
something different.  (But the tuple density assumption is exactly the
weak spot of what we've got, so I'm unconvinced that would be a big step
forward.)

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [GENERAL] Compound keys and foreign constraints

2004-04-02 Thread wespvp
On 4/1/04 6:00 PM, "Manfred Koizar" <[EMAIL PROTECTED]> wrote:

> And what are your settings for random_page_cost, effective_cache_size,
> and sort_mem?

I didn't read close enough last time.  Random_page_cosst, cpu_tuple_cost,
cpu_index_tuple_cost, and cpu_operator_cosst are all at default.
Effective_cache_size is 5, and sort_mem is 8192.  Shared_buffers=16384.

I've tried setting:

  random_page_cost 1 - 20
  cpu_tupple_cost  1 - .001
  cpu_index_tupple_cost 1 - .1
  sort_mem 1000 - 65535
  effective_cache_size 1000 - 100,000

Nothing gets it to use the (message_key, message_date) index if there is a
(message_key) index defined.  The only thing that changed the plan at all
was when I changed random_page_cost to greater than 9 (see below).

Other than that, I am still in the catch 22 - index (message_key) is
required for the foreign key constraints, but index(message_key,
message_date) will never be used if index (message_key) is defined.

Is this a bug in the planner that can be fixed?  It sure would be helpful if
I could specify a planner hint "use index xxx";

---

I just did some more testing.  At random_page_cost=1, the trivial case picks
the compound index "message_pkey",  but the real case still does the
'filter' with the (messge_key) index.

However, if I set random_page_cost to less than 1 (e.g. 0.5) then I can get
it to use the compound index.

Setting random_page_cost down from 4 to .5 seems like it wouldn't be a good
idea.  However, at this point it seems to be the only solution.

Wes


db=>set random_page_cost=10;

db=> explain select count(*) from messages m, message_recipients r,
addresses a WHERE r.Message_Key=m.Message_Key AND
a.Address='[EMAIL PROTECTED]' AND a.Address_Key=r.Recipient AND (
(m.Message_Date >= '29-MAR-04') AND (m.Message_Date <=
TO_TIMESTAMP('31-MAR-04 23:59:59', 'dd-mon-yy HH24:MI:SS')::timestamp
without time zone));
   
QUERY PLAN 



--
 Aggregate  (cost=595569.79..595569.79 rows=1 width=0)
   ->  Nested Loop  (cost=564647.77..595569.78 rows=2 width=0)
 Join Filter: ("outer".address_key = "inner".recipient)
 ->  Index Scan using addresses_i_address on addresses a
(cost=0.00..11.97 rows=2 width=11)
   Index Cond: ((address)::text =
'[EMAIL PROTECTED]'::text)
 ->  Materialize  (cost=564647.77..572920.00 rows=574623 width=10)
   ->  Nested Loop  (cost=0.00..562121.77 rows=574623 width=10)
 ->  Index Scan using messages_i_mdate on messages m
(cost=0.00..123060.87 rows=100789 width=11)
   Index Cond: ((message_date >= '2004-03-29
00:00:00'::timestamp without time zone) AND (message_date <=
(to_timestamp('31-MAR-04 23:59:59'::text, 'dd-mon-yy
HH24:MI:SS'::text))::timestamp without time zone))
 ->  Index Scan using message_recipients_i_message on
message_recipients r  (cost=0.00..3.70 rows=52 width=21)
   Index Cond: (r.message_key = "outer".message_key)



db=> set random_page_cost=1;
SETTime: 0.342 ms


db=> explain select count(*) from messages where message_key=12345 and
(message_date = '2004-03-29 00:00:00'::timestamp without time zone);
QUERY PLAN 

--
 Aggregate  (cost=3.02..3.02 rows=1 width=0)
   ->  Index Scan using messages_pkey on messages  (cost=0.00..3.02 rows=1
width=0)
 Index Cond: ((message_key = 12345::numeric) AND (message_date =
'2004-03-29 00:00:00'::timestamp without time zone))
(3 rows)


db=> explain analyze select count(*) from messages m, message_recipients r,
addresses a WHERE r.Message_Key=m.Message_Key AND
a.Address='[EMAIL PROTECTED]' AND a.Address_Key=r.Recipient AND (
(m.Message_Date >= '29-MAR-04') AND (m.Message_Date <=
TO_TIMESTAMP('31-MAR-04 23:59:59', 'dd-mon-yy HH24:MI:SS')::timestamp
without time zone));
   
QUERY PLAN 


--
 Aggregate  (cost=62514.26..62514.26 rows=1 width=0) (actual
time=336976.694..336976.694 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..62514.25 rows=2 width=0) (actual
time=119.178..336959.210 rows=8812 loops=1)
 ->  Nested Loop  (cost=0.00..61907.05 rows=200 width=11) (actual
time=83.232..32412.459 rows=312741 loops=1)
   ->  Index Scan using addresses_i_address on addresses a
(cost=0.00..3.01 rows=2 width=11) (actual time=0.074..0.517 rows=1 loops=1)
 Index Cond: ((address)::text =
'[EMAIL PROTECTED]'::text)

[GENERAL] execute function after user connect

2004-04-02 Thread John DeSoi
Is there a mechanism to execute a function after the user connects to 
the database? I would like to set the schema path based on the user's 
group membership. It would also be nice to log to a table on 
connect/disconnect.

Thanks,

John DeSoi, Ph.D.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [GENERAL] Compound keys and foreign constraints

2004-04-02 Thread Alvaro Herrera
On Fri, Apr 02, 2004 at 11:08:21AM -0600, [EMAIL PROTECTED] wrote:

> db=> explain analyze select count(*) from messages m, message_recipients r,
> addresses a WHERE r.Message_Key=m.Message_Key AND
> a.Address='[EMAIL PROTECTED]' AND a.Address_Key=r.Recipient AND (
> (m.Message_Date >= '29-MAR-04') AND (m.Message_Date <=
> TO_TIMESTAMP('31-MAR-04 23:59:59', 'dd-mon-yy HH24:MI:SS')::timestamp
> without time zone));
>
> QUERY PLAN 
> 
> 
> --
[...]

>->  Index Scan using message_recipients_i_recipient on
> message_recipients r  (cost=0.00..30569.25 rows=30622 width=21) (actual
> time=83.146..31609.149 rows=312741 loops=1)

This estimate is off by an order of magnitude.  Maybe you want to
increase the statistic target for this column ...

-- 
Alvaro Herrera ()
"The Postgresql hackers have what I call a "NASA space shot" mentality.
Quite refreshing in a world of "weekend drag racer" developers."
(Scott Marlowe)

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


[GENERAL] Storage cost of a null column

2004-04-02 Thread Randall Skelton
What is the storage cost of a null entry in a column?  i.e. does a null 
entry of type integer, float8 or text consume the same amount of 
storage as one that is filled?  I ask because I have satellite data 
which is transmitted via a dodgy RF link that drops data packets.  This 
means I have a number of columns in a table that are null.  Moreover, 
the operations people decided to use a compression scheme whereby 
non-changing bit/integer values are not output at regular intervals 
which also adds a considerable number of null entries into the columns. 
 Because of this, we made a decision that we would have hundreds of 2 
column tables (timestamp, value) and use unions, intersections, and 
joins to get what was needed.  Unfortunately, this has made application 
programming a real nightmare as we are often forced to reconstruct a 
snapshot frame for the range of times either in C or have the app 
create temporary tables in SQL and insert the relevant data prior to 
selecting it.  As it stands, we've ordered a new disk array and 
provided that the storage costs are not that high, I will probably be 
reorganising all this next week.  If anyone has any other suggestions, 
I'd be very keen to hear them.

Cheers,
Randall
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [GENERAL] row-level security model

2004-04-02 Thread John DeSoi
On Apr 2, 2004, at 10:39 AM, Bruno Wolff III wrote:

Except for SQL functions which can be inlined, the optimizer can't
optimize what is being done in the function with what is being done
in the select statement. This might be a performance issue depending
on your particular circumstances.
How does one specify a SQL function that can be inlined?

Thanks,

John DeSoi, Ph.D.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [GENERAL] Storage cost of a null column

2004-04-02 Thread Tom Lane
Randall Skelton <[EMAIL PROTECTED]> writes:
> What is the storage cost of a null entry in a column?

If you have even one, all the rest in that row are free, so your scheme
sounds reasonable.

Null columns are simply not stored.  There is a bitmap at the start of
the row with one bit per column, indicating which ones are null or not
null.  However we omit the bitmap if all columns are nonnull in a
particular row.  So it's reasonable to consider the cost of the first
null as being the size of the bitmap (N bits for an N-column table,
rounded up).  The rest are free.

>  Because of this, we made a decision that we would have hundreds of 2 
> column tables (timestamp, value) and use unions, intersections, and 

Narrow tables are a dead loss if you're concerned about storage space
--- you'll get eaten by the per-row overhead, which is a minimum of 28
bytes per row.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [GENERAL] Large DB

2004-04-02 Thread Manfred Koizar
[time to move this to -hackers]

On Fri, 02 Apr 2004 11:16:21 -0500, Tom Lane <[EMAIL PROTECTED]> wrote:
>Manfred Koizar <[EMAIL PROTECTED]> writes:
>> The first step, however, (acquire_sample_rows() in analyze.c) has to
>> read more rows than finally end up in the sample.  It visits less than
>> O(nblocks) pages but certainly more than O(1).
>
>> A vague feeling tries to tell me that the number of page reads is
>> somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
>> grow like O(ln(n)).
>
>Good guess.  Vitter's paper says the expected time to sample n rows from
>a table of size N is O(n * (1 + log(N/n))).

Well, for what I tried to find out my wild guess seems to be wrong.

I don't doubt that Vitter's formula is correct, but it assumes that
access to any tuple has the same cost.  This does not apply to our
problem, however.  With 100 tuples per page, we access the first
sample_size tuples at a cost of 0.01 sequential page reads per tuple.
Later we use less and less tuples per page which results in higher
per-tuple-cost.  Near the end of a large relation we can expect to
access only one tuple per page and more and more pages are skipped, so
that prefetching doesn't help any more.

Playing around with some real numbers (for 100 tuples/page and a sample
size of 3000) I got:

 rel  | page
 size | reads
--+-
   30 |30
  300 |   300expectation is something like 299.9995 
  500 |   499
   1K |   990
   3K |  2.6K
  30K |8K
 100K |   12K
   1M |   19K
  10M |   26K
 100M |   33K

This growth rate is steeper than O(log(nblocks)).

>> I have an idea how this could be done with O(1) page reads.

What I have in mind is a kind of "Double Vitter" algorithm.  Whatever we
do to get our sample of rows, in the end the sampled rows come from no
more than sample_size different blocks.  So my idea is to first create a
random sample of sample_size block numbers, and then to sample the rows
out of this pool of blocks.

I have to think harder though, what to do about those 400 pages that are
not accessed when the sample size is 3000 ...

>The hard part is getting a genuinely random sample when we don't know N
>in advance.  We do however know the table size in blocks, so if you're
>willing to make assumptions about constant tuple density you could do
>something different.  (But the tuple density assumption is exactly the
>weak spot of what we've got, so I'm unconvinced that would be a big step
>forward.)

Starting the scan at some random blocks should help against the common
case of unusual distribution of dead tuples near the start of the
relation.  And I plan to factor information about dead tuple hits into
an increasingly better estimation of dead/live tuple ratio.

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster