Re: [PERFORM] seq scan cache vs. index cache smackdown

2005-02-15 Thread Greg Stark
Tom Lane <[EMAIL PROTECTED]> writes:

> Greg Stark <[EMAIL PROTECTED]> writes:
> > 8.0, on the other hand, has a new algorithm that specifically tries to
> > protect against the shared buffers being blown out by a sequential
> > scan. But that will only help if it's the shared buffers being
> > thrashed that's hurting you, not the entire OS file system cache.
> 
> Something we ought to think about sometime: what are the performance
> implications of the real-world situation that we have another level of
> caching sitting underneath us?  

It seems inevitable that Postgres will eventually eliminate that redundant
layer of buffering. Since mmap is not workable, that means using O_DIRECT to
read table and index data.

Every other database eventually goes this direction, and for good reason.
Having two layers of caching and buffering is inherently inefficient. It also
makes it impossible for Postgres to offer any application-specific hints to
the caching replacement algorithms.

In that world you would configure Postgres much like you configure Oracle,
with shared_buffers taking up as much of your memory as you can afford. And
the OS file system cache is kept entirely out of the loop.

> AFAIK all the theoretical studies we've looked at consider only a single
> level of caching. But for example, if our buffer management algorithm
> recognizes an index page as being heavily hit and therefore keeps it in
> cache for a long time, then when it does fall out of cache you can be sure
> it's going to need to be read from disk when it's next used, because the
> OS-level buffer cache has not seen a call for that page in a long time.
> Contrariwise a page that we think is only on the fringe of usefulness is
> going to stay in the OS cache because we repeatedly drop it and then have to
> ask for it again.

Hum. Is it clear that that's bad? By the same logic it's the ones on the
fringe that you're likely to have to read again anyways. The ones that are
being heavily used are likely not to have to be read again anyways.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] VACCUM FULL ANALYZE PROBLEM

2005-02-15 Thread PFC
	I don't know if this would work, but if you just want to restructure your  
rows, your could do this:

UPDATE table SET id = id WHERE id BETWEEN 0 AND 2;
VACUUM table;
UPDATE table SET id = id WHERE id BETWEEN 20001 AND 4;
VACUUM table;
wash, rinse, repeat.
	The idea is that an update rewrites the rows (in your new format) and  
that VACUUM (not FULL) is quite fast when you just modified a part of the  
table, and non-locking.

Would this work ?

"Iain" <[EMAIL PROTECTED]> writes:
another  way  to speed up full vacuum?

Hmmm... a full vacuum may help to re-organize the structure of modified
tables, but whether this is significant or not is another matter.
Actually, VACUUM FULL is designed to work nicely for the situation where
a table has say 10% wasted space and you want the wasted space all
compressed out.  When there is a lot of wasted space, so that nearly all
the rows have to be moved to complete the compaction operation, VACUUM
FULL is not a very good choice.  And it simply moves rows around, it
doesn't modify the rows internally; so it does nothing at all to reclaim
space that would have been freed up by DROP COLUMN operations.
CLUSTER is actually a better bet if you want to repack a table that's
suffered a lot of updates or deletions.  In PG 8.0 you might also
consider one of the rewriting variants of ALTER TABLE.
regards, tom lane
---(end of broadcast)---
TIP 6: Have you searched our list archives?
   http://archives.postgresql.org

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] seq scan cache vs. index cache smackdown

2005-02-15 Thread Merlin Moncure
> It seems inevitable that Postgres will eventually eliminate that
redundant
> layer of buffering. Since mmap is not workable, that means using
O_DIRECT
> to
> read table and index data.

What about going the other way and simply letting the o/s do all the
caching?  How bad (or good) would the performance really be?  

Merlin

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] How to interpret this explain analyse?

2005-02-15 Thread Bricklen Anderson
Greg Stark wrote:
Kevin Brown <[EMAIL PROTECTED]> writes:

Ouch.  Is this really a reasonable assumption?  I figured the primary
use of a cursor was to fetch small amounts of data at a time from a
large table, so 10% seems extremely high as an average fetch size.  Or
is the optimization based on the number of rows that will be fetched
by the cursor during the cursor's lifetime (as opposed to in a single
fetch)?
Also, one has to ask what the consequences are of assuming a value too
low versus too high.  Which ends up being worse?

This is one of the things the planner really cannot know. Ultimately it's the
kind of thing for which hints really are necessary. Oracle distinguishes
between the "minimize total time" versus "minimize startup time" with
/*+ ALL_ROWS */ and /*+ FIRST_ROWS */ hints, for example.
I would also find it reasonable to have hints to specify a selectivity for
expressions the optimizer has no hope of possibly being able to estimate.
Things like "WHERE myfunction(col1,col2,?) /*+ 10% */"

Not to mention that hints would be helpful if you want to specify a particular index for a specific 
query (case in point, testing plans and response of various indices without having to drop and 
create other ones). This is a bit of functionality that I'd like to see.

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] seq scan cache vs. index cache smackdown

2005-02-15 Thread Josh Berkus
Tom, Greg, Merlin,

>  But for example,
> if our buffer management algorithm recognizes an index page as being
> heavily hit and therefore keeps it in cache for a long time, then when
> it does fall out of cache you can be sure it's going to need to be read
> from disk when it's next used, because the OS-level buffer cache has not
> seen a call for that page in a long time.  Contrariwise a page that we
> think is only on the fringe of usefulness is going to stay in the OS
> cache because we repeatedly drop it and then have to ask for it again.

Now you can see why other DBMSs don't use the OS disk cache.  There's other 
issues as well; for example, as long as we use the OS disk cache, we can't 
eliminate checkpoint spikes, at least on Linux.  No matter what we do with 
the bgwriter, fsyncing the OS disk cache causes heavy system activity.

> It seems inevitable that Postgres will eventually eliminate that redundant
> layer of buffering. Since mmap is not workable, that means using O_DIRECT
> to read table and index data.

Why is mmap not workable?It would require far-reaching changes to our code 
-- certainly -- but I don't think it can be eliminated from consideration.

> What about going the other way and simply letting the o/s do all the
> caching?  How bad (or good) would the performance really be?  

Pretty bad.   You can simulate this easily by turning your shared_buffers way 
down ...

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] seq scan cache vs. index cache smackdown

2005-02-15 Thread Tom Lane
Josh Berkus  writes:
> Why is mmap not workable?

We can't control write order.  There are other equally bad problems,
but that one alone eliminates it from consideration.  See past discussions.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] seq scan cache vs. index cache smackdown

2005-02-15 Thread Merlin Moncure
Josh Berkus wrote:
> Now you can see why other DBMSs don't use the OS disk cache.  There's
> other
> issues as well; for example, as long as we use the OS disk cache, we
can't
> eliminate checkpoint spikes, at least on Linux.  No matter what we do
with
> the bgwriter, fsyncing the OS disk cache causes heavy system activity.

MS SQL server uses the O/S disk cache...the database is very tightly
integrated with the O/S.  Write performance is one of the few things SQL
server can do better than most other databases despite running on a
mid-grade kernel and a low-grade filesystem...what does that say?
ReadFileScatter() and ReadFileGather() were added to the win32 API
specifically for SQL server...this is somewhat analogous to transaction
based writing such as in Reisfer4.  I'm not arguing ms sql server is
better in any way, IIRC they are still using table locks (!).  

> > It seems inevitable that Postgres will eventually eliminate that
> redundant
> > layer of buffering. Since mmap is not workable, that means using
> O_DIRECT
> > to read table and index data.

IMO, The O_DIRECT argument makes assumptions about storage and o/s
technology that are moving targets.  Not sure about mmap().

Merlin

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] seq scan cache vs. index cache smackdown

2005-02-15 Thread Greg Stark

Josh Berkus  writes:

> Why is mmap not workable?It would require far-reaching changes to our 
> code 
> -- certainly -- but I don't think it can be eliminated from consideration.

Fundamentally because there is no facility for being notified by the OS before
a page is written to disk. And there's no way to prevent a page from being
written to disk (mlock just prevents it from being flushed from memory, not
from being synced to disk). 

So there's no way to guarantee the WAL will be written before the buffer is
synced to disk.



Maybe it could be done by writing and syncing the WAL independently before the
shared buffer is written to at all, but that would be a completely different
model. And it would locking the shared buffer until the sync is done, and
require a private copy of the shared buffer necessitating more copies than the
double buffering in the first place.

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] seq scan cache vs. index cache smackdown

2005-02-15 Thread Magnus Hagander
>Josh Berkus wrote:
>> Now you can see why other DBMSs don't use the OS disk cache.  There's
>> other
>> issues as well; for example, as long as we use the OS disk cache, we
>can't
>> eliminate checkpoint spikes, at least on Linux.  No matter what we do
>with
>> the bgwriter, fsyncing the OS disk cache causes heavy system 
>activity.
>
>MS SQL server uses the O/S disk cache...

No, it doesn't. They open all files with FILE_FLAG_WRITE_THROUGH and
FILE_FLAG_NO_BUFFERING. It scales the size of it dynamically with the
system, but it uses it's own buffer cache.

> the database is very tightly
>integrated with the O/S.  

That it is.


>Write performance is one of the few things SQL
>server can do better than most other databases despite running on a
>mid-grade kernel and a low-grade filesystem...what does that say?
>ReadFileScatter() and ReadFileGather() were added to the win32 API
>specifically for SQL server...this is somewhat analogous to transaction
>based writing such as in Reisfer4. 

(Those are ReadFileScatter and WriteFileGather)

I don't think that's correct either. Scatter/Gather I/O is used to SQL
Server can issue reads for several blocks from disks into it's own
buffer cache with a single syscall even if these buffers are not
sequential. It did make significant performance improvements when they
added it, though.

(For those not knowing - it's ReadFile/WriteFile where you pass an array
of "this many bytes to this address" as parameters)


> I'm not arguing ms sql server is
>better in any way, IIRC they are still using table locks (!).  

Not at all. They use row level locks, escalated to page level, then
escalated to table level. Has been since 7.0. In <= 6.5 they had page
level and table level locks. I think possibly back in 4.2 (this is
16-bit days on OS/2) they had only table level locks, but that's a long
time ago.
They don't do MVCC, though.

(I'm not saying it's better either. At some things it is, at many it is
not)

//Magnus

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] seq scan cache vs. index cache smackdown

2005-02-15 Thread Christopher Browne
In the last exciting episode, [EMAIL PROTECTED] ("Merlin Moncure") wrote:
>> It seems inevitable that Postgres will eventually eliminate that
>> redundant layer of buffering. Since mmap is not workable, that
>> means using O_DIRECT to read table and index data.
>
> What about going the other way and simply letting the o/s do all the
> caching?  How bad (or good) would the performance really be?

I'm going to see about taking this story to OLS (Ottawa Linux
Symposium) in July and will see what hearing I can get.  There are
historically some commonalities in the way this situation is regarded,
in that there was _long_ opposition to the notion of having unbuffered
disk devices.

If there's more "story" that definitely needs to be taken, let me
know...
-- 
output = reverse("moc.enworbbc" "@" "enworbbc")
http://linuxdatabases.info/info/slony.html
Rules of  the Evil Overlord  #90. "I will  not design my  Main Control
Room  so  that  every  workstation  is facing  away  from  the  door."


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] seq scan cache vs. index cache smackdown

2005-02-15 Thread PFC
In the 'wishful hand waving' department :
	read index -> determine (tuple id,page) to hit in table -> for each of  
these, tell the OS 'I'm gonna need these' via a NON BLOCKING call. Non  
blocking because you feed the information to the OS as you read the index,  
streaming it.

	Meanwhile, the OS accumulates the requests in an internal FIFO,  
reorganizes them according to the order best suited to good disk head  
movements, then reads them in clusters, and calls a callback inside the  
application when it has data available. Or the application polls it once  
in a while to get a bucketload of pages. The 'I'm gonna need these()'  
syscall would also sometimes return  'hey, I'm full, read the pages I have  
here waiting for you before asking for new ones'.

	A flag would tell the OS if the application wanted the results in any  
order, or with order preserved.
	Without order preservation, if the application has requested twice the  
same page with different tuple id's, the OS would call the callback only  
once, giving it a list of the tuple id's associated with that page.

	It involves a tradeoff between memory and performance : as the size of  
the FIFO increases, likelihood of good contiguous disk reading increases.  
However, the memory structure would only contain page numbers and tuple  
id's, so it could be pretty small.

Returning the results in-order would also need more memory.
	It could be made very generic if instead of 'tuple id' you read 'opaque  
application data', and instead of 'page' you read '(offset, length)'.

	This structure actually exists already in the Linux Kernel, it's called  
the Elevator or something, but it works for scheduling reads between  
threads.

	You can also read 'internal not yet developed postgres cache manager'  
instead of OS if you don't feel like talking kernel developers into  
implementing this thing.


(Those are ReadFileScatter and WriteFileGather)
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] seq scan cache vs. index cache smackdown

2005-02-15 Thread Greg Stark

PFC <[EMAIL PROTECTED]> writes:

>   You can also read 'internal not yet developed postgres cache manager'
> instead of OS if you don't feel like talking kernel developers into
> implementing this thing.

It exists already, it's called aio. 

But there are a *lot* of details you skipped over. 
And as always, the devil is in the details.

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match