2011/3/24 Jim Nasby <j...@nasby.net>: > On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote: >> 2011/3/22 Greg Stark <gsst...@mit.edu>: >>> On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <j...@nasby.net> wrote: >>>> Has anyone looked at the overhead of measuring how long IO requests to the >>>> kernel take? If we did that not only could we get an idea of what our IO >>>> workload looked like, we could also figure out whether a block came out of >>>> cache or not. That information could potentially be useful to the planner, >>>> but even if the database couldn't use that knowledge itself it would be a >>>> damn useful statistic to have... IMHO, far more useful than our current >>>> hit rate statistics. >>>> >>> >>> I've done this -- actually better, I used mincore to actually check >>> whether the block was in cache before issuing the read -- but it turns >>> out you can't get what you're looking for this way. >> >> The linux fincore() syscall never get in the kernel, maybe something >> to revive... > > Is there an equivalent in other OSes? Could we use time measurement as an > alternative if not?
fincore() syscall is a shortcut for mmap+mincore calls, suggested by people working on libprefetch. see http://lwn.net/Articles/371538/ The alternative via time measurement is interesting, should be easy to ouput both measures in pg_statio_* and see what happens... > >>> >>> It turns out when you do this you see one block being read from disk >>> followed by n blocks that all appear to be cache hits. Because they've >>> been prefetched by the kernel. >> >> I did the same, I now believe that it is not very important to have >> the very exact numbers. >> Prefetech blocks *are* in memory when we request them, the first read >> access read more than one block because the cost is the same. > > Yeah... there's places in the planner where we make guesses as to the > likelyhood of something being in-cache. If we could actually track complete > hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at > things anymore. > > And having this info in pg_stats would be extremely valuable. yes, also Robert wrote some interesting items to keep in mind when thinking about that, in another thread, recently. A fs-cache snapshot or just a 'percent_in_cache' per relation/file (?) is easy to do/add to some auto-analyze daemon. *but* making a good use of it in the planner is not as trivial as it looks. (i.e. without breaking what is working well) Once I get time to add hooks in costsize.c, a simple extension can do the trick. (just need some shared_buffers to keep FS-pg_stats and hooks to use it in some places). > >>> What you end up with is actually something like the number of iops >>> which is also an interesting measure but not really what you were >>> looking for. >>> >>> My getrusage patch, which I should still dig out though it's rather >>> too late to be committing now unless someone tells me otherwise, would >>> tell you how much i/o a plan node actually did. But you won't know >>> which blocks did the i/o since I was only tracking totals for the plan >>> node. That's probably what you're looking for here. >> >> Please show us the patch :) > -- > Jim C. Nasby, Database Architect j...@nasby.net > 512.569.9461 (cell) http://jim.nasby.net > > > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers