On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: >> On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> (If the query ended up being a seqscan, I'd expect a second >>> lseek(SEEK_END) when the executor starts up, but I gather from the other >>> complaints that the mosbench people were only testing simple indexscan >>> queries.) > >> Yeah, it seems that for a sequential scan we lseek the heap, then the >> index, then the heap again; but for index scans we just hit the heap >> and the index. > > Sure. The first two come from the planner getting the table and index > sizes for estimation purposes (look in plancat.c). The last is done in > heapam.c's initscan(). We could possibly accept stale values for the > planner estimates, but I think heapam's number had better be accurate.
I think the exact requirement is that, if the relation turns out to be larger than the size we read, the extra blocks had better not contain any tuples our snapshot can see. There's actually no interlock between smgrnblocks() and smgrextend() right now, so presumably we don't need to add one. However, a value cached from a few seconds ago is clearly not going to cut it. I don't really think there's any sensible way to implement a per-backend cache, because that would require invalidation events of some kind to be sent on relation extension, and that seems utterly insane from a performance standpoint, even if we invented something less expensive than sinval. I guess it might work for planning purposes if you only sent out invalidation events on every N'th extension or something, but penalizing the accuracy of planning to work around a Linux kernel bug that only manifests itself on machines with >32 cores doesn't seem very appealing. A shared cache seems like it could work, but the locking is tricky. Normally we'd just use a hash table protected by an LWLock, one one LWLock per partition, but here that's clearly not going to work. The kernel is using a spinlock per file, and that's still too heavy-weight. I think that if we could prepopulate the cache with all the keys (i.e. relfilenodes) we care about and then never add or evict any, we could run it completely unlocked. I believe that the existing memory barriers in things like LWLockAcquire() would be sufficient to prevent us from reading a too-old value (i.e. block count). In particular, you couldn't read a value that predated your snapshot, if you got your snapshot by holding ProcArrayLock. But the races involved with adding and removing items from the cache are hard to deal with without using locks, especially because the keys are 12 bytes or more and therefore can't be read or written atomically. I've been mulling over how we might deal with this and actually coded up an implementation, but it turns out (surprise, surprise) to have problems with insufficient locking. So I'm thinking it over some more. And hoping that the Linux guys decide to do something about it. This isn't really our bug - lseek is quite cheap in the uncontended case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers