Re: [PERFORM] Postgres low end processing.
Thanks for the replies, On Fri, 3 Oct 2003 11:08:48 -0700 Josh Berkus <[EMAIL PROTECTED]> wrote: => 1. Make sure that the WAL files (pg_xlog) are on a seperate disk from the => database files, either through mounting or symlinking. I'm not sure I understand how this helps? => 2. Tweak the .conf file for low vacuum_mem (1024?), but vacuum very => frequently, like every 1-5 minutes. Spend some time tuning your => fsm_max_pages to the ideal level so that you're not allocating any extra => memory to the FSM. => => 3. If your concern is *average* CPU/RAM consumption, and not peak load => activity, increase wal_files and checkpoint_segments to do more efficient => batch processing of pending updates as the cost of some disk space. If peak => load activity is a problem, don't do this. => => 4. Tune all of your queries carefully to avoid anything requiring a => RAM-intensive merge join or CPU-eating calculated expression hash join, or => similar computation-or-RAM-intensive operations. Thanks, I'll try some of these, and post the results. The actual machines seem to be Pentium I machines, with 32M RAM. I've gathered that it is theoretically possible, so no to go try it. Regards Stef pgp0.pgp Description: PGP signature
Re: [PERFORM] reindex/vacuum locking/performance?
On Sun, 5 Oct 2003, Neil Conway wrote: > > > I don't know any portable way to do that :-( > > For the non-portable way of doing this, are you referring to O_DIRECT? > > Even if it isn't available everywhere, it might be worth considering > this at least for the platforms on which it is supported. > I strongly agree here only if we can prove there is a benefit. I think it would be silly of us if some OS supported SnazzyFeatureC that was able to speed up PG by a large percentage (hopefully, in a rather non-invasive way in the code). But, I do see the problem here with bloat and PG being radically different platform to platform. I suppose we could dictate that at least N os's had to have it.. or perhaps supply it has contrib/ patches Something to think about. I'd be interested in tinkering with this, but I'm more interested at the moment of why (with proof, not antecdotal) Solaris is so much slower than Linux and what we cna do about this. We're looking to move a rather large Informix db to PG and ops has reservations about ditching Sun hardware. -- Jeff Trout <[EMAIL PROTECTED]> http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] reindex/vacuum locking/performance?
On Mon, Oct 06, 2003 at 08:07:27AM -0400, Jeff wrote: > I strongly agree here only if we can prove there is a benefit. There's plenty of academic work which purports to show that LRU is far from the best choice. Just in principle, it seems obvious that a single-case seqscan-type operation (such as vacuum does) is a good way to lose your cache for no real gain. > I'd be interested in tinkering with this, but I'm more interested at the > moment of why (with proof, not antecdotal) Solaris is so much slower than > Linux and what we cna do about this. We're looking to move a rather large > Informix db to PG and ops has reservations about ditching Sun hardware. Interestingly, we're contemplating ditching Solaris because of the terrible reliability we're getting from the hardware. You can use truss to find some of the problems on Solaris. The open() syscall takes forever when you don't hit the Postgres shared buffers (even if you can be sure the page is in filesystem buffers -- we could demonstrate it on a 1 G database on a machine with 10 G of RAM). I've heard grumblings about spinlocks on Solaris which might explain this problem. I certainly notice that performance gets geometrically worse when you add a few hundred extra connections. A -- Andrew Sullivan 204-4141 Yonge Street Afilias CanadaToronto, Ontario Canada <[EMAIL PROTECTED]> M2P 2A8 +1 416 646 3304 x110 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] reindex/vacuum locking/performance?
Jeff wrote: I'd be interested in tinkering with this, but I'm more interested at the moment of why (with proof, not antecdotal) Solaris is so much slower than Linux and what we cna do about this. We're looking to move a rather large Informix db to PG and ops has reservations about ditching Sun hardware. Is linux on sparc hardware is an option..:-) Shridhar ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Postgres low end processing.
On Mon, Oct 06, 2003 at 09:55:51 +0200, Stef <[EMAIL PROTECTED]> wrote: > > Thanks, I'll try some of these, and post the results. > The actual machines seem to be Pentium I machines, > with 32M RAM. I've gathered that it is theoretically > possible, so no to go try it. I am running 7.4beta2 on a Pentium I machine with 48 MB of memory. I was running an earlier version of Postgres (probably 7.1.x) on it when it only had 32 MB of memory. It doesn't run very fast, but it works OK. I remember increase from 32MB to 48MB was very noticible in the time to serve web pages using information from the DB, but I don't remember the details since it was a couple of years ago. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] locking/performance, Solaris performance discovery
On Mon, 6 Oct 2003, Andrew Sullivan wrote: > There's plenty of academic work which purports to show that LRU is > far from the best choice. Just in principle, it seems obvious that a > single-case seqscan-type operation (such as vacuum does) is a good > way to lose your cache for no real gain. > Logically bypassing caches for a seq scan also makes sense. > Interestingly, we're contemplating ditching Solaris because of the > terrible reliability we're getting from the hardware. > The reason ops likes solaris / sun is twofold. 1. we have a pile of big sun machines around. 2. Solaris / Sun is quite a bit more graceful in the egvent of a hardware failure. We've burned out our fair share of cpu's etc and solaris has been rather graceful about it. I've started profiling and running tests... currently it is leaning towards the sysv semaphores. I see in src/backend/port/ that pg_sema.c is linked to the sysv implementation. So what I did was create a semaphore set, and then fired off 5 copies of a program that attaches to that semaphore and then locks/unlocks it 1M times. 2xP2-450, Linux 2.4.18: 1 process: 221680 / sec, 5 process: 98039 / sec 4xUltraSparc II-400Mhz, Solaris 2.6: 1 proc: 142857 / sec, 5 process: 23809 So I'm guessing that is where a LOT of the suck is coming from. What I plan to do next is looking to see if there are other interprocess locking mechanisms on solaris (perhaps pthread_mutex with that inter-process flag or something) to see if I can get those numbers a little closer. -- Jeff Trout <[EMAIL PROTECTED]> http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] locking/performance, Solaris performance discovery
Jeff <[EMAIL PROTECTED]> writes: > I've started profiling and running tests... currently it is leaning > towards the sysv semaphores. I see in src/backend/port/ that pg_sema.c is > linked to the sysv implementation. Does Solaris have Posix semaphores? You could try using those instead. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] Shopping for hardware
Ok, I asked this on [novice], but I was told it's be better to post it here... I've got some money to spend on a new servers. The biggest concern is the PostgreSQL database server that will "be the company." (*Everyone* uses the database server in some form or another) I'm looking for hot-swappable RAID 1 on a Linux platform at the least. Are there any vendors to avoid or prefer? What works best? Am I better off going with a DIY or getting something pre-packaged? In terms of numbers, we expect have an average of 100 active connections (most of which are idle 9/10ths of the time), with about 85% reading traffic. I hope to have one server host about 1000-2000 active databases, with the largest being about 60 meg (no blobs). Inactive databases will only be for reading (archival) purposes, and will seldom be accessed. (I could probably move them off to another server with a r/o disk...) Does any of this represent a problem for Postgres? The datasets are typically not that large, only a few queries on a few databases ever return over 1000 rows. The configuration that is going on in my head is: RAID 1, 200gig disks 1 server, 4g ram Linux 2.4 or 2.6 (depends on when we deploy and 2.6's track record at that time) I want something that can do hot-swaps and auto-mirroring after swap. Unfortunately, this is a new area for me. (I normally stick to S/W for non-high end systems) Thanks! Jason Hihn Paytime Payroll ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] locking/performance, Solaris performance discovery
On Mon, 6 Oct 2003, Tom Lane wrote: > > Does Solaris have Posix semaphores? You could try using those instead. > > regards, tom lane Yep. It does. I learned them quick enough (using posix_sema.c as a guide) and found out that at least on Sol 2.6 they are slower than sysv - with 5 processes it went to about 16k lock/unlock a second. I'm going to try to find a box around here I can get sol(8|9) on that has sufficient disk space and see. I'm guessing sun has likely made improvements... Another odd thing I'm trying to work out is why my profiles come out so radically different on the linux box and the sun box. Sun: 31.17 18.9018.90 internal_mcount 19.10 30.4811.58 8075381 0.00 0.00 _bt_checkkeys 5.66 33.91 3.43 24375253 0.00 0.00 FunctionCall2 4.82 36.83 2.92 8073010 0.00 0.00 _bt_step 3.51 38.96 2.1314198 0.15 0.15 _read 2.77 40.64 1.68 8069040 0.00 0.00 varchareq 2.59 42.21 1.5728454 0.06 0.23 _bt_next 2.29 43.60 1.39 1003 1.39 1.40 AtEOXact_Buffers 1.86 44.73 1.13 16281197 0.00 0.00 pg_detoast_datum 1.81 45.83 1.10 _mcount 1.68 46.85 1.02 2181 0.47 0.47 pglz_decompress Linux: 11.14 0.62 0.62 1879 0.00 0.00 pglz_decompress 6.71 0.99 0.37 1004 0.00 0.00 AtEOXact_Buffers 3.80 1.20 0.21 1103045 0.00 0.00 AllocSetAlloc 3.23 1.38 0.18 174871 0.00 0.00 nocachegetattr 2.92 1.54 0.16 1634957 0.00 0.00 AllocSetFreeIndex 2.50 1.68 0.1420303 0.00 0.00 heapgettup 1.93 1.79 0.11 1003 0.00 0.00 AtEOXact_CatCache 1.76 1.89 0.10 128442 0.00 0.00 hash_any 1.72 1.98 0.1090312 0.00 0.00 FunctionCall3 1.69 2.08 0.0950632 0.00 0.00 ExecTargetList 1.60 2.17 0.0951647 0.00 0.00 heap_formtuple 1.55 2.25 0.09 406162 0.00 0.00 newNode 1.46 2.33 0.08 133044 0.00 0.00 hash_search It is the same query with slightly different data (The Sun has probably.. 20-40k more rows in the table the query hits). I'll be digging up more info later today. -- Jeff Trout <[EMAIL PROTECTED]> http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Shopping for hardware
On Mon, 6 Oct 2003, Jason Hihn wrote: > Ok, I asked this on [novice], but I was told it's be better to post it > here... > > I've got some money to spend on a new servers. The biggest concern is the > PostgreSQL database server that will "be the company." (*Everyone* uses the > database server in some form or another) I'm looking for hot-swappable RAID > 1 on a Linux platform at the least. Are there any vendors to avoid or > prefer? What works best? Am I better off going with a DIY or getting > something pre-packaged? Depends on your hardware expertise. You can do quite well either way. I prefer adding my own components to a pre-built vanilla server. > In terms of numbers, we expect have an average of 100 active connections > (most of which are idle 9/10ths of the time), with about 85% reading > traffic. I hope to have one server host about 1000-2000 active databases, > with the largest being about 60 meg (no blobs). Inactive databases will only > be for reading (archival) purposes, and will seldom be accessed. (I could > probably move them off to another server with a r/o disk...) That's not a really big load, but I'm guessing the peaks will be big enough to notice. > Does any of this represent a problem for Postgres? The datasets are > typically not that large, only a few queries on a few databases ever return > over 1000 rows. Nah, this is pretty normal stuff for Postgresql or any other database in its approximate class (Sybase, Oracle, Informix, DB2, MSSQL2k). > The configuration that is going on in my head is: > RAID 1, 200gig disks > 1 server, 4g ram > Linux 2.4 or 2.6 (depends on when we deploy and 2.6's track record at that > time) That's a good starting point. I'd avoid 2.6 until it's had time for the bugs to drop out. The latest 2.4 kernels are pretty stable. List of things to include if you need more performance, in order of priority: proper tuning of the postgresql.conf file (see http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html) hardware RAID card with battery backed cache, the bigger the cache the better. more drives for RAID 1+0 faster CPUs. since you've already got 4 gigs of RAM slated, you're set there on linux, where having more won't likely help a lot unless you go to a 64 bit platform. > I want something that can do hot-swaps and auto-mirroring after swap. > Unfortunately, this is a new area for me. (I normally stick to S/W for > non-high end systems) The LSI/Megaraid cards can handle hot swaps quite well, make sure you get the right kind of hot swap shoes so they isolate the drive from the buss when you turn it off and they don't lock up your scsi buss. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Shopping for hardware
> "JH" == Jason Hihn <[EMAIL PROTECTED]> writes: JH> The configuration that is going on in my head is: JH> RAID 1, 200gig disks JH> 1 server, 4g ram JH> Linux 2.4 or 2.6 (depends on when we deploy and 2.6's track record at that JH> time) My recommendation is to get more disks (smaller and faster) rather than a few large ones. As for vendors, I always buy from Dell because they actually honor their "4-hour 24x7 replacement parts with technician to stick 'em" in guarantee. That and their hardware is rock solid and non-funky (ie, I can run FreeBSD on it with no issues). Here's my latest setup I just got: Dell PE 2650, dual Xeon processors (lowest speed they sell, as this is not a bottleneck) 4Gb RAM Dell PERC3 RAID controller (rebranded AMI controller) dual channel 2x 18Gb internal disks on RAID1 (RAID channel0) 14x 18Gb external disks on RAID5 (RAID channel1, see my posts on this list from a month or so ago on how I arrived at RAID5). All the disks are SCSI 15kRPM U320 drives, tho the controller only does U160. I run FreeBSD, but it should run linux just fine, too. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Vivek Khera, Ph.D.Khera Communications, Inc. Internet: [EMAIL PROTECTED] Rockville, MD +1-240-453-8497 AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Shopping for hardware
Jason, > In terms of numbers, we expect have an average of 100 active connections > (most of which are idle 9/10ths of the time), with about 85% reading > traffic. I hope to have one server host about 1000-2000 active databases, > with the largest being about 60 meg (no blobs). Inactive databases will > only be for reading (archival) purposes, and will seldom be accessed. (I > could probably move them off to another server with a r/o disk...) Hey, two people (one of them me) suggested that rather than putting all 2000 databases on one $15,000 server, that you buy 3 $5000 servers and split things up. You may have considered this suggestion and rejected it, but I'mm wondering if you missed it ... If you're lumping everything on one server, you'll need to remember to increase max_fsm_relations to the total number of tables in all databases ... for example, for 10 tables in 2000 databases you'll want a setting of 2 (which sounds huge but it's really only about 1mb memory). -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
[PERFORM] Seqscan buffer promotion (was: reindex/vacuum locking/performance?)
On Mon, 2003-10-06 at 05:15, Andrew Sullivan wrote: > There's plenty of academic work which purports to show that LRU is > far from the best choice. Just in principle, it seems obvious that a > single-case seqscan-type operation (such as vacuum does) is a good > way to lose your cache for no real gain. Traditionally, seqscan type operations are accommodated in LRU type managers by having multiple buffer promotion policies, primarily because it is simple to implement. For example, if you are doing a seqscan, a buffer loaded from disk is never promoted to the top of the LRU. Instead it is only partially promoted (say, halfway for example) toward the top of the buffer list. A page that is already in the buffer is promoted either to the halfway point or top depending on where it was found. There are numerous variations on the idea, some being more clever and complex than others. The point of this being that a pathological or rare sequential scan can never trash more than a certain percentage of the cache, while not significantly impacting the performance of a sequential scan. The primary nuisance is that it slightly increases the API complexity. I'll add that I don't know what PostgreSQL actually does in this regard, but from the thread it appears as though seqscans are handled like the default case. Cheers, -James Rogers [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] count(*) slow on large tables
> How it will help? This is in addition to trigger proposal that came > up earlier. With triggers it's not possible to make values visible > across backends unless trigger updates a table, which eventually > leads to vacuum/dead tuples problem. > > 1. User creates a trigger to check updates/inserts for certain conditions. > 2. It updates the count as and when required. > 3. If the trigger detects the count is not initialized, it would issue the > same query first time. There is no avoiding this issue. > > Besides providing facility of resident variables could be used > imaginatively as well. > > Does this make sense? IMO this is more generalised approach over all. I do this _VERY_ frequently in my databases, only I have my stored procs do the aggregate in a predefined MVCC table that's always there. Here's a denormalized version for public consumption/thought: CREATE TABLE global.dba_aggregate_cache ( dbl TEXT NOT NULL,-- The database location, doesn't need to be -- qualified (ex: schema.table.col) op TEXT NOT NULL, -- The operation, SUM, COUNT, etc. qual TEXT,-- Any kind of conditional, such as a where clause val_int INT, -- Whatever the value is, of type INT val_bigint BIGINT,-- Whatever the value is, of type BIGINT val_text TEXT,-- Whatever the value is, of type TEXT val_bytea BYTEA, -- Whatever the value is, of type BYTEA ); CREATE UNIQUE INDEX dba_aggregate_cache_dbl_op_udx ON global.dba_aggregate_cache(dbl,op); Then, I use a function to retrieve this value instead of a SELECT COUNT(*). SELECT public.cache_count('dbl','qual'); -- In this case, the op is COUNT SELECT public.cache_count('dbl'); -- Returns the COUNT for the table listed in the dbl Then, I create 4 or 5 functions (depends on the op I'm performing): 1) A private function that _doesn't_ run as security definer, that populates the global.dba_aggregate_cache row if it's empty. 2) A STABLE function for SELECTs, if the row doesn't exist, then it calls function #1 to populate its existence. 3) A STABLE function for INSERTs, if the row doesn't exist, then it calls function #1 to populate its existence, then adds the necessary bits to make it accurate. 4) A STABLE function for DELETEs, if the row doesn't exist, then it calls function #1 to populate its existence, then deletes the necessary bits to make it accurate. 5) A STABLE function for UPDATEs, if the row doesn't exist, then it calls function #1 to populate its existence, then updates the necessary bits to make it accurate. It's not uncommon for me to not have an UPDATE function/trigger. Create triggers for functions 2-5, and test away. It's MVCC, searching through a table that's INDEX'ed for a single row is obviously vastly faster than a seqscan/aggregate. If I need any kind of an aggregate to be fast, I use this system with a derivation of the above table. The problem with it being that I have to retrain others to use cache_count(), or some other function instead of using COUNT(*). That said, it'd be nice if there were a way to tell PostgreSQL to do the above for you and teach COUNT(*), SUM(*), or other aggregates to use an MVCC backed cache similar to the above. If people want their COUNT's to be fast, then they have to live with the INSERT, UPDATE, DELETE cost. The above doesn't work with anything complex such as join's, but it's certainly a start and I think satisfies everyone's gripes other than the tuple churn that _does_ happen (*nudge nudge*, pg_autovacuum could be integrated into the backend to handle this). Those worried about performance, the pages that are constantly being recycled would likely stay in disk cache (PG or the OS). There's still some commit overhead, but still... no need to over optimize by requiring the table to be stored in the out dated, slow, and over used shm (also, *nudge nudge*). Anyway, let me throw that out there as a solution that I use and it works quite well. I didn't explain the use of the qual column, but I think those who grasp the above way of handling things probably grok how to use the qual column in a dynamically executed query. CREATE AGGREGATE CACHE anyone? -sc -- Sean Chittenden ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] SOlaris updates
Ran the test on another linux box - the one that generated the dump the sun loaded (which should have similar data...) and I got a profile plan similar to the Sun. Which makes me feel more comfortable. Still interesting why that other box gave me the different profile. Now off the fun and exciting world of seeing what I can do about it. -- Jeff Trout <[EMAIL PROTECTED]> http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] reindex/vacuum locking/performance?
On Sun, 2003-10-05 at 19:50, Neil Conway wrote: > On Sun, 2003-10-05 at 19:43, Tom Lane wrote: > > This would be relatively easy to fix as far as our own buffering is > > concerned, but the thing that's needed to make it really useful is > > to prevent caching of seqscan-read pages in the kernel disk buffers. > For the non-portable way of doing this, are you referring to O_DIRECT? I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT, right? (If you were referring to O_DIRECT, I wanted to add that I wouldn't be surprised if using O_DIRECT on many kernels reduces or eliminates any readahead the OS will be doing on the sequential read, so the net result may actually be a loss for a typical seqscan.) -Neil ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Postgres low end processing.
Stef, > => 1. Make sure that the WAL files (pg_xlog) are on a seperate disk from the > => database files, either through mounting or symlinking. > > I'm not sure I understand how this helps? It gives you better fsync write performance on a low-end disk setup. Otherwise, the disk is forced to do a hop-back-and-forth between the database and the xlog, resulting in much slower updates and thus the database tying up blocks of RAM longer -- particularly if your shared_buffers are set very low, which they will be. On RAID setups, this is unnecessary becuase the RAID takes care of disk access management. But on a low-end, 2-IDE-disk machine, you have to do it. -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] reindex/vacuum locking/performance?
Neil Conway <[EMAIL PROTECTED]> writes: > On Sun, 2003-10-05 at 19:50, Neil Conway wrote: > I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT, > right? Not necessarily --- as you point out, it's not clear that O_DIRECT would help us. What would be way cool is something similar to what James Rogers was talking about: a way to tell the kernel not to promote this page all the way to the top of its LRU list. I'm not sure that *any* Unixen have such an API, let alone one that's common across more than one platform :-( regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] reindex/vacuum locking/performance?
--On Monday, October 06, 2003 14:26:10 -0400 Tom Lane <[EMAIL PROTECTED]> wrote: Neil Conway <[EMAIL PROTECTED]> writes: On Sun, 2003-10-05 at 19:50, Neil Conway wrote: I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT, right? Not necessarily --- as you point out, it's not clear that O_DIRECT would help us. What would be way cool is something similar to what James Rogers was talking about: a way to tell the kernel not to promote this page all the way to the top of its LRU list. I'm not sure that *any* Unixen have such an API, let alone one that's common across more than one platform :-( I think Verita's VxFS has this as an option/IOCTL. You can read the Veritas doc on my http://www.lerctr.org:8458/ pages under filesystems. That should work on UnixWare and Solaris sites that have VxFS installed. VxFS is standard on UW. LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749 pgp0.pgp Description: PGP signature
Re: [PERFORM] reindex/vacuum locking/performance?
Tom Lane wrote: > Neil Conway <[EMAIL PROTECTED]> writes: > > On Sun, 2003-10-05 at 19:50, Neil Conway wrote: > > I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT, > > right? > > Not necessarily --- as you point out, it's not clear that O_DIRECT would > help us. What would be way cool is something similar to what James > Rogers was talking about: a way to tell the kernel not to promote this > page all the way to the top of its LRU list. I'm not sure that *any* > Unixen have such an API, let alone one that's common across more than > one platform :-( Solaris has "free-behind", which prevents a large kernel sequential scan from blowing out the cache. I only read about it in the Mauro Solaris Internals book, and it seems to be done automatically. I guess most OS's don't do this optimization because they usually don't read files larger than their cache. I see BSD/OS madvise() has: #define MADV_NORMAL 0 /* no further special treatment */ #define MADV_RANDOM 1 /* expect random page references */ #define MADV_SEQUENTIAL 2 /* expect sequential references */ #define MADV_WILLNEED 3 /* will need these pages */ --> #define MADV_DONTNEED 4 /* don't need these pages */ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ The marked one seems to have the control we need. Of course, the kernel madvise() code has: /* Not yet implemented */ Looks like NetBSD implements it, but it also unmaps the page from the address space, which might be more than we want. NetBSD alao has: #define MADV_FREE 6 /* pages are empty, free them */ which frees the page. I am unclear on its us. FreeBSD has this comment: /* * vm_page_dontneed * * Cache, deactivate, or do nothing as appropriate. This routine * is typically used by madvise() MADV_DONTNEED. * * Generally speaking we want to move the page into the cache so * it gets reused quickly. However, this can result in a silly syndrome * due to the page recycling too quickly. Small objects will not be * fully cached. On the otherhand, if we move the page to the inactive * queue we wind up with a problem whereby very large objects * unnecessarily blow away our inactive and cache queues. * * The solution is to move the pages based on a fixed weighting. We * either leave them alone, deactivate them, or move them to the cache, * where moving them to the cache has the highest weighting. * By forcing some pages into other queues we eventually force the * system to balance the queues, potentially recovering other unrelated * space from active. The idea is to not force this to happen too * often. */ The Linux comment is: /* * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about * data it wants to keep. Be sure to free swap resources too. The * zap_page_range call sets things up for refill_inactive to actually free * these pages later if no one else has touched them in the meantime, * although we could add these pages to a global reuse list for * refill_inactive to pick up before reclaiming other pages. * * NB: This interface discards data rather than pushes it out to swap, * as some implementations do. This has performance implications for * applications like large transactional databases which want to discard * pages in anonymous maps after committing to backing store the data * that was kept in them. There is no reason to write this data out to * the swap area if the application is discarding it. * * An interface that causes the system to free clean pages and flush * dirty pages is already available as msync(MS_INVALIDATE). */ It seems mmap is more for controlling the memory mapping of files rather than controlling the cache itself. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] reindex/vacuum locking/performance?
Stepping out on a limb... (I'm not a disk kernel guy) I have long thought that as part of a cache descriptor, there should be a process-definable replacement-strategy (RS). Each cache entry would be associated to each process's replacement-strategy variable and the page-replacement algorithm would then take into consideration the desired policy. Imagine for simplicity sake, that each strategy gets its own cache table. When it comes time to replace a page, the system scans the cache tables, picks the most likely page for replacement from each table, then selects the most likely page between all policies. This allows the 99% of apps that can make excellent use of use LRU to use LRU among themselves (best for everyone), and the MRU (better for databases) (best for everyone too) to only sacrifice the best pages between MRU apps. Though, once you have an MRU process, the final decision between taking the page should be use MRU, and not LRU. Of course there are a number of questions: does each RS get its own table, to be managed independently, or can we combine them all into one table? What are the performance implications of the multiple table management? One day, I'd like to see function pointers and kernel modules used as ways for apps to manage replacement policy. fantasyland# insmod MRU.o fantasyland# vi postgresql.conf { replacement_policy=MRU } {meanwhile in some postgre .c file:} set_cache_policy(get_cfg_replacement_policy()); fantasyland# service postmaster restart Anyone want to throw this at the kernel developers? > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Tom Lane > Sent: Monday, October 06, 2003 2:26 PM > To: Neil Conway > Cc: Andrew Sullivan; PostgreSQL Performance > Subject: Re: [PERFORM] reindex/vacuum locking/performance? > > > Neil Conway <[EMAIL PROTECTED]> writes: > > On Sun, 2003-10-05 at 19:50, Neil Conway wrote: > > I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT, > > right? > > Not necessarily --- as you point out, it's not clear that O_DIRECT would > help us. What would be way cool is something similar to what James > Rogers was talking about: a way to tell the kernel not to promote this > page all the way to the top of its LRU list. I'm not sure that *any* > Unixen have such an API, let alone one that's common across more than > one platform :-( > > regards, tom lane > > ---(end of broadcast)--- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to [EMAIL PROTECTED] so that your > message can get through to the mailing list cleanly > ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] reindex/vacuum locking/performance?
> This would be relatively easy to fix as far as our own buffering is > concerned, but the thing that's needed to make it really useful is > to prevent caching of seqscan-read pages in the kernel disk buffers. > I don't know any portable way to do that :-( raw disc ? :-) ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]