Re: [PERFORM] Sunfire X4500 recommendations
On Tue, 27 Mar 2007, Matt Smiley wrote: The goal is to calculate the probability of data loss when we loose a certain number of disks within a short timespan (e.g. loosing a 2nd disk before replacing+rebuilding the 1st one). For RAID 10, 50, and Z, we will loose data if any disk group (i.e. mirror or parity-group) looses 2 disks. For RAID 60 and Z2, we will loose data if 3 disks die in the same parity group. The parity groups can include arbitrarily many disks. Having larger groups gives us more usable diskspace but less protection. (Naturally we're more likely to loose 2 disks in a group of 50 than in a group of 5.) g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or more; dual-parity = 4 or more) n = total number of disks risk of loosing any 1 disk = 1/n please explain why you are saying that the risk of loosing any 1 disk is 1/n. shouldn't it be probability of failure * n instead? risk of loosing 1 disk from a particular group = g/n risk of loosing 2 disks in the same group = g/n * (g-1)/(n-1) risk of loosing 3 disks in the same group = g/n * (g-1)/(n-1) * (g-2)/(n-2) following this logic the risk of loosing all 48 disks in a single group of 48 would be 100% also what you are looking for is the probability of the second (and third) disks failing in time X (where X is the time nessasary to notice the failure, get a replacement, and rebuild the disk) the killer is the time needed to rebuild the disk, with multi-TB arrays is't sometimes faster to re-initialize the array and reload from backup then it is to do a live rebuild (the kernel.org servers had a raid failure recently and HPA mentioned that it took a week to rebuild the array, but it would have only taken a couple days to do a restore from backup) add to this the fact that disk failures do not appear to be truely independant from each other statisticly (see the recent studies released by google and cmu), and I wouldn't bother with single-parity for a multi-TB array. If the data is easy to recreate (including from backup) or short lived (say a database of log data that cycles every month or so) I would just do RAID-0 and plan on loosing the data on drive failure (this assumes that you can afford the loss of service when this happens). if the data is more important then I'd do dual-parity or more, along with a hot spare so that the rebuild can start as soon as the first failure is noticed by the system to give myself a fighting chance to save things. In terms of performance, I think RAID 10 should always be best for write speed. (Since it doesn't calculate parity, writing a new block doesn't require reading the rest of the RAID stripe just to recalculate the parity bits.) I think it's also normally just as fast for reading, since the controller can load-balance the pending read requests to both sides of each mirror. this depends on your write pattern. if you are doing sequential writes (say writing a log archive) then RAID 5 can be faster then RAID 10. since there is no data there to begin with the system doesn't have to read anything to calculate the parity, and with the data spread across more spindles you have a higher potential throughput. if your write pattern is is more random, and especially if you are overwriting existing data then the reads needed to calculate the parity will slow you down. as for read speed, it all depends on your access pattern and stripe size. if you are reading data that spans disks (larger then your stripe size) you end up with a single read tieing up multiple spindles. with Raid 1 (and varients) you can read from either disk of the set if you need different data within the same stripe that's on different disk tracks (if it's on the same track you'll get it just as fast reading from a single drive, or so close to it that it doesn't matter). beyond that the question is how many spindles can you keep busy reading (as opposed to seeking to new data or sitting idle becouse you don't need their data) the worst case for reading is to be jumping through your data in strides of stripe*# disks available (accounting for RAID type) as all your reads will end up hitting the same disk. David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Sunfire X4500 recommendations
On Thu, 29 Mar 2007, Matt Smiley wrote: Hi David, Thanks for your feedback! I'm rather a newbie at this, and I do appreciate the critique. First, let me correct myself: The formulas for the risk of loosing data when you loose 2 and 3 disks shouldn't have included the first term (g/n). I'll give the corrected formulas and tables at the end of the email. please explain why you are saying that the risk of loosing any 1 disk is 1/n. shouldn't it be probability of failure * n instead? 1/n represents the assumption that all disks have an equal probability of being the next one to fail. This seems like a fair assumption in general for the active members of a stripe (not including hot spares). A possible exception would be the parity disks (because reads always skip them and writes always hit them), but that's only a consideration if the RAID configuration used dedicated disks for parity instead of distributing it across the RAID 5/6 group members. Apart from that, whether the workload is write-heavy or read-heavy, sequential or scattered, the disks in the stripe ought to handle a roughly equivalent number of iops over their lifetime. only assuming that you have a 100% chance of some disk failing. if you have 15 disks in one array and 60 disks in another array the chances of having _some_ failure in the 15 disk array is only 1/4 the chance of having a failure of _some_ disk in the 60 disk array following this logic the risk of loosing all 48 disks in a single group of 48 would be 100% Exactly. Putting all disks in one group is RAID 0 -- no data protection. If you loose even 1 active member of the stripe, the probability of loosing your data is 100%. but by your math, the chance of failure with dual parity if a 48 disk raid5 was also 100%, this is just wrong. also what you are looking for is the probability of the second (and third) disks failing in time X (where X is the time nessasary to notice the failure, get a replacement, and rebuild the disk) Yep, that's exactly what I'm looking for. That's why I said, "these probabilities are only describing the case where we don't have enough time between disk failures to recover the array." My goal wasn't to estimate how long time X is. (It doesn't seem like a generalizable quantity; due partly to logistical and human factors, it's unique to each operating environment.) Instead, I start with the assumption that time X has been exceeded, and we've lost a 2nd (or 3rd) disk in the array. Given that assumption, I wanted to show the probability that the loss of the 2nd disk has caused the stripe to become unrecoverable. Ok, this is the chance that if you loose that N disks without replacing any of them how much data are you likly to loose in different arrays. We know that RAID 10 and 50 can tolerate the loss of anywhere between 1 and n/g disks, depending on how lucky you are. I wanted to quantify the amount of luck required, as a risk management tool. The duration of time X can be minimized with hot spares and attentive administrators, but the risk after exceeding time X can only be minimized (as far as I know) by configuring the RAID stripe with small enough underlying failure groups. but I don't think this is the question anyone is really asking. what people want to know isn't 'how many disks can I loose without replacing them before I loose data' what they want to know is ' with this configuration (including a drive replacement time of Y for the first N drives and Z for drives after that), what are the odds of loosing data' and for the second question the chance of failure of additional disks isn't 100%. the killer is the time needed to rebuild the disk, with multi-TB arrays is't sometimes faster to re-initialize the array and reload from backup then it is to do a live rebuild (the kernel.org servers had a raid failure recently and HPA mentioned that it took a week to rebuild the array, but it would have only taken a couple days to do a restore from backup) That's very interesting. I guess the rebuild time also would depend on how large the damaged failure group was. Under RAID 10, for example, I think you'd still only have to rebuild 1 disk from its mirror, regardless of how many other disks were in the stripe, right? So shortening the rebuild time may be another good motivation to keep the failure groups small. correct, however you have to decide how much this speed is worth to you. if you are building a ~20TB array you can do this with ~30 drives with single or dual parity, or ~60 drives with RAID 10. remember the big cost of arrays like this isn't even the cost of the drives (although you are talking an extra $20,000 or so there), but the cost of the power and cooling to run all those extra drives add to this the fact that d
Re: [PERFORM] SCSI vs SATA
On Tue, 3 Apr 2007, Geoff Tolley wrote: Ron wrote: At 07:07 PM 4/3/2007, Ron wrote: > For random IO, the 3ware cards are better than PERC > > > Question: will 8*15k 73GB SCSI drives outperform 24*7K 320GB SATA II > drives? > > Nope. Not even if the 15K 73GB HDs were the brand new Savvio 15K > screamers. > > Example assuming 3.5" HDs and RAID 10 => 4 15K 73GB vs 12 7.2K 320GB > The 15K's are 2x faster rpm, but they are only ~23% the density => > advantage per HD to SATAs. > Then there's the fact that there are 1.5x as many 7.2K spindles as 15K > spindles... Oops make that =3x= as many 7.2K spindles as 15K spindles... I don't think the density difference will be quite as high as you seem to think: most 320GB SATA drives are going to be 3-4 platters, the most that a 73GB SCSI is going to have is 2, and more likely 1, which would make the SCSIs more like 50% the density of the SATAs. Note that this only really makes a difference to theoretical sequential speeds; if the seeks are random the SCSI drives could easily get there 50% faster (lower rotational latency and they certainly will have better actuators for the heads). Individual 15K SCSIs will trounce 7.2K SATAs in terms of i/os per second. true, but with 3x as many drives (and 4x the capacity per drive) the SATA system will have to do far less seeking for that matter, with 20ish 320G drives, how large would a parition be that only used the outer pysical track of each drive? (almost certinly multiple logical tracks) if you took the time to set this up you could eliminate seeking entirely (at the cost of not useing your capacity, but since you are considering a 12x range in capacity, it's obviously not your primary concern) If you care about how often you'll have to replace a failed drive, then the SCSI option no question, although check the cases for hot-swapability. note that the CMU and Google studies both commented on being surprised at the lack of difference between the reliability of SCSI and SATA drives. David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] SCSI vs SATA
On Wed, 4 Apr 2007, Peter Kovacs wrote: But if an individual disk fails in a disk array, sooner than later you would want to purchase a new fitting disk, walk/drive to the location of the disk array, replace the broken disk in the array and activate the new disk. Is this correct? correct, but more drives also give you the chance to do multiple parity arrays so that you can loose more drives before you loose data. see the tread titled 'Sunfire X4500 recommendations' for some stats on how likely you are to loose your data in the face of multiple drive failures. you can actually get much better reliability then RAID 10 David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] SCSI vs SATA
On Thu, 5 Apr 2007, [EMAIL PROTECTED] wrote: I'm curious to know why you're on xfs (i've been too chicken to stray from ext3). better support for large files (although postgres does tend to try and keep the file size down by going with multiple files) and also for more files the multiple levels of indirection that ext3 uses for accessing large files (or large directories) can really slow things down, just from the overhead of looking up the metadata (including finding where the actual data blocks are on disk) ext4 is planning to address this and will probably be a _very_ good improvement, but ext3 has very definiate limits that it inherited from ext2. David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] SCSI vs SATA
On Thu, 5 Apr 2007, Ron wrote: At 11:19 AM 4/5/2007, Scott Marlowe wrote: On Thu, 2007-04-05 at 00:32, Tom Lane wrote: > "James Mansion" <[EMAIL PROTECTED]> writes: > > > Right --- the point is not the interface, but whether the drive is > > > built > > > for reliability or to hit a low price point. > > > Personally I take the marketing mublings about the enterprise drives > > with a pinch of salt. The low-price drives HAVE TO be reliable too, > > because a non-negligible failure rate will result in returns > > processing > > costs that destroy a very thin margin. > > Reliability is relative. Server-grade drives are built to be beat upon > 24x7x365 for the length of their warranty period. Consumer-grade drives > are built to be beat upon a few hours a day, a few days a week, for the > length of their warranty period. Even if the warranties mention the > same number of years, there is a huge difference here. Just a couple of points... Server drives are generally more tolerant of higher temperatures. I.e. the failure rate for consumer and server class HDs may be about the same at 40 degrees C, but by the time the internal case temps get up to 60-70 degrees C, the consumer grade drives will likely be failing at a much higher rate, whether they're working hard or not. Exactly correct. Which brings up my next point: I'd rather have 36 consumer grade drives in a case that moves a LOT of air and keeps the drive bays cool than 12 server class drives in a case that has mediocre / poor air flow in it. Also exactly correct. High temperatures or unclean power issues age HDs faster than any other factors. this I agree with, however I believe that this is _so_ much of a factor that it swamps any difference that they may be between 'enterprise' and 'consumer' drives. Until the drives have been burnt in and proven reliable, just assume that they could all fail at any time and act accordingly. Yep. Folks should google "bath tub curve of statistical failure" or similar. Basically, always burn in your drives for at least 1/2 a day before using them in a production or mission critical role. for this and your first point, please go and look at the google and cmu studies. unless the vendors did the burn-in before delivering the drives to the sites that installed them, there was no 'infant mortality' spike on the drives (both studies commented on this, they expected to find one) David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] a question about Direct I/O and double buffering
On Thu, 5 Apr 2007, Xiaoning Ding wrote: To the best of my knowledge, Postgres itself does not have a direct IO option (although it would be a good addition). So, in order to use direct IO with postgres you'll need to consult your filesystem docs for how to set the forcedirectio mount option. I believe it can be set dynamically, but if you want it to be permanent you'll to add it to your fstab/vfstab file. I use Linux. It supports direct I/O on a per-file basis only. To bypass OS buffer cache, files should be opened with O_DIRECT option. I afraid that I have to modify PG. as someone who has been reading the linux-kernel mailing list for 10 years, let me comment on this a bit. linux does have a direct i/o option, but it has significant limits on when and how you cna use it (buffers must be 512byte aligned and multiples of 512 bytes, things like that). Also, in many cases testing has shon that there is a fairly significant performance hit for this, not a perfomance gain. what I think that postgres really needs is to add support for write barriers (telling the OS to make shure that everything before the barrier is written to disk before anything after the barrier) I beleive that these are avaiable on SCSI drives, and on some SATA drives. this sort of support, along with appropriate async I/O support (which is probably going to end up being the 'syslets' or 'threadlets' stuff that's in the early experimental stage, rather then the current aio API) has the potential to be a noticable improvement. if you haven't followed the syslets discussion on the kernel list, threadlets are an approach that basicly lets you turn any syscall into a async interface (if the call doesn't block on anything you get the answer back immediatly, if it does block it gets turned into a async call by the kernel) syslets are a way to combine multiple syscalls into a single call, avoiding the user->system->user calling overhead for the additional calls. (it's also viewed as a way to do prototyping of possible new calls, if a sequence of syscalls end up being common enough the kernel devs will look at makeing a new, combined, syscall (for example lock, write, unlock could be made into one if it's common enough and there's enough of a performance gain) David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] a question about Direct I/O and double buffering
On Thu, 5 Apr 2007, Xiaoning Ding wrote: > Xiaoning > Looks like it. I just did a cursory search of the archives and it seems > that others have looked at this before so you'll probably want to start > there if your up to it. > Linux used to have (still does?) a RAW interface which might also be useful. I think the original code was contributed by oracle so they could support direct IO. Alex I am more concerned with reads , and how to do direct I/O under Linux here. Reading raw devices in linux bypasses OS buffer cache. it also bypassed OS readahead, not nessasarily a win But how can you mount a raw device( it is a character device) as a file system? you can do a makefs on /dev/hda just like you do on /dev/hda2 and then mount the result as a filesystem. Postgres wants the OS layer to provide the filesystem, Oracle implements it's own filesystem, so you would just point it at the drive/partition and it would do it's own 'formatting' this is something that may be reasonable for postgres to consider doing someday, since postgres allocates things into 1m files and then keeps track of what filename is used for what, it could instead allocate things in 1m (or whatever size) chunks on the disk, and just keep track of what addresses are used for what instead of filenames. this would definantly allow you to work around problems like the ext2/3 indirect lookup problems. now that the ability for partitioned table spaces it would be an interesting experiment to be able to define a tablespace that used a raw device instead of a filesystem to see if there are any noticable performance gains David Lang ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] SCSI vs SATA
On Thu, 5 Apr 2007, Scott Marlowe wrote: On Thu, 2007-04-05 at 14:30, James Mansion wrote: Server drives are generally more tolerant of higher temperatures. I.e. the failure rate for consumer and server class HDs may be about the same at 40 degrees C, but by the time the internal case temps get up to 60-70 degrees C, the consumer grade drives will likely be failing at a much higher rate, whether they're working hard or not. Can you cite any statistical evidence for this? Logic? Mechanical devices have decreasing MTBF when run in hotter environments, often at non-linear rates. this I will agree with. Server class drives are designed with a longer lifespan in mind. Server class hard drives are rated at higher temperatures than desktop drives. these two I question. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] SCSI vs SATA
On Thu, 5 Apr 2007, Ron wrote: At 10:07 PM 4/5/2007, [EMAIL PROTECTED] wrote: On Thu, 5 Apr 2007, Scott Marlowe wrote: > Server class drives are designed with a longer lifespan in mind. > > Server class hard drives are rated at higher temperatures than desktop > drives. these two I question. David Lang Both statements are the literal truth. Not that I would suggest abusing your server class HDs just because they are designed to live longer and in more demanding environments. Overheating, nasty electrical phenomenon, and abusive physical shocks will trash a server class HD almost as fast as it will a consumer grade one. The big difference between the two is that a server class HD can sit in a rack with literally 100's of its brothers around it, cranking away on server class workloads 24x7 in a constant vibration environment (fans, other HDs, NOC cooling systems) and be quite happy while a consumer HD will suffer greatly shortened life and die a horrible death in such a environment and under such use. Ron, I know that the drive manufacturers have been claiming this, but I'll say that my experiance doesn't show a difference and neither do the google and CMU studies (and they were all in large datacenters, some HPC labs, some commercial companies). again the studies showed _no_ noticable difference between the 'enterprise' SCSI drives and the 'consumer' SATA drives. David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] a question about Direct I/O and double buffering
On Thu, 5 Apr 2007, Erik Jones wrote: On Apr 5, 2007, at 3:33 PM, [EMAIL PROTECTED] wrote: On Thu, 5 Apr 2007, Xiaoning Ding wrote: > > > > To the best of my knowledge, Postgres itself does not have a direct IO > > option (although it would be a good addition). So, in order to use > > direct > > IO with postgres you'll need to consult your filesystem docs for how > > to > > set the forcedirectio mount option. I believe it can be set > > dynamically, > > but if you want it to be permanent you'll to add it to your > > fstab/vfstab > > file. > > I use Linux. It supports direct I/O on a per-file basis only. To bypass > OS buffer cache, > files should be opened with O_DIRECT option. I afraid that I have to > modify PG. as someone who has been reading the linux-kernel mailing list for 10 years, let me comment on this a bit. linux does have a direct i/o option, Yes, I know applications can request direct i/o with the O_DIRECT flag to open(), but can this be set to be forced for all applications or for individual applications from "outside" the application (not that I've ever heard of something like the second)? no it can't, due to the fact that direct i/o has additional requirements for what you can user for buffers that don't apply to normal i/o but it has significant limits on when and how you cna use it (buffers must be 512byte aligned and multiples of 512 bytes, things like that). That's a standard limit imposed by the sector size of hard drives, and is present in all direct i/o implementations, not just Linux. right, but you don't have those limits for normal i/o Also, in many cases testing has shon that there is a fairly significant performance hit for this, not a perfomance gain. Those performance hits have been noticed for high i/o transaction databases? The idea here is that these kinds of database manage their own caches and having a separate filesystem cache in virtual memory that works with system memory page sizes is an unneeded level of indirection. ahh, you're proposing a re-think of how postgres interacts with the O/S, not just an optimization to be applied to the current architecture. unlike Oracle, Postgres doesn't try to be an OS itself, it tries very hard to rely on the OS to properly implement things rather then doing it's own implementation. Yes, you should expect other "normal" utilities will suffer a performance hit as if you are trying to cp a 500 byte file you'll still have to work with 8K writes and reads whereas with the filesystem cache you can just write/read part of a page in memory and let the cache decide when it needs to write and read from disk. If there are other caveats to direct i/o on Linux I'd love to hear them. other then bad interactions with "normal" utilities not compiled for driect i/o I don't remember them offhand. David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] SCSI vs SATA
On Fri, 6 Apr 2007, Ron wrote: Bear in mind that Google was and is notorious for pushing their environmental factors to the limit while using the cheapest "PoS" HW they can get their hands on. Let's just say I'm fairly sure every piece of HW they were using for those studies was operating outside of manufacturer's suggested specifications. Ron, please go read both the studies. unless you want to say that every orginization the CMU picked to study also abused their hardware as well Under such conditions the environmental factors are so deleterious that they swamp any other effect. OTOH, I've spent my career being as careful as possible to as much as possible run HW within manufacturer's suggested specifications. I've been chided for it over the years... ...usually by folks who "save" money by buying commodity HDs for big RAID farms in NOCs or push their environmental envelope or push their usage envelope or ... ...and then act surprised when they have so much more down time and HW replacements than I do. All I can tell you is that I've gotten to eat my holiday dinner far more often than than my counterparts who push it in that fashion. OTOH, there are crises like the Power Outage of 2003 in the NE USA where some places had such Bad Things happen that it simply doesn't matter what you bought (power dies, generator cuts in, power comes on, but AC units crash, temperatures shoot up so fast that by the time everything is re-shutdown it's in the 100F range in the NOC. Lot's 'O Stuff dies on the spot + spend next 6 months having HW failures at +considerably+ higher rates than historical norms. Ick..) IME, it really does make a difference =if you pay attention to the difference in the first place=. If you treat everything equally poorly, then you should not be surprised when everything acts equally poorly. But hey, YMMV. Cheers, Ron Peacetree ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] SCSI vs SATA
On Fri, 6 Apr 2007, Scott Marlowe wrote: Based on experience I think that on average server drives are more reliable than consumer grade drives, and can take more punishment. this I am not sure about But, the variables of manufacturer, model, and the batch often make even more difference than grade. this I will agree with fully. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] SCSI vs SATA
On Fri, 6 Apr 2007, Charles Sprickman wrote: On Fri, 6 Apr 2007, [EMAIL PROTECTED] wrote: On Fri, 6 Apr 2007, Scott Marlowe wrote: > Based on experience I think that on average server drives are more > reliable than consumer grade drives, and can take more punishment. this I am not sure about I think they should survey Tivo owners next time. Perfect stress-testing environment. Mine runs at over 50C most of the time, and it's writing 2 video streams 24/7. What more could you do to punish a drive? :) and the drives that are in them are consumer IDE drives. I will admit that I've removed to cover from my tivo to allow it to run cooler, and I'm still on the origional drive + 100G drive I purchased way back when (7+ years ago) before I removed the cover I did have times when the tivo would die from the heat (Los Angeles area in the summer with no A/C) David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] SCSI vs SATA
x27;cheap' drives, my reaction is that single reducndancy isn't good enough. depending on how valuble the data is, you need to either replicate the data to another system, or go with dual-parity redundancy (or both) while drives probably won't be this bad in real life (this is after all, slightly worse then the studies show for their 3rd year drives, and 'enterprise' drives may be slightly better) , I have to assume that they will be for my reliability planning. also, if you read throught the cmu study, drive failures were only a small percentage of system outages (16-25% depending on the site). you have to make sure that you aren't so fixated on drive reliabilty that you fail to account for other types of problems (down to and including the chance of someone accidently powering down the rack that you are plugged into, be it from hitting a power switch, to overloading a weak circuit breaker) In looking at these problems overall I find that in most cases I need to have redundant systems with the data replicated anyway (with logs sent elsewhere), so I can get away with building failover pairs instead of having each machine with redundant drives. I've found that I can frequently get a pair of machines for less money then other departments spend on buying a single 'enterprise' machine with the same specs (although the prices are dropping enough on the top-tier manufacturers that this is less true today then it was a couple of years ago), and I find that the failure rate is about the same on a per-machine basis, so I end up with a much better uptime record due to having the redundancy of the second full system (never mind things like it being easier to do upgrades as I can work on the inactive machine and then failover to work on the other, now, inactive machine). while I could ask for the budget to be doubled to provide the same redundancy with the top-tier manufacturers I don't do so for several reasons, the top two being that these manufacurers frequently won't configure a machine the way I want them to (just try to get a box with writeable media built in, either a floppy of a CDR/DVDR, they want you to use something external), and doing so also exposes me to people second guessing me on where redundancy is needed ('that's only development, we don't need redundancy there', until a system goes down for a day and the entire department is unable to work) it's not that the people who disagree with you don't care about their data, it's that they have different experiances then you do (experiances that come close to matching the studies where they tracked hundereds of thousands of drives of different types), and as a result believe that the difference (if any) between the different types of drives isn't significant in the overall failure rate (especially when you take the difference of drive capacity into account) David Lang P.S. here is a chart from that thread showing the chances of loosing data with different array configurations. if you say that there is a 10% chance of a disk failing each year (significnatly higher then the studies listed above, but close enough) then this works out to ~0.001% chance of a drive failing per hour (a reasonably round number to work with) to write 750G at ~45MB/sec takes 5 hours of 100% system throughput, or ~50 hours at 10% of the system throughput (background rebuilding) if we cut this in half to account for inefficiancies in retrieving data from other disks to calculate pairity it can take 100 hours (just over four days) to do a background rebuild, or about 0.1% chance for each disk of loosing a seond disk. with 48 drives this is ~5% chance of loosing everything with single-parity, however the odds of loosing two disks during this time are .25% so double-parity is _well_ worth it. chance of loosing data before hotspare is finished rebuilding (assumes one hotspare per group, you may be able to share a hotspare between multiple groups to get slightly higher capacity) RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data: disks_per_group num_groups total_disks usable_disks risk_of_data_loss 2 24 48 n/an/a 3 16 48 n/a (0.0001% with manual replacement of drive) 4 12 4812 0.0009% 6 8 4824 0.003% 8 6 4830 0.006% 12 4 4836 0.02% 16 3 4839 0.03% 24 2 4842 0.06% 48 1 4845 0.25% RAID 10 or 50 -- Mirroring or single-parity mus
Re: [PERFORM] SCSI vs SATA
inion (no matter who it's from) isn't as good as a formal study It's also insulting to everyone who puts in the time and effort to be a professional within a field rather than a lay person. it's also insulting to assume (or appear to assume) that everyone who disagrees with your is a lay person. you may not have meant it (this is e-mail after all, with all the problems that come from that), but this is what you seem to have been implying, if not outright saying. Whether there's a name for it or not, there's definitely an important distinction between each of anecdote, professional opinion, and study result. the line between an anecdote and a professional opinion is pretty blury, and hard to see without wasting a lot of time getting everyone to give their credentials, etc. if a professional doesn't spend enough time thinking about some of the details (i.e. how many drive failures of each type have I seen in the last 5 years as opposed to in the 5 year timeframe from 1980-1985) they can end up giving an opinion that's in the range of reliability and relavance that anecdotes are. don't assume malice so quickly. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning
On Fri, 27 Apr 2007, Josh Berkus wrote: Dan, Yes, this is the classic problem. I'm not demanding anyone pick up the ball and jump on this today, tomorrow, etc.. I just think it would be good for those who *could* make a difference to keep those goals in mind when they continue. If you have the right mindset, this problem will fix itself over time. Don't I wish. Autotuning is *hard*. It took Oracle 6 years. It took Microsoft 3-4 years, and theirs still has major issues last I checked. And both of those DBs support less OSes than we do. I think it's going to take more than the *right mindset* and my spare time. I think there are a couple different things here. 1. full autotuning as you say, this is very hard and needs a lot of info about your particular database useage. 2. getting defaults that are closer to right then current. this is much easier. for this nobody is expecting that the values are right, we're just begging for some tool to get us within an couple orders of magnatude of what's correct. the current defaults are appropriate for a single cpu with 10's of MB of ram and a single drive nowdays you have people trying to run quick-and-dirty tests on some spare hardware they have laying around (waiting for another project) that's got 4-8 CPU's with 10's of GB of ram and a couple dozen drives these people don't know about database tuneing, they can learn, but they want to see if postgres is even in the ballpark. if the results are close to acceptable they will ask questions and research the tuneing, but if the results are orders of magnatude lower then they need to be they'll just say that postgress is too slow and try another database. an autodefault script that was written assuming that postgres has the box to itself would be a wonderful start. I think the next step would be to be able to tell the script 'only plan on useing 1/2 of this box' and beyond that would be the steps that you are thinking of where the useage pattern is considered. but when every performance question is answered with "did you change the defaults? they are way too low for modern hardware, raise them by 2 orders of magnatude and then we'll start investigating" David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning
On Tue, 1 May 2007, Carlos Moreno wrote: large problem from a slog perspective; there is no standard way even within Linux to describe CPUs, for example. Collecting available disk space information is even worse. So I'd like some help on this portion. Quite likely, naiveness follows... But, aren't things like /proc/cpuinfo , /proc/meminfo, /proc/partitions / /proc/diskstats standard, at the very least across Linux distros? I'm not familiar with BSD or other Unix flavours, but I would expect these (or their equivalent) to exist in those, no? Am I just being naive? unfortunantly yes. across different linux distros they are fairly standard (however different kernel versions will change them) however different kernels need drasticly different tools to get the info from them. David Lang ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning
On Tue, 1 May 2007, Greg Smith wrote: On Tue, 1 May 2007, Josh Berkus wrote: there is no standard way even within Linux to describe CPUs, for example. Collecting available disk space information is even worse. So I'd like some help on this portion. what type of description of the CPU's are you looking for? It's also hard/impossible to devise tuning algorithms that work for both gross tuning (increase shared_buffers by 100x) and fine tuning (decrease bgwriter_interval to 45ms). I would advocate focusing on iterative improvements to an existing configuration rather than even bothering with generating a one-off config for exactly this reason. It *is* hard/impossible to get it right in a single shot, because of how many parameters interact and the way bottlenecks clear, so why not assume from the start you're going to do it several times--then you've only got one piece of software to write. nobody is asking for things to be right the first time. The idea I have in my head is a tool that gathers system info, connects to the database, and then spits out recommendations in order of expected effectiveness--with the specific caveat that changing too many things at one time isn't recommended, and some notion of parameter dependencies. The first time you run it, you'd be told that shared_buffers was wildly low, effective_cache_size isn't even in the right ballpark, and your work_mem looks small relative to the size of your tables; fix those before you bother doing anything else because any data collected with those at very wrong values is bogus. why not have a much simpler script that gets these values up into the right ballpark first? then after that the process and analysis that you are suggesting would be useful. the problem is that the defaults are _so_ far off that no sane incremental program is going to be able to converge on the right answer rapidly. David Lang Take two, those parameters pass their sanity tests, but since you're actually running at a reasonable speed now the fact that your tables are no longer being vacuumed frequently enough might bubble to the top. It would take a few passes through to nail down everything, but as long as it's put together such that you'd be in a similar position to the single-shot tool after running it once it would remove that as something separate that needed to be built. To argue against myself for a second, it may very well be the case that writing the simpler tool is the only way to get a useful prototype for building the more complicated one; very easy to get bogged down in feature creep on a grand design otherwise. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning
On Thu, 3 May 2007, Josh Berkus wrote: Greg, I'm not fooled--secretly you and your co-workers laugh at how easy this is on Solaris and are perfectly happy with how difficult it is on Linux, right? Don't I wish. There's issues with getting CPU info on Solaris, too, if you get off of Sun Hardware to generic white boxes. The base issue is that there's no standardization on how manufacturers report the names of their CPUs, 32/64bit, or clock speeds. So any attempt to determine "how fast" a CPU is, even on a 1-5 scale, requires matching against a database of regexes which would have to be kept updated. And let's not even get started on Windows. I think the only sane way to try and find the cpu speed is to just do a busy loop of some sort (ideally something that somewhat resembles the main code) and see how long it takes. you may have to do this a few times until you get a loop that takes long enough (a few seconds) on a fast processor David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning
On Thu, 3 May 2007, Carlos Moreno wrote: > CPUs, 32/64bit, or clock speeds. So any attempt to determine "how > fast" > a CPU is, even on a 1-5 scale, requires matching against a database of > regexes which would have to be kept updated. > > And let's not even get started on Windows. I think the only sane way to try and find the cpu speed is to just do a busy loop of some sort (ideally something that somewhat resembles the main code) and see how long it takes. you may have to do this a few times until you get a loop that takes long enough (a few seconds) on a fast processor I was going to suggest just that (but then was afraid that again I may have been just being naive) --- I can't remember the exact name, but I remember using (on some Linux flavor) an API call that fills a struct with data on the resource usage for the process, including CPU time; I assume measured with precision (that is, immune to issues of other applications running simultaneously, or other random events causing the measurement to be polluted by random noise). since what we are looking for here is a reasonable first approximation, not perfection I don't think we should worry much about pollution of the value. if the person has other things running while they are running this test that will be running when they run the database it's no longer 'pollution' it's part of the environment. I think a message at runtime that it may produce inaccurate results if you have other heavy processes running for the config that won't be running with the database would be good enough (remember it's not only CPU time that's affected like this, it's disk performance as well) As for 32/64 bit --- doesn't PG already know that information? I mean, ./configure does gather that information --- does it not? we're not talking about comiling PG, we're talking about getting sane defaults for a pre-compiled binary. if it's a 32 bit binary assume a 32 bit cpu, if it's a 64 bit binary assume a 64 bit cpu (all hardcoded into the binary at compile time) David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning
On Thu, 3 May 2007, Carlos Moreno wrote: > been just being naive) --- I can't remember the exact name, but I > remember > using (on some Linux flavor) an API call that fills a struct with data > on the > resource usage for the process, including CPU time; I assume measured > with precision (that is, immune to issues of other applications running > simultaneously, or other random events causing the measurement to be > polluted by random noise). since what we are looking for here is a reasonable first approximation, not perfection I don't think we should worry much about pollution of the value. Well, it's not as much worrying as it is choosing the better among two equally difficult options --- what I mean is that obtaining the *real* resource usage as reported by the kernel is, from what I remember, equally hard as it is obtaining the time with milli- or micro-seconds resolution. So, why not choosing this option? (in fact, if we wanted to do it "the scripted way", I guess we could still use "time test_cpuspeed_loop" and read the report by the command time, specifying CPU time and system calls time. I don't think it's that hard to get system time to a reasonable level (if this config tuner needs to run for a min or two to generate numbers that's acceptable, it's only run once) but I don't think that the results are really that critical. do we really care if the loop runs 1,000,000 times per second or 1,001,000 times per second? I'd argue that we don't even care about 1,000,000 times per second vs 1,100,000 times per second, what we care about is 1,000,000 times per second vs 100,000 times per second, if you do a 10 second test and run it for 11 seconds you are still in the right ballpark (i.e. close enough that you really need to move to the stage2 tuneing to figure the exact values) > As for 32/64 bit --- doesn't PG already know that information? I mean, > ./configure does gather that information --- does it not? we're not talking about comiling PG, we're talking about getting sane defaults for a pre-compiled binary. if it's a 32 bit binary assume a 32 bit cpu, if it's a 64 bit binary assume a 64 bit cpu (all hardcoded into the binary at compile time) Right --- I was thinking that configure, which as I understand, generates the Makefiles to compile applications including initdb, could plug those values as compile-time constants, so that initdb (or a hypothetical additional utility that would do what we're discussing in this thread) already has them. Anyway, yes, that would go for the binaries as well --- we're pretty much saying the same thing :-) I'm thinking along the lines of a script or pre-compiled binary (_not_ initdb) that you could run and have it generate a new config file that has values that are at within about an order of magnatude of being correct. David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning
On Thu, 3 May 2007, Carlos Moreno wrote: I don't think it's that hard to get system time to a reasonable level (if this config tuner needs to run for a min or two to generate numbers that's acceptable, it's only run once) but I don't think that the results are really that critical. Still --- this does not provide a valid argument against my claim. Ok, we don't need precision --- but do we *need* to have less precision?? I mean, you seem to be proposing that we deliberately go out of our way to discard a solution with higher precision and choose the one with lower precision --- just because we do not have a critical requirement for the extra precision. That would be a valid argument if the extra precision came at a considerable cost (well, or at whatever cost, considerable or not). the cost I am seeing is the cost of portability (getting similarly accruate info from all the different operating systems) But my point is still that obtaining the time in the right ballpark and obtaining the time with good precision are two things that have, from any conceivable point of view (programming effort, resources consumption when executing it, etc. etc.), the exact same cost --- why not pick the one that gives us the better results? Mostly when you consider that: I'd argue that we don't even care about 1,000,000 times per second vs 1,100,000 times per second, what we care about is 1,000,000 times per second vs 100,000 times per second Part of my claim is that measuring real-time you could get an error like this or even a hundred times this!! Most of the time you wouldn't, and definitely if the user is careful it would not happen --- but it *could* happen!!! (and when I say could, I really mean: trust me, I have actually seen it happen) if you have errors of several orders of magnatude in the number of loops it can run in a given time period then you don't have something that you can measure to any accuracy (and it wouldn't matter anyway, if your loops are that variable, your code execution would be as well) Why not just use an *extremely simple* solution that is getting information from the kernel reporting the actual CPU time that has been used??? Of course, this goes under the premise that in all platforms there is such a simple solution like there is on Linux (the exact name of the API function still eludes me, but I have used it in the past, and I recall that it was just three or five lines of code). I think the problem is that it's a _different_ 3-5 lines of code for each OS. if I'm wrong and it's the same for the different operating systems then I agree that we should use the most accurate clock we can get. I just don't think we have that. David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning
On Thu, 3 May 2007, Carlos Moreno wrote: > error like this or even a hundred times this!! Most of the time > you wouldn't, and definitely if the user is careful it would not > happen --- but it *could* happen!!! (and when I say could, I > really mean: trust me, I have actually seen it happen) Part of my claim is that measuring real-time you could get an if you have errors of several orders of magnatude in the number of loops it can run in a given time period then you don't have something that you can measure to any accuracy (and it wouldn't matter anyway, if your loops are that variable, your code execution would be as well) Not necessarily --- operating conditions may change drastically from one second to the next; that does not mean that your system is useless; simply that the measuring mechanism is way too vulnerable to the particular operating conditions at the exact moment it was executed. I'm not sure if that was intentional, but you bring up an interesting issue --- or in any case, your comment made me drastically re-think my whole argument: do we *want* to measure the exact speed, or rather the effective speed under normal operating conditions on the target machine? I know the latter is almost impossible --- we're talking about an estimate of a random process' parameter (and we need to do it in a short period of time) ... But the argument goes more or less like this: if you have a machine that runs at 1000 MIPS, but it's usually busy running things that in average consume 500 of those 1000 MIPS, would we want PG's configuration file to be obtained based on 1000 or based on 500 MIPS??? After all, the CPU is, as far as PostgreSQL will be able see, 500 MIPS fast, *not* 1000. I think I better stop, if we want to have any hope that the PG team will ever actually implement this feature (or similar) ... We're probably just scaring them!! :-) simpler is better (or perfect is the enemy of good enough) if you do your sample over a few seconds (or few tens of seconds) things will average out quite a bit. the key is to be going for a reasonable starting point. after that then the full analysis folks can start in with all their monitoring and tuneing, but the 80/20 rule really applies here. 80% of the gain is from getting 'fairly close' to the right values, and that should only be 20% of the full 'tuneing project' David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Best OS for Postgres 8.2
On Tue, 8 May 2007, Chris wrote: David Levy wrote: Hi, I am about to order a new server for my Postgres cluster. I will probably get a Dual Xeon Quad Core instead of my current Dual Xeon. Which OS would you recommend to optimize Postgres behaviour (i/o access, multithreading, etc) ? I am hesitating between Fedora Core 6, CentOS and Debian. Can anyone help with this ? Use the one you're most comfortable with. I don't think you'll notice *that* much difference between linux systems for performance - but whether you're comfortable using any of them will make a difference in managing it in general. the tuneing that you do (both of the OS and of postgres) will make more of a difference then anything else. David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Best OS for Postgres 8.2
On Tue, 8 May 2007, �~]~N彦 Ian Li wrote: In #postgresql on freenode, somebody ever mentioned that ZFS from Solaris helps a lot to the performance of pgsql, so dose anyone have information about that? the filesystem you use will affect the performance of postgres significantly. I've heard a lot of claims for ZFS, unfortunantly many of them from people who have prooven that they didn't know what they were talking about by the end of their first or second e-mails. much of the hype for ZFS is it's volume management capabilities and admin tools. Linux has most (if not all) of the volume management capabilities, it just seperates them from the filesystems so that any filesystem can use them, and as a result you use one tool to setup your RAID, one to setup snapshots, and a third to format your filesystems where ZFS does this in one userspace tool. once you seperate the volume management piece out, the actual performance question is a lot harder to answer. there are a lot of people who say that it's far faster then the alternate filesystems on Solaris, but I haven't seen any good comparisons between it and Linux filesystems. On Linux you have the choice of several filesystems, and the perfomance will vary wildly depending on your workload. I personally tend to favor ext2 (for small filesystems where the application is ensuring data integrity) or XFS (for large filesystems) I personally don't trust reiserfs, jfs seems to be a tools for transitioning from AIX more then anything else, and ext3 seems to have all the scaling issues of ext2 plus the overhead (and bottleneck) of journaling. one issue with journaling filesystems, if you journal the data as well as the metadata you end up with a very reliable setup, however it means that all your data needs to be written twice, oncce to the journal, and once to the final location. the write to the journal can be slightly faster then a normal write to the final location (the journal is a sequential write to an existing file), however the need to write twice can effectivly cut your disk I/O bandwidth in half when doing heavy writes. worse, when you end up writing mor ethen will fit in the journal (128M is the max for ext3) the entire system then needs to stall while the journal gets cleared to make space for the additional writes. if you don't journal your data then you avoid the problems above, but in a crash you may find that you lost data, even though the filesystem is 'intact' according to fsck. David Lang Steve Atkins wrote: On May 7, 2007, at 2:55 PM, David Levy wrote: > Hi, > > I am about to order a new server for my Postgres cluster. I will > probably get a Dual Xeon Quad Core instead of my current Dual Xeon. > Which OS would you recommend to optimize Postgres behaviour (i/o > access, multithreading, etc) ? > > I am hesitating between Fedora Core 6, CentOS and Debian. Can anyone > help with this ? Well, all three you mention are much the same, just with a different badge on the box, as far as performance is concerned. They're all going to be a moderately recent Linux kernel, with your choice of filesystems, so any choice between them is going to be driven more by available staff and support or personal preference. I'd probably go CentOS 5 over Fedora just because Fedora doesn't get supported for very long - more of an issue with a dedicated database box with a long lifespan than your typical desktop or interchangeable webserver. I might also look at Solaris 10, though. I've yet to play with it much, but it seems nice, and I suspect it might manage 8 cores better than current Linux setups. Cheers, Steve ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings Regards Ian ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Best OS for Postgres 8.2
On Tue, 8 May 2007, Claus Guttesen wrote: > In #postgresql on freenode, somebody ever mentioned that ZFS from > Solaris > helps a lot to the performance of pgsql, so dose anyone have information > about that? the filesystem you use will affect the performance of postgres significantly. I've heard a lot of claims for ZFS, unfortunantly many of them from people who have prooven that they didn't know what they were talking about by the end of their first or second e-mails. much of the hype for ZFS is it's volume management capabilities and admin tools. Linux has most (if not all) of the volume management capabilities, it just seperates them from the filesystems so that any filesystem can use them, and as a result you use one tool to setup your RAID, one to setup snapshots, and a third to format your filesystems where ZFS does this in one userspace tool. Even though those posters may have proven them selves wrong, zfs is still a very handy fs and it should not be judged relative to these statements. I don't disagree with you, I'm just noteing that too many of the 'ZFS is great' posts need to be discounted as a result (the same thing goes for the 'reiserfs4 is great' posts) once you seperate the volume management piece out, the actual performance question is a lot harder to answer. there are a lot of people who say that it's far faster then the alternate filesystems on Solaris, but I haven't seen any good comparisons between it and Linux filesystems. One could install pg on solaris 10 and format the data-area as ufs and then as zfs and compare import- and query-times and other benchmarking but comparing ufs/zfs to Linux-filesystems would also be a comparison of those two os'es. however, such a comparison is very legitimate, it doesn't really matter which filesystem is better if the OS that it's tied to limits it so much that the other one wins out with an inferior filesystem currently ZFS is only available on Solaris, parts of it have been released under GPLv2, but it doesn't look like enough of it to be ported to Linux (enough was released for grub to be able to access it read-only, but not the full filesystem). there are also patent concerns that are preventing any porting to Linux. on the other hand, it's integrated userspace tools are pushing people to create similar tools for Linux (without needeing to combine the vairous pieces in the kernel) David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Best OS for Postgres 8.2
On Tue, 8 May 2007, Trygve Laugstøl wrote: currently ZFS is only available on Solaris, parts of it have been released under GPLv2, but it doesn't look like enough of it to be ported to Linux (enough was released for grub to be able to access it read-only, but not the full filesystem). there are also patent concerns that are preventing any porting to Linux. This is not entirely correct. ZFS is only under the CDDL license and it has been ported to FreeBSD. http://mail.opensolaris.org/pipermail/zfs-discuss/2007-April/026922.html I wonder how they handled the license issues? I thought that if you combined stuff that was BSD licensed with stuff with a more restrictive license the result was under the more restrictive license. thanks for the info. here's a link about the GPLv2 stuff for zfs http://blogs.sun.com/darren/entry/zfs_under_gplv2_already_exists ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Best OS for Postgres 8.2
On Tue, 8 May 2007, Steinar H. Gunderson wrote: On Mon, May 07, 2007 at 03:14:08PM -0700, Joshua D. Drake wrote: It is my understanding (and I certainly could be wrong) that FreeBSD doesn't handle SMP nearly as well as Linux (and Linux not as well as Solaris). I'm not actually sure about the last part. There are installations as big as 1024 CPUs that run Linux -- most people won't need that, but it's probably an indicator that eight cores should run OK :-) over the weekend the question of scalability was raised on the linux kernel mailing list and people are shipping 1024 cpu systems with linux, and testing 4096 cpu systems. there are occasionally still bottlenecks that limit scalability, butunless you run into a bad driver or filesystem you should have no problems in the 8-16 core range. any comparison between Linux and any other OS needs to include a date for when the comparison was made, Linux is changing at a frightning pace (I think I saw something within the last few weeks that said that the rate of change for the kernel has averaged around 9000 lines of code per day over the last couple of years) you need to re-check comparisons every year or two or you end up working with obsolete data. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Throttling PostgreSQL's CPU usage
On Tue, 8 May 2007, Daniel Griscom wrote: I'm building a kiosk with a 3D front end accessing PostGIS/PostgreSQL via Apache/PHP. The 3D display is supposed to show smooth motion from location to location, with PostGIS giving dynamically updated information on the locations. Everything runs on the same machine, and it all works, but when I start a query the 3D display stutters horribly. It looks like PostgreSQL grabs hold of the CPU and doesn't let go until it's completed the query. I don't need the PostgreSQL query to return quickly, but I must retain smooth animation while the query is being processed. In other words, I need PostgreSQL to spread out its CPU usage so that it doesn't monopolize the CPU for any significant time (more than 50ms or so). Possible solutions: 1: Set the PostgreSQL task priority lower than the 3D renderer task, and to make sure that the 3D renderer sleep()s enough to let PostgreSQL get its work done. The obvious objection to this obvious solution is "Priority inversion!", but I see that as an additional challenge to be surmounted rather than an absolute prohibition. So, any thoughts on setting the PostgreSQL task priority (including by the much-maligned tool shown at <http://weblog.bignerdranch.com/?p=11>)? this may or may not help 3: ... some other solution I haven't thought of. take a look at the scheduler discussion that has been takeing place on the linux-kernel list. there are a number of things being discussed specificly to address the type of problems that you are running into (CPU hog causes latencies for graphics processes). it looks like nothing will go into the 2.6.22 kernel officially, but if you are willing to test the begezzes out of it before you depend on it, I suspect that either the SD or CFS schedulers will clean things up for you. David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] What's The Difference Between VACUUM and VACUUM ANALYZE?
On Tue, 8 May 2007, Y Sidhu wrote: I am trying to follow a message thread. One guy says we should be running vacuum analyze daily and the other says we should be running vacuum multiple times a day. I have tried looking for what a vacuum analyze is to help me understand but no luck. vaccum frees tuples that are no longer refrenced vaccum analyse does the same thing, but then does some additional information gathering about what data is in the tables Postgres uses this data to adjust it's estimates of how long various things will take (sequential scan, etc). if these estimates are off by a huge amount (especially if you have never done a vaccum analyse after loading your table) then it's very likely that postgres will be slow becouse it's doing expensive operations that it thinks are cheap. David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Throttling PostgreSQL's CPU usage
On Tue, 8 May 2007, Daniel Griscom wrote: Thanks for all the feedback. Unfortunately I didn't specify that this is running on a WinXP machine (the 3D renderer is an ActiveX plugin), and I don't even think "nice" is available. I've tried using the Windows Task Manager to set every postgres.exe process to a low priority, but that didn't make a difference. Several people have mentioned having multiple processors; my current machine is a uni-processor machine, but I believe we could spec the actual runtime machine to have multiple processors/cores. I'm only running one query at a time; would that query be guaranteed to confine itself to a single processor/core? In terms of performance, I don't think simply more power will do the trick; I've got an AMD 3200+, and even doubling the power/halving the stutter time won't be good enough. Someone suggested "setting the CPU costs higher"; where would I learn about that? Someone else mentioned having a custom function that sleep()ed on every row access; where would I learn more about that? I've also been reading up on VACUUM. I haven't explicitly run it in the several days since I've installed the database (by loading a humongous data.sql file); might this be part of the performance problem? it would cause postgres to work harder then it needs to, but it doesn't solve the problem of postgres eating cpu that you need for your rendering (i.e. it may reduce the stutters, but won't eliminate them) a single query will confine itself to one core, but if you have a vaccum or autovaccum run it will affect the second core. I don't know what you can do on windows beyond this though. David Lang P.S. make sure you get real multi-core cpu's, hyperthreading is _not_ a second core for this problem. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Throttling PostgreSQL's CPU usage
On Tue, 8 May 2007, Carlos Moreno wrote: Daniel Griscom wrote: Several people have mentioned having multiple processors; my current machine is a uni-processor machine, but I believe we could spec the actual runtime machine to have multiple processors/cores. My estimate is that yes, you should definitely consider that. I'm only running one query at a time; would that query be guaranteed to confine itself to a single processor/core? From what Joshua mentions, looks like you do have that guarantee. isn't there a way to limit how many processes postgres will create? if this is limited to 1, what happens when a vaccum run hits (or autovaccum) David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: FW: [PERFORM]
On Tue, 8 May 2007, Orhan Aglagul wrote: No, it is one transaction per insert. -Original Message- From: Scott Marlowe [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 08, 2007 5:38 PM To: Orhan Aglagul Subject: RE: [PERFORM] On Tue, 2007-05-08 at 19:36, Orhan Aglagul wrote: But 10,000 records in 65 sec comes to ~153 records per second. On a dual 3.06 Xeon What range is acceptable? If you're doing that in one big transaction, that's horrible. Because it shouldn't be waiting for each insert to fsync, but the whole transaction. with a standard 7200 rpm drive ~150 transactions/sec sounds about right to really speed things up you want to get a disk controller with a battery backed cache so that the writes don't need to hit the disk to be safe. that should get your speeds up to (and possibly above) what you got by turning fsync off. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Best OS for Postgres 8.2
On Tue, 8 May 2007, Greg Smith wrote: On Tue, 8 May 2007, Luke Lonergan wrote: From discussions with the developers, the biggest issue is a technical one: the Linux VFS layer makes the [ZFS] port difficult. Difficult on two levels. First you'd have to figure out how to make it work at all; then you'd have to reshape it into a form that it would be acceptable to the Linux kernel developers, who haven't seemed real keen on the idea so far. given that RAID, snapshots, etc are already in the linux kernel, I suspect that what will need to happen is for the filesystem to be ported without those features and then the userspace tools (that manipulate the volumes ) be ported to use the things already in the kernel. The standard article I'm you've already seen this week on this topic is Jeff Bonwick's at http://blogs.sun.com/bonwick/entry/rampant_layering_violation yep, that sounds like what I've been hearing. what the ZFS (and reiserfs4) folks haven't been wanting to hear from the linux kernel devs is that they are interested in having all these neat features available for use with all filesystems (and the linux kernel has a _lot_ of filesystems available), with solaris you basicly have UFS and ZFS so it's not as big a deal. What really bugged me was his earlier article linked to there where he talks about how ZFS eliminates the need for hardware RAID controllers: http://blogs.sun.com/bonwick/entry/raid_z While there may be merit to that idea for some applications, like situations where you have a pig of a RAID5 volume, that's just hype for database writes. "We issue the SYNCHRONIZE CACHE command to the disks after pushing all data in a transaction group"--see, that would be the part the hardware controller is needed to accelerate. If you really care about whether your data hit disk, there is no way to break the RPM barrier without hardware support. The fact that he misunderstands such a fundamental point makes me wonder what other gigantic mistakes might be buried in his analysis. I've seen similar comments from some of the linux kernel devs, they've used low-end raid controllers with small processors on them and think that a second core/socket in the main system to run software raid on is better. David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Best OS for Postgres 8.2
On Wed, 9 May 2007, Steinar H. Gunderson wrote: On Wed, May 09, 2007 at 01:57:51AM -0700, [EMAIL PROTECTED] wrote: given that RAID, snapshots, etc are already in the linux kernel, I suspect that what will need to happen is for the filesystem to be ported without those features and then the userspace tools (that manipulate the volumes ) be ported to use the things already in the kernel. Well, part of the idea behind ZFS is that these parts are _not_ separated in "layers" -- for instance, the filesystem can push data down to the RAID level to determine the stripe size used. there's nothing preventing this from happening if they are seperate layers either. there are some performance implications of the seperate layers, but until someone has the ability to do head-to-head comparisons it's hard to say which approach will win (in theory the lack of layers makes for faster code, but in practice the fact that each layer is gone over by experts looking for ways to optimize it may overwelm the layering overhead) Whether this is a good idea is of course hotly debated, but I don't think you can port just the filesystem part and call it a day. Oh, I'm absolutly sure that doing so won't satidfy people (wnd would generate howles of outrage from some parts), but having watched other groups try and get things into the kernel that the kernel devs felt were layering violations I think that it's wat will ultimatly happen. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: ZFS and Postgresql - WASRe: [PERFORM] Best OS for Postgres 8.2
On Wed, 9 May 2007, Jignesh Shah wrote: But we still pay the penalty on WAL while writing them in the first place I guess .. Is there an option to disable it.. I can test how much is the impact I guess couple of %s but good to verify :-) ) on modern CPU's where the CPU is significantly faster then RAM, calculating a checksum is free if the CPU has to touch the data anyway (cycles where it would be waiting for a cache miss are spent doing the calculations) if you don't believe me, hack the source to remove the checksum and see if you can measure any difference. David Lang > Regards, Jignesh Alvaro Herrera wrote: Jignesh Shah escribió: > Now comes the thing that I am still exploring > * Do we do checksum in WAL ? I guess we do .. Which means that we are > now doing double checksumming on the data. One in ZFS and one in > postgresql. ZFS does allow checksumming to be turned off (but on new > blocks allocated). But of course the philosophy is where should it be > done (ZFS or PostgreSQL). > Checksums on WAL are not optional in Postgres, because AFAIR they are used to determine when it should stop recovering. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] setting up raid10 with more than 4 drives
On Wed, 30 May 2007, Jonah H. Harris wrote: On 5/29/07, Luke Lonergan <[EMAIL PROTECTED]> wrote: AFAIK you can't RAID1 more than two drives, so the above doesn't make sense to me. Yeah, I've never seen a way to RAID-1 more than 2 drives either. It would have to be his first one: D1 + D2 = MD0 (RAID 1) D3 + D4 = MD1 ... D5 + D6 = MD2 ... MD0 + MD1 + MD2 = MDF (RAID 0) I don't know what the failure mode ends up being, but on linux I had no problems creating what appears to be a massively redundant (but small) array md0 : active raid1 sdo1[10](S) sdn1[8] sdm1[7] sdl1[6] sdk1[5] sdj1[4] sdi1[3] sdh1[2] sdg1[9] sdf1[1] sde1[11](S) sdd1[0] 896 blocks [10/10] [UU] David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Thousands of tables versus on table?
On Mon, 4 Jun 2007, Scott Marlowe wrote: Gregory Stark wrote: "Thomas Andrews" <[EMAIL PROTECTED]> writes: > I guess my real question is, does it ever make sense to create thousands > of > tables like this? > Sometimes. But usually it's not a good idea. What you're proposing is basically partitioning, though you may not actually need to put all the partitions together for your purposes. Partitioning's main benefit is in the management of the data. You can drop and load partitions in chunks rather than have to perform large operations on millions of records. Postgres doesn't really get any faster by breaking the tables up like that. In fact it probably gets slower as it has to look up which of the thousands of tables you want to work with. That's not entirely true. PostgreSQL can be markedly faster using partitioning as long as you always access it by referencing the partitioning key in the where clause. So, if you partition the table by date, and always reference it with a date in the where clause, it will usually be noticeably faster. OTOH, if you access it without using a where clause that lets it pick partitions, then it will be slower than one big table. So, while this poster might originally think to have one table for each user, resulting in thousands of tables, maybe a compromise where you partition on userid ranges would work out well, and keep each partition table down to some 50-100 thousand rows, with smaller indexes to match. what if he doesn't use the postgres internal partitioning, but instead makes his code access the tables named responsesN where N is the id of the customer? this is what it sounded like he was asking initially. David Lang ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Thousands of tables versus on table?
On Tue, 5 Jun 2007, Gregory Stark wrote: "Scott Marlowe" <[EMAIL PROTECTED]> writes: Sorry, I think I initially read your response as "Postgres doesn't really get any faster by breaking the tables up" without the "like that" part. Well breaking up the tables like that or partitioning, either way should be about equivalent really. Breaking up the tables and doing it in the application should perform even better but it does make the schema less flexible and harder to do non-partition based queries and so on. but he said in the initial message that they don't do cross-customer reports anyway, so there really isn't any non-partition based querying going on anyway. I guess I should explain what I originally meant: A lot of people come from a flat-file world and assume that things get slower when you deal with large tables. In fact due to the magic of log(n) accessing records from a large index is faster than first looking up the table and index info in a small index and then doing a second lookup in up in an index for a table half the size. however, if your query plan every does a sequential scan of a table then you are nog doing a log(n) lookup are you? Where the win in partitioning comes in is in being able to disappear some of the data entirely. By making part of the index key implicit in the choice of partition you get away with a key that's half as large. And in some cases you can get away with using a different key entirely which wouldn't otherwise have been feasible to index. In some cases you can even do sequential scans whereas in an unpartitioned table you would have to use an index (or scan the entire table). But the real reason people partition data is really for the management ease. Being able to drop, and load entire partitions in O(1) is makes it feasible to manage data on a scale that would simply be impossible without partitioned tables. remember that the origional question wasn't about partitioned tables, it was about the performance problem he was having with one large table (slow insert speed) and asking if postgres would collapse if he changed his schema to use a seperate table per customer. I see many cases where people advocate collapsing databases/tables togeather by adding a column that indicates which customer the line is for. however I really don't understand why it is more efficiant to have a 5B line table that you do a report/query against 0.1% of then it is to have 1000 different tables of 5M lines each and do a report/query against 100% of. it would seem that the fact that you don't have to skip over 99.9% of the data to find things that _may_ be relavent would have a noticable cost in and of itself. David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Thousands of tables versus on table?
On Tue, 5 Jun 2007, Tom Lane wrote: [EMAIL PROTECTED] writes: however I really don't understand why it is more efficiant to have a 5B line table that you do a report/query against 0.1% of then it is to have 1000 different tables of 5M lines each and do a report/query against 100% of. Essentially what you are doing when you do that is taking the top few levels of the index out of the database and putting it into the filesystem; plus creating duplicative indexing information in the database's system catalogs. The degree to which this is a win is *highly* debatable, and certainly depends on a whole lot of assumptions about filesystem performance. You also need to assume that constraint-exclusion in the planner is pretty doggone cheap relative to the table searches, which means it almost certainly will lose badly if you carry the subdivision out to the extent that the individual tables become small. (This last could ^^ what is considered 'small'? a few thousand records, a few million records? what multiplication factor would there need to be on the partitioning to make it worth while? 100 tables, 1000 tables, 1 tables? the company that I'm at started out with a seperate database per customer (not useing postgres), there are basicly zero cross-customer queries, with a large volume of updates and lookups. overall things have now grown to millions of updates/day (some multiple of this in lookups), and ~2000 customers, with tens of millions of rows between them. having each one as a seperate database has really helped us over the years as it's made it easy to scale (run 500 databases on each server instead of 1000, performance just doubled) various people (not database experts) are pushing to install Oracle cluster so that they can move all of these to one table with a customerID column. the database folks won't comment much on this either way, but they don't seem enthusiastic to combine all the data togeather. I've been on the side of things that said that seperate databases is better becouse it improves data locality to only have to look at the data for one customer at a time rather then having to pick out that customer's data out from the mass of other, unrelated data. be improved in some cases if we had a more explicit representation of partitioning, but it'll never be as cheap as one more level of index search.) say you have a billing table of customerID, date, description, amount, tax, extended, paid and you need to do things like report on invoices that haven't been paied summarize the amount billed each month summarize the tax for each month but you need to do this seperately for each customerID (not as a batch job that reports on all customerID's at once, think a website where the customer can request such reports at any time with a large variation in criteria) would you be able to just have one index on customerID and then another on date? or would the second one need to be on customerID||date? and would this process of going throught he index and seeking to the data it points to really be faster then a sequential scan of just the data related to that customerID? I think the main argument for partitioning is when you are interested in being able to drop whole partitions cheaply. I fully understand this if you are doing queries across all the partitions, but if your query is confined to a single partition, especially in the case where you know ahead of time in the application which 'partition' you care about it would seem that searching through significantly less data should be a win. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Thousands of tables versus on table?
On Wed, 6 Jun 2007, Steinar H. Gunderson wrote: On Tue, Jun 05, 2007 at 05:59:25PM -0400, Tom Lane wrote: I think the main argument for partitioning is when you are interested in being able to drop whole partitions cheaply. Wasn't there also talk about adding the ability to mark individual partitions as read-only, thus bypassing MVCC and allowing queries to be satisfied using indexes only? Not that I think I've seen it on the TODO... :-) now that's a very interesting idea, especially when combined with time-based data where the old times will never change. David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Thousands of tables versus on table?
On Wed, 6 Jun 2007, Scott Marlowe wrote: > pretty doggone cheap relative to the table searches, which means it > almost certainly will lose badly if you carry the subdivision out to > the extent that the individual tables become small. (This last could ^^ what is considered 'small'? a few thousand records, a few million records? I would say small is when the individual tables are in the 10 to 20 Megabyte range. How many records that is depends on record width, of course. Basically, once the tables get small enough that you don't really need indexes much, since you tend to grab 25% or more of each one that you're going to hit in a query. thanks, that makes a lot of sense what multiplication factor would there need to be on the partitioning to make it worth while? 100 tables, 1000 tables, 1 tables? Really depends on the size of the master table I think. If the master table is about 500 Megs in size, and you partition it down to about 1 meg per child table, you're probably ok. Walking through 500 entries for constraint exclusion seems pretty speedy from the tests I've run on a 12M row table that was about 250 Megs, split into 200 to 400 or so equisized child tables. The time to retrieve 85,000 rows that were all neighbors went from 2 to 6 seconds, to about 0.2 seconds, and we got rid of indexes entirely since they weren't really needed anymore. remember, I'm talking about a case wher eyou don't have to go through contraint checking. you know to start with what customerID you are dealing with so you just check the tables for that customer the company that I'm at started out with a seperate database per customer (not useing postgres), there are basicly zero cross-customer queries, with a large volume of updates and lookups. overall things have now grown to millions of updates/day (some multiple of this in lookups), and ~2000 customers, with tens of millions of rows between them. having each one as a seperate database has really helped us over the years as it's made it easy to scale (run 500 databases on each server instead of 1000, performance just doubled) I think that for what you're doing, partitioning at the database level is probably a pretty good compromise solution. Like you say, it's easy to put busy databases on a new server to balance out the load. Hardware is cheap. various people (not database experts) are pushing to install Oracle cluster so that they can move all of these to one table with a customerID column. Have these people identified a particular problem they're trying to solve, or is this a religious issue for them? From your description it sounds like a matter of dogma, not problem solving. in part it is, in part it's becouse the commercial database companies have told management that doing database replication is impossible with so many databases (we first heard this back when we had 300 or so databases), we've gone the expensive EMC disk-layer replication route, but they think that mergeing everything will simplify things somehow so the database can do it's job better. I see it as just a limitation on the replication solution offered by the bigname vendors. the database folks won't comment much on this either way, but they don't seem enthusiastic to combine all the data togeather. I think they can see the fecal matter heading towards the rotational cooling device on this one. I can't imagine this being a win from the perspective of saving the company money. neither do I. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] How much ram is too much
On Fri, 8 Jun 2007, Dave Cramer wrote: Is it possible that providing 128G of ram is too much ? Will other systems in the server bottleneck ? the only way 128G of ram would be too much is if your total database size (including indexes) is smaller then this. now it may not gain you as much of an advantage going from 64G to 128G as it does going from 32G to 64G, but that depends on many variables as others have been asking. David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Volunteer to build a configuration tool
one thing to point out to people about this idea is that nothing says that this page needs to be served via a webserver. If all the calculations are done in javascript this could be a local file that you open with a browser. do any of the text-mode browsers implement javascript? if so then you have an answer even for the deeply buried isolated headless servers. David Lang On Mon, 18 Jun 2007, Campbell, Lance wrote: I am a Java Software architect, DBA, and project manager for the University of Illinois, Department of Web Services. We use PostgreSQL to serve about 2 million pages of dynamic content a month; everything from calendars, surveys, forms, discussion boards, RSS feeds, etc. I am really impressed with this tool. The only major problem area I have found where PostgreSQL is really lacking is in "what should my initial configuration settings be?" I realize that there are many elements that can impact a DBA's specific database settings but it would be nice to have a "configuration tool" that would get someone up and running better in the beginning. This is my idea: A JavaScript HTML page that would have some basic questions at the top: 1) How much memory do you have? 2) How many connections will be made to the database? 3) What operating system do you use? 4) Etc... ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] PostgreSQL Configuration Tool for Dummies
On Tue, 19 Jun 2007, Y Sidhu wrote: On 6/19/07, Francisco Reyes <[EMAIL PROTECTED]> wrote: Campbell, Lance writes: > Francisco and Richard, > Why ask about disk or raid? How would that impact any settings in > postgresql.conf? If the user has 2 disks and says that he will do a lot of updates he could put pg_xlog in the second disk. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster Let's not ask about disk or raid at this level of sanity tuning. It is important for a newbie to take the right first step. When it comes to disks, we start talking I/O, SATA, SCSI and the varying degrees of SATA and SCSI, and controller cards. Then we throw in RAID and the different levels therein. Add to that, we can talk about drivers controlling these drives and which OS is faster, more stable, etc. As you can see, a newbie would get drowned. So, please keep it simple. I know many people on this list are Gurus. We know you are the best in this field, but we are not and are just trying to improve what we have. I strongly agree. besides, the number and types of drives, raid configurations, etc is so variable that I strongly believe that the right answer is going to be something along the lines of 'run this tool and then enter the number(s) that the tool reports' and then let the tool measure the end result of all the variables rather then trying to calculate the results. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] PostgreSQL Configuration Tool for Dummies
On Tue, 19 Jun 2007, Campbell, Lance wrote: Memory There are many different ways to ask about memory. Rather than ask a series of questions I went with a single question, #2. If it is better to ask about the memory in a series of questions then please give me the questions you would ask and why you would ask each of them. From my understanding the primary memory issue as it relates to PostgreSQL is "how much memory is available to PostgreSQL". Remember that this needs to be as simple as possible. there are three catagories of memory useage 1. needed by other software 2. available for postgres 3. needed by the OS it's not clear if what you are asking is #2 or a combination of #2 and #3 IMHO you should ask for #2 and #3, possibly along the lines of "how much memory is in the machine that isn't already used by other applications" David Lang ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] PostgreSQL Configuration Tool for Dummies - feedback adjustable control
On Fri, 22 Jun 2007, Sabin Coanda wrote: Instead of (or in addition to) configure dozens of settings, what do you say about a feedback adjustable control based on the existing system statistics and parsing logs (e.g http://pgfouine.projects.postgresql.org/index.html ) ? something like this would be useful for advanced tuneing, but the biggest problem is that it's so difficult to fingoure out a starting point. bad choices at the starting point can cause several orders of magnatude difference in the database performsnce. In addition we know that the current defaults are bad for just about everyone (we just can't decide what better defaults would be) this horrible starting point gives people a bad first impression that a simple tool like what's being discussed can go a long way towards solving. David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Dell Hardware Recommendations
On Thu, 9 Aug 2007, Decibel! wrote: On Thu, Aug 09, 2007 at 08:58:19PM -0500, Scott Marlowe wrote: On Thu, Aug 09, 2007 at 05:29:18PM -0500, Scott Marlowe wrote: On 8/9/07, Decibel! <[EMAIL PROTECTED]> wrote: Also, a good RAID controller can spread reads out across both drives in each mirror on a RAID10. Though, there is an argument for not doing that... it makes it much less likely that both drives in a mirror will fail close enough to each other that you'd lose that chunk of data. I'd think that kind of failure mode is pretty uncommon, unless you're in an environment where physical shocks are common. which is not a typical database environment. (tell that to the guys writing a db for a modern tank fire control system though :) ) You'd be surprised. I've seen more than one case of a bunch of drives failing within a month, because they were all bought at the same time. while with RAID10 you can potentially lose half the array without losing any data. Yes, but the RIGHT two drives can kill EITHER RAID 5 or RAID10. Sure, but the odds of that with RAID5 are 100%, while they're much less in a RAID10. so you go with Raid6, not Raid5. While I agree with Merlin that for OLTP a faster drive is a must, for OLAP, more drives is often the real key. The high aggregate bandwidth of a large array of SATA drives is an amazing thing to watch when running a reporting server with otherwise unimpressive specs. True. In this case, the OP will probably want to have one array for the OLTP stuff and one for the OLAP stuff. one thing that's interesting is that the I/O throughlut on the large SATA drives can actually be higher then the faster, but smaller SCSI drives. the SCSI drives can win on seeking, but how much seeking you need to do depends on how large the OLTP database ends up being David Lang ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] SAN vs Internal Disks
On Fri, 7 Sep 2007, Tobias Brox wrote: We're also considering to install postgres on SAN - that is, my boss is convinced this is the right way to go. Advantages: 1. Higher I/O (at least the salesman claims so) only if you buy better disks for the SAN then for the local system (note that this includes battery backed ram for write caching. the SAN will include a bunch becouse it's performance would _suck_ otherwise. if you don't put any on your stand-alone system you are comparing apples to oranges) 2. Easier to upgrade the disk capacity only if you buy a SAN with a lot of empty drive slots, but wouldn't buy a system with empty drive slots. 3. Easy to set up "warm standby" functionality. (Then again, if the postgres server fails miserably, it's likely to be due to a disk crash). and if postgres dies for some other reason the image on disk needs repair, unless you script stopping postgres when the SAN does it's snapshots, those snapshots are not going to be that good. the problems are useually repairable, but that makes starting your warm spare harder. Also, my boss states that "all big enterprises uses SAN nowadays". your bos would be very surprised at what the really big shops are doing (and not doing). yes they have a SAN, they have many SANs, from many different vendors, and they have many systems that don't use the SAN and use local disks instead. when you get really large you can find just about anything _somewhere_ in the company. Disadvantages: 1. Risky? One gets the impression that there are frequent problems with data integrity when reading some of the posts in this thread. SAN's add more parts and more potential points of failure, then when you add the SAN replication to the mix things get even more 'interesting'. doing SAN replication across a significant distance to your DR facility can be a LOT harder to get working right then the salesman makes it sound. it's not uncommon to see a san replication decide that it's going to take a week to catch up after doing a DR test for example. 2. Expensive no, _extremely expensive. price one and then look at how much hardware you could buy instead. you can probably buy much mroe storage, and a couple complete spare systems (do replication to a local spare as well as your remote system) and end up with even more reliability. 3. "Single point of failure" ... but that you have either it's a SAN or a local disk, one will anyway need good backup systems (and eventually "warm standby"-servers running from physically separated disks). no, with local disks you can afford to have multiple systems so that you don't have a SPOF 4. More complex setup? 5. If there are several hosts with write permission towards the same disk, I can imagine the risks being higher for data integrity breakages. Particularly, I can imagine that if two postgres instances is started up towards the same disk (due to some sysadmin mistake), it could be disasterous. when you are useing a SAN for a database the SAN vendor will have you allocate complete disks to each box, so you don't have multiple boxes hitting the same drive, but you also don't get a lot of the anvantages the salesman talks about. David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] SAN vs Internal Disks
On Tue, 11 Sep 2007, Decibel! wrote: On Tue, Sep 11, 2007 at 05:09:00PM -0400, Michael Stone wrote: On Tue, Sep 11, 2007 at 03:55:51PM -0500, Decibel! wrote: Also, to reply to someone else's email... there is one big reason to use a SAN over direct storage: you can do HA that results in 0 data loss. Good SANs are engineered to be highly redundant, with multiple controllers, PSUs, etc, so that the odds of losing the SAN itself are very, very low. The same isn't true with DAS. You can get DAS arrays with multiple controllers, PSUs, etc. DAS != single disk. It's still in the same chassis, though, which means if you lose memory or mobo you're still screwed. In a SAN setup for redundancy, there's very little in the way of a single point of failure; generally only the backplane, and because there's very little that's on there it's extremely rare for one to fail. not nessasarily. direct attached doesn't mean in the same chassis, external drive shelves attached via SCSI are still DAS you can even have DAS attached to a pair of machines, with the second box configured to mount the drives only if the first one dies. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] 10K vs 15k rpm for analytics
On Tue, 9 Mar 2010, Pierre C wrote: On Tue, 09 Mar 2010 08:00:50 +0100, Greg Smith wrote: Scott Carey wrote: For high sequential throughput, nothing is as optimized as XFS on Linux yet. It has weaknesses elsewhere however. When files are extended one page at a time (as postgres does) fragmentation can be pretty high on some filesystems (ext3, but NTFS is the absolute worst) if several files (indexes + table) grow simultaneously. XFS has delayed allocation which really helps. I'm curious what you feel those weaknesses are. Handling lots of small files, especially deleting them, is really slow on XFS. Databases don't care about that. accessing lots of small files works really well on XFS compared to ext* (I use XFS with a cyrus mail server which keeps each message as a seperate file and XFS vastly outperforms ext2/3 there). deleting is slow as you say David Lang There is also the dark side of delayed allocation : if your application is broken, it will manifest itself very painfully. Since XFS keeps a lot of unwritten stuff in the buffers, an app that doesn't fsync correctly can lose lots of data if you don't have a UPS. Fortunately, postgres handles fsync like it should be. A word of advice though : a few years ago, we lost a few terabytes on XFS (after that, restoring from backup was quite slow !) because a faulty SCSI cable crashed the server, then crashed it again during xfsrepair. So if you do xfsrepair on a suspicious system, please image the disks first. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing FusionIO
On Wed, 17 Mar 2010, Brad Nicholson wrote: On Wed, 2010-03-17 at 14:11 -0400, Justin Pitts wrote: On Mar 17, 2010, at 10:41 AM, Brad Nicholson wrote: On Wed, 2010-03-17 at 09:52 -0400, Justin Pitts wrote: FusionIO is publicly claiming 24 years @ 5TB/day on the 80GB SLC device, which wear levels across 100GB of actual installed capacity. http://community.fusionio.com/forums/p/34/258.aspx#258 20% of overall capacity free for levelling doesn't strike me as a lot. I don't have any idea how to judge what amount would be right. Some of the Enterprise grade stuff we are looking into (like TMS RamSan) leaves 40% (with much larger overall capacity). Also, running that drive at 80GB is the "Maximum Capacity" mode, which decreases the write performance. Very fair. In my favor, my proposed use case is probably at half capacity or less. I am getting the impression that partitioning/formatting the drive for the intended usage, and not the max capacity, is the way to go. Capacity isn't an issue with this workload. I cannot fit enough drives into these servers to get a tenth of the IOPS that even Tom's documents the ioDrive is capable of at reduced performance levels. The actual media is only good for a very limited number of write cycles. The way that the drives get around to be reliable is to constantly write to different areas. The more you have free, the less you have to re-use, the longer the lifespan. This is done by the drives wear levelling algorithms, not by using partitioning utilities btw. true, but if the drive is partitioned so that parts of it are never written to by the OS, the drive knows that those parts don't contain data and so can treat them as unallocated. once the OS writes to a part of the drive, unless the OS issues a trim command the drive can't know that the data there is worthless and can be ignored, it has to try and preserve that data, which makes doing the wear leveling harder and slower. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
On Sat, 24 Jul 2010, David Boreham wrote: Do you guys have any more ideas to properly 'feel this disk at its teeth' ? While an 'end-to-end' test using PG is fine, I think it would be easier to determine if the drive is behaving correctly by using a simple test program that emulates the storage semantics the WAL expects. Have it write a constant stream of records, fsync'ing after each write. Record the highest record number flushed so far in some place that won't be lost with the drive under test (e.g. send it over the network to another machine). Kill the power, bring the system back up again and examine what's at the tail end of that file. I think this will give you the worst case test with the easiest result discrimination. If you want to you could add concurrent random writes to another file for extra realism. Someone here may already have a suitable test program. I know I've written several over the years in order to test I/O performance, prove the existence of kernel bugs, and so on. I doubt it matters much how long the power is turned of. A second should be plenty time to flush pending writes if the drive is going to do so. remember that SATA is designed to be hot-plugged, so you don't have to kill the entire system to kill power to the drive. this is a little more ubrupt than the system loosing power, but in terms of loosing data this is about the worst case (while at the same time, it eliminates the possibility that the OS continues to perform writes to the drive as power dies, which is a completely different class of problems, independant of the drive type) David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
On Thu, 12 Aug 2010, Brad Nicholson wrote: On 10-08-12 03:22 AM, Arjen van der Meijden wrote: On 12-8-2010 2:53 gnuo...@rcn.com wrote: - The value of SSD in the database world is not as A Faster HDD(tm). Never was, despite the naive' who assert otherwise. The value of SSD is to enable BCNF datastores. Period. If you're not going to do that, don't bother. Silicon storage will never reach equivalent volumetric density, ever. SSD will never be useful in the byte bloat world of xml and other flat file datastores (resident in databases or not). Industrial strength SSD will always be more expensive/GB, and likely by a lot. (Re)factoring to high normalization strips out an order of magnitude of byte bloat, increases native data integrity by as much, reduces much of the redundant code, and puts the ACID where it belongs. All good things, but not effortless. It is actually quite common to under-utilize (short stroke) hard drives in the enterprise world. Simply because 'they' need more IOps per amount of data than a completely utilized disk can offer. As such the expense/GB can be much higher than simply dividing the capacity by its price (and if you're looking at fiber channel disks, that price is quite high already). And than it is relatively easy to find enterprise SSD's with better pricing for the whole system as soon as the IOps are more important than the capacity. And when you compare the ongoing operational costs of rack space, powering and cooling for big arrays full of spinning disks to flash based solutions the price comparison evens itself out even more. check your SSD specs, some of the high performance ones draw quite a bit of power. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On Mon, 11 Oct 2010, Samuel Gendler wrote: On Mon, Oct 11, 2010 at 9:06 PM, Scott Carey wrote: I can't speak to documentation, but it is something that helps as your I/O subsystem gets more powerful, and how much it helps depends more on your hardware, which may have adaptive read ahead on its own, and your file system which may be more or less efficient at sequential I/O. For example ext3 out of the box gets a much bigger gain from tuning read-ahead than XFS on a DELL PERC6 RAID card (but still ends up slower). Geez. I wish someone would have written something quite so bold as 'xfs is always faster than ext3' in the standard tuning docs. I couldn't find anything that made a strong filesystem recommendation. How does xfs compare to ext4? I wound up on ext4 on a dell perc6 raid card when an unexpected hardware failure on a production system caused my test system to get thrown into production before I could do any serious testing of xfs. If there is a strong consensus that xfs is simply better, I could afford the downtime to switch. unfortunantly you are not going to get a clear opinion here. ext3 has a long track record, and since it is the default, it gets a lot of testing. it does have known issues xfs had problems on linux immediatly after it was ported. It continues to be improved and many people have been using it for years and trust it. XFS does have a weakness in creating/deleting large numbers of small files. ext4 is the new kid on the block. it claims good things, but it's so new that many people don't trust it yet btrfs is the 'future of filesystems' that is supposed to be better than anything else, but it's definantly not stable yet, and time will tell if it really lives up to it's promises. an this is just on linux on BSD or solaris (or with out-of-kernel patches) you also have ZFS, which some people swear by, and other people swear at. David Lang As it happens, this is a system where all of the heavy workload is in the form of sequential scan type load. The OLTP workload is very minimal (tens of queries per minute on a small number of small tables), but there are a lot of reporting queries that wind up doing sequential scans of large partitions (millions to tens of millions of rows). We've sized the new hardware so that the most commonly used partitions fit into memory, but if we could speed the queries that touch less frequently used partitions, that would be good. I'm the closest thing our team has to a DBA, which really only means that I'm the one person on the dev team or the ops team to have read all of the postgres docs and wiki and the mailing lists. I claim no actual DBA experience or expertise and have limited cycles to devote to tuning and testing, so if there is an established wisdom for filesystem choice and read ahead tuning, I'd be very interested in hearing it. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On Tue, 12 Oct 2010, Craig Ringer wrote: BTW: There is a lot of talk about MVCC, but is next solution possible: 1) Create a page information map that for each page in the table will tell you how may rows are within and if any write (either successful or not) were done to this page. This even can be two maps to make second one really small (a bit per page) - so that it could be most time in-memory. 2) When you need to to count(*) or index check - first check if there were no writes to the page. If not - you can use count information from page info/index data without going to the page itself 3) Let vacuum clear the bit after frozing all the tuples in the page (am I using terminology correctly?). Part of this already exists. It's called the visibility map, and is present in 8.4 and above. It's not currently used for queries, but can potentially be used to aid some kinds of query. http://www.postgresql.org/docs/8.4/static/storage-vm.html In this case all read-only (archive) data will be this bit off and index/count(*) will be really fast. A count with any joins or filter criteria would still have to scan all pages with visible tuples in them. So the visibility map helps speed up scanning of bloated tables, but doesn't provide a magical "fast count" except in the utterly trivial "select count(*) from tablename;" case, and can probably only be used for accurate results when there are no read/write transactions currently open. Even if you kept a count of tuples in each page along with the mvcc transaction ID information required to determine for which transactions that count is valid, it'd only be useful if you didn't have to do any condition checks, and it'd be yet another thing to update with every insert/delete/update. Perhaps for some users that'd be worth having, but it seems to me like it'd have pretty narrow utility. I'm not sure that's the answer. from a PR point of view, speeding up the trivil count(*) case could be worth it, just to avoid people complaining about it not being fast. in the case where you are doing a count(*) where query and the where is on an indexed column, could the search just look at the index + the visibility mapping rather than doing an sequential search through the table? as for your worries about the accuracy of a visibility based count in the face of other transactions, wouldn't you run into the same issues if you are doing a sequential scan with the same transactions in process? David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On Tue, 12 Oct 2010, Joe Uhl wrote: The biggest single problem with "select count(*)" is that it is seriously overused. People use that idiom to establish existence, which usually leads to a performance disaster in the application using it, unless the table has no more than few hundred records. SQL language, of which PostgreSQL offers an excellent implementation, offers [NOT] EXISTS clause since its inception in the Jurassic era. The problem is with the sequential scan, not with counting. I'd even go as far as to suggest that 99% instances of the "select count(*)" idiom are probably bad use of the SQL language. I agree, I have seen many very bad examples of using count(*). I will go so far as to question the use of count(*) in my examples here. It there a better way to come up with a page list than using count(*)? What is the best method to make a page of results and a list of links to other pages of results? Am I barking up the wrong tree here? One way I have dealt with this on very large tables is to cache the count(*) at the application level (using memcached, terracotta, or something along those lines) and then increment that cache whenever you add a row to the relevant table. On application restart that cache is re-initialized with a regular old count(*). This approach works really well and all large systems in my experience need caching in front of the DB eventually. If you have a simpler system with say a single application/web server you can simply store the value in a variable, the specifics would depend on the language and framework you are using. this works if you know ahead of time what the criteria of the search is going to be. so it will work for select count(*) from table; what this won't work for is cases wher the criteria of the search is unpredictable, i.e. ask the user for input select count(*) from table where field=$input; David Lang Another more all-DB approach is to create a statistics tables into which you place aggregated statistics rows (num deleted, num inserted, totals, etc) at an appropriate time interval in your code. So you have rows containing aggregated statistics information for the past and some tiny portion of the new data happening right now that hasn't yet been aggregated. Queries then look like a summation of the aggregated values in the statistics table plus a count(*) over just the newest portion of the data table and are generally very fast. Overall I have found that once things get big the layers of your app stack start to blend together and have to be combined in clever ways to keep speed up. Postgres is a beast but when you run into things it can't do well just find a way to cache it or make it work together with some other persistence tech to handle those cases. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On Tue, 12 Oct 2010, Mladen Gogala wrote: da...@lang.hm wrote: from a PR point of view, speeding up the trivil count(*) case could be worth it, just to avoid people complaining about it not being fast. Fixing PR stuff is not the approach that I would take. People are complaining about select count(*) because they're using it in all the wrong places. that may be the case, but if it's possible to make it less painful it will mean more people use postgres, both because it works better for them when they are using the suboptimal programs, but also because when people do their trivial testing of databases to decide which one they will use, they won't rule out postgres because "it's so slow" the fact of the matter is that people do use count(*), and even though there are usually ways to avoid doing so, having the programmer have to do something different for postgres than they do for other databases is raising a barrier against postgres untilization in anything. David Lang My assessment that there is a problem with sequential scan was wrong. Now, let's again take Oracle as the measure. Someone asked me about caching the data. Here it is: SQL> connect system/* Connected. SQL> alter system flush buffer_cache; System altered. Elapsed: 00:00:12.68 SQL> connect adbase/* Connected. SQL> alter session set db_file_multiblock_read_Count=128; Session altered. Elapsed: 00:00:00.41 SQL> select count(*) from ni_occurrence; COUNT(*) -- 402062638 Elapsed: 00:02:37.77 SQL> select bytes/1048576 MB from user_segments 2 where segment_name='NI_OCCURRENCE'; MB -- 35329 Elapsed: 00:00:00.20 SQL> So, the results weren't cached the first time around. The explanation is the fact that Oracle, as of the version 10.2.0, reads the table in the private process memory, not in the shared buffers. This table alone is 35GB in size, Oracle took 2 minutes 47 seconds to read it using the full table scan. If I do the same thing with PostgreSQL and a comparable table, Postgres is, in fact, faster: psql (9.0.1) Type "help" for help. news=> \timing Timing is on. news=> select count(*) from moreover_documents_y2010m09; count -- 17242655 (1 row) Time: 113135.114 ms news=> select pg_size_pretty(pg_table_size('moreover_documents_y2010m09')); pg_size_pretty 27 GB (1 row) Time: 100.849 ms news=> The number of rows is significantly smaller, but the table contains rather significant "text" field which consumes quite a bit of TOAST storage and the sizes are comparable. Postgres read through 27GB in 113 seconds, less than 2 minutes and oracle took 2 minutes 37 seconds to read through 35GB. I stand corrected: there is nothing wrong with the speed of the Postgres sequential scan. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On Tue, 12 Oct 2010, Craig Ringer wrote: On 10/12/2010 04:22 PM, da...@lang.hm wrote: from a PR point of view, speeding up the trivil count(*) case could be worth it, just to avoid people complaining about it not being fast. At the cost of a fair bit more complexity, though, and slowing everything else down. complexity probably, although given how complex the planner is already is this significant? as far as slowing everything else down, why would it do that? (beyond the simple fact that any new thing the planner can do makes the planner take a little longer) David Lang The proper solution here remains, IMO, support for visibility information in indexes, whether by storing it once in the index and once in the heap (ouch!), storing it out-of-line, or using a covering index where one or more columns are stored wholly in the index not in the table heap at all. Here are a few of the many past discussions about this that have already covered some of the same ground: http://stackoverflow.com/questions/839015/postgres-could-an-index-organized-tables-paved-way-for-faster-select-count-fr http://osdir.com/ml/db.postgresql.performance/2003-10/msg00075.html (and the rest of the thread) A decent look with Google will find many, many more. in the case where you are doing a count(*) where query and the where is on an indexed column, could the search just look at the index + the visibility mapping rather than doing an sequential search through the table? Nope, because the visibility map, which is IIRC only one bit per page, doesn't record how many tuples there are on the page, or enough information about them to determine how many of them are visible to the current transaction*. as for your worries about the accuracy of a visibility based count in the face of other transactions, wouldn't you run into the same issues if you are doing a sequential scan with the same transactions in process? No. Every tuple in a table heap in postgresql has hidden fields, some of which are used to determine whether the current transaction* can "see" the tuple - it may have been inserted after this transaction started, or deleted before this transaction started, so it's not visible to this transaction but may still be to others. http://www.postgresql.org/docs/current/static/ddl-system-columns.html This information isn't available in the visibility map, or in indexes. That's why PostgreSQL has to hit the heap to find it. * current transaction should really be "current snapshot". The snapshot is taken at the start of the whole transaction for SERIALIZABLE isolation, and at the start of each statement for READ COMMITTED isolation. -- Craig Ringer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On Wed, 13 Oct 2010, Tom Lane wrote: Neil Whelchel writes: That's probably got little to do with caching and everything to do with setting hint bits on the first SELECT pass. I concur with Mark's question about whether your UPDATE pushed the table size across the limit of what would fit in RAM. Neil, can you just double the size of your initial test to make sure that it's too large to fit in ram to start with? David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, 29 Oct 2010, Robert Haas wrote: On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane wrote: James Mansion writes: Tom Lane wrote: The other and probably worse problem is that there's no application control over how soon changes to mmap'd pages get to disk. An msync will flush them out, but the kernel is free to write dirty pages sooner. So if they're depending for consistency on writes not happening until msync, it's broken by design. (This is one of the big reasons we don't use mmap'd space for Postgres disk buffers.) Well, I agree that it sucks for the reason you give - but you use write and that's *exactly* the same in terms of when it gets written, as when you update a byte on an mmap'd page. Uh, no, it is not. The difference is that we can update a byte in a shared buffer, and know that it *isn't* getting written out before we say so. If the buffer were mmap'd then we'd have no control over that, which makes it mighty hard to obey the WAL "write log before data" paradigm. It's true that we don't know whether write() causes an immediate or delayed disk write, but we generally don't care that much. What we do care about is being able to ensure that a WAL write happens before the data write, and with mmap we don't have control over that. Well, we COULD keep the data in shared buffers, and then copy it into an mmap()'d region rather than calling write(), but I'm not sure there's any advantage to it. Managing address space mappings is a pain in the butt. keep in mind that you have no way of knowing what order the data in the mmap region gets written out to disk. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, 29 Oct 2010, James Mansion wrote: Tom Lane wrote: Uh, no, it is not. The difference is that we can update a byte in a shared buffer, and know that it *isn't* getting written out before we Well, I don't know where yu got the idea I was refering to that sort of thing - its the same as writing to a buffer before copying to the mmap'd area. It's true that we don't know whether write() causes an immediate or delayed disk write, but we generally don't care that much. What we do Which is what I was refering to. care about is being able to ensure that a WAL write happens before the data write, and with mmap we don't have control over that. I think you have just the same control either way, because you can only force ordering with an appropriate explicit sync, and in the absence of such a sync all bets are off for whether/when each disk page is written out, and if you can't ensure that the controller and disk are write through you'd better do a hardware cache flush.too, right? A shame that so many systems have relatively poor handling of that hardware flush. the issue is that when you update a mmaped chunk of data, it could be written out immediatly without you doing _anything_ (and thanks to multiple cores/threads, it could get written out while you are still in the middle of updating it). When you update an internal buffer and then write that, you know that nothing will hit the disk before you issue the write command. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How to best use 32 15k.7 300GB drives?
On Thu, 27 Jan 2011, Robert Schnabel wrote: HI, I use PostgreSQL basically as a data warehouse to store all the genetic data that our lab generates. The only person that accesses the database is myself and therefore I've had it housed on my workstation in my office up till now. However, it's getting time to move it to bigger hardware. I currently have a server that is basically only storing backup images of all our other workstations so I'm going to move my database onto it. The server looks like this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 quad-core x2, 32GB RAM. For my purposes the CPUs and RAM are fine. I currently have an Adaptec 52445+BBU controller that has the OS (4 drive RAID5), FTP (4 drive RAID5) and two backup arrays (8 drive each RAID0). The backup arrays are in a 16 drive external enclosure through an expander so I actually have 16 ports free on the 52445 card. I plan to remove 3 of the drives from my backup arrays to make room for 3 - 73GB 15k.5 drives (re-purposed from my workstation). Two 16 drive enclosures with SAS2 expanders just arrived as well as 36 Seagate 15k.7 300GB drives (ST3300657SS). I also intend on getting an Adaptec 6445 controller with the flash module when it becomes available in about a month or two. I already have several Adaptec cards so I'd prefer to stick with them. Here's the way I was planning using the new hardware: xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 controller data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller 2 - 300G 15k.7 as hot spares enclosure C 4 spare 15k.7 for on the shelf With this configuration I figure I'll have ~3TB for my main data tables and 1TB for indexes. Right now my database is 500GB total. The 3:1 split reflects my current table structure and what I foresee coming down the road in terms of new data. So my questions are 1) am I'm crazy for doing this, 2) would you change anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp filespace) on a different controller than everything else? Please keep in mind I'm a geneticist who happens to know a little bit about bioinformatics and not the reverse. :-) a number of questions spring to mind how much of the time are you expecting to spend inserting data into this system vs querying data from the system? is data arriving continuously, or is it a matter of receiving a bunch of data, inserting it, then querying it? which do you need to optimize for, insert speed or query speed? do you expect your queries to be searching for a subset of the data scattered randomly throughlut the input data, or do you expect it to be 'grab this (relativly) sequential chunk of input data and manipulate it to generate a report' type of thing what is your connectvity to the raid enclosures? (does putting 22 drives on one cable mean that you will be limited due to the bandwidth of this cable rather than the performance of the drives) can you do other forms of raid on these drives or only raid 10? how critical is the data in this database? if it were to die would it just be a matter of recreating it and reloading the data? or would you loose irreplaceable data? David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How to best use 32 15k.7 300GB drives?
sorry for not replying properly to your response, I managed to delete the mail. as I understand your data access pattern it's the following: for the main table space: bulk loads every couple of weeks. if the data is lost you can just reload it. searches tend to be extracting large sequential chunks of data, either to external files or into different tables spaces. for this table space, you are basically only inserting every couple of weeks, and it sounds as if you do not really care how long it takes to load the data. first the disclaimer, I'm not a postgres expert, but I do have good experiance with large amounts of data on linux systems (and especially running into the limitations when doing it on the cheap ;-) with this data pattern your WAL is meaningless (as it's only relavent for isertes), and you may as well use raid6 as raid10 (both allow you to utalize all drives for reads, but raid6 gives you 2 drives worth of reducnancy while the wrong two drives on raid10 could kill the entire array). You may even want to disable fsync on imports. It will save you a lot of time, and if the system crashes during the load you can just reinitialize and reload the data. however, since you are going to be large sequential data transfers, you want to be utalizing multiple SAS links, preferrably as evenly as possible, so rather than putting all your data drives on one port, you may want to spread them between ports so that your aggragate bandwidth to the drives is higher (with this many high speed drives, this is a significant limitation) the usual reason for keeping the index drives separate is to avoid having writes interact with index reads. Since you are not going to be doing both at the same time, I don't know if it helps to separate your indexes. now, if you pull the data from this main table into a smaller table for analysis, you may want to do more interesting things with the drives that you use for this smaller table as you are going to be loading data into them more frequently. David Lang On Thu, 27 Jan 2011, da...@lang.hm wrote: Date: Thu, 27 Jan 2011 15:19:32 -0800 (PST) From: da...@lang.hm To: Robert Schnabel Cc: pgsql-performance Subject: Re: [PERFORM] How to best use 32 15k.7 300GB drives? On Thu, 27 Jan 2011, Robert Schnabel wrote: HI, I use PostgreSQL basically as a data warehouse to store all the genetic data that our lab generates. The only person that accesses the database is myself and therefore I've had it housed on my workstation in my office up till now. However, it's getting time to move it to bigger hardware. I currently have a server that is basically only storing backup images of all our other workstations so I'm going to move my database onto it. The server looks like this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 quad-core x2, 32GB RAM. For my purposes the CPUs and RAM are fine. I currently have an Adaptec 52445+BBU controller that has the OS (4 drive RAID5), FTP (4 drive RAID5) and two backup arrays (8 drive each RAID0). The backup arrays are in a 16 drive external enclosure through an expander so I actually have 16 ports free on the 52445 card. I plan to remove 3 of the drives from my backup arrays to make room for 3 - 73GB 15k.5 drives (re-purposed from my workstation). Two 16 drive enclosures with SAS2 expanders just arrived as well as 36 Seagate 15k.7 300GB drives (ST3300657SS). I also intend on getting an Adaptec 6445 controller with the flash module when it becomes available in about a month or two. I already have several Adaptec cards so I'd prefer to stick with them. Here's the way I was planning using the new hardware: xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 controller data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller 2 - 300G 15k.7 as hot spares enclosure C 4 spare 15k.7 for on the shelf With this configuration I figure I'll have ~3TB for my main data tables and 1TB for indexes. Right now my database is 500GB total. The 3:1 split reflects my current table structure and what I foresee coming down the road in terms of new data. So my questions are 1) am I'm crazy for doing this, 2) would you change anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp filespace) on a different controller than everything else? Please keep in mind I'm a geneticist who happens to know a little bit about bioinformatics and not the reverse. :-) a number of questions spring to mind how much of the time are you expecting to spend inserting data into this system vs querying data from the system? is data arriving continuously, or is it a matter of receiving a bunch of data, inserting it, then querying it? which do you need to optimize for, insert speed or query speed? do you expect your queries to
Re: [HACKERS] [PERFORM] Slow count(*) again...
On Thu, 3 Feb 2011, Vitalii Tymchyshyn wrote: 02.02.11 20:32, Robert Haas ???(??): Yeah. Any kind of bulk load into an empty table can be a problem, even if it's not temporary. When you load a bunch of data and then immediately plan a query against it, autoanalyze hasn't had a chance to do its thing yet, so sometimes you get a lousy plan. May be introducing something like 'AutoAnalyze' threshold will help? I mean that any insert/update/delete statement that changes more then x% of table (and no less then y records) must do analyze right after it was finished. Defaults like x=50 y=1 should be quite good as for me. If I am understanding things correctly, a full Analyze is going over all the data in the table to figure out patterns. If this is the case, wouldn't it make sense in the situation where you are loading an entire table from scratch to run the Analyze as you are processing the data? If you don't want to slow down the main thread that's inserting the data, you could copy the data to a second thread and do the analysis while it's still in RAM rather than having to read it off of disk afterwords. this doesn't make sense for updates to existing databases, but the use case of loading a bunch of data and then querying it right away isn't _that_ uncommon. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [HACKERS] [PERFORM] Slow count(*) again...
On Thu, 3 Feb 2011, Robert Haas wrote: On Thu, Feb 3, 2011 at 5:11 AM, wrote: If I am understanding things correctly, a full Analyze is going over all the data in the table to figure out patterns. No. It's going over a small, fixed-size sample which depends on default_statistics_target but NOT on the table size. It's really important to come up with a solution that's not susceptible to running ANALYZE over and over again, in many cases unnecessarily. If this is the case, wouldn't it make sense in the situation where you are loading an entire table from scratch to run the Analyze as you are processing the data? If you don't want to slow down the main thread that's inserting the data, you could copy the data to a second thread and do the analysis while it's still in RAM rather than having to read it off of disk afterwords. Well that's basically what autoanalyze is going to do anyway, if the table is small enough to fit in shared_buffers. And it's actually usually BAD if it starts running while you're doing a large bulk load, because it competes for I/O bandwidth and the buffer cache and slows things down. Especially when you're bulk loading for a long time and it tries to run over and over. I'd really like to suppress all those asynchronous ANALYZE operations and instead do ONE synchronous one at the end, when we try to use the data. If the table is not large enough to fit in ram, then it will compete for I/O, and the user will have to wait. what I'm proposing is that as the records are created, the process doing the creation makes copies of the records (either all of them, or some of them if not all are needed for the analysis, possibly via shareing memory with the analysis process), this would be synchronous with the load, not asynchronous. this would take zero I/O bandwidth, it would take up some ram, memory bandwidth, and cpu time, but a load of a large table like this is I/O contrained. it would not make sense for this to be the default, but as an option it should save a significant amount of time. I am making the assumption that an Analyze run only has to go over the data once (a seqential scan of the table if it's >> ram for example) and gathers stats as it goes. with the current code, this is a completely separate process that knows nothing about the load, so if you kick it off when you start the load, it makes a pass over the table (competing for I/O), finishes, you continue to update the table, so it makes another pass, etc. As you say, this is a bad thing to do. I am saying to have an option that ties the two togeather, essentially making the data feed into the Analyze run be a fork of the data comeing out of the insert run going to disk. So the Analyze run doesn't do any I/O and isn't going to complete until the insert is complete. At which time it will have seen one copy of the entire table. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [HACKERS] [PERFORM] Slow count(*) again...
On Thu, 3 Feb 2011, Robert Haas wrote: On Thu, Feb 3, 2011 at 3:54 PM, wrote: with the current code, this is a completely separate process that knows nothing about the load, so if you kick it off when you start the load, it makes a pass over the table (competing for I/O), finishes, you continue to update the table, so it makes another pass, etc. As you say, this is a bad thing to do. I am saying to have an option that ties the two togeather, essentially making the data feed into the Analyze run be a fork of the data comeing out of the insert run going to disk. So the Analyze run doesn't do any I/O and isn't going to complete until the insert is complete. At which time it will have seen one copy of the entire table. Yeah, but you'll be passing the entire table through this separate process that may only need to see 1% of it or less on a large table. If you want to write the code and prove it's better than what we have now, or some other approach that someone else may implement in the meantime, hey, this is an open source project, and I like improvements as much as the next guy. But my prediction for what it's worth is that the results will suck. :-) I will point out that 1% of a very large table can still be a lot of disk I/O that is avoided (especially if it's random I/O that's avoided) David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [HACKERS] [PERFORM] Slow count(*) again...
On Thu, 3 Feb 2011, Robert Haas wrote: On Thu, Feb 3, 2011 at 7:39 PM, wrote: Yeah, but you'll be passing the entire table through this separate process that may only need to see 1% of it or less on a large table. If you want to write the code and prove it's better than what we have now, or some other approach that someone else may implement in the meantime, hey, this is an open source project, and I like improvements as much as the next guy. But my prediction for what it's worth is that the results will suck. :-) I will point out that 1% of a very large table can still be a lot of disk I/O that is avoided (especially if it's random I/O that's avoided) Sure, but I think that trying to avoid it will be costly in other ways - you'll be streaming a huge volume of data through some auxiliary process, which will have to apply some algorithm that's very different from the one we use today. The reality is that I think there's little evidence that the way we do ANALYZE now is too expensive. It's typically very cheap and works very well. It's a bit annoying when it fires off in the middle of a giant data load, so we might need to change the time of it a little, but if there's a problem with the operation itself being too costly, this is the first I'm hearing of it. We've actually worked *really* hard to make it cheap. I could be misunderstanding things here, but my understanding is that it's 'cheap' in that it has little impact on the database while it is running. the issue here is that the workflow is load data analyze start work so the cost of analyze in this workflow is not "1% impact on query speed for the next X time", it's "the database can't be used for the next X time while we wait for analyze to finish running" I don't understand why the algorithm would have to be so different than what's done today, surely the analyze thread could easily be tweaked to ignore the rest of the data (assuming we don't have the thread sending the data to analyze do the filtering) David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [HACKERS] [PERFORM] Slow count(*) again...
On Fri, 4 Feb 2011, ??? wrote: 2011/2/3 If the table is not large enough to fit in ram, then it will compete for I/O, and the user will have to wait. what I'm proposing is that as the records are created, the process doing the creation makes copies of the records (either all of them, or some of them if not all are needed for the analysis, possibly via shareing memory with the analysis process), this would be synchronous with the load, not asynchronous. this would take zero I/O bandwidth, it would take up some ram, memory bandwidth, and cpu time, but a load of a large table like this is I/O contrained. it would not make sense for this to be the default, but as an option it should save a significant amount of time. I am making the assumption that an Analyze run only has to go over the data once (a seqential scan of the table if it's >> ram for example) and gathers stats as it goes. with the current code, this is a completely separate process that knows nothing about the load, so if you kick it off when you start the load, it makes a pass over the table (competing for I/O), finishes, you continue to update the table, so it makes another pass, etc. As you say, this is a bad thing to do. I am saying to have an option that ties the two togeather, essentially making the data feed into the Analyze run be a fork of the data comeing out of the insert run going to disk. So the Analyze run doesn't do any I/O and isn't going to complete until the insert is complete. At which time it will have seen one copy of the entire table. Actually that are two different problems. The one is to make analyze more automatic to make select right after insert more clever by providing statistics to it. Another is to make it take less IO resources. I dont like for it to be embedded into insert (unless the threshold can be determined before inserts starts). Simply because it is more CPU/memory that will slow down each insert. And if you will add knob, that is disabled by default, this will be no more good than manual analyze. if it can happen during the copy instead of being a step after the copy it will speed things up. things like the existing parallel restore could use this instead ofneeding a separate pass. so I don't think that having to turn it on manually makes it useless, any more than the fact that you have to explicity disable fsync makes that disabling feature useless (and the two features would be likely to be used togeather) when a copy command is issued, I assume that there is some indication of how much data is going to follow. I know that it's not just 'insert everything until the TCP connection terminates' because that would give you no way of knowing if the copy got everything in or was interrupted part way through. think about what happens with ftp if the connection drops, you get a partial file 'successfully' as there is no size provided, but with HTTP you get a known-bad transfer that you can abort or resume. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [HACKERS] [PERFORM] Slow count(*) again...
On Fri, 4 Feb 2011, Vitalii Tymchyshyn wrote: 04.02.11 16:33, Kenneth Marshall ???(??): In addition, the streaming ANALYZE can provide better statistics at any time during the load and it will be complete immediately. As far as passing the entire table through the ANALYZE process, a simple counter can be used to only send the required samples based on the statistics target. Where this would seem to help the most is in temporary tables which currently do not work with autovacuum but it would streamline their use for more complicated queries that need an analyze to perform well. Actually for me the main "con" with streaming analyze is that it adds significant CPU burden to already not too fast load process. Especially if it's automatically done for any load operation performed (and I can't see how it can be enabled on some threshold). two thoughts 1. if it's a large enough load, itsn't it I/O bound? 2. this chould be done in a separate process/thread than the load itself, that way the overhead of the load is just copying the data in memory to the other process. with a multi-threaded load, this would eat up some cpu that could be used for the load, but cores/chip are still climbing rapidly so I expect that it's still pretty easy to end up with enough CPU to handle the extra load. David Lang And you can't start after some threshold of data passed by since you may loose significant information (like minimal values). Best regards, Vitalii Tymchyshyn -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] getting the most of out multi-core systems for repeated complex SELECT statements
On Fri, 4 Feb 2011, Chris Browne wrote: 2. The query needs to NOT be I/O-bound. If it's I/O bound, then your system is waiting for the data to come off disk, rather than to do processing of that data. yes and no on this one. it is very possible to have a situation where the process generating the I/O is waiting for the data to come off disk, but if there are still idle resources in the disk subsystem. it may be that the best way to address this is to have the process generating the I/O send off more requests, but that sometimes is significantly more complicated than splitting the work between two processes and letting them each generate I/O requests with rotating disks, ideally you want to have at least two requests outstanding, one that the disk is working on now, and one for it to start on as soon as it finishes the one that it's on (so that the disk doesn't sit idle while the process decides what the next read should be). In practice you tend to want to have even more outstanding from the application so that they can be optimized (combined, reordered, etc) by the lower layers. if you end up with a largish raid array (say 16 disks), this can translate into a lot of outstanding requests that you want to have active to fully untilize the array, but having the same number of requests outstanding with a single disk would be counterproductive as the disk would not be able to see all the outstanding requests and therefor would not be able to optimize them as effectivly. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Which RAID Controllers to pick/avoid?
On Sun, 6 Feb 2011, Scott Marlowe wrote: On Sun, Feb 6, 2011 at 2:39 AM, Royce Ausburn wrote: On Wed, Feb 2, 2011 at 7:00 PM, Craig Ringer wrote: Whatever RAID controller you get, make sure you have a battery backup unit (BBU) installed so you can safely enable write-back caching. Without that, you might as well use software RAID - it'll generally be faster (and cheaper) than HW RAID w/o a BBU. Recently we had to pull our RAID controllers and go to plain SAS cards. While random access dropped a bit, sequential throughput skyrocketed, saturating the 4 lane cable we use. 4x300Gb/s = 1200Gb/s or right around 1G of data a second off the array. VERY impressive. This is really surprising. Software raid generally outperform hardware raid without BBU? Why is that? My company uses hardware raid quite a bit without BBU and have never thought to compare with software raid =/ For raw throughtput it's not uncommon to beat a RAID card whether it has a battery backed cache or not. If I'm wiriting a 200G file to the disks, a BBU cache isn't gonna make that any faster, it'll fill up in a second and then it's got to write to disk. BBU Cache are for faster random writes, and will handily beat SW RAID. But for raw large file read and write SW RAID is the fastest thing I've seen. keep in mind that hardware raide with BBU is safer than software raid. since the updates to the drives do not all happen at the same time, there is a chance that a write to software raid may have happened on some drives and not others when the system crashes. with hardware raid and BBU, the controller knows what it was trying to write where, and if it didn't get the scknowledgement, it will complete the write when it comes up again. but with software raid you will have updates some part of the array and not others. this will result in a corrupted stripe in the array. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] choosing the right RAID level for PostgresQL database
On Sun, 13 Feb 2011, Dave Crooke wrote: For any database, anywhere, the answer is pretty much always RAID-10. The only time you would do anything else is for odd special cases. there are two situations where you would opt for something other than RAID-10 1. if you need the space that raid 6 gives you compared to raid 10 you may not have much choice 2. if you do almost no updates to the disk during the time you are doing the reads then raid 6 can be at least as fast as raid 10 in non-degraded mode (it could be faster if you are able to use faster parts of the disks in raid 6 than you could in raid 10). degraded mode suffers more, but you can tolerate any 2 drives failing rather than just any 1 drive failing for raid 10 (the wrong two drives failing can kill a raid 10, while if the right drives fail you can loose a lot more drives in raid 10) where raid 6 is significantly slower than raid 10 is when you are doing small random writes. Also many the performance variation between raid controllers will be much higher with raid 6 than with raid 10 David Lang Cheers Dave On Sun, Feb 13, 2011 at 2:12 PM, sergey wrote: Hello, I got a disk array appliance of 8 disks 1T each (UltraStor RS8IP4). It will be used solely by PostgresQL database and I am trying to choose the best RAID level for it. The most priority is for read performance since we operate large data sets (tables, indexes) and we do lots of searches/scans, joins and nested queries. With the old disks that we have now the most slowdowns happen on SELECTs. Fault tolerance is less important, it can be 1 or 2 disks. Space is the least important factor. Even 1T will be enough. Which RAID level would you recommend in this situation. The current options are 60, 50 and 10, but probably other options can be even better. Thank you! -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] TB-sized databases
On Wed, 28 Nov 2007, Csaba Nagy wrote: On Wed, 2007-11-28 at 08:27 -0500, Bill Moran wrote: Is there something wrong with: set enable_seqscan = off ? Nothing wrong with enable_seqscan = off except it is all or nothing type of thing... if you want the big table to never use seqscan, but a medium table which is joined in should use it, then what you do ? And setting enable_seqscan = off will actually not mean the planner can't use a sequential scan for the query if no other alternative exist. In any case it doesn't mean "please throw an error if you can't do this without a sequential scan". In fact an even more useful option would be to ask the planner to throw error if the expected cost exceeds a certain threshold... and even better if the option can be overridden for a specific transaction or connection. that way it can be set relativly low for normal operations, but when you need to do an expensive query you can change it for that query. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] TB-sized databases
On Wed, 28 Nov 2007, Simon Riggs wrote: On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote: In fact an even more useful option would be to ask the planner to throw error if the expected cost exceeds a certain threshold... Well, I've suggested it before: statement_cost_limit on pgsql-hackers, 1 March 2006 Would people like me to re-write and resubmit this patch for 8.4? Tom's previous concerns were along the lines of "How would know what to set it to?", given that the planner costs are mostly arbitrary numbers. arbitrary numbers are fine if they are relativly consistant with each other. will a plan with a estimated cost of 1,000,000 take approximatly 100 times as long as one with a cost of 10,000? or more importantly, will a plan with an estimated cost of 2000 reliably take longer then one with an estimated cost of 1000? David Lang Any bright ideas, or is it we want it and we don't care about the possible difficulties? ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10
On Wed, 26 Dec 2007, Fernando Hevia wrote: Mark Mielke Wrote: In my experience, software RAID 5 is horrible. Write performance can decrease below the speed of one disk on its own, and read performance will not be significantly more than RAID 1+0 as the number of stripes has only increased from 2 to 3, and if reading while writing, you will not get 3X as RAID 5 write requires at least two disks to be involved. I believe hardware RAID 5 is also horrible, but since the hardware hides it from the application, a hardware RAID 5 user might not care. Software RAID 1+0 works fine on Linux with 4 disks. This is the setup I use for my personal server. I will use software RAID so RAID 1+0 seems to be the obvious choice. Thanks for the advice! to clarify things a bit more. with only four drives the space difference between raid 1+0 and raid 5 isn't that much, but when you do a write you must write to two drives (the drive holding the data you are changing, and the drive that holds the parity data for that stripe, possibly needing to read the old parity data first, resulting in stalling for seek/read/calculate/seek/write since the drive moves on after the read), when you read you must read _all_ drives in the set to check the data integrity. for seek heavy workloads (which almost every database application is) the extra seeks involved can be murder on your performance. if your workload is large sequential reads/writes, and you can let the OS buffer things for you, the performance of raid 5 is much better. on the other hand, doing raid 6 (instead of raid 5) gives you extra data protection in exchange for the performance hit, but with only 4 drives this probably isn't what you are looking for. Linux software raid can do more then two disks in a mirror, so you may be able to get the added protection with raid 1 sets (again, probably not relavent to four drives), although there were bugs in this within the last six months or so, so you need to be sure your kernel is new enough to have the fix. now, if you can afford solid-state drives which don't have noticable seek times, things are completely different ;-) David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10
On Wed, 26 Dec 2007, Fernando Hevia wrote: David Lang Wrote: with only four drives the space difference between raid 1+0 and raid 5 isn't that much, but when you do a write you must write to two drives (the drive holding the data you are changing, and the drive that holds the parity data for that stripe, possibly needing to read the old parity data first, resulting in stalling for seek/read/calculate/seek/write since the drive moves on after the read), when you read you must read _all_ drives in the set to check the data integrity. Thanks for the explanation David. It's good to know not only what but also why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be read: the one with the data and the parity disk? no, becouse the parity is of the sort (A+B+C+P) mod X = 0 so if X=10 (which means in practice that only the last decimal digit of anything matters, very convienient for examples) A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0 if you read B and get 3 and P and get 4 you don't know if this is right or not unless you also read A and C (at which point you would get A+B+C+P=11=1=error) for seek heavy workloads (which almost every database application is) the extra seeks involved can be murder on your performance. if your workload is large sequential reads/writes, and you can let the OS buffer things for you, the performance of raid 5 is much better. Well, actually most of my application involves large sequential reads/writes. The memory available for buffering (4GB) isn't bad either, at least for my scenario. On the other hand I have got such strong posts against RAID 5 that I doubt to even consider it. in theory a system could get the same performance with a large sequential read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same number of data disks, ignoring the parity disks) becouse the OS could read the entire stripe in at once, do the calculation once, and use all the data (or when writing, don't write anything until you are ready to write the entire stripe, calculate the parity and write everything once). Unfortunantly in practice filesystems don't support this, they don't do enough readahead to want to keep the entire stripe (so after they read it all in they throw some of it away), they (mostly) don't know where a stripe starts (and so intermingle different types of data on one stripe and spread data across multiple stripes unessasarily), and they tend to do writes in small, scattered chunks (rather then flushing an entire stripes worth of data at once) those who have been around long enough to remember the days of MFM/RLL (when you could still find the real layout of the drives) may remember optmizing things to work a track at a time instead of a sector at a time. this is the exact same logic, just needing to be applied to drive stripes instead of sectors and tracks on a single drive. the issue has been raised with the kernel developers, but there's a lot of work to be done (especially in figuring out how to get all the layers the info they need in a reasonable way) Linux software raid can do more then two disks in a mirror, so you may be able to get the added protection with raid 1 sets (again, probably not relavent to four drives), although there were bugs in this within the last six months or so, so you need to be sure your kernel is new enough to have the fix. Well, here rises another doubt. Should I go for a single RAID 1+0 storing OS + Data + WAL files or will I be better off with two RAID 1 separating data from OS + Wal files? if you can afford the space, you are almost certinly better seperating the WAL from the data (I think I've seen debates about which is better OS+data/Wal or date/OS+Wal, but very little disagreement that either is better than combining them all) David Lang now, if you can afford solid-state drives which don't have noticable seek times, things are completely different ;-) Ha, sadly budget is very tight. :) Regards, Fernando. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10
On Wed, 26 Dec 2007, Florian Weimer wrote: seek/read/calculate/seek/write since the drive moves on after the read), when you read you must read _all_ drives in the set to check the data integrity. I don't know of any RAID implementation that performs consistency checking on each read operation. 8-( I could see a raid 1 array not doing consistancy checking (after all, it has no way of knowing what's right if it finds an error), but since raid 5/6 can repair the data I would expect them to do the checking each time. David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10
On Wed, 26 Dec 2007, Mark Mielke wrote: Florian Weimer wrote: seek/read/calculate/seek/write since the drive moves on after the read), when you read you must read _all_ drives in the set to check the data integrity. I don't know of any RAID implementation that performs consistency checking on each read operation. 8-( Dave had too much egg nog... :-) Yep - checking consistency on read would eliminate the performance benefits of RAID under any redundant configuration. except for raid0, raid is primarily a reliability benifit, any performance benifit is incidental, not the primary purpose. that said, I have heard of raid1 setups where it only reads off of one of the drives, but I have not heard of higher raid levels doing so. David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10
On Wed, 26 Dec 2007, Mark Mielke wrote: [EMAIL PROTECTED] wrote: Thanks for the explanation David. It's good to know not only what but also why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be read: the one with the data and the parity disk? no, becouse the parity is of the sort (A+B+C+P) mod X = 0 so if X=10 (which means in practice that only the last decimal digit of anything matters, very convienient for examples) A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0 if you read B and get 3 and P and get 4 you don't know if this is right or not unless you also read A and C (at which point you would get A+B+C+P=11=1=error) I don't think this is correct. RAID 5 is parity which is XOR. The property of XOR is such that it doesn't matter what the other drives are. You can write any block given either: 1) The block you are overwriting and the parity, or 2) all other blocks except for the block we are writing and the parity. Now, it might be possible that option 2) is taken more than option 1) for some complicated reasons, but it is NOT to check consistency. The array is assumed consistent until proven otherwise. I was being sloppy in explaining the reason, you are correct that for writes you don't need to read all the data, you just need the current parity block, the old data you are going to replace, and the new data to be able to calculate the new parity block (and note that even with my checksum example this would be the case). however I was addressing the point that for reads you can't do any checking until you have read in all the blocks. if you never check the consistency, how will it ever be proven otherwise. in theory a system could get the same performance with a large sequential read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same number of data disks, ignoring the parity disks) becouse the OS could read the entire stripe in at once, do the calculation once, and use all the data (or when writing, don't write anything until you are ready to write the entire stripe, calculate the parity and write everything once). For the same number of drives, this cannot be possible. With 10 disks, on raid5, 9 disks hold data, and 1 holds parity. The theoretical maximum performance is only 9/10 of the 10/10 performance possible with RAID 0. I was saying that a 10 drive raid0 could be the same performance as a 10+1 drive raid 5 or a 10+2 drive raid 6 array. this is why I said 'same number of data disks, ignoring the parity disks'. in practice you would probably not do quite this good anyway (you have the parity calculation to make and the extra drive or two's worth of data passing over your busses), but it could be a lot closer then any implementation currently is. Unfortunantly in practice filesystems don't support this, they don't do enough readahead to want to keep the entire stripe (so after they read it all in they throw some of it away), they (mostly) don't know where a stripe starts (and so intermingle different types of data on one stripe and spread data across multiple stripes unessasarily), and they tend to do writes in small, scattered chunks (rather then flushing an entire stripes worth of data at once) In my experience, this theoretical maximum is not attainable without significant write cache, and an intelligent controller, neither of which Linux software RAID seems to have by default. My situation was a bit worse in that I used applications that fsync() or journalled metadata that is ordered, which forces the Linux software RAID to flush far more than it should - but the same system works very well with RAID 1+0. my statements above apply to any type of raid implementation, hardware or software. the thing that saves the hardware implementation is that the data is written to a battery-backed cache and the controller lies to the system, telling it that the write is complete, and then it does the write later. on a journaling filesystem you could get very similar results if you put the journal on a solid-state drive. but for your application, the fact that you are doing lots of fsyncs is what's killing you, becouse the fsync forces a lot of data to be written out, swamping the caches involved, and requiring that you wait for seeks. nothing other then a battery backed disk cache of some sort (either on the controller or a solid-state drive on a journaled filesystem would work) David Lang ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10
On Wed, 26 Dec 2007, Mark Mielke wrote: [EMAIL PROTECTED] wrote: I could see a raid 1 array not doing consistancy checking (after all, it has no way of knowing what's right if it finds an error), but since raid 5/6 can repair the data I would expect them to do the checking each time. Your messages are spread across the thread. :-) RAID 5 cannot repair the data. I don't know much about RAID 6, but I expect it cannot necessarily repair the data either. It still doesn't know which drive is wrong. In any case, there is no implementation I am aware of that performs mandatory consistency checks on read. This would be silliness. sorry, raid 5 can repair data if it knows which chunk is bad (the same way it can rebuild a drive). Raid 6 does something slightly different for it's parity, I know it can recover from two drives going bad, but I haven't looked into the question of it detecting bad data. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10
On Wed, 26 Dec 2007, Mark Mielke wrote: [EMAIL PROTECTED] wrote: On Wed, 26 Dec 2007, Mark Mielke wrote: Florian Weimer wrote: seek/read/calculate/seek/write since the drive moves on after the read), when you read you must read _all_ drives in the set to check the data integrity. I don't know of any RAID implementation that performs consistency checking on each read operation. 8-( Dave had too much egg nog... :-) Yep - checking consistency on read would eliminate the performance benefits of RAID under any redundant configuration. except for raid0, raid is primarily a reliability benifit, any performance benifit is incidental, not the primary purpose. that said, I have heard of raid1 setups where it only reads off of one of the drives, but I have not heard of higher raid levels doing so. What do you mean "heard of"? Which raid system do you know of that reads all drives for RAID 1? Linux dmraid reads off ONLY the first. Linux mdadm reads off the "best" one. Neither read from both. Why should it need to read from both? What will it do if the consistency check fails? It's not like it can tell which disk is the right one. It only knows that the whole array is inconsistent. Until it gets an actual hardware failure (read error, write error), it doesn't know which disk is wrong. yes, the two linux software implementations only read from one disk, but I have seen hardware implementations where it reads from both drives, and if they disagree it returns a read error rather then possibly invalid data (it's up to the admin to figure out which drive is bad at that point). no, I don't remember which card this was. I've been playing around with things in this space for quite a while. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Best way to index IP data?
On Fri, 11 Jan 2008, Greg Smith wrote: You may well need netmasks to configure your interface, but there's absolutely no need for them to identify an IP endpoint, which is all you need to identify the destination the packet is going to, and that is the most common use of IP addresses. Technically you can't ever send a packet unless you know both the endpoint and your local netmask. As the sender, you're obligated to determine if the destination is on your local LAN (in which case you send it there) or if it goes to the gateway. That's similar to a routing decision, but it's not quite--if you don't have to look in a routing table, it's not actually part of routing. you also need to know your local IP address, but there is no reason to need the netmask of the other end my IP address is 64.81.33.126 why do you need to know my netmaask? how would you find out what it is? DNS doesn't report the netmask,and it's arguably the biggest database of IP addresses around ;-) one of the biggest reasons for storing IP addresses in a SQL database is as part of log analysis. David Lang I believe this sort of detail is why subnet masks are considered required for some things even though it doesn't seem like they are needed. Regardless, the details of how the packets move aren't important to some applications, and arguing over what the RFCs do and don't require doesn't change that. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] [OT] RAID controllers blocking one another?
On Thu, 17 Jan 2008, Sean Davis wrote: We have a machine that serves as a fileserver and a database server. Our server hosts a raid array of 40 disk drives, attached to two3-ware cards, one 9640SE-24 and one 9640SE-16. We have noticed that activity on one controller blocks access on the second controller, not only for disk-IO but also the command line tools which become unresponsive for the inactive controller. The controllers are sitting in adjacent PCI-express slots on a machine with dual-dual AMD and 16GB of RAM. Has anyone else noticed issues like this? Throughput for either controller is a pretty respectable 150-200MB/s writing and somewhat faster for reading, but the "blocking" is problematic, as the machine is serving multiple purposes. I know this is off-topic, but I know lots of folks here deal with very large disk arrays; it is hard to get real-world input on machines such as these. there have been a lot of discussions on the linux-kernel mailing list over the last several months on the topic of IO to one set of drives interfearing with IO to another set of drives. The soon-to-be-released 2.6.24 kernel includes a substantial amount of work in this area that (at least on initial reports) is showing significant improvements. I haven't had the time to test this out yet, so I can't add personal experiance, but it's definantly something to look at on a test system. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] scheduler
On Tue, 22 Jan 2008, Adrian Moisey wrote: Hi Which scheduler is recommended for a box that is dedicated to running postgres? I've asked google and found no answers. Is it the OS itself? Yes, in linux. I've found that cfq or deadline is best, but I haven't seen anyone try a benchmark it also depends on your hardware. if you have a good battery-backed cache on your I/O system you may be best off with 'none' (or simple elevator, I don't remember the exact name) the more complexity in your I/O that the kernek oesn't know about, the less likely it is to make the right decision. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] 8x2.5" or 6x3.5" disks
On Mon, 28 Jan 2008, Arjen van der Meijden wrote: On 28-1-2008 20:25 Christian Nicolaisen wrote: So, my question is: should I go for the 2.5" disk setup or 3.5" disk setup, and does the raid setup in either case look correct? Afaik they are about equal in speed. With the smaller ones being a bit faster in random access and the larger ones a bit faster for sequential reads/writes. I missed the initial post in this thread, but I haven't seen any 15K rpm 2.5" drives, so if you compare 10K rpm 2.5" drives with 15K rpm 3.5" drives you will see differences (depending on your workload and controller cache) David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] migration of 7.4 to 8.1
On Wed, 12 Mar 2008, sathiya psql wrote: My question is that how to migrate my database to 7.4 to 8.1 that is not only dumping the db and extracting that in 8.1 .. If i do that whether it will work without problem, or i have to do some manual changes is my question... you would need to look at the release notes for each version to see if you are using any of the features that were changed. but in return for this hassle (which you will need to do eventually, you are just deciding when) you get to use the newer version with all the speed improvements, not to mention the new features. going from 7.x to 8.0 is the most painful step. going on from 8.0 to 8.1 to 8.2 to 8.3 are relativly minor steps in comparison. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] migration of 7.4 to 8.1
On Wed, 12 Mar 2008, Craig Ringer wrote: sathiya psql wrote: This might be a silly question, but ... why 8.1 ? If you're doing a major upgrade, why not go straight to 8.3? It's been out long enough that there aren't any obvious nasty bugs, and there have been a fair few fixes and improvements since prior versions. Because am using Debian ETCH stable... i need to continue using this... In Debian ETCH stable 8.1 is only there I use Etch on my servers. The magic of etch-backports from backports.org allows me to use 8.3 without messing with anything else. It works extremely well. All I had to do to install 8.3 was add the etch-backports line to my /etc/apt/sources.list, run `apt-get update', then run: apt-get -t etch-backports install postgresql-8.3 ... then migrate the data to it and remove the old packages. See backports.org for a mirror list. I will be installing my project in other machines where it is having DEBIAN ETCH STABLE, i dont want to face any problems... so only... am i right ??? Personally I'd use 8.3 from backports.org, but it's of course up to you. consider the fact that after 1 year Etch is 2 major releases out of date. by the time the next stable release is made you will be running on ancient versions unless you make use of backports.org (or compile it yourself, which is useually my choice). David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] migration of 7.4 to 8.1
On Wed, 12 Mar 2008, sathiya psql wrote: In the home page itself they were saying testing ... unstable you are talking about the debian home page right? then we should not use that for live. so i prefer 8.1 . Debian selected the version of Postgres for Etch about a year and a half ago. At that point selecting 8.1 was a resonable choice. Debian has a policy that they will never change the version number of a package in a stable release (they will backport many bugfixes, but not upgrade the version) As a result 2 years from now when Postgres is on 8.5 stable (and looking at 8.6), Debian Etch will still be on 8.1 it is of course your option to stick with 8.1, but before very long the answer to most of your questions about postgres is going to be 'upgrade to a resonably current version' (people running 7.4 and 8.0 are starting to get that answer now. 8.1 and 8.2 are recent enough that it's rare to get that now, but in a year or two that will change) for most utility software you just want it to work and don't really care about new features over a couple of years (or the project has reached the stage where it just doesn't change much over a couple of years). In these situations the Debian policy is great, you don't have to worry about new stuff messing you up. however some software has much faster development cycles. The kernel has a new release about every two months, Postgres is aiming at a one year cycle, Apache has it's own release schedule. These packages are usually pretty core to your operation and the improvments (and security fixes that are not possible to backport sanely) mean that you need to think very hard about what version of them you are going to run. On my systems I bypass Debian directly for such packages and compile them myself, the backports.org option allows you to avoid that hassle, and get versions that are fairly well tested (like any new version, you need to do some testing yourself), just wait a month or two after a new release hits backports.org and you will be pretty safe. David Lang You can get 8.3 from backports: http://www.backports.org/ - it's a debian project to get more up to date versions for existing stable releases (they package everything exactly the same way). -- Postgresql & php tutorials http://www.designmagick.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What is the best way to storage music files in Postgresql
On Wed, 19 Mar 2008, Andrej Ricnik-Bay wrote: On 18/03/2008, Craig Ringer <[EMAIL PROTECTED]> wrote: Isn't a 10 or 24 spindle RAID 5 array awfully likely to encounter a double disk failure (such as during the load imposed by rebuild onto a spare) ? that's why you should use raid6 (allowing for dual failures) David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What is the best way to storage music files in Postgresql
On Tue, 18 Mar 2008, Gregory Stark wrote: <[EMAIL PROTECTED]> writes: On Wed, 19 Mar 2008, Andrej Ricnik-Bay wrote: On 18/03/2008, Craig Ringer <[EMAIL PROTECTED]> wrote: Isn't a 10 or 24 spindle RAID 5 array awfully likely to encounter a double disk failure (such as during the load imposed by rebuild onto a spare) ? that's why you should use raid6 (allowing for dual failures) You can have as many parity drives as you want with RAID 5 too. you can? I've never seen a raid 5 setup with more then a single parity dirve (or even the option of having more then one drives worth of redundancy). you can have hot-spare drives, but thats a different thing. what controller/software lets you do this? David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What is the best way to storage music files in Postgresql
On Wed, 19 Mar 2008, Craig Ringer wrote: [EMAIL PROTECTED] wrote: you can? I've never seen a raid 5 setup with more then a single parity dirve (or even the option of having more then one drives worth of redundancy). you can have hot-spare drives, but thats a different thing. With RAID 4, where the "parity drives" are in fact dedicated to parity information, the controller could just store the parity data mirrored on more than one drive. Unfortunately write performance on RAID 4 is absolutely horrible, and a second or third parity disk would not help with that. I suppose there's nothing stopping a controller adding a second disk's worth of duplicate parity information when striping a four or more disk RAID 5 array, but I thought that's basically what RAID 6 was. just duplicating the Raid 4 or 5 pairity information will not help you if the parity drive is not one of the drives that fail. raid 6 uses a different pairity algorithm so that any two drives in the array can fail with no data loss. even this isn't completely error proof. I just went through a scare with a 15 disk array where it reported 3 dead drives after a power outage. one of the dead drives ended up being the hot-spare, and another drive that acted up worked well enough to let me eventually recover all the data (seek errors), but it was a very scary week while I worked through this. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance increase with elevator=deadline
On Tue, 15 Apr 2008, Florian Weimer wrote: * Jeff: Using 4 of these with a dataset of about 30GB across a few files (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to noop. Quite an improvement. If you have a decent controller CFQ is not what you want. I tried deadline as well and it was a touch slower. The controller is a 3ware 9550sx with 4 disks in a raid10. I'll be trying this out on the big array later today. I found it suprising this info wasn't more widespread (the use of CFQ on a good controller). 3ware might be a bit special because the controller has got very deep queues on its own, so many assumptions of the kernel I/O schedulers do not seem to apply. Toying with the kernel/controller queue depths might help, but I haven't done real benchmarks to see if it's actually a difference. A few days ago, I experienced this: On a machine with a 3ware controller, a simple getxattr call on a file in an uncontended directory took several minutes because a PostgreSQL dump process was running in the background (and some other activity of a legacy database which caused frequent fdatasync calls). Clearly, this is unacceptable, and I've since switched to the deadline scheduler, too. So far, this particular behavior hasn't occurred again. one other thing to watch out for. up until very recent kernels (2.6.23 or 2.6.24) it was possible for one very busy block device to starve other block devices. they added isolation of queues for different block devices, but I've seen reports that the isolation can end up throttling high performance devices as a result. I haven't had time to really dig into this, but there are tuning knobs available to adjust the que space available to different devices and the reports are significantly better activity on a tuned system. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Strange behavior: pgbench and new Linux kernels
On Thu, 17 Apr 2008, Greg Smith wrote: On Thu, 17 Apr 2008, Matthew wrote: The last message in the thread says that 2.6.25-rc6 has the problem nailed. That was a month ago. So I guess, upgrade to 2.6.25, which was released today. Ah, even more support for me to distrust everything I read. The change has flattened out things, so now the pgbench results are awful everywhere. On this benchmark 2.6.25 is the worst kernel yet: -bash-3.00$ pgbench -S -c 4 -t 1 pgbench | grep excluding tps = 8619.710649 (excluding connections establishing) tps = 8664.321235 (excluding connections establishing) tps = 8671.973915 (excluding connections establishing) (was 18388 in 2.6.9 and 16621 in 2.6.23-3) -bash-3.00$ pgbench -S -c 8 -t 1 pgbench | grep excluding tps = 9011.728765 (excluding connections establishing) tps = 9039.441796 (excluding connections establishing) tps = 9206.574000 (excluding connections establishing) (was 15760 in 2.6.9 and 15551 in 2.6.23-3) -bash-3.00$ pgbench -S -c 16 -t 1 pgbench | grep excluding tps = 7063.710786 (excluding connections establishing) tps = 6956.266777 (excluding connections establishing) tps = 7120.971600 (excluding connections establishing) (was 14148 in 2.6.9 and 7311 in 2.6.23-3) -bash-3.00$ pgbench -S -c 32 -t 1 pgbench | grep excluding tps = 7006.311636 (excluding connections establishing) tps = 6971.305909 (excluding connections establishing) tps = 7002.820583 (excluding connections establishing) (was 13647 in 2.6.9 and 7141 in 2.6.23-3) This is what happens when the kernel developers are using results from a MySQL tool to optimize things I guess. It seems I have a lot of work ahead of me here to nail down and report what's going on here. report this to the kernel list so that they know, and be ready to test fixes. the kernel developers base sucess or failure on the results of tests. if the only people providing test results are MySQL people, how would they know there is a problem? David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres replication
On Wed, 30 Apr 2008, Gernot Schwed wrote: Hi all, looking for a HA master/master or master/slave replication solution. Our setup consists of two databases and we want to use them both for queries. Aside from pgpool II there seems no advisable replication solution. But the problem seems to be that we will have a single point of failure with pgpool. slony also has the disadvantage not to cover a real failover solution. Are there any other manageable and well tested tools/setups for our scenario? I'm about to setup a similar config and what I was intending to do is to run pgpool on both boxes and use heartbeat (from http://linux-ha.org ) to move an IP address from one box to the other. clients connect to this virtual IP and then pgpool will distribute the connections to both systems from there. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] which ext3 fs type should I use for postgresql
On Thu, 15 May 2008, Matthew Wakeling wrote: On Thu, 15 May 2008, Philippe Amelant wrote: using mkfs.ext3 I can use "-T" to tune the filesytem mkfs.ext3 -T fs_type ... fs_type are in /etc/mke2fs.conf (on debian) If you look at that file, you'd see that tuning really doesn't change that much. In fact, the only thing it does change (if you avoid "small" and "floppy") is the number of inodes available in the filesystem. Since Postgres tends to produce few large files, you don't need that many inodes, so the "largefile" option may be best. However, note that the number of inodes is a hard limit of the filesystem - if you try to create more files on the filesystem than there are available inodes, then you will get an out of space error even if the filesystem has space left. The only real benefit of having not many inodes is that you waste a little less space, so many admins are pretty generous with this setting. IIRC postgres likes to do 1M/file, which isn't very largeas far as the -T setting goes. Probably of more use are some of the other settings: -m reserved-blocks-percentage - this reserves a portion of the filesystem that only root can write to. If root has no need for it, you can kill this by setting it to zero. The default is for 5% of the disc to be wasted. think twice about this. ext2/3 get slow when they fill up (they have fragmentation problems when free space gets too small), this 5% that only root can use also serves as a buffer against that as well. -j turns the filesystem into ext3 instead of ext2 - many people say that for Postgres you shouldn't do this, as ext2 is faster. for the partition with the WAL on it you may as well do ext2 (the WAL is written synchronously and sequentially so the journal doesn't help you), but for the data partition you may benifit from the journal. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] which ext3 fs type should I use for postgresql
On Thu, 15 May 2008, [EMAIL PROTECTED] wrote: On Thu, 15 May 2008, Matthew Wakeling wrote: On Thu, 15 May 2008, Philippe Amelant wrote: using mkfs.ext3 I can use "-T" to tune the filesytem mkfs.ext3 -T fs_type ... fs_type are in /etc/mke2fs.conf (on debian) If you look at that file, you'd see that tuning really doesn't change that much. In fact, the only thing it does change (if you avoid "small" and "floppy") is the number of inodes available in the filesystem. Since Postgres tends to produce few large files, you don't need that many inodes, so the "largefile" option may be best. However, note that the number of inodes is a hard limit of the filesystem - if you try to create more files on the filesystem than there are available inodes, then you will get an out of space error even if the filesystem has space left. The only real benefit of having not many inodes is that you waste a little less space, so many admins are pretty generous with this setting. IIRC postgres likes to do 1M/file, which isn't very largeas far as the -T setting goes. Probably of more use are some of the other settings: -m reserved-blocks-percentage - this reserves a portion of the filesystem that only root can write to. If root has no need for it, you can kill this by setting it to zero. The default is for 5% of the disc to be wasted. think twice about this. ext2/3 get slow when they fill up (they have fragmentation problems when free space gets too small), this 5% that only root can use also serves as a buffer against that as well. -j turns the filesystem into ext3 instead of ext2 - many people say that for Postgres you shouldn't do this, as ext2 is faster. for the partition with the WAL on it you may as well do ext2 (the WAL is written synchronously and sequentially so the journal doesn't help you), but for the data partition you may benifit from the journal. a fairly recent article on the subject http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Hardware vs Software Raid
On Thu, 26 Jun 2008, Peter T. Breuer wrote: "Also sprach Merlin Moncure:" The linux software raid algorithms are highly optimized, and run on a I can confidently tell you that that's balderdash both as a Linux author and as a software RAID linux author (check the attributions in the kernel source, or look up something like "Raiding the Noosphere" on google). presumably (much faster) cpu than what the controller supports. However, there is still some extra oomph you can get out of letting the raid controller do what the software raid can't...namely delay sync for a time. There are several design problems left in software raid in the linux kernel. One of them is the need for extra memory to dispatch requests with and as (i.e. buffer heads and buffers, both). bhs should be OK since the small cache per device won't be exceeded while the raid driver itself serialises requests, which is essentially the case (it does not do any buffering, queuing, whatever .. and tries hard to avoid doing so). The need for extra buffers for the data is a problem. On different platforms different aspects of that problem are important (would you believe that on ARM mere copying takes so much cpu time that one wants to avoid it at all costs, whereas on intel it's a forgettable trivium). I also wouldn't aboslutely swear that request ordering is maintained under ordinary circumstances. which flavor of linux raid are you talking about (the two main families I am aware of are the md and dm ones) David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] syslog performance when logging big statements
On Tue, 8 Jul 2008, Tom Lane wrote: Jeff <[EMAIL PROTECTED]> writes: On Jul 8, 2008, at 8:24 AM, Achilleas Mantzios wrote: File sizes of about 3M result in actual logging output of ~ 10Mb. In this case, the INSERT *needs* 20 minutes to return. This is because the logging through syslog seems to severely slow the system. If instead, i use stderr, even with logging_collector=on, the same statement needs 15 seconds to return. In syslog.conf is the destination for PG marked with a "-"? (ie -/var/ log/pg.log) which tells syslog to not sync after each line logged. That could explain a large chunk of the difference in time. I experimented with this a bit here. There definitely is an O(N^2) penalty from the repeated strchr() calls, but it doesn't really start to hurt till 1MB or so statement length. Even with that patched, syslog logging pretty much sucks performance-wise. Here are the numbers I got on a Fedora 8 workstation, testing the time to log a statement of the form SELECT length('123456...lots of data, no newlines...7890'); statement length1MB 10MB CVS HEAD2523ms 215588ms + patch to fix repeated strchr 529ms 36734ms after turning off syslogd's fsync569ms5881ms PG_SYSLOG_LIMIT 1024, fsync on 216ms2532ms PG_SYSLOG_LIMIT 1024, no fsync 242ms2692ms For comparison purposes: logging statements to stderr 155ms2042ms execute statement without logging 42ms 520ms This machine is running a cheap IDE drive that caches writes, so the lack of difference between fsync off and fsync on is not too surprising --- on a machine with server-grade drives there'd be a lot more difference. (The fact that there is a difference in the 10MB case probably reflects filling the drive's write cache.) On my old HPUX machine, where fsync really works (and the syslogd doesn't seem to allow turning it off), the 1MB case takes 195957ms with the strchr patch, 22922ms at PG_SYSLOG_LIMIT=1024. So there's a fairly clear case to be made for fixing the repeated strchr, but I also think that there's a case for jacking up PG_SYSLOG_LIMIT. As far as I can tell the current value of 128 was chosen *very* conservatively without thought for performance: http://archives.postgresql.org/pgsql-hackers/2000-05/msg01242.php At the time we were looking at evidence that the then-current Linux syslogd got tummyache with messages over about 1KB: http://archives.postgresql.org/pgsql-hackers/2000-05/msg00880.php Some experimentation with the machines I have handy now says that Fedora 8: truncates messages at 2KB (including syslog's header) HPUX 10.20 (ancient): ditto Mac OS X 10.5.3:drops messages if longer than about 1900 bytes So it appears to me that setting PG_SYSLOG_LIMIT = 1024 would be perfectly safe on current systems (and probably old ones too), and would give at least a factor of two speedup for logging long strings --- more like a factor of 8 if syslogd is fsync'ing. Comments? Anyone know of systems where this is too high? Perhaps we should make that change only in HEAD, not in the back branches, or crank it up only to 512 in the back branches? with linux ext2/ext3 filesystems I have seen similar problems when the syslog starts getting large. there are several factors here 1. fsync after each write unless you have "-" in syslog.conf (only available on linux AFAIK) 2. ext2/ext3 tend to be very inefficiant when doing appends to large files. each write requires that the syslog daemon seek to the end of the file (becouse something else may have written to the file in the meantime) and with the small block sizes and chaining of indirect blocks this can start to be painful when logfiles get up in to the MB range. note that you see this same problem when you start to get lots of files in one directory as well. even if you delete a lot of files the directory itself is still large and this can cause serious performance problems. other filesystems are much less sensitive to file (and directory) sizes. my suggestion would be to first make sure you are doing async writes to syslog, and then try putting the logfiles on different filesystems to see how they differ. personally I use XFS most of the time where I expect lots of files or large files. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance