from:"david"

Re: [PERFORM] Sunfire X4500 recommendations

2007-03-27 Thread david


On Tue, 27 Mar 2007, Matt Smiley wrote:




The goal is to calculate the probability of data loss when we loose a 
certain number of disks within a short timespan (e.g. loosing a 2nd disk 
before replacing+rebuilding the 1st one).  For RAID 10, 50, and Z, we 
will loose data if any disk group (i.e. mirror or parity-group) looses 2 
disks.  For RAID 60 and Z2, we will loose data if 3 disks die in the 
same parity group.  The parity groups can include arbitrarily many 
disks.  Having larger groups gives us more usable diskspace but less 
protection.  (Naturally we're more likely to loose 2 disks in a group of 
50 than in a group of 5.)


   g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or 
more; dual-parity = 4 or more)
   n = total number of disks
   risk of loosing any 1 disk = 1/n


please explain why you are saying that the risk of loosing any 1 disk is 
1/n. shouldn't it be probability of failure * n instead?



   risk of loosing 1 disk from a particular group = g/n
   risk of loosing 2 disks in the same group = g/n * (g-1)/(n-1)
   risk of loosing 3 disks in the same group = g/n * (g-1)/(n-1) * (g-2)/(n-2)


following this logic the risk of loosing all 48 disks in a single group of 
48 would be 100%


also what you are looking for is the probability of the second (and third) 
disks failing in time X (where X is the time nessasary to notice the 
failure, get a replacement, and rebuild the disk)


the killer is the time needed to rebuild the disk, with multi-TB arrays 
is't sometimes faster to re-initialize the array and reload from backup 
then it is to do a live rebuild (the kernel.org servers had a raid failure 
recently and HPA mentioned that it took a week to rebuild the array, but 
it would have only taken a couple days to do a restore from backup)


add to this the fact that disk failures do not appear to be truely 
independant from each other statisticly (see the recent studies released 
by google and cmu), and I wouldn't bother with single-parity for a 
multi-TB array. If the data is easy to recreate (including from backup) or 
short lived (say a database of log data that cycles every month or so) I 
would just do RAID-0 and plan on loosing the data on drive failure (this 
assumes that you can afford the loss of service when this happens). if the 
data is more important then I'd do dual-parity or more, along with a hot 
spare so that the rebuild can start as soon as the first failure is 
noticed by the system to give myself a fighting chance to save things.



In terms of performance, I think RAID 10 should always be best for write 
speed.  (Since it doesn't calculate parity, writing a new block doesn't 
require reading the rest of the RAID stripe just to recalculate the 
parity bits.)  I think it's also normally just as fast for reading, 
since the controller can load-balance the pending read requests to both 
sides of each mirror.


this depends on your write pattern. if you are doing sequential writes 
(say writing a log archive) then RAID 5 can be faster then RAID 10. since 
there is no data there to begin with the system doesn't have to read 
anything to calculate the parity, and with the data spread across more 
spindles you have a higher potential throughput.


if your write pattern is is more random, and especially if you are 
overwriting existing data then the reads needed to calculate the parity 
will slow you down.


as for read speed, it all depends on your access pattern and stripe size. 
if you are reading data that spans disks (larger then your stripe size) 
you end up with a single read tieing up multiple spindles. with Raid 1 
(and varients) you can read from either disk of the set if you need 
different data within the same stripe that's on different disk tracks (if 
it's on the same track you'll get it just as fast reading from a single 
drive, or so close to it that it doesn't matter). beyond that the question 
is how many spindles can you keep busy reading (as opposed to seeking to 
new data or sitting idle becouse you don't need their data)


the worst case for reading is to be jumping through your data in strides 
of stripe*# disks available (accounting for RAID type) as all your reads 
will end up hitting the same disk.


David Lang

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Sunfire X4500 recommendations

2007-03-29 Thread david


On Thu, 29 Mar 2007, Matt Smiley wrote:


Hi David,

Thanks for your feedback!  I'm rather a newbie at this, and I do appreciate the 
critique.

First, let me correct myself: The formulas for the risk of loosing data when 
you loose 2 and 3 disks shouldn't have included the first term (g/n).  I'll 
give the corrected formulas and tables at the end of the email.



please explain why you are saying that the risk of loosing any 1 disk is
1/n. shouldn't it be probability of failure * n instead?


1/n represents the assumption that all disks have an equal probability of being 
the next one to fail.  This seems like a fair assumption in general for the 
active members of a stripe (not including hot spares).  A possible exception 
would be the parity disks (because reads always skip them and writes always hit 
them), but that's only a consideration if the RAID configuration used dedicated 
disks for parity instead of distributing it across the RAID 5/6 group members.  
Apart from that, whether the workload is write-heavy or read-heavy, sequential 
or scattered, the disks in the stripe ought to handle a roughly equivalent 
number of iops over their lifetime.



only assuming that you have a 100% chance of some disk failing. if you 
have 15 disks in one array and 60 disks in another array the chances of 
having _some_ failure in the 15 disk array is only 1/4 the chance of 
having a failure of _some_ disk in the 60 disk array





following this logic the risk of loosing all 48 disks in a single group of
48 would be 100%


Exactly.  Putting all disks in one group is RAID 0 -- no data protection.  If 
you loose even 1 active member of the stripe, the probability of loosing your 
data is 100%.


but by your math, the chance of failure with dual parity if a 48 disk 
raid5 was also 100%, this is just wrong.





also what you are looking for is the probability of the second (and third)
disks failing in time X (where X is the time nessasary to notice the
failure, get a replacement, and rebuild the disk)


Yep, that's exactly what I'm looking for.  That's why I said, "these 
probabilities are only describing the case where we don't have enough 
time between disk failures to recover the array."  My goal wasn't to 
estimate how long time X is.  (It doesn't seem like a generalizable 
quantity; due partly to logistical and human factors, it's unique to 
each operating environment.)  Instead, I start with the assumption that 
time X has been exceeded, and we've lost a 2nd (or 3rd) disk in the 
array.  Given that assumption, I wanted to show the probability that the 
loss of the 2nd disk has caused the stripe to become unrecoverable.


Ok, this is the chance that if you loose that N disks without replacing 
any of them how much data are you likly to loose in different arrays.


We know that RAID 10 and 50 can tolerate the loss of anywhere between 1 
and n/g disks, depending on how lucky you are.  I wanted to quantify the 
amount of luck required, as a risk management tool.  The duration of 
time X can be minimized with hot spares and attentive administrators, 
but the risk after exceeding time X can only be minimized (as far as I 
know) by configuring the RAID stripe with small enough underlying 
failure groups.


but I don't think this is the question anyone is really asking.

what people want to know isn't 'how many disks can I loose without 
replacing them before I loose data' what they want to know is ' with this 
configuration (including a drive replacement time of Y for the first N 
drives and Z for drives after that), what are the odds of loosing data'


and for the second question the chance of failure of additional disks 
isn't 100%.





the killer is the time needed to rebuild the disk, with multi-TB arrays
is't sometimes faster to re-initialize the array and reload from backup
then it is to do a live rebuild (the kernel.org servers had a raid failure
recently and HPA mentioned that it took a week to rebuild the array, but
it would have only taken a couple days to do a restore from backup)


That's very interesting.  I guess the rebuild time also would depend on 
how large the damaged failure group was.  Under RAID 10, for example, I 
think you'd still only have to rebuild 1 disk from its mirror, 
regardless of how many other disks were in the stripe, right?  So 
shortening the rebuild time may be another good motivation to keep the 
failure groups small.




correct, however you have to decide how much this speed is worth to you. 
if you are building a ~20TB array you can do this with ~30 drives with 
single or dual parity, or ~60 drives with RAID 10.


remember the big cost of arrays like this isn't even the cost of the 
drives (although you are talking an extra $20,000 or so there), but the 
cost of the power and cooling to run all those extra drives



add to this the fact that d

Re: [PERFORM] SCSI vs SATA

2007-04-03 Thread david

On Tue, 3 Apr 2007, Geoff Tolley wrote:

Ron wrote:

 At 07:07 PM 4/3/2007, Ron wrote:
>  For random IO, the 3ware cards are better than PERC
> 
> >  Question: will 8*15k 73GB SCSI drives outperform 24*7K 320GB SATA II 
>  drives?
> 
>  Nope.  Not even if the 15K 73GB HDs were the brand new Savvio 15K 
>  screamers.
> 
>  Example assuming 3.5" HDs and RAID 10 => 4 15K 73GB vs 12 7.2K 320GB
>  The 15K's are 2x faster rpm, but they are only ~23% the density => 
>  advantage per HD to SATAs.
>  Then there's the fact that there are 1.5x as many 7.2K spindles as 15K 
>  spindles...

 Oops make that =3x= as many 7.2K spindles as 15K spindles...

I don't think the density difference will be quite as high as you seem to 
think: most 320GB SATA drives are going to be 3-4 platters, the most that a 
73GB SCSI is going to have is 2, and more likely 1, which would make the 
SCSIs more like 50% the density of the SATAs. Note that this only really 
makes a difference to theoretical sequential speeds; if the seeks are random 
the SCSI drives could easily get there 50% faster (lower rotational latency 
and they certainly will have better actuators for the heads). Individual 15K 
SCSIs will trounce 7.2K SATAs in terms of i/os per second.

true, but with 3x as many drives (and 4x the capacity per drive) the SATA 
system will have to do far less seeking

for that matter, with 20ish 320G drives, how large would a parition be 
that only used the outer pysical track of each drive? (almost certinly 
multiple logical tracks) if you took the time to set this up you could 
eliminate seeking entirely (at the cost of not useing your capacity, but 
since you are considering a 12x range in capacity, it's obviously not your 
primary concern)

If you care about how often you'll have to replace a failed drive, then the 
SCSI option no question, although check the cases for hot-swapability.

note that the CMU and Google studies both commented on being surprised at 
the lack of difference between the reliability of SCSI and SATA drives.

David Lang

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] SCSI vs SATA

2007-04-04 Thread david


On Wed, 4 Apr 2007, Peter Kovacs wrote:


But if an individual disk fails in a disk array, sooner than later you
would want to purchase a new fitting disk, walk/drive to the location
of the disk array, replace the broken disk in the array and activate
the new disk. Is this correct?



correct, but more drives also give you the chance to do multiple parity 
arrays so that you can loose more drives before you loose data. see the 
tread titled 'Sunfire X4500 recommendations' for some stats on how likely 
you are to loose your data in the face of multiple drive failures.


you can actually get much better reliability then RAID 10

David Lang

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] SCSI vs SATA

2007-04-05 Thread david


On Thu, 5 Apr 2007, [EMAIL PROTECTED] wrote:



I'm curious to know why you're on xfs (i've been too chicken to stray from 
ext3).


better support for large files (although postgres does tend to try and 
keep the file size down by going with multiple files) and also for more 
files


the multiple levels of indirection that ext3 uses for accessing large 
files (or large directories) can really slow things down, just from the 
overhead of looking up the metadata (including finding where the actual 
data blocks are on disk)


ext4 is planning to address this and will probably be a _very_ good 
improvement, but ext3 has very definiate limits that it inherited from 
ext2.


David Lang


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] SCSI vs SATA

2007-04-05 Thread david

On Thu, 5 Apr 2007, Ron wrote:

At 11:19 AM 4/5/2007, Scott Marlowe wrote:

On Thu, 2007-04-05 at 00:32, Tom Lane wrote:
>  "James Mansion" <[EMAIL PROTECTED]> writes:
> > >  Right --- the point is not the interface, but whether the drive is 
> > >  built

> > >  for reliability or to hit a low price point.
> 
> >  Personally I take the marketing mublings about the enterprise drives

> >  with a pinch of salt.  The low-price drives HAVE TO be reliable too,
> >  because a non-negligible failure rate will result in returns 
> >  processing

> >  costs that destroy a very thin margin.
> 
>  Reliability is relative.  Server-grade drives are built to be beat upon

>  24x7x365 for the length of their warranty period.  Consumer-grade drives
>  are built to be beat upon a few hours a day, a few days a week, for the
>  length of their warranty period.  Even if the warranties mention the
>  same number of years, there is a huge difference here.

Just a couple of points...

Server drives are generally more tolerant of higher temperatures.  I.e.
the failure rate for consumer and server class HDs may be about the same
at 40 degrees C, but by the time the internal case temps get up to 60-70
degrees C, the consumer grade drives will likely be failing at a much
higher rate, whether they're working hard or not.

Exactly correct.

Which brings up my next point:

I'd rather have 36 consumer grade drives in a case that moves a LOT of
air and keeps the drive bays cool than 12 server class drives in a case
that has mediocre / poor air flow in it.

Also exactly correct.  High temperatures or unclean power issues age HDs 
faster than any other factors.

this I agree with, however I believe that this is _so_ much of a factor 
that it swamps any difference that they may be between 'enterprise' and 
'consumer' drives.

 Until the drives have been burnt in and proven reliable, just assume that
 they could all fail at any time and act accordingly.
Yep.  Folks should google "bath tub curve of statistical failure" or similar. 
Basically, always burn in your drives for at least 1/2 a day before using 
them in a production or mission critical role.

for this and your first point, please go and look at the google and cmu 
studies. unless the vendors did the burn-in before delivering the drives 
to the sites that installed them, there was no 'infant mortality' spike on 
the drives (both studies commented on this, they expected to find one)

David Lang

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] a question about Direct I/O and double buffering

2007-04-05 Thread david


On Thu, 5 Apr 2007, Xiaoning Ding wrote:



 To the best of my knowledge, Postgres itself does not have a direct IO
 option (although it would be a good addition).  So, in order to use direct
 IO with postgres you'll need to consult your filesystem docs for how to
 set the forcedirectio mount option.  I believe it can be set dynamically,
 but if you want it to be permanent you'll to add it to your fstab/vfstab
 file.


I use Linux.  It supports direct I/O on a per-file basis only.  To bypass OS 
buffer cache,
files should be opened with O_DIRECT option.  I afraid that I have to modify 
PG.


as someone who has been reading the linux-kernel mailing list for 10 
years, let me comment on this a bit.


linux does have a direct i/o option, but it has significant limits on when 
and how you cna use it (buffers must be 512byte aligned and multiples of 
512 bytes, things like that). Also, in many cases testing has shon that 
there is a fairly significant performance hit for this, not a perfomance 
gain.


what I think that postgres really needs is to add support for write 
barriers (telling the OS to make shure that everything before the barrier 
is written to disk before anything after the barrier) I beleive that these 
are avaiable on SCSI drives, and on some SATA drives. this sort of 
support, along with appropriate async I/O support (which is probably going 
to end up being the 'syslets' or 'threadlets' stuff that's in the early 
experimental stage, rather then the current aio API) has the potential to 
be a noticable improvement.


if you haven't followed the syslets discussion on the kernel list, 
threadlets are an approach that basicly lets you turn any syscall into a 
async interface (if the call doesn't block on anything you get the answer 
back immediatly, if it does block it gets turned into a async call by the 
kernel)


syslets are a way to combine multiple syscalls into a single call, 
avoiding the user->system->user calling overhead for the additional calls. 
(it's also viewed as a way to do prototyping of possible new calls, if a 
sequence of syscalls end up being common enough the kernel devs will look 
at makeing a new, combined, syscall (for example lock, write, unlock could 
be made into one if it's common enough and there's enough of a performance 
gain)


David Lang

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] a question about Direct I/O and double buffering

2007-04-05 Thread david

On Thu, 5 Apr 2007, Xiaoning Ding wrote:

>  Xiaoning
>  Looks like it.  I just did a cursory search of the archives and it seems
>  that others have looked at this before so you'll probably want to start
>  there if your up to it.
>

 Linux used to have (still does?) a RAW interface which might also be
 useful.  I think the original code was contributed by oracle so they
 could support direct IO.

 Alex

I am more concerned with reads , and how to do direct I/O under Linux here.
Reading raw devices in linux bypasses OS buffer cache.

it also bypassed OS readahead, not nessasarily a win

 But how can you
mount a raw device( it is a character device) as a file system?

you can do a makefs on /dev/hda just like you do on /dev/hda2 and then 
mount the result as a filesystem.

Postgres wants the OS layer to provide the filesystem, Oracle implements 
it's own filesystem, so you would just point it at the drive/partition and 
it would do it's own 'formatting'

this is something that may be reasonable for postgres to consider doing 
someday, since postgres allocates things into 1m files and then keeps 
track of what filename is used for what, it could instead allocate things 
in 1m (or whatever size) chunks on the disk, and just keep track of what 
addresses are used for what instead of filenames. this would definantly 
allow you to work around problems like the ext2/3 indirect lookup 
problems. now that the ability for partitioned table spaces it would be an 
interesting experiment to be able to define a tablespace that used a raw 
device instead of a filesystem to see if there are any noticable 
performance gains

David Lang

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] SCSI vs SATA

2007-04-05 Thread david


On Thu, 5 Apr 2007, Scott Marlowe wrote:


On Thu, 2007-04-05 at 14:30, James Mansion wrote:

Server drives are generally more tolerant of higher temperatures.  I.e.
the failure rate for consumer and server class HDs may be about the same
at 40 degrees C, but by the time the internal case temps get up to 60-70
degrees C, the consumer grade drives will likely be failing at a much
higher rate, whether they're working hard or not.


Can you cite any statistical evidence for this?


Logic?

Mechanical devices have decreasing MTBF when run in hotter environments,
often at non-linear rates.


this I will agree with.


Server class drives are designed with a longer lifespan in mind.

Server class hard drives are rated at higher temperatures than desktop
drives.


these two I question.

David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] SCSI vs SATA

2007-04-05 Thread david

On Thu, 5 Apr 2007, Ron wrote:

At 10:07 PM 4/5/2007, [EMAIL PROTECTED] wrote:

On Thu, 5 Apr 2007, Scott Marlowe wrote:

> Server class drives are designed with a longer lifespan in mind.
> 
> Server class hard drives are rated at higher temperatures than desktop

> drives.

these two I question.

David Lang
Both statements are the literal truth.  Not that I would suggest abusing your 
server class HDs just because they are designed to live longer and in more 
demanding environments.

Overheating, nasty electrical phenomenon, and abusive physical shocks will 
trash a server class HD almost as fast as it will a consumer grade one.

The big difference between the two is that a server class HD can sit in a 
rack with literally 100's of its brothers around it, cranking away on server 
class workloads 24x7 in a constant vibration environment (fans, other HDs, 
NOC cooling systems) and be quite happy while a consumer HD will suffer 
greatly shortened life and die a horrible death in such a environment and 
under such use.

Ron,
  I know that the drive manufacturers have been claiming this, but I'll 
say that my experiance doesn't show a difference and neither do the google 
and CMU studies (and they were all in large datacenters, some HPC labs, 
some commercial companies).

again the studies showed _no_ noticable difference between the 
'enterprise' SCSI drives and the 'consumer' SATA drives.

David Lang

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] a question about Direct I/O and double buffering

2007-04-05 Thread david

On Thu, 5 Apr 2007, Erik Jones wrote:

On Apr 5, 2007, at 3:33 PM, [EMAIL PROTECTED] wrote:

On Thu, 5 Apr 2007, Xiaoning Ding wrote:

> > 
> >  To the best of my knowledge, Postgres itself does not have a direct IO
> >  option (although it would be a good addition).  So, in order to use 
> >  direct
> >  IO with postgres you'll need to consult your filesystem docs for how 
> >  to
> >  set the forcedirectio mount option.  I believe it can be set 
> >  dynamically,
> >  but if you want it to be permanent you'll to add it to your 
> >  fstab/vfstab

> >  file.
> 
> I use Linux.  It supports direct I/O on a per-file basis only.  To bypass 
> OS buffer cache,
> files should be opened with O_DIRECT option.  I afraid that I have to 
> modify PG.

as someone who has been reading the linux-kernel mailing list for 10 years, 
let me comment on this a bit.

linux does have a direct i/o option,

Yes, I know applications can request direct i/o with the O_DIRECT flag to 
open(), but can this be set to be forced for all applications or for 
individual applications from "outside" the application (not that I've ever 
heard of something like the second)?

no it can't, due to the fact that direct i/o has additional requirements 
for what you can user for buffers that don't apply to normal i/o

but it has significant limits on when and how you cna use it (buffers must 
be 512byte aligned and multiples of 512 bytes, things like that).

That's a standard limit imposed by the sector size of hard drives, and is 
present in all direct i/o implementations, not just Linux.

right, but you don't have those limits for normal i/o

Also, in many cases testing has shon that there is a fairly significant 
performance hit for this, not a perfomance gain.

Those performance hits have been noticed for high i/o transaction databases? 
The idea here is that these kinds of database manage their own caches and 
having a separate filesystem cache in virtual memory that works with system 
memory page sizes is an unneeded level of indirection.

ahh, you're proposing a re-think of how postgres interacts with the O/S, 
not just an optimization to be applied to the current architecture.

unlike Oracle, Postgres doesn't try to be an OS itself, it tries very hard 
to rely on the OS to properly implement things rather then doing it's own 
implementation.

Yes, you should 
expect other "normal" utilities will suffer a performance hit as if you are 
trying to cp a 500 byte file you'll still have to work with 8K writes and 
reads whereas with the filesystem cache you can just write/read part of a 
page in memory and let the cache decide when it needs to write and read from 
disk.  If there are other caveats to direct i/o on Linux I'd love to hear 
them.

other then bad interactions with "normal" utilities not compiled for 
driect i/o I don't remember them offhand.

David Lang

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] SCSI vs SATA

2007-04-05 Thread david


On Fri, 6 Apr 2007, Ron wrote:

Bear in mind that Google was and is notorious for pushing their environmental 
factors to the limit while using the cheapest "PoS" HW they can get their 
hands on.
Let's just say I'm fairly sure every piece of HW they were using for those 
studies was operating outside of manufacturer's suggested specifications.


Ron, please go read both the studies. unless you want to say that every 
orginization the CMU picked to study also abused their hardware as 
well


Under such conditions the environmental factors are so deleterious that they 
swamp any other effect.


OTOH, I've spent my career being as careful as possible to as much as 
possible run HW within manufacturer's suggested specifications.
I've been chided for it over the years... ...usually by folks who "save" 
money by buying commodity HDs for big RAID farms in NOCs or push their 
environmental envelope or push their usage envelope or ... ...and then act 
surprised when they have so much more down time and HW replacements than I 
do.


All I can tell you is that I've gotten to eat my holiday dinner far more 
often than than my counterparts who push it in that fashion.


OTOH, there are crises like the Power Outage of 2003 in the NE USA where some 
places had such Bad Things happen that it simply doesn't matter what you 
bought
(power dies, generator cuts in, power comes on, but AC units crash, 
temperatures shoot up so fast that by the time everything is re-shutdown it's 
in the 100F range in the NOC.  Lot's 'O Stuff dies on the spot + spend next 6 
months having HW failures at +considerably+ higher rates than historical 
norms.  Ick..)


IME, it really does make a difference =if you pay attention to the 
difference in the first place=.
If you treat everything equally poorly, then you should not be surprised when 
everything acts equally poorly.


But hey, YMMV.

Cheers,
Ron Peacetree 





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] SCSI vs SATA

2007-04-06 Thread david


On Fri, 6 Apr 2007, Scott Marlowe wrote:


Based on experience I think that on average server drives are more
reliable than consumer grade drives, and can take more punishment.


this I am not sure about


But,
the variables of manufacturer, model, and the batch often make even more
difference than grade.


this I will agree with fully.

David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] SCSI vs SATA

2007-04-06 Thread david

On Fri, 6 Apr 2007, Charles Sprickman wrote:

On Fri, 6 Apr 2007, [EMAIL PROTECTED] wrote:

 On Fri, 6 Apr 2007, Scott Marlowe wrote:

>  Based on experience I think that on average server drives are more
>  reliable than consumer grade drives, and can take more punishment.

 this I am not sure about

I think they should survey Tivo owners next time.

Perfect stress-testing environment.  Mine runs at over 50C most of the time, 
and it's writing 2 video streams 24/7.  What more could you do to punish a 
drive? :)

and the drives that are in them are consumer IDE drives.

I will admit that I've removed to cover from my tivo to allow it to run 
cooler, and I'm still on the origional drive + 100G drive I purchased way 
back when (7+ years ago) before I removed the cover I did have times when 
the tivo would die from the heat (Los Angeles area in the summer with no 
A/C)

David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] SCSI vs SATA

2007-04-07 Thread david

x27;cheap' drives, my reaction is that single reducndancy isn't good 
enough. depending on how valuble the data is, you need to either replicate 
the data to another system, or go with dual-parity redundancy (or both)


while drives probably won't be this bad in real life (this is after all, 
slightly worse then the studies show for their 3rd year drives, and 
'enterprise' drives may be slightly better) , I have to assume that they 
will be for my reliability planning.


also, if you read throught the cmu study, drive failures were only a small 
percentage of system outages (16-25% depending on the site). you have to 
make sure that you aren't so fixated on drive reliabilty that you fail to 
account for other types of problems (down to and including the chance of 
someone accidently powering down the rack that you are plugged into, be 
it from hitting a power switch, to overloading a weak circuit breaker)


In looking at these problems overall I find that in most cases I need to 
have redundant systems with the data replicated anyway (with logs sent 
elsewhere), so I can get away with building failover pairs instead of 
having each machine with redundant drives. I've found that I can 
frequently get a pair of machines for less money then other departments 
spend on buying a single 'enterprise' machine with the same specs 
(although the prices are dropping enough on the top-tier manufacturers 
that this is less true today then it was a couple of years ago), and I 
find that the failure rate is about the same on a per-machine basis, so I 
end up with a much better uptime record due to having the redundancy of 
the second full system (never mind things like it being easier to do 
upgrades as I can work on the inactive machine and then failover to work 
on the other, now, inactive machine). while I could ask for the budget to 
be doubled to provide the same redundancy with the top-tier manufacturers 
I don't do so for several reasons, the top two being that these 
manufacurers frequently won't configure a machine the way I want them to 
(just try to get a box with writeable media built in, either a floppy of a 
CDR/DVDR, they want you to use something external), and doing so also 
exposes me to people second guessing me on where redundancy is needed 
('that's only development, we don't need redundancy there', until a system 
goes down for a day and the entire department is unable to work)


it's not that the people who disagree with you don't care about their 
data, it's that they have different experiances then you do (experiances 
that come close to matching the studies where they tracked hundereds of 
thousands of drives of different types), and as a result believe that the 
difference (if any) between the different types of drives isn't 
significant in the overall failure rate (especially when you take the 
difference of drive capacity into account)


David Lang

P.S. here is a chart from that thread showing the chances of loosing data 
with different array configurations.


if you say that there is a 10% chance of a disk failing each year 
(significnatly higher then the studies listed above, but close enough) 
then this works out to ~0.001% chance of a drive failing per hour (a 
reasonably round number to work with)


to write 750G at ~45MB/sec takes 5 hours of 100% system throughput, or ~50 
hours at 10% of the system throughput (background rebuilding)


if we cut this in half to account for inefficiancies in retrieving data 
from other disks to calculate pairity it can take 100 hours (just over 
four days) to do a background rebuild, or about 0.1% chance for each disk 
of loosing a seond disk. with 48 drives this is ~5% chance of loosing 
everything with single-parity, however the odds of loosing two disks 
during this time are .25% so double-parity is _well_ worth it.


chance of loosing data before hotspare is finished rebuilding (assumes one 
hotspare per group, you may be able to share a hotspare between multiple 
groups to get slightly higher capacity)



RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose 
data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
2  24   48   n/an/a
3  16   48   n/a (0.0001% with 
manual replacement of drive)
4  12   4812 0.0009%
6   8   4824 0.003%
8   6   4830 0.006%
   12   4   4836 0.02%
   16   3   4839 0.03%
   24   2   4842 0.06%
   48   1   4845 0.25%



RAID 10 or 50 -- Mirroring or single-parity mus

Re: [PERFORM] SCSI vs SATA

2007-04-07 Thread david

inion (no matter who it's from) isn't as good as a formal study


It's also insulting to everyone who puts in the time and effort to be a 
professional within a field rather than a lay person.


it's also insulting to assume (or appear to assume) that everyone who 
disagrees with your is a lay person. you may not have meant it (this is 
e-mail after all, with all the problems that come from that), but this is 
what you seem to have been implying, if not outright saying.


Whether there's a name for it or not, there's definitely an important 
distinction between each of anecdote, professional opinion, and study result.


the line between an anecdote and a professional opinion is pretty blury, 
and hard to see without wasting a lot of time getting everyone to give 
their credentials, etc. if a professional doesn't spend enough time 
thinking about some of the details (i.e. how many drive failures of each 
type have I seen in the last 5 years as opposed to in the 5 year timeframe 
from 1980-1985) they can end up giving an opinion that's in the range of 
reliability and relavance that anecdotes are.


don't assume malice so quickly.

David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-27 Thread david


On Fri, 27 Apr 2007, Josh Berkus wrote:


Dan,


Yes, this is the classic problem.  I'm not demanding anyone pick up the
ball and jump on this today, tomorrow, etc.. I just think it would be
good for those who *could* make a difference to keep those goals in mind
when they continue.  If you have the right mindset, this problem will
fix itself over time.


Don't I wish.  Autotuning is *hard*.  It took Oracle 6 years.  It took
Microsoft 3-4 years, and theirs still has major issues last I checked. And
both of those DBs support less OSes than we do.  I think it's going to
take more than the *right mindset* and my spare time.


I think there are a couple different things here.

1. full autotuning

  as you say, this is very hard and needs a lot of info about your 
particular database useage.


2. getting defaults that are closer to right then current.

  this is much easier. for this nobody is expecting that the values are 
right, we're just begging for some tool to get us within an couple orders 
of magnatude of what's correct.


the current defaults are appropriate for a single cpu with 10's of MB of 
ram and a single drive


nowdays you have people trying to run quick-and-dirty tests on some spare 
hardware they have laying around (waiting for another project) that's got 
4-8 CPU's with 10's of GB of ram and a couple dozen drives


these people don't know about database tuneing, they can learn, but they 
want to see if postgres is even in the ballpark. if the results are close 
to acceptable they will ask questions and research the tuneing, but if the 
results are orders of magnatude lower then they need to be they'll just 
say that postgress is too slow and try another database.


an autodefault script that was written assuming that postgres has the box 
to itself would be a wonderful start.


I think the next step would be to be able to tell the script 'only plan on 
useing 1/2 of this box'


and beyond that would be the steps that you are thinking of where the 
useage pattern is considered.


but when every performance question is answered with "did you change the 
defaults? they are way too low for modern hardware, raise them by 2 orders 
of magnatude and then we'll start investigating"


David Lang
---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-01 Thread david


On Tue, 1 May 2007, Carlos Moreno wrote:


 large problem from a slog perspective; there is no standard way even
 within Linux to describe CPUs, for example. Collecting available disk
 space information is even worse.   So I'd like some help on this portion.



Quite likely, naiveness follows...  But, aren't things like /proc/cpuinfo ,
/proc/meminfo, /proc/partitions / /proc/diskstats standard, at the very
least across Linux distros?  I'm not familiar with BSD or other Unix
flavours, but I would expect these (or their equivalent) to exist in those,
no?

Am I just being naive?


unfortunantly yes.

across different linux distros they are fairly standard (however different 
kernel versions will change them)


however different kernels need drasticly different tools to get the info 
from them.


David Lang

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-01 Thread david


On Tue, 1 May 2007, Greg Smith wrote:


On Tue, 1 May 2007, Josh Berkus wrote:


 there is no standard way even within Linux to describe CPUs, for example.
 Collecting available disk space information is even worse.  So I'd like
 some help on this portion.


what type of description of the CPU's are you looking for?


 It's also hard/impossible to devise tuning algorithms that work for both
 gross tuning (increase shared_buffers by 100x) and fine tuning (decrease
 bgwriter_interval to 45ms).


I would advocate focusing on iterative improvements to an existing 
configuration rather than even bothering with generating a one-off config for 
exactly this reason.  It *is* hard/impossible to get it right in a single 
shot, because of how many parameters interact and the way bottlenecks clear, 
so why not assume from the start you're going to do it several times--then 
you've only got one piece of software to write.


nobody is asking for things to be right the first time.

The idea I have in my head is a tool that gathers system info, connects to 
the database, and then spits out recommendations in order of expected 
effectiveness--with the specific caveat that changing too many things at one 
time isn't recommended, and some notion of parameter dependencies. The first 
time you run it, you'd be told that shared_buffers was wildly low, 
effective_cache_size isn't even in the right ballpark, and your work_mem 
looks small relative to the size of your tables; fix those before you bother 
doing anything else because any data collected with those at very wrong 
values is bogus.


why not have a much simpler script that gets these values up into the 
right ballpark first? then after that the process and analysis that you 
are suggesting would be useful. the problem is that the defaults are _so_ 
far off that no sane incremental program is going to be able to converge 
on the right answer rapidly.


David Lang

 Take two, those parameters pass their sanity tests, but 
since you're actually running at a reasonable speed now the fact that your 
tables are no longer being vacuumed frequently enough might bubble to the 
top.


It would take a few passes through to nail down everything, but as long as 
it's put together such that you'd be in a similar position to the single-shot 
tool after running it once it would remove that as something separate that 
needed to be built.


To argue against myself for a second, it may very well be the case that 
writing the simpler tool is the only way to get a useful prototype for 
building the more complicated one; very easy to get bogged down in feature 
creep on a grand design otherwise.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-03 Thread david


On Thu, 3 May 2007, Josh Berkus wrote:


Greg,


I'm not fooled--secretly you and your co-workers laugh at how easy this
is on Solaris and are perfectly happy with how difficult it is on Linux,
right?


Don't I wish.  There's issues with getting CPU info on Solaris, too, if you
get off of Sun Hardware to generic white boxes.  The base issue is that
there's no standardization on how manufacturers report the names of their
CPUs, 32/64bit, or clock speeds.   So any attempt to determine "how fast"
a CPU is, even on a 1-5 scale, requires matching against a database of
regexes which would have to be kept updated.

And let's not even get started on Windows.


I think the only sane way to try and find the cpu speed is to just do a 
busy loop of some sort (ideally something that somewhat resembles the main 
code) and see how long it takes. you may have to do this a few times until 
you get a loop that takes long enough (a few seconds) on a fast processor


David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-03 Thread david

On Thu, 3 May 2007, Carlos Moreno wrote:

>  CPUs, 32/64bit, or clock speeds.   So any attempt to determine "how 
>  fast"

>  a CPU is, even on a 1-5 scale, requires matching against a database of
>  regexes which would have to be kept updated.
> 
>  And let's not even get started on Windows.

 I think the only sane way to try and find the cpu speed is to just do a
 busy loop of some sort (ideally something that somewhat resembles the main
 code) and see how long it takes. you may have to do this a few times until
 you get a loop that takes long enough (a few seconds) on a fast processor

I was going to suggest just that (but then was afraid that again I may have
been just being naive) --- I can't remember the exact name, but I remember
using (on some Linux flavor) an API call that fills a struct with data on the
resource usage for the process, including CPU time;  I assume measured
with precision  (that is, immune to issues of other applications running
simultaneously, or other random events causing the measurement to be
polluted by random noise).

since what we are looking for here is a reasonable first approximation, 
not perfection I don't think we should worry much about pollution of the 
value. if the person has other things running while they are running this 
test that will be running when they run the database it's no longer 
'pollution' it's part of the environment. I think a message at runtime 
that it may produce inaccurate results if you have other heavy processes 
running for the config that won't be running with the database would be 
good enough (remember it's not only CPU time that's affected like this, 
it's disk performance as well)

As for 32/64 bit --- doesn't PG already know that information?  I mean,
./configure does gather that information --- does it not?

we're not talking about comiling PG, we're talking about getting sane 
defaults for a pre-compiled binary. if it's a 32 bit binary assume a 32 
bit cpu, if it's a 64 bit binary assume a 64 bit cpu (all hardcoded into 
the binary at compile time)

David Lang

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-03 Thread david

On Thu, 3 May 2007, Carlos Moreno wrote:

>  been just being naive) --- I can't remember the exact name, but I 
>  remember
>  using (on some Linux flavor) an API call that fills a struct with data 
>  on the

>  resource usage for the process, including CPU time;  I assume measured
>  with precision  (that is, immune to issues of other applications running
>  simultaneously, or other random events causing the measurement to be
>  polluted by random noise).

 since what we are looking for here is a reasonable first approximation,
 not perfection I don't think we should worry much about pollution of the
 value.

Well, it's not as much worrying as it is choosing the better among two 
equally
difficult options --- what I mean is that obtaining the *real* resource usage 
as
reported by the kernel is, from what I remember, equally hard as it is 
obtaining

the time with milli- or micro-seconds resolution.

So, why not choosing this option?  (in fact, if we wanted to do it "the 
scripted
way", I guess we could still use "time test_cpuspeed_loop" and read the 
report

by the command time, specifying CPU time and system calls time.

I don't think it's that hard to get system time to a reasonable level (if 
this config tuner needs to run for a min or two to generate numbers that's 
acceptable, it's only run once)

but I don't think that the results are really that critical.

do we really care if the loop runs 1,000,000 times per second or 1,001,000 
times per second? I'd argue that we don't even care about 1,000,000 times 
per second vs 1,100,000 times per second, what we care about is 1,000,000 
times per second vs 100,000 times per second, if you do a 10 second test 
and run it for 11 seconds you are still in the right ballpark (i.e. close 
enough that you really need to move to the stage2 tuneing to figure the 
exact values)

>  As for 32/64 bit --- doesn't PG already know that information?  I mean,
>  ./configure does gather that information --- does it not?

 we're not talking about comiling PG, we're talking about getting sane
 defaults for a pre-compiled binary. if it's a 32 bit binary assume a 32
 bit cpu, if it's a 64 bit binary assume a 64 bit cpu (all hardcoded into
 the binary at compile time)

Right --- I was thinking that configure, which as I understand, generates the
Makefiles to compile applications including initdb, could plug those values
as compile-time constants, so that initdb (or a hypothetical additional 
utility
that would do what we're discussing in this thread) already has them. 
Anyway,

yes, that would go for the binaries as well --- we're pretty much saying the
same thing  :-)

I'm thinking along the lines of a script or pre-compiled binary (_not_ 
initdb) that you could run and have it generate a new config file that has 
values that are at within about an order of magnatude of being correct.

David Lang

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-03 Thread david


On Thu, 3 May 2007, Carlos Moreno wrote:


 I don't think it's that hard to get system time to a reasonable level (if
 this config tuner needs to run for a min or two to generate numbers that's
 acceptable, it's only run once)

 but I don't think that the results are really that critical.


Still --- this does not provide a valid argument against my claim.

Ok, we don't need precision --- but do we *need* to have less
precision??   I mean, you seem to be proposing that we deliberately
go out of our way to discard a solution with higher precision and
choose the one with lower precision --- just because we do not
have a critical requirement for the extra precision.

That would be a valid argument if the extra precision came at a
considerable cost  (well, or at whatever cost, considerable or not).


the cost I am seeing is the cost of portability (getting similarly 
accruate info from all the different operating systems)



But my point is still that obtaining the time in the right ballpark
and obtaining the time with good precision are two things that
have, from any conceivable point of view  (programming effort,
resources consumption when executing it, etc. etc.), the exact
same cost --- why not pick the one that gives us the better results?

Mostly when you consider that:


 I'd argue that we don't even care about 1,000,000 times per second vs
 1,100,000 times per second, what we care about is 1,000,000 times per
 second vs 100,000 times per second


Part of my claim is that measuring real-time you could get an
error like this or even a hundred times this!!   Most of the time
you wouldn't, and definitely if the user is careful it would not
happen --- but it *could* happen!!!  (and when I say could, I
really mean:  trust me, I have actually seen it happen)


if you have errors of several orders of magnatude in the number of loops 
it can run in a given time period then you don't have something that you 
can measure to any accuracy (and it wouldn't matter anyway, if your loops 
are that variable, your code execution would be as well)



Why not just use an *extremely simple* solution that is getting
information from the kernel reporting the actual CPU time that
has been used???

Of course, this goes under the premise that in all platforms there
is such a simple solution like there is on Linux  (the exact name
of the API function still eludes me, but I have used it in the past,
and I recall that it was just three or five lines of code).


I think the problem is that it's a _different_ 3-5 lines of code for each 
OS.


if I'm wrong and it's the same for the different operating systems then I 
agree that we should use the most accurate clock we can get. I just don't 
think we have that.


David Lang

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-05-03 Thread david

On Thu, 3 May 2007, Carlos Moreno wrote:

>  error like this or even a hundred times this!!   Most of the time
>  you wouldn't, and definitely if the user is careful it would not
>  happen --- but it *could* happen!!!  (and when I say could, I
>  really mean:  trust me, I have actually seen it happen)
 Part of my claim is that measuring real-time you could get an

 if you have errors of several orders of magnatude in the number of loops
 it can run in a given time period then you don't have something that you
 can measure to any accuracy (and it wouldn't matter anyway, if your loops
 are that variable, your code execution would be as well)

Not necessarily --- operating conditions may change drastically from
one second to the next;  that does not mean that your system is useless;
simply that the measuring mechanism is way too vulnerable to the
particular operating conditions at the exact moment it was executed.

I'm not sure if that was intentional, but you bring up an interesting
issue --- or in any case, your comment made me drastically re-think
my whole argument: do we *want* to measure the exact speed, or
rather the effective speed under normal operating conditions on the
target machine?

I know the latter is almost impossible --- we're talking about an estimate
of a random process' parameter (and we need to do it in a short period
of time) ...  But the argument goes more or less like this:  if you have a
machine that runs at  1000 MIPS, but it's usually busy running things
that in average consume 500 of those 1000 MIPS, would we want PG's
configuration file to be obtained based on 1000 or based on 500 MIPS???
After all, the CPU is, as far as PostgreSQL will be able see, 500 MIPS
fast, *not* 1000.

I think I better stop, if we want to have any hope that the PG team will
ever actually implement this feature (or similar) ...  We're probably just
scaring them!!  :-)

simpler is better (or perfect is the enemy of good enough)

if you do your sample over a few seconds (or few tens of seconds) things 
will average out quite a bit.

the key is to be going for a reasonable starting point. after that then 
the full analysis folks can start in with all their monitoring and 
tuneing, but the 80/20 rule really applies here. 80% of the gain is from 
getting 'fairly close' to the right values, and that should only be 20% of 
the full 'tuneing project'

David Lang

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Best OS for Postgres 8.2

2007-05-07 Thread david


On Tue, 8 May 2007, Chris wrote:


David Levy wrote:

 Hi,

 I am about to order a new server for my Postgres cluster. I will
 probably get a Dual Xeon Quad Core instead of my current Dual Xeon.
 Which OS would you recommend to optimize Postgres behaviour (i/o
 access, multithreading, etc) ?

 I am hesitating between Fedora Core 6, CentOS and Debian. Can anyone
 help with this ?


Use the one you're most comfortable with.

I don't think you'll notice *that* much difference between linux systems for 
performance - but whether you're comfortable using any of them will make a 
difference in managing it in general.


the tuneing that you do (both of the OS and of postgres) will make more 
of a difference then anything else.


David Lang

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Best OS for Postgres 8.2

2007-05-08 Thread david

On Tue, 8 May 2007, �~]~N彦 Ian Li wrote:

In #postgresql on freenode, somebody ever mentioned that ZFS from Solaris 
helps a lot to the performance of pgsql, so dose anyone have information 
about that?

the filesystem you use will affect the performance of postgres 
significantly. I've heard a lot of claims for ZFS, unfortunantly many of 
them from people who have prooven that they didn't know what they were 
talking about by the end of their first or second e-mails.

much of the hype for ZFS is it's volume management capabilities and admin 
tools. Linux has most (if not all) of the volume management capabilities, 
it just seperates them from the filesystems so that any filesystem can use 
them, and as a result you use one tool to setup your RAID, one to setup 
snapshots, and a third to format your filesystems where ZFS does this in 
one userspace tool.

once you seperate the volume management piece out, the actual performance 
question is a lot harder to answer. there are a lot of people who say that 
it's far faster then the alternate filesystems on Solaris, but I haven't 
seen any good comparisons between it and Linux filesystems.

On Linux you have the choice of several filesystems, and the perfomance 
will vary wildly depending on your workload. I personally tend to favor 
ext2 (for small filesystems where the application is ensuring data 
integrity) or XFS (for large filesystems)

I personally don't trust reiserfs, jfs seems to be a tools for 
transitioning from AIX more then anything else, and ext3 seems to have all 
the scaling issues of ext2 plus the overhead (and bottleneck) of 
journaling.

one issue with journaling filesystems, if you journal the data as well as 
the metadata you end up with a very reliable setup, however it means that 
all your data needs to be written twice, oncce to the journal, and once to 
the final location. the write to the journal can be slightly faster then a 
normal write to the final location (the journal is a sequential write to 
an existing file), however the need to write twice can effectivly cut your 
disk I/O bandwidth in half when doing heavy writes. worse, when you end up 
writing mor ethen will fit in the journal (128M is the max for ext3) the 
entire system then needs to stall while the journal gets cleared to make 
space for the additional writes.

if you don't journal your data then you avoid the problems above, but in a 
crash you may find that you lost data, even though the filesystem is 
'intact' according to fsck.

David Lang

Steve Atkins wrote:

 On May 7, 2007, at 2:55 PM, David Levy wrote:

>  Hi,
> 
>  I am about to order a new server for my Postgres cluster. I will

>  probably get a Dual Xeon Quad Core instead of my current Dual Xeon.
>  Which OS would you recommend to optimize Postgres behaviour (i/o
>  access, multithreading, etc) ?
> 
>  I am hesitating between Fedora Core 6, CentOS and Debian. Can anyone

>  help with this ?

 Well, all three you mention are much the same, just with a different
 badge on the box, as far as performance is concerned. They're all
 going to be a moderately recent Linux kernel, with your choice
 of filesystems, so any choice between them is going to be driven
 more by available staff and support or personal preference.

 I'd probably go CentOS 5 over Fedora  just because Fedora doesn't
 get supported for very long - more of an issue with a dedicated
 database box with a long lifespan than your typical desktop or
 interchangeable webserver.

 I might also look at Solaris 10, though. I've yet to play with it much,
 but it
 seems nice, and I suspect it might manage 8 cores better than current
 Linux setups.

 Cheers,
   Steve

 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings

Regards

Ian

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

 http://www.postgresql.org/docs/faq

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Best OS for Postgres 8.2

2007-05-08 Thread david

On Tue, 8 May 2007, Claus Guttesen wrote:

>  In #postgresql on freenode, somebody ever mentioned that ZFS from 
>  Solaris

>  helps a lot to the performance of pgsql, so dose anyone have information
>  about that?

 the filesystem you use will affect the performance of postgres
 significantly. I've heard a lot of claims for ZFS, unfortunantly many of
 them from people who have prooven that they didn't know what they were
 talking about by the end of their first or second e-mails.

 much of the hype for ZFS is it's volume management capabilities and admin
 tools. Linux has most (if not all) of the volume management capabilities,
 it just seperates them from the filesystems so that any filesystem can use
 them, and as a result you use one tool to setup your RAID, one to setup
 snapshots, and a third to format your filesystems where ZFS does this in
 one userspace tool.

Even though those posters may have proven them selves wrong, zfs is
still a very handy fs and it should not be judged relative to these
statements.

I don't disagree with you, I'm just noteing that too many of the 'ZFS is 
great' posts need to be discounted as a result (the same thing goes for 
the 'reiserfs4 is great' posts)

 once you seperate the volume management piece out, the actual performance
 question is a lot harder to answer. there are a lot of people who say that
 it's far faster then the alternate filesystems on Solaris, but I haven't
 seen any good comparisons between it and Linux filesystems.

One could install pg on solaris 10 and format the data-area as ufs and
then as zfs and compare import- and query-times and other benchmarking
but comparing ufs/zfs to Linux-filesystems would also be a comparison
of those two os'es.

however, such a comparison is very legitimate, it doesn't really matter 
which filesystem is better if the OS that it's tied to limits it so much 
that the other one wins out with an inferior filesystem

currently ZFS is only available on Solaris, parts of it have been released 
under GPLv2, but it doesn't look like enough of it to be ported to Linux 
(enough was released for grub to be able to access it read-only, but not 
the full filesystem). there are also patent concerns that are preventing 
any porting to Linux.

on the other hand, it's integrated userspace tools are pushing people to 
create similar tools for Linux (without needeing to combine the vairous 
pieces in the kernel)

David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Best OS for Postgres 8.2

2007-05-08 Thread david


On Tue, 8 May 2007, Trygve Laugstøl wrote:


 currently ZFS is only available on Solaris, parts of it have been released
 under GPLv2, but it doesn't look like enough of it to be ported to Linux
 (enough was released for grub to be able to access it read-only, but not
 the full filesystem). there are also patent concerns that are preventing
 any porting to Linux.


This is not entirely correct. ZFS is only under the CDDL license and it has 
been ported to FreeBSD.


http://mail.opensolaris.org/pipermail/zfs-discuss/2007-April/026922.html

I wonder how they handled the license issues? I thought that if you 
combined stuff that was BSD licensed with stuff with a more restrictive 
license the result was under the more restrictive license. thanks for the 
info.


here's a link about the GPLv2 stuff for zfs

http://blogs.sun.com/darren/entry/zfs_under_gplv2_already_exists
---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Best OS for Postgres 8.2

2007-05-08 Thread david


On Tue, 8 May 2007, Steinar H. Gunderson wrote:


On Mon, May 07, 2007 at 03:14:08PM -0700, Joshua D. Drake wrote:

It is my understanding (and I certainly could be wrong) that FreeBSD
doesn't handle SMP nearly as well as Linux (and Linux not as well as
Solaris).


I'm not actually sure about the last part. There are installations as big as
1024 CPUs that run Linux -- most people won't need that, but it's probably an
indicator that eight cores should run OK :-)


over the weekend the question of scalability was raised on the linux 
kernel mailing list and people are shipping 1024 cpu systems with linux, 
and testing 4096 cpu systems. there are occasionally still bottlenecks 
that limit scalability, butunless you run into a bad driver or filesystem 
you should have no problems in the 8-16 core range.


any comparison between Linux and any other OS needs to include a date for 
when the comparison was made, Linux is changing at a frightning pace (I 
think I saw something within the last few weeks that said that the rate of 
change for the kernel has averaged around 9000 lines of code per day over 
the last couple of years) you need to re-check comparisons every year or 
two or you end up working with obsolete data.


David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Throttling PostgreSQL's CPU usage

2007-05-08 Thread david


On Tue, 8 May 2007, Daniel Griscom wrote:

I'm building a kiosk with a 3D front end accessing PostGIS/PostgreSQL via 
Apache/PHP. The 3D display is supposed to show smooth motion from location to 
location, with PostGIS giving dynamically updated information on the 
locations. Everything runs on the same machine, and it all works, but when I 
start a query the 3D display stutters horribly. It looks like PostgreSQL 
grabs hold of the CPU and doesn't let go until it's completed the query.


I don't need the PostgreSQL query to return quickly, but I must retain smooth 
animation while the query is being processed. In other words, I need 
PostgreSQL to spread out its CPU usage so that it doesn't monopolize the CPU 
for any significant time (more than 50ms or so).


Possible solutions:

1: Set the PostgreSQL task priority lower than the 3D renderer task, and to 
make sure that the 3D renderer sleep()s enough to let PostgreSQL get its work 
done. The obvious objection to this obvious solution is "Priority 
inversion!", but I see that as an additional challenge to be surmounted 
rather than an absolute prohibition. So, any thoughts on setting the 
PostgreSQL task priority (including by the much-maligned tool shown at 
<http://weblog.bignerdranch.com/?p=11>)?


this may or may not help


3: ... some other solution I haven't thought of.


take a look at the scheduler discussion that has been takeing place on the 
linux-kernel list. there are a number of things being discussed specificly 
to address the type of problems that you are running into (CPU hog causes 
latencies for graphics processes).


it looks like nothing will go into the 2.6.22 kernel officially, but if 
you are willing to test the begezzes out of it before you depend on it, I 
suspect that either the SD or CFS schedulers will clean things up for you.


David Lang

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] What's The Difference Between VACUUM and VACUUM ANALYZE?

2007-05-08 Thread david


On Tue, 8 May 2007, Y Sidhu wrote:


I am trying to follow a message thread. One guy says we should be running
vacuum analyze daily and the other says we should be running vacuum multiple
times a day. I have tried looking for what a vacuum analyze is to help me
understand but no luck.


vaccum frees tuples that are no longer refrenced
vaccum analyse does the same thing, but then does some additional 
information gathering about what data is in the tables Postgres uses this 
data to adjust it's estimates of how long various things will take 
(sequential scan, etc). if these estimates are off by a huge amount 
(especially if you have never done a vaccum analyse after loading your 
table) then it's very likely that postgres will be slow becouse it's doing 
expensive operations that it thinks are cheap.


David Lang

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Throttling PostgreSQL's CPU usage

2007-05-08 Thread david


On Tue, 8 May 2007, Daniel Griscom wrote:

Thanks for all the feedback. Unfortunately I didn't specify that this is 
running on a WinXP machine (the 3D renderer is an ActiveX plugin), and I 
don't even think "nice" is available. I've tried using the Windows Task 
Manager to set every postgres.exe process to a low priority, but that didn't 
make a difference.


Several people have mentioned having multiple processors; my current machine 
is a uni-processor machine, but I believe we could spec the actual runtime 
machine to have multiple processors/cores. I'm only running one query at a 
time; would that query be guaranteed to confine itself to a single 
processor/core?


In terms of performance, I don't think simply more power will do the trick; 
I've got an AMD 3200+, and even doubling the power/halving the stutter time 
won't be good enough.


Someone suggested "setting the CPU costs higher"; where would I learn about 
that?


Someone else mentioned having a custom function that sleep()ed on every row 
access; where would I learn more about that?


I've also been reading up on VACUUM. I haven't explicitly run it in the 
several days since I've installed the database (by loading a humongous 
data.sql file); might this be part of the performance problem?


it would cause postgres to work harder then it needs to, but it doesn't 
solve the problem of postgres eating cpu that you need for your rendering 
(i.e. it may reduce the stutters, but won't eliminate them)


a single query will confine itself to one core, but if you have a vaccum 
or autovaccum run it will affect the second core.


I don't know what you can do on windows beyond this though.

David Lang

P.S. make sure you get real multi-core cpu's, hyperthreading is _not_ a 
second core for this problem.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Throttling PostgreSQL's CPU usage

2007-05-08 Thread david


On Tue, 8 May 2007, Carlos Moreno wrote:


Daniel Griscom wrote:


 Several people have mentioned having multiple processors; my current
 machine is a uni-processor machine, but I believe we could spec the actual
 runtime machine to have multiple processors/cores. 


My estimate is that yes, you should definitely consider that.


 I'm only running one query at a time; would that query be guaranteed to
 confine itself to a single processor/core?


From what Joshua mentions, looks like you do have that guarantee.


isn't there a way to limit how many processes postgres will create?

if this is limited to 1, what happens when a vaccum run hits (or 
autovaccum)


David Lang


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: FW: [PERFORM]

2007-05-08 Thread david

On Tue, 8 May 2007, Orhan Aglagul wrote:

No, it is one transaction per insert.

-Original Message-
From: Scott Marlowe [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 08, 2007 5:38 PM
To: Orhan Aglagul
Subject: RE: [PERFORM]

On Tue, 2007-05-08 at 19:36, Orhan Aglagul wrote:

But 10,000 records in 65 sec comes to ~153 records per second. On a

dual

3.06 Xeon
What range is acceptable?

If you're doing that in one big transaction, that's horrible.  Because
it shouldn't be waiting for each insert to fsync, but the whole
transaction.

with a standard 7200 rpm drive ~150 transactions/sec sounds about right

to really speed things up you want to get a disk controller with a battery 
backed cache so that the writes don't need to hit the disk to be safe.

that should get your speeds up to (and possibly above) what you got by 
turning fsync off.

David Lang

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Best OS for Postgres 8.2

2007-05-09 Thread david


On Tue, 8 May 2007, Greg Smith wrote:


On Tue, 8 May 2007, Luke Lonergan wrote:


 From discussions with the developers, the biggest issue is a technical
 one: the Linux VFS layer makes the [ZFS] port difficult.


Difficult on two levels.  First you'd have to figure out how to make it work 
at all; then you'd have to reshape it into a form that it would be acceptable 
to the Linux kernel developers, who haven't seemed real keen on the idea so 
far.


given that RAID, snapshots, etc are already in the linux kernel, I suspect 
that what will need to happen is for the filesystem to be ported without 
those features and then the userspace tools (that manipulate the volumes ) 
be ported to use the things already in the kernel.


The standard article I'm you've already seen this week on this topic is Jeff 
Bonwick's at http://blogs.sun.com/bonwick/entry/rampant_layering_violation


yep, that sounds like what I've been hearing.

what the ZFS (and reiserfs4) folks haven't been wanting to hear from the 
linux kernel devs is that they are interested in having all these neat 
features available for use with all filesystems (and the linux kernel has 
a _lot_ of filesystems available), with solaris you basicly have UFS and 
ZFS so it's not as big a deal.


What really bugged me was his earlier article linked to there where he talks 
about how ZFS eliminates the need for hardware RAID controllers:

http://blogs.sun.com/bonwick/entry/raid_z

While there may be merit to that idea for some applications, like situations 
where you have a pig of a RAID5 volume, that's just hype for database writes. 
"We issue the SYNCHRONIZE CACHE command to the disks after pushing all data 
in a transaction group"--see, that would be the part the hardware controller 
is needed to accelerate.  If you really care about whether your data hit 
disk, there is no way to break the RPM barrier without hardware support.  The 
fact that he misunderstands such a fundamental point makes me wonder what 
other gigantic mistakes might be buried in his analysis.


I've seen similar comments from some of the linux kernel devs, they've 
used low-end raid controllers with small processors on them and think that 
a second core/socket in the main system to run software raid on is better.


David Lang

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Best OS for Postgres 8.2

2007-05-09 Thread david


On Wed, 9 May 2007, Steinar H. Gunderson wrote:


On Wed, May 09, 2007 at 01:57:51AM -0700, [EMAIL PROTECTED] wrote:

given that RAID, snapshots, etc are already in the linux kernel, I suspect
that what will need to happen is for the filesystem to be ported without
those features and then the userspace tools (that manipulate the volumes )
be ported to use the things already in the kernel.


Well, part of the idea behind ZFS is that these parts are _not_ separated in
"layers" -- for instance, the filesystem can push data down to the RAID level
to determine the stripe size used.


there's nothing preventing this from happening if they are seperate layers 
either.


there are some performance implications of the seperate layers, but until 
someone has the ability to do head-to-head comparisons it's hard to say 
which approach will win (in theory the lack of layers makes for faster 
code, but in practice the fact that each layer is gone over by experts 
looking for ways to optimize it may overwelm the layering overhead)



Whether this is a good idea is of course hotly debated, but I don't think you
can port just the filesystem part and call it a day.


Oh, I'm absolutly sure that doing so won't satidfy people (wnd would 
generate howles of outrage from some parts), but having watched other 
groups try and get things into the kernel that the kernel devs felt were 
layering violations  I think that it's wat will ultimatly happen.


David Lang

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: ZFS and Postgresql - WASRe: [PERFORM] Best OS for Postgres 8.2

2007-05-09 Thread david

On Wed, 9 May 2007, Jignesh Shah wrote:

But we still pay the penalty on WAL while writing them in the first place I 
guess .. Is there an option to disable it.. I can test how much is the impact 
I guess couple of %s but good to verify :-) )

on modern CPU's where the CPU is significantly faster then RAM, 
calculating a checksum is free if the CPU has to touch the data anyway 
(cycles where it would be waiting for a cache miss are spent doing the 
calculations)

if you don't believe me, hack the source to remove the checksum and see if 
you can measure any difference.

David Lang

 >

Regards,
Jignesh

Alvaro Herrera wrote:

 Jignesh Shah escribió:

>  Now comes the thing that I am still exploring
>  * Do we do checksum in WAL ? I guess we do .. Which means that we are 
>  now doing double checksumming on the data. One in ZFS and one in 
>  postgresql. ZFS does allow checksumming to be turned off (but on new 
>  blocks allocated). But of course the philosophy is where should it be 
>  done (ZFS or PostgreSQL).
> 

 Checksums on WAL are not optional in Postgres, because AFAIR they are
 used to determine when it should stop recovering.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] setting up raid10 with more than 4 drives

2007-05-29 Thread david


On Wed, 30 May 2007, Jonah H. Harris wrote:


On 5/29/07, Luke Lonergan <[EMAIL PROTECTED]> wrote:

 AFAIK you can't RAID1 more than two drives, so the above doesn't make
 sense
 to me.


Yeah, I've never seen a way to RAID-1 more than 2 drives either.  It
would have to be his first one:

D1 + D2 = MD0 (RAID 1)
D3 + D4 = MD1 ...
D5 + D6 = MD2 ...
MD0 + MD1 + MD2 = MDF (RAID 0)



I don't know what the failure mode ends up being, but on linux I had no 
problems creating what appears to be a massively redundant (but small) array


md0 : active raid1 sdo1[10](S) sdn1[8] sdm1[7] sdl1[6] sdk1[5] sdj1[4] sdi1[3] 
sdh1[2] sdg1[9] sdf1[1] sde1[11](S) sdd1[0]
  896 blocks [10/10] [UU]

David Lang

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Thousands of tables versus on table?

2007-06-04 Thread david

On Mon, 4 Jun 2007, Scott Marlowe wrote:

Gregory Stark wrote:

 "Thomas Andrews" <[EMAIL PROTECTED]> writes:

>  I guess my real question is, does it ever make sense to create thousands 
>  of

>  tables like this?
> 

 Sometimes. But usually it's not a good idea. 

 What you're proposing is basically partitioning, though you may not
 actually
 need to put all the partitions together for your purposes. Partitioning's
 main
 benefit is in the management of the data. You can drop and load partitions
 in
 chunks rather than have to perform large operations on millions of
 records.

 Postgres doesn't really get any faster by breaking the tables up like
 that. In
 fact it probably gets slower as it has to look up which of the thousands
 of
 tables you want to work with.

That's not entirely true.  PostgreSQL can be markedly faster using 
partitioning as long as you always access it by referencing the partitioning 
key in the where clause.  So, if you partition the table by date, and always 
reference it with a date in the where clause, it will usually be noticeably 
faster.  OTOH, if you access it without using a where clause that lets it 
pick partitions, then it will be slower than one big table.

So, while this poster might originally think to have one table for each user, 
resulting in thousands of tables, maybe a compromise where you partition on 
userid ranges would work out well, and keep each partition table down to some 
50-100 thousand rows, with smaller indexes to match.

what if he doesn't use the postgres internal partitioning, but instead 
makes his code access the tables named responsesN where N is the 
id of the customer?

this is what it sounded like he was asking initially.

David Lang

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] Thousands of tables versus on table?

2007-06-05 Thread david


On Tue, 5 Jun 2007, Gregory Stark wrote:


"Scott Marlowe" <[EMAIL PROTECTED]> writes:


Sorry, I think I initially read your response as "Postgres doesn't really get
any faster by breaking the tables up" without the "like that" part.


Well breaking up the tables like that or partitioning, either way should be
about equivalent really. Breaking up the tables and doing it in the
application should perform even better but it does make the schema less
flexible and harder to do non-partition based queries and so on.


but he said in the initial message that they don't do cross-customer 
reports anyway, so there really isn't any non-partition based querying 
going on anyway.



I guess I should explain what I originally meant: A lot of people come from a
flat-file world and assume that things get slower when you deal with large
tables. In fact due to the magic of log(n) accessing records from a large
index is faster than first looking up the table and index info in a small
index and then doing a second lookup in up in an index for a table half the
size.


however, if your query plan every does a sequential scan of a table then 
you are nog doing a log(n) lookup are you?



Where the win in partitioning comes in is in being able to disappear some of
the data entirely. By making part of the index key implicit in the choice of
partition you get away with a key that's half as large. And in some cases you
can get away with using a different key entirely which wouldn't otherwise have
been feasible to index. In some cases you can even do sequential scans whereas
in an unpartitioned table you would have to use an index (or scan the entire
table).

But the real reason people partition data is really for the management ease.
Being able to drop, and load entire partitions in O(1) is makes it feasible to
manage data on a scale that would simply be impossible without partitioned
tables.


remember that the origional question wasn't about partitioned tables, it 
was about the performance problem he was having with one large table (slow 
insert speed) and asking if postgres would collapse if he changed his 
schema to use a seperate table per customer.


I see many cases where people advocate collapsing databases/tables 
togeather by adding a column that indicates which customer the line is 
for.


however I really don't understand why it is more efficiant to have a 5B 
line table that you do a report/query against 0.1% of then it is to have 
1000 different tables of 5M lines each and do a report/query against 100% 
of. it would seem that the fact that you don't have to skip over 99.9% of 
the data to find things that _may_ be relavent would have a noticable cost 
in and of itself.


David Lang

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Thousands of tables versus on table?

2007-06-05 Thread david


On Tue, 5 Jun 2007, Tom Lane wrote:


[EMAIL PROTECTED] writes:

however I really don't understand why it is more efficiant to have a 5B
line table that you do a report/query against 0.1% of then it is to have
1000 different tables of 5M lines each and do a report/query against 100%
of.


Essentially what you are doing when you do that is taking the top few
levels of the index out of the database and putting it into the
filesystem; plus creating duplicative indexing information in the
database's system catalogs.

The degree to which this is a win is *highly* debatable, and certainly
depends on a whole lot of assumptions about filesystem performance.
You also need to assume that constraint-exclusion in the planner is
pretty doggone cheap relative to the table searches, which means it
almost certainly will lose badly if you carry the subdivision out to
the extent that the individual tables become small.  (This last could

  ^^
what is considered 'small'? a few thousand records, a few million records?

what multiplication factor would there need to be on the partitioning to 
make it worth while? 100 tables, 1000 tables, 1 tables?


the company that I'm at started out with a seperate database per customer 
(not useing postgres), there are basicly zero cross-customer queries, with 
a large volume of updates and lookups.


overall things have now grown to millions of updates/day (some multiple of 
this in lookups), and ~2000 customers, with tens of millions of rows 
between them.


having each one as a seperate database has really helped us over the years 
as it's made it easy to scale (run 500 databases on each server instead of 
1000, performance just doubled)


various people (not database experts) are pushing to install Oracle 
cluster so that they can move all of these to one table with a customerID 
column.


the database folks won't comment much on this either way, but they don't 
seem enthusiastic to combine all the data togeather.


I've been on the side of things that said that seperate databases is 
better becouse it improves data locality to only have to look at the data 
for one customer at a time rather then having to pick out that customer's 
data out from the mass of other, unrelated data.



be improved in some cases if we had a more explicit representation of
partitioning, but it'll never be as cheap as one more level of index
search.)


say you have a billing table of
customerID, date, description, amount, tax, extended, paid

and you need to do things like
report on invoices that haven't been paied
summarize the amount billed each month
summarize the tax for each month

but you need to do this seperately for each customerID (not as a batch job 
that reports on all customerID's at once, think a website where the 
customer can request such reports at any time with a large variation in 
criteria)


would you be able to just have one index on customerID and then another on 
date? or would the second one need to be on customerID||date?


and would this process of going throught he index and seeking to the data 
it points to really be faster then a sequential scan of just the data 
related to that customerID?



I think the main argument for partitioning is when you are interested in
being able to drop whole partitions cheaply.


I fully understand this if you are doing queries across all the 
partitions, but if your query is confined to a single partition, 
especially in the case where you know ahead of time in the application 
which 'partition' you care about it would seem that searching through 
significantly less data should be a win.


David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Thousands of tables versus on table?

2007-06-05 Thread david


On Wed, 6 Jun 2007, Steinar H. Gunderson wrote:


On Tue, Jun 05, 2007 at 05:59:25PM -0400, Tom Lane wrote:

I think the main argument for partitioning is when you are interested in
being able to drop whole partitions cheaply.


Wasn't there also talk about adding the ability to mark individual partitions
as read-only, thus bypassing MVCC and allowing queries to be satisfied using
indexes only?

Not that I think I've seen it on the TODO... :-)


now that's a very interesting idea, especially when combined with 
time-based data where the old times will never change.


David Lang

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Thousands of tables versus on table?

2007-06-06 Thread david

On Wed, 6 Jun 2007, Scott Marlowe wrote:

>  pretty doggone cheap relative to the table searches, which means it
>  almost certainly will lose badly if you carry the subdivision out to
>  the extent that the individual tables become small.  (This last could
   ^^
 what is considered 'small'? a few thousand records, a few million records?

I would say small is when the individual tables are in the 10 to 20 Megabyte 
range.  How many records that is depends on record width, of course. 
Basically, once the tables get small enough that you don't really need 
indexes much, since you tend to grab 25% or more of each one that you're 
going to hit in a query.

thanks, that makes a lot of sense

 what multiplication factor would there need to be on the partitioning to
 make it worth while? 100 tables, 1000 tables, 1 tables?
Really depends on the size of the master table I think.  If the master table 
is about 500 Megs in size, and you partition it down to about 1 meg per child 
table, you're probably ok.  Walking through 500 entries for constraint 
exclusion seems pretty speedy from the tests I've run on a 12M row table that 
was about 250 Megs, split into 200 to 400 or so equisized child tables.  The 
time to retrieve 85,000 rows that were all neighbors went from 2 to 6 
seconds, to about 0.2 seconds, and we got rid of indexes entirely since they 
weren't really needed anymore.

remember, I'm talking about a case wher eyou don't have to go through 
contraint checking. you know to start with what customerID you are dealing 
with so you just check the tables for that customer

 the company that I'm at started out with a seperate database per customer
 (not useing postgres), there are basicly zero cross-customer queries, with
 a large volume of updates and lookups.

 overall things have now grown to millions of updates/day (some multiple of
 this in lookups), and ~2000 customers, with tens of millions of rows
 between them.

 having each one as a seperate database has really helped us over the years
 as it's made it easy to scale (run 500 databases on each server instead of
 1000, performance just doubled)
I think that for what you're doing, partitioning at the database level is 
probably a pretty good compromise solution.  Like you say, it's easy to put 
busy databases on a new server to balance out the load.  Hardware is cheap.

 various people (not database experts) are pushing to install Oracle
 cluster so that they can move all of these to one table with a customerID
 column.
Have these people identified a particular problem they're trying to solve, or 
is this a religious issue for them?  From your description it sounds like a 
matter of dogma, not problem solving.

in part it is, in part it's becouse the commercial database companies have 
told management that doing database replication is impossible with so many 
databases (we first heard this back when we had 300 or so databases), 
we've gone the expensive EMC disk-layer replication route, but they think 
that mergeing everything will simplify things somehow so the database can 
do it's job better.

I see it as just a limitation on the replication solution offered by the 
bigname vendors.

 the database folks won't comment much on this either way, but they don't
 seem enthusiastic to combine all the data togeather.
I think they can see the fecal matter heading towards the rotational cooling 
device on this one.  I can't imagine this being a win from the perspective of 
saving the company money.

neither do I.

David Lang

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] How much ram is too much

2007-06-08 Thread david


On Fri, 8 Jun 2007, Dave Cramer wrote:


 Is it possible that providing 128G of ram is too much ? Will other systems
 in the server bottleneck ?


the only way 128G of ram would be too much is if your total database size 
(including indexes) is smaller then this.


now it may not gain you as much of an advantage going from 64G to 128G as 
it does going from 32G to 64G, but that depends on many variables as 
others have been asking.


David Lang

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Volunteer to build a configuration tool

2007-06-18 Thread david

one thing to point out to people about this idea is that nothing says that 
this page needs to be served via a webserver. If all the calculations are 
done in javascript this could be a local file that you open with a 
browser.


do any of the text-mode browsers implement javascript? if so then you have 
an answer even for the deeply buried isolated headless servers.


David Lang

 On Mon, 18 Jun 2007, 
Campbell, Lance wrote:




I am a Java Software architect, DBA, and project manager for the
University of Illinois, Department of Web Services.  We use PostgreSQL
to serve about 2 million pages of dynamic content a month; everything
from calendars, surveys, forms, discussion boards, RSS feeds, etc.  I am
really impressed with this tool.



The only major problem area I have found where PostgreSQL is really
lacking is in "what should my initial configuration settings be?"  I
realize that there are many elements that can impact a DBA's specific
database settings but it would be nice to have a "configuration tool"
that would get someone up and running better in the beginning.



This is my idea:



A JavaScript HTML page that would have some basic questions at the top:

1) How much memory do you have?

2) How many connections will be made to the database?

3) What operating system do you use?

4) Etc...


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] PostgreSQL Configuration Tool for Dummies

2007-06-19 Thread david

On Tue, 19 Jun 2007, Y Sidhu wrote:

On 6/19/07, Francisco Reyes <[EMAIL PROTECTED]> wrote:

 Campbell, Lance writes:

>  Francisco and Richard,
>  Why ask about disk or raid?  How would that impact any settings in
>  postgresql.conf?

 If the user has 2 disks and says that he will do a lot of updates he could
 put pg_xlog in the second disk.

 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster

Let's not ask about disk or raid at this level of sanity tuning. It is
important for a newbie to take the right first step. When it comes to disks,
we start talking I/O, SATA, SCSI and the varying degrees of SATA and SCSI,
and controller cards. Then we throw in RAID and the different levels
therein. Add to that, we can talk about drivers controlling these drives and
which OS is faster, more stable, etc. As you can see, a newbie would get
drowned. So, please keep it simple. I know many people on this list are
Gurus. We know you are the best in this field, but we are not and are just
trying to improve what we have.

I strongly agree.

besides, the number and types of drives, raid configurations, etc is so 
variable that I strongly believe that the right answer is going to be 
something along the lines of 'run this tool and then enter the number(s) 
that the tool reports' and then let the tool measure the end result of all 
the variables rather then trying to calculate the results.

David Lang

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL Configuration Tool for Dummies

2007-06-19 Thread david


On Tue, 19 Jun 2007, Campbell, Lance wrote:


Memory

There are many different ways to ask about memory.  Rather than ask a
series of questions I went with a single question, #2.  If it is better
to ask about the memory in a series of questions then please give me the
questions you would ask and why you would ask each of them.  From my
understanding the primary memory issue as it relates to PostgreSQL is
"how much memory is available to PostgreSQL".  Remember that this needs
to be as simple as possible.


there are three catagories of memory useage

1. needed by other software
2. available for postgres
3. needed by the OS

it's not clear if what you are asking is #2 or a combination of #2 and #3

IMHO you should ask for #2 and #3, possibly along the lines of "how much 
memory is in the machine that isn't already used by other applications"


David Lang

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] PostgreSQL Configuration Tool for Dummies - feedback adjustable control

2007-06-22 Thread david


On Fri, 22 Jun 2007, Sabin Coanda wrote:

Instead of (or in addition to) configure dozens of settings, what do you 
say about a feedback adjustable control based on the existing system 
statistics and parsing logs (e.g 
http://pgfouine.projects.postgresql.org/index.html ) ?


something like this would be useful for advanced tuneing, but the biggest 
problem is that it's so difficult to fingoure out a starting point. bad 
choices at the starting point can cause several orders of magnatude 
difference in the database performsnce. In addition we know that the 
current defaults are bad for just about everyone (we just can't decide 
what better defaults would be)


this horrible starting point gives people a bad first impression that a 
simple tool like what's being discussed can go a long way towards solving.


David Lang

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Dell Hardware Recommendations

2007-08-09 Thread david


On Thu, 9 Aug 2007, Decibel! wrote:


On Thu, Aug 09, 2007 at 08:58:19PM -0500, Scott Marlowe wrote:

On Thu, Aug 09, 2007 at 05:29:18PM -0500, Scott Marlowe wrote:

On 8/9/07, Decibel! <[EMAIL PROTECTED]> wrote:


Also, a good RAID controller can spread reads out across both drives in
each mirror on a RAID10. Though, there is an argument for not doing
that... it makes it much less likely that both drives in a mirror will
fail close enough to each other that you'd lose that chunk of data.


I'd think that kind of failure mode is pretty uncommon, unless you're
in an environment where physical shocks are common.  which is not a
typical database environment.  (tell that to the guys writing a db for
a modern tank fire control system though :) )


You'd be surprised. I've seen more than one case of a bunch of drives
failing within a month, because they were all bought at the same time.


while with RAID10 you can
potentially lose half the array without losing any data.


Yes, but the RIGHT two drives can kill EITHER RAID 5 or RAID10.


Sure, but the odds of that with RAID5 are 100%, while they're much less
in a RAID10.


so you go with Raid6, not Raid5.


While I agree with Merlin that for OLTP a faster drive is a must, for
OLAP, more drives is often the real key.  The high aggregate bandwidth
of a large array of SATA drives is an amazing thing to watch when
running a reporting server with otherwise unimpressive specs.


True. In this case, the OP will probably want to have one array for the
OLTP stuff and one for the OLAP stuff.


one thing that's interesting is that the I/O throughlut on the large SATA 
drives can actually be higher then the faster, but smaller SCSI drives. 
the SCSI drives can win on seeking, but how much seeking you need to do 
depends on how large the OLTP database ends up being


David Lang

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] SAN vs Internal Disks

2007-09-07 Thread david


On Fri, 7 Sep 2007, Tobias Brox wrote:


We're also considering to install postgres on SAN - that is, my boss is
convinced this is the right way to go.

Advantages:

1. Higher I/O (at least the salesman claims so)


only if you buy better disks for the SAN then for the local system (note 
that this includes battery backed ram for write caching. the SAN will 
include a bunch becouse it's performance would _suck_ otherwise. if you 
don't put any on your stand-alone system you are comparing apples to 
oranges)



2. Easier to upgrade the disk capacity


only if you buy a SAN with a lot of empty drive slots, but wouldn't buy a 
system with empty drive slots.



3. Easy to set up "warm standby" functionality.  (Then again, if the
postgres server fails miserably, it's likely to be due to a disk
crash).


and if postgres dies for some other reason the image on disk needs repair, 
unless you script stopping postgres when the SAN does it's snapshots, 
those snapshots are not going to be that good. the problems are useually 
repairable, but that makes starting your warm spare harder.



Also, my boss states that "all big enterprises uses SAN nowadays".


your bos would be very surprised at what the really big shops are doing 
(and not doing). yes they have a SAN, they have many SANs, from many 
different vendors, and they have many systems that don't use the SAN and 
use local disks instead. when you get really large you can find just about 
anything _somewhere_ in the company.



Disadvantages:

1. Risky?  One gets the impression that there are frequent problems
with data integrity when reading some of the posts in this thread.


SAN's add more parts and more potential points of failure, then when you 
add the SAN replication to the mix things get even more 'interesting'. 
doing SAN replication across a significant distance to your DR facility 
can be a LOT harder to get working right then the salesman makes it sound. 
it's not uncommon to see a san replication decide that it's going to take 
a week to catch up after doing a DR test for example.



2. Expensive


no, _extremely expensive. price one and then look at how much hardware you 
could buy instead. you can probably buy much mroe storage, and a couple 
complete spare systems (do replication to a local spare as well as your 
remote system) and end up with even more reliability.



3. "Single point of failure" ... but that you have either it's a SAN or
a local disk, one will anyway need good backup systems (and eventually
"warm standby"-servers running from physically separated disks).


no, with local disks you can afford to have multiple systems so that you 
don't have a SPOF



4. More complex setup?

5. If there are several hosts with write permission towards the same
disk, I can imagine the risks being higher for data integrity
breakages.  Particularly, I can imagine that if two postgres instances
is started up towards the same disk (due to some sysadmin mistake), it
could be disasterous.


when you are useing a SAN for a database the SAN vendor will have you 
allocate complete disks to each box, so you don't have multiple boxes 
hitting the same drive, but you also don't get a lot of the anvantages the 
salesman talks about.


David Lang

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] SAN vs Internal Disks

2007-09-11 Thread david


On Tue, 11 Sep 2007, Decibel! wrote:


On Tue, Sep 11, 2007 at 05:09:00PM -0400, Michael Stone wrote:

On Tue, Sep 11, 2007 at 03:55:51PM -0500, Decibel! wrote:

Also, to reply to someone else's email... there is one big reason to use
a SAN over direct storage: you can do HA that results in 0 data loss.
Good SANs are engineered to be highly redundant, with multiple
controllers, PSUs, etc, so that the odds of losing the SAN itself are
very, very low. The same isn't true with DAS.


You can get DAS arrays with multiple controllers, PSUs, etc.  DAS !=
single disk.


It's still in the same chassis, though, which means if you lose memory
or mobo you're still screwed. In a SAN setup for redundancy, there's
very little in the way of a single point of failure; generally only the
backplane, and because there's very little that's on there it's
extremely rare for one to fail.


not nessasarily. direct attached doesn't mean in the same chassis, 
external drive shelves attached via SCSI are still DAS


you can even have DAS attached to a pair of machines, with the second box 
configured to mount the drives only if the first one dies.


David Lang

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] 10K vs 15k rpm for analytics

2010-03-09 Thread david


On Tue, 9 Mar 2010, Pierre C wrote:


On Tue, 09 Mar 2010 08:00:50 +0100, Greg Smith  wrote:


Scott Carey wrote:
For high sequential throughput, nothing is as optimized as XFS on Linux 
yet.  It has weaknesses elsewhere however.




When files are extended one page at a time (as postgres does) fragmentation 
can be pretty high on some filesystems (ext3, but NTFS is the absolute worst) 
if several files (indexes + table) grow simultaneously. XFS has delayed 
allocation which really helps.



I'm curious what you feel those weaknesses are.


Handling lots of small files, especially deleting them, is really slow on 
XFS.

Databases don't care about that.


accessing lots of small files works really well on XFS compared to ext* (I 
use XFS with a cyrus mail server which keeps each message as a seperate 
file and XFS vastly outperforms ext2/3 there). deleting is slow as you say


David Lang

There is also the dark side of delayed allocation : if your application is 
broken, it will manifest itself very painfully. Since XFS keeps a lot of 
unwritten stuff in the buffers, an app that doesn't fsync correctly can lose 
lots of data if you don't have a UPS.


Fortunately, postgres handles fsync like it should be.

A word of advice though : a few years ago, we lost a few terabytes on XFS 
(after that, restoring from backup was quite slow !) because a faulty SCSI 
cable crashed the server, then crashed it again during xfsrepair. So if you 
do xfsrepair on a suspicious system, please image the disks first.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Testing FusionIO

2010-03-17 Thread david


On Wed, 17 Mar 2010, Brad Nicholson wrote:


On Wed, 2010-03-17 at 14:11 -0400, Justin Pitts wrote:

On Mar 17, 2010, at 10:41 AM, Brad Nicholson wrote:


On Wed, 2010-03-17 at 09:52 -0400, Justin Pitts wrote:

FusionIO is publicly claiming 24 years @ 5TB/day on the 80GB SLC device, which 
wear levels across 100GB of actual installed capacity.
http://community.fusionio.com/forums/p/34/258.aspx#258



20% of overall capacity free for levelling doesn't strike me as a lot.


I don't have any idea how to judge what amount would be right.


Some of the Enterprise grade stuff we are looking into (like TMS RamSan)
leaves 40% (with much larger overall capacity).

Also, running that drive at 80GB is the "Maximum Capacity" mode, which
decreases the write performance.


Very fair. In my favor, my proposed use case is probably at half capacity or 
less. I am getting the impression that partitioning/formatting the drive for 
the intended usage, and not the max capacity, is the way to go. Capacity isn't 
an issue with this workload. I cannot fit enough drives into these servers to 
get a tenth of the IOPS that even Tom's documents the ioDrive is capable of at 
reduced performance levels.



The actual media is only good for a very limited number of write cycles.  The 
way that the drives get around to be reliable is to
constantly write to different areas.  The more you have free, the less you have 
to re-use, the longer the lifespan.

This is done by the drives wear levelling algorithms, not by using
partitioning utilities btw.


true, but if the drive is partitioned so that parts of it are never 
written to by the OS, the drive knows that those parts don't contain data 
and so can treat them as unallocated.


once the OS writes to a part of the drive, unless the OS issues a trim 
command the drive can't know that the data there is worthless and can be 
ignored, it has to try and preserve that data, which makes doing the wear 
leveling harder and slower.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Testing Sandforce SSD

2010-07-24 Thread david


On Sat, 24 Jul 2010, David Boreham wrote:


Do you guys have any more ideas to properly 'feel this disk at its teeth' ?


While an 'end-to-end' test using PG is fine, I think it would be easier to 
determine if the drive is behaving correctly by using a simple test program 
that emulates the storage semantics the WAL expects. Have it write a constant 
stream of records, fsync'ing after each write. Record the highest record 
number flushed so far in some place that won't be lost with the drive under 
test (e.g. send it over the network to another machine).


Kill the power, bring the system back up again and examine what's at the tail 
end of that file. I think this will give you the worst case test with the 
easiest result discrimination.


If you want to you could add concurrent random writes to another file for 
extra realism.


Someone here may already have a suitable test program. I know I've written 
several over the years in order to test I/O performance, prove the existence 
of kernel bugs, and so on.


I doubt it matters much how long the power is turned of. A second should be 
plenty time to flush pending writes if the drive is going to do so.


remember that SATA is designed to be hot-plugged, so you don't have to 
kill the entire system to kill power to the drive.


this is a little more ubrupt than the system loosing power, but in terms 
of loosing data this is about the worst case (while at the same time, it 
eliminates the possibility that the OS continues to perform writes to the 
drive as power dies, which is a completely different class of problems, 
independant of the drive type)


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD

2010-08-17 Thread david


On Thu, 12 Aug 2010, Brad Nicholson wrote:


On 10-08-12 03:22 AM, Arjen van der Meijden wrote:

On 12-8-2010 2:53 gnuo...@rcn.com wrote:

- The value of SSD in the database world is not as A Faster HDD(tm).
Never was, despite the naive' who assert otherwise.  The value of SSD
is to enable BCNF datastores.  Period.  If you're not going to do
that, don't bother.  Silicon storage will never reach equivalent
volumetric density, ever.  SSD will never be useful in the byte bloat
world of xml and other flat file datastores (resident in databases or
not).  Industrial strength SSD will always be more expensive/GB, and
likely by a lot.  (Re)factoring to high normalization strips out an
order of magnitude of byte bloat, increases native data integrity by
as much, reduces much of the redundant code, and puts the ACID where
it belongs.  All good things, but not effortless.


It is actually quite common to under-utilize (short stroke) hard drives in 
the enterprise world. Simply because 'they' need more IOps per amount of 
data than a completely utilized disk can offer.
As such the expense/GB can be much higher than simply dividing the capacity 
by its price (and if you're looking at fiber channel disks, that price is 
quite high already). And than it is relatively easy to find enterprise 
SSD's with better pricing for the whole system as soon as the IOps are more 
important than the capacity.


And when you compare the ongoing operational costs of rack space, powering 
and cooling for big arrays full of spinning disks to flash based solutions 
the price comparison evens itself out even more.


check your SSD specs, some of the high performance ones draw quite a bit 
of power.


David Lang


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Slow count(*) again...

2010-10-11 Thread david


On Mon, 11 Oct 2010, Samuel Gendler wrote:


On Mon, Oct 11, 2010 at 9:06 PM, Scott Carey wrote:


I can't speak to documentation, but it is something that helps as your I/O
subsystem gets more powerful, and how much it helps depends more on your
hardware, which may have adaptive read ahead on its own, and your file
system which may be more or less efficient at sequential I/O.  For example
ext3 out of the box gets a much bigger gain from tuning read-ahead than XFS
on a DELL PERC6 RAID card (but still ends up slower).



Geez.  I wish someone would have written something quite so bold as 'xfs is
always faster than ext3' in the standard tuning docs.  I couldn't find
anything that made a strong filesystem recommendation.  How does xfs compare
to ext4?  I wound up on ext4 on a dell perc6 raid card when an unexpected
hardware failure on a production system caused my test system to get thrown
into production before I could do any serious testing of xfs.  If there is a
strong consensus that xfs is simply better, I could afford the downtime to
switch.


unfortunantly you are not going to get a clear opinion here.

ext3 has a long track record, and since it is the default, it gets a lot 
of testing. it does have known issues


xfs had problems on linux immediatly after it was ported. It continues to 
be improved and many people have been using it for years and trust it. XFS 
does have a weakness in creating/deleting large numbers of small files.


ext4 is the new kid on the block. it claims good things, but it's so new 
that many people don't trust it yet


btrfs is the 'future of filesystems' that is supposed to be better than 
anything else, but it's definantly not stable yet, and time will tell if 
it really lives up to it's promises.


an this is just on linux

on BSD or solaris (or with out-of-kernel patches) you also have ZFS, which 
some people swear by, and other people swear at.


David Lang



As it happens, this is a system where all of the heavy workload is in the
form of sequential scan type load. The OLTP workload is very minimal (tens
of queries per minute on a small number of small tables), but there are a
lot of reporting queries that wind up doing sequential scans of large
partitions (millions to tens of millions of rows).  We've sized the new
hardware so that the most commonly used partitions fit into memory, but if
we could speed the queries that touch less frequently used partitions, that
would be good.  I'm the closest thing our team has to a DBA, which really
only means that I'm the one person on the dev team or the ops team to have
read all of the postgres docs and wiki and the mailing lists.  I claim no
actual DBA experience or expertise and have limited cycles to devote to
tuning and testing, so if there is an established wisdom for filesystem
choice and read ahead tuning, I'd be very interested in hearing it.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Slow count(*) again...

2010-10-12 Thread david


On Tue, 12 Oct 2010, Craig Ringer wrote:




BTW: There is a lot of talk about MVCC, but is next solution possible:
1) Create a page information map that for each page in the table will
tell you how may rows are within and if any write (either successful or
not) were done to this page. This even can be two maps to make second
one really small (a bit per page) - so that it could be most time
in-memory.
2) When you need to to count(*) or index check - first check if there
were no writes to the page. If not - you can use count information from
page info/index data without going to the page itself
3) Let vacuum clear the bit after frozing all the tuples in the page (am
I using terminology correctly?).


Part of this already exists. It's called the visibility map, and is present 
in 8.4 and above. It's not currently used for queries, but can potentially be 
used to aid some kinds of query.


http://www.postgresql.org/docs/8.4/static/storage-vm.html


In this case all read-only (archive) data will be this bit off and
index/count(*) will be really fast.


A count with any joins or filter criteria would still have to scan all pages 
with visible tuples in them. So the visibility map helps speed up scanning of 
bloated tables, but doesn't provide a magical "fast count" except in the 
utterly trivial "select count(*) from tablename;" case, and can probably only 
be used for accurate results when there are no read/write transactions 
currently open. Even if you kept a count of tuples in each page along with 
the mvcc transaction ID information required to determine for which 
transactions that count is valid, it'd only be useful if you didn't have to 
do any condition checks, and it'd be yet another thing to update with every 
insert/delete/update.


Perhaps for some users that'd be worth having, but it seems to me like it'd 
have pretty narrow utility. I'm not sure that's the answer.


from a PR point of view, speeding up the trivil count(*) case could be 
worth it, just to avoid people complaining about it not being fast.


in the case where you are doing a count(*) where query and the where is on 
an indexed column, could the search just look at the index + the 
visibility mapping rather than doing an sequential search through the 
table?


as for your worries about the accuracy of a visibility based count in the 
face of other transactions, wouldn't you run into the same issues if you 
are doing a sequential scan with the same transactions in process?


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Slow count(*) again...

2010-10-12 Thread david


On Tue, 12 Oct 2010, Joe Uhl wrote:


The biggest single problem with "select count(*)" is that it is
seriously overused. People use that idiom to establish existence, which
usually leads to a performance disaster in the application using it,
unless the table has no more than few hundred records. SQL language, of
which PostgreSQL offers an excellent implementation,  offers [NOT]
EXISTS clause since its inception in the Jurassic era. The problem is
with the sequential scan, not with counting. I'd even go as far as to
suggest that 99% instances of the "select count(*)" idiom are probably
bad use of the SQL language.


I agree, I have seen many very bad examples of using count(*). I will go so
far as to question the use of count(*) in my examples here. It there a 
better
way to come up with a page list than using count(*)? What is the best 
method
to make a page of results and a list of links to other pages of results? Am 
I

barking up the wrong tree here?
One way I have dealt with this on very large tables is to cache the count(*) 
at the application level (using memcached, terracotta, or something along 
those lines) and then increment that cache whenever you add a row to the 
relevant table.  On application restart that cache is re-initialized with a 
regular old count(*).  This approach works really well and all large systems 
in my experience need caching in front of the DB eventually.  If you have a 
simpler system with say a single application/web server you can simply store 
the value in a variable, the specifics would depend on the language and 
framework you are using.


this works if you know ahead of time what the criteria of the search is 
going to be.


so it will work for

select count(*) from table;

what this won't work for is cases wher the criteria of the search is 
unpredictable, i.e.


ask the user for input

select count(*) from table where field=$input;

David Lang

Another more all-DB approach is to create a statistics tables into which you 
place aggregated statistics rows (num deleted, num inserted, totals, etc) at 
an appropriate time interval in your code.  So you have rows containing 
aggregated statistics information for the past and some tiny portion of the 
new data happening right now that hasn't yet been aggregated.  Queries then 
look like a summation of the aggregated values in the statistics table plus a 
count(*) over just the newest portion of the data table and are generally 
very fast.


Overall I have found that once things get big the layers of your app stack 
start to blend together and have to be combined in clever ways to keep speed 
up.  Postgres is a beast but when you run into things it can't do well just 
find a way to cache it or make it work together with some other persistence 
tech to handle those cases.






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Slow count(*) again...

2010-10-12 Thread david


On Tue, 12 Oct 2010, Mladen Gogala wrote:


da...@lang.hm wrote:
from a PR point of view, speeding up the trivil count(*) case could be 
worth it, just to avoid people complaining about it not being fast.



Fixing PR stuff is not the approach that I would take. People are complaining 
about select count(*) because they're using it in all the wrong places.


that may be the case, but if it's possible to make it less painful it will 
mean more people use postgres, both because it works better for them when 
they are using the suboptimal programs, but also because when people do 
their trivial testing of databases to decide which one they will use, they 
won't rule out postgres because "it's so slow"


the fact of the matter is that people do use count(*), and even though 
there are usually ways to avoid doing so, having the programmer have to do 
something different for postgres than they do for other databases is 
raising a barrier against postgres untilization in anything.


David Lang

 My 
assessment that there is a problem with sequential scan was wrong. Now, let's 
again take Oracle as the measure.

Someone asked me about caching the data.  Here it is:

SQL> connect system/*
Connected.
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:12.68
SQL> connect adbase/*
Connected.
SQL> alter session set db_file_multiblock_read_Count=128;

Session altered.

Elapsed: 00:00:00.41
SQL> select count(*) from ni_occurrence;

COUNT(*)
--
402062638

Elapsed: 00:02:37.77

SQL> select bytes/1048576 MB from user_segments
2  where segment_name='NI_OCCURRENCE';

  MB
--
   35329

Elapsed: 00:00:00.20
SQL>


So, the results weren't cached the first time around. The explanation is the 
fact that Oracle, as of the version 10.2.0, reads the table in the private 
process memory, not in the shared buffers.  This table alone is  35GB in 
size,  Oracle took 2 minutes 47 seconds to read it using the full table scan. 
If I do the same thing with PostgreSQL and a comparable table, Postgres is, 
in fact, faster:


psql (9.0.1)
Type "help" for help.

news=> \timing
Timing is on.
news=> select count(*) from moreover_documents_y2010m09;
count  --
17242655
(1 row)

Time: 113135.114 ms
news=> select pg_size_pretty(pg_table_size('moreover_documents_y2010m09'));
pg_size_pretty

27 GB
(1 row)

Time: 100.849 ms
news=>

The number of rows is significantly smaller, but the table contains rather 
significant "text" field which consumes quite a bit of TOAST storage and the 
sizes are comparable. Postgres read through 27GB in 113 seconds, less than 2 
minutes and oracle took 2 minutes 37 seconds to read through 35GB.  I stand 
corrected: there is nothing wrong with the speed of the Postgres sequential 
scan.






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Slow count(*) again...

2010-10-12 Thread david


On Tue, 12 Oct 2010, Craig Ringer wrote:


On 10/12/2010 04:22 PM, da...@lang.hm wrote:


from a PR point of view, speeding up the trivil count(*) case could be
worth it, just to avoid people complaining about it not being fast.


At the cost of a fair bit more complexity, though, and slowing everything 
else down.


complexity probably, although given how complex the planner is already is 
this significant?


as far as slowing everything else down, why would it do that? (beyond the 
simple fact that any new thing the planner can do makes the planner take a 
little longer)


David Lang

The proper solution here remains, IMO, support for visibility information in 
indexes, whether by storing it once in the index and once in the heap 
(ouch!), storing it out-of-line, or using a covering index where one or more 
columns are stored wholly in the index not in the table heap at all.


Here are a few of the many past discussions about this that have already 
covered some of the same ground:


http://stackoverflow.com/questions/839015/postgres-could-an-index-organized-tables-paved-way-for-faster-select-count-fr

http://osdir.com/ml/db.postgresql.performance/2003-10/msg00075.html
(and the rest of the thread)

A decent look with Google will find many, many more.


in the case where you are doing a count(*) where query and the where is
on an indexed column, could the search just look at the index + the
visibility mapping rather than doing an sequential search through the
table?


Nope, because the visibility map, which is IIRC only one bit per page, 
doesn't record how many tuples there are on the page, or enough information 
about them to determine how many of them are visible to the current 
transaction*.



as for your worries about the accuracy of a visibility based count in
the face of other transactions, wouldn't you run into the same issues if
you are doing a sequential scan with the same transactions in process?


No. Every tuple in a table heap in postgresql has hidden fields, some of 
which are used to determine whether the current transaction* can "see" the 
tuple - it may have been inserted after this transaction started, or deleted 
before this transaction started, so it's not visible to this transaction but 
may still be to others.


http://www.postgresql.org/docs/current/static/ddl-system-columns.html

This information isn't available in the visibility map, or in indexes. That's 
why PostgreSQL has to hit the heap to find it.


* current transaction should really be "current snapshot". The snapshot is 
taken at the start of the whole transaction for SERIALIZABLE isolation, and 
at the start of each statement for READ COMMITTED isolation.


--
Craig Ringer



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Slow count(*) again...

2010-10-13 Thread david


On Wed, 13 Oct 2010, Tom Lane wrote:


Neil Whelchel  writes:

That's probably got little to do with caching and everything to do with
setting hint bits on the first SELECT pass.

I concur with Mark's question about whether your UPDATE pushed the table
size across the limit of what would fit in RAM.


Neil, can you just double the size of your initial test to make sure that 
it's too large to fit in ram to start with?


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-29 Thread david


On Fri, 29 Oct 2010, Robert Haas wrote:


On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane  wrote:

James Mansion  writes:

Tom Lane wrote:

The other and probably worse problem is that there's no application
control over how soon changes to mmap'd pages get to disk.  An msync
will flush them out, but the kernel is free to write dirty pages sooner.
So if they're depending for consistency on writes not happening until
msync, it's broken by design.  (This is one of the big reasons we don't
use mmap'd space for Postgres disk buffers.)



Well, I agree that it sucks for the reason you give - but you use
write and that's *exactly* the same in terms of when it gets written,
as when you update a byte on an mmap'd page.


Uh, no, it is not.  The difference is that we can update a byte in a
shared buffer, and know that it *isn't* getting written out before we
say so.  If the buffer were mmap'd then we'd have no control over that,
which makes it mighty hard to obey the WAL "write log before data"
paradigm.

It's true that we don't know whether write() causes an immediate or
delayed disk write, but we generally don't care that much.  What we do
care about is being able to ensure that a WAL write happens before the
data write, and with mmap we don't have control over that.


Well, we COULD keep the data in shared buffers, and then copy it into
an mmap()'d region rather than calling write(), but I'm not sure
there's any advantage to it.  Managing address space mappings is a
pain in the butt.


keep in mind that you have no way of knowing what order the data in the 
mmap region gets written out to disk.


David Lang
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] BBU Cache vs. spindles

2010-10-29 Thread david


On Fri, 29 Oct 2010, James Mansion wrote:


Tom Lane wrote:

Uh, no, it is not.  The difference is that we can update a byte in a
shared buffer, and know that it *isn't* getting written out before we

Well, I don't know where yu got the idea I was refering to that sort of thing 
- its

the same as writing to a buffer before copying to the mmap'd area.

It's true that we don't know whether write() causes an immediate or
delayed disk write, but we generally don't care that much.  What we do


Which is what I was refering to.

care about is being able to ensure that a WAL write happens before the
data write, and with mmap we don't have control over that.


I think you have just the same control either way, because you can only force 
ordering
with an appropriate explicit sync, and in the absence of such a sync all bets 
are off for
whether/when each disk page is written out, and if you can't ensure that the 
controller

and disk are write through you'd better do a hardware cache flush.too, right?

A shame that so many systems have relatively poor handling of that hardware 
flush.


the issue is that when you update a mmaped chunk of data, it could be 
written out immediatly without you doing _anything_ (and thanks to 
multiple cores/threads, it could get written out while you are still in 
the middle of updating it). When you update an internal buffer and then 
write that, you know that nothing will hit the disk before you issue the 
write command.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] How to best use 32 15k.7 300GB drives?

2011-01-27 Thread david


On Thu, 27 Jan 2011, Robert Schnabel wrote:


HI,

I use PostgreSQL basically as a data warehouse to store all the genetic data 
that our lab generates.  The only person that accesses the database is myself 
and therefore I've had it housed on my workstation in my office up till now. 
However, it's getting time to move it to bigger hardware.  I currently have a 
server that is basically only storing backup images of all our other 
workstations so I'm going to move my database onto it.  The server looks like 
this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 quad-core x2, 32GB 
RAM.  For my purposes the CPUs and RAM are fine.  I currently have an Adaptec 
52445+BBU controller that has the OS (4 drive RAID5), FTP (4 drive RAID5) and 
two backup arrays (8 drive each RAID0).  The backup arrays are in a 16 drive 
external enclosure through an expander so I actually have 16 ports free on 
the 52445 card.  I plan to remove 3 of the drives from my backup arrays to 
make room for 3 - 73GB 15k.5 drives (re-purposed from my workstation).  Two 
16 drive enclosures with SAS2 expanders just arrived as well as 36 Seagate 
15k.7 300GB drives (ST3300657SS).  I also intend on getting an Adaptec 6445 
controller with the flash module when it becomes available in about a month 
or two.  I already have several Adaptec cards so I'd prefer to stick with 
them.


Here's the way I was planning using the new hardware:
xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 controller
data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller
indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller
2 - 300G 15k.7 as hot spares enclosure C
4 spare 15k.7 for on the shelf

With this configuration I figure I'll have ~3TB for my main data tables and 
1TB for indexes.  Right now my database is 500GB total.  The 3:1 split 
reflects my current table structure and what I foresee coming down the road 
in terms of new data.


So my questions are 1) am I'm crazy for doing this, 2) would you change 
anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp 
filespace) on a different controller than everything else?  Please keep in 
mind I'm a geneticist who happens to know a little bit about bioinformatics 
and not the reverse. :-)


a number of questions spring to mind

how much of the time are you expecting to spend inserting data into this 
system vs querying data from the system?


is data arriving continuously, or is it a matter of receiving a bunch of 
data, inserting it, then querying it?


which do you need to optimize for, insert speed or query speed?

do you expect your queries to be searching for a subset of the data 
scattered randomly throughlut the input data, or do you expect it to be 
'grab this (relativly) sequential chunk of input data and manipulate it to 
generate a report' type of thing


what is your connectvity to the raid enclosures? (does 
putting 22 drives on one cable mean that you will be limited due to the 
bandwidth of this cable rather than the performance of the drives)


can you do other forms of raid on these drives or only raid 10?

how critical is the data in this database? if it were to die would it just 
be a matter of recreating it and reloading the data? or would you loose 
irreplaceable data?


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] How to best use 32 15k.7 300GB drives?

2011-01-27 Thread david

sorry for not replying properly to your response, I managed to delete the 
mail.


as I understand your data access pattern it's the following:

for the main table space:

bulk loads every couple of weeks. if the data is lost you can just reload 
it.


searches tend to be extracting large sequential chunks of data, either to 
external files or into different tables spaces.


for this table space, you are basically only inserting every couple of 
weeks, and it sounds as if you do not really care how long it takes to 
load the data.



first the disclaimer, I'm not a postgres expert, but I do have good 
experiance with large amounts of data on linux systems (and especially 
running into the limitations when doing it on the cheap ;-)



with this data pattern your WAL is meaningless (as it's only relavent for 
isertes), and you may as well use raid6 as raid10 (both allow you to 
utalize all drives for reads, but raid6 gives you 2 drives worth of 
reducnancy while the wrong two drives on raid10 could kill the entire 
array). You may even want to disable fsync on imports. It will save you a 
lot of time, and if the system crashes during the load you can just 
reinitialize and reload the data.


however, since you are going to be large sequential data transfers, you 
want to be utalizing multiple SAS links, preferrably as evenly as 
possible, so rather than putting all your data drives on one port, you may 
want to spread them between ports so that your aggragate bandwidth to the 
drives is higher (with this many high speed drives, this is a significant 
limitation)



the usual reason for keeping the index drives separate is to avoid having 
writes interact with index reads. Since you are not going to be doing both 
at the same time, I don't know if it helps to separate your indexes.



now, if you pull the data from this main table into a smaller table for 
analysis, you may want to do more interesting things with the drives that 
you use for this smaller table as you are going to be loading data into 
them more frequently.


David Lang


On Thu, 27 Jan 2011, da...@lang.hm wrote:


Date: Thu, 27 Jan 2011 15:19:32 -0800 (PST)
From: da...@lang.hm
To: Robert Schnabel 
Cc: pgsql-performance 
Subject: Re: [PERFORM] How to best use 32 15k.7 300GB drives?

On Thu, 27 Jan 2011, Robert Schnabel wrote:


HI,

I use PostgreSQL basically as a data warehouse to store all the genetic 
data that our lab generates.  The only person that accesses the database is 
myself and therefore I've had it housed on my workstation in my office up 
till now. However, it's getting time to move it to bigger hardware.  I 
currently have a server that is basically only storing backup images of all 
our other workstations so I'm going to move my database onto it.  The 
server looks like this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 
quad-core x2, 32GB RAM.  For my purposes the CPUs and RAM are fine.  I 
currently have an Adaptec 52445+BBU controller that has the OS (4 drive 
RAID5), FTP (4 drive RAID5) and two backup arrays (8 drive each RAID0). 
The backup arrays are in a 16 drive external enclosure through an expander 
so I actually have 16 ports free on the 52445 card.  I plan to remove 3 of 
the drives from my backup arrays to make room for 3 - 73GB 15k.5 drives 
(re-purposed from my workstation).  Two 16 drive enclosures with SAS2 
expanders just arrived as well as 36 Seagate 15k.7 300GB drives 
(ST3300657SS).  I also intend on getting an Adaptec 6445 controller with 
the flash module when it becomes available in about a month or two.  I 
already have several Adaptec cards so I'd prefer to stick with them.


Here's the way I was planning using the new hardware:
xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 
controller

data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller
indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller
2 - 300G 15k.7 as hot spares enclosure C
4 spare 15k.7 for on the shelf

With this configuration I figure I'll have ~3TB for my main data tables and 
1TB for indexes.  Right now my database is 500GB total.  The 3:1 split 
reflects my current table structure and what I foresee coming down the road 
in terms of new data.


So my questions are 1) am I'm crazy for doing this, 2) would you change 
anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp 
filespace) on a different controller than everything else?  Please keep in 
mind I'm a geneticist who happens to know a little bit about bioinformatics 
and not the reverse. :-)


a number of questions spring to mind

how much of the time are you expecting to spend inserting data into this 
system vs querying data from the system?


is data arriving continuously, or is it a matter of receiving a bunch of 
data, inserting it, then querying it?


which do you need to optimize for, insert speed or query speed?

do you expect your queries to

Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-03 Thread david


On Thu, 3 Feb 2011, Vitalii Tymchyshyn wrote:


02.02.11 20:32, Robert Haas ???(??):


Yeah.  Any kind of bulk load into an empty table can be a problem,
even if it's not temporary.  When you load a bunch of data and then
immediately plan a query against it, autoanalyze hasn't had a chance
to do its thing yet, so sometimes you get a lousy plan.


May be introducing something like 'AutoAnalyze' threshold will help? I mean 
that any insert/update/delete statement that changes more then x% of table 
(and no less then y records) must do analyze right after it was finished.

Defaults like x=50 y=1 should be quite good as for me.


If I am understanding things correctly, a full Analyze is going over all 
the data in the table to figure out patterns.


If this is the case, wouldn't it make sense in the situation where you are 
loading an entire table from scratch to run the Analyze as you are 
processing the data? If you don't want to slow down the main thread that's 
inserting the data, you could copy the data to a second thread and do the 
analysis while it's still in RAM rather than having to read it off of disk 
afterwords.


this doesn't make sense for updates to existing databases, but the use 
case of loading a bunch of data and then querying it right away isn't 
_that_ uncommon.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-03 Thread david


On Thu, 3 Feb 2011, Robert Haas wrote:


On Thu, Feb 3, 2011 at 5:11 AM,   wrote:

If I am understanding things correctly, a full Analyze is going over all the
data in the table to figure out patterns.


No.  It's going over a small, fixed-size sample which depends on
default_statistics_target but NOT on the table size.  It's really
important to come up with a solution that's not susceptible to running
ANALYZE over and over again, in many cases unnecessarily.


If this is the case, wouldn't it make sense in the situation where you are
loading an entire table from scratch to run the Analyze as you are
processing the data? If you don't want to slow down the main thread that's
inserting the data, you could copy the data to a second thread and do the
analysis while it's still in RAM rather than having to read it off of disk
afterwords.


Well that's basically what autoanalyze is going to do anyway, if the
table is small enough to fit in shared_buffers.  And it's actually
usually BAD if it starts running while you're doing a large bulk load,
because it competes for I/O bandwidth and the buffer cache and slows
things down.  Especially when you're bulk loading for a long time and
it tries to run over and over.  I'd really like to suppress all those
asynchronous ANALYZE operations and instead do ONE synchronous one at
the end, when we try to use the data.


If the table is not large enough to fit in ram, then it will compete for 
I/O, and the user will have to wait.


what I'm proposing is that as the records are created, the process doing 
the creation makes copies of the records (either all of them, or some of 
them if not all are needed for the analysis, possibly via shareing memory 
with the analysis process), this would be synchronous with the load, not 
asynchronous.


this would take zero I/O bandwidth, it would take up some ram, memory 
bandwidth, and cpu time, but a load of a large table like this is I/O 
contrained.


it would not make sense for this to be the default, but as an option it 
should save a significant amount of time.


I am making the assumption that an Analyze run only has to go over the 
data once (a seqential scan of the table if it's >> ram for example) and 
gathers stats as it goes.


with the current code, this is a completely separate process that knows 
nothing about the load, so if you kick it off when you start the load, it 
makes a pass over the table (competing for I/O), finishes, you continue to 
update the table, so it makes another pass, etc. As you say, this is a bad 
thing to do. I am saying to have an option that ties the two togeather, 
essentially making the data feed into the Analyze run be a fork of the 
data comeing out of the insert run going to disk. So the Analyze run 
doesn't do any I/O and isn't going to complete until the insert is 
complete. At which time it will have seen one copy of the entire table.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-03 Thread david


On Thu, 3 Feb 2011, Robert Haas wrote:


On Thu, Feb 3, 2011 at 3:54 PM,   wrote:

with the current code, this is a completely separate process that knows
nothing about the load, so if you kick it off when you start the load, it
makes a pass over the table (competing for I/O), finishes, you continue to
update the table, so it makes another pass, etc. As you say, this is a bad
thing to do. I am saying to have an option that ties the two togeather,
essentially making the data feed into the Analyze run be a fork of the data
comeing out of the insert run going to disk. So the Analyze run doesn't do
any I/O and isn't going to complete until the insert is complete. At which
time it will have seen one copy of the entire table.


Yeah, but you'll be passing the entire table through this separate
process that may only need to see 1% of it or less on a large table.
If you want to write the code and prove it's better than what we have
now, or some other approach that someone else may implement in the
meantime, hey, this is an open source project, and I like improvements
as much as the next guy.  But my prediction for what it's worth is
that the results will suck.  :-)


I will point out that 1% of a very large table can still be a lot of disk 
I/O that is avoided (especially if it's random I/O that's avoided)


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-03 Thread david


On Thu, 3 Feb 2011, Robert Haas wrote:


On Thu, Feb 3, 2011 at 7:39 PM,   wrote:

Yeah, but you'll be passing the entire table through this separate
process that may only need to see 1% of it or less on a large table.
If you want to write the code and prove it's better than what we have
now, or some other approach that someone else may implement in the
meantime, hey, this is an open source project, and I like improvements
as much as the next guy.  But my prediction for what it's worth is
that the results will suck.  :-)


I will point out that 1% of a very large table can still be a lot of disk
I/O that is avoided (especially if it's random I/O that's avoided)


Sure, but I think that trying to avoid it will be costly in other ways
- you'll be streaming a huge volume of data through some auxiliary
process, which will have to apply some algorithm that's very different
from the one we use today.  The reality is that I think there's little
evidence that the way we do ANALYZE now is too expensive.  It's
typically very cheap and works very well.  It's a bit annoying when it
fires off in the middle of a giant data load, so we might need to
change the time of it a little, but if there's a problem with the
operation itself being too costly, this is the first I'm hearing of
it.  We've actually worked *really* hard to make it cheap.


I could be misunderstanding things here, but my understanding is that it's 
'cheap' in that it has little impact on the database while it is running.


the issue here is that the workflow is

load data
analyze
start work

so the cost of analyze in this workflow is not "1% impact on query speed 
for the next X time", it's "the database can't be used for the next X time 
while we wait for analyze to finish running"


I don't understand why the algorithm would have to be so different than 
what's done today, surely the analyze thread could easily be tweaked to 
ignore the rest of the data (assuming we don't have the thread sending the 
data to analyze do the filtering)


David Lang
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-03 Thread david


On Fri, 4 Feb 2011, ???  wrote:


2011/2/3 



If the table is not large enough to fit in ram, then it will compete for
I/O, and the user will have to wait.

what I'm proposing is that as the records are created, the process doing
the creation makes copies of the records (either all of them, or some of
them if not all are needed for the analysis, possibly via shareing memory
with the analysis process), this would be synchronous with the load, not
asynchronous.

this would take zero I/O bandwidth, it would take up some ram, memory
bandwidth, and cpu time, but a load of a large table like this is I/O
contrained.

it would not make sense for this to be the default, but as an option it
should save a significant amount of time.

I am making the assumption that an Analyze run only has to go over the data
once (a seqential scan of the table if it's >> ram for example) and gathers
stats as it goes.

with the current code, this is a completely separate process that knows
nothing about the load, so if you kick it off when you start the load, it
makes a pass over the table (competing for I/O), finishes, you continue to
update the table, so it makes another pass, etc. As you say, this is a bad
thing to do. I am saying to have an option that ties the two togeather,
essentially making the data feed into the Analyze run be a fork of the data
comeing out of the insert run going to disk. So the Analyze run doesn't do
any I/O and isn't going to complete until the insert is complete. At which
time it will have seen one copy of the entire table.


Actually that are two different problems. The one is to make analyze more
automatic to make select right after insert more clever by providing
statistics to it.
Another is to make it take less IO resources.
I dont like for it to be embedded into insert (unless the threshold can be
determined before inserts starts). Simply because it is more CPU/memory that
will slow down each insert. And if you will add knob, that is disabled by
default, this will be no more good than manual analyze.


if it can happen during the copy instead of being a step after the copy it 
will speed things up. things like the existing parallel restore could use 
this instead ofneeding a separate pass. so I don't think that having to 
turn it on manually makes it useless, any more than the fact that you have 
to explicity disable fsync makes that disabling feature useless (and the 
two features would be likely to be used togeather)


when a copy command is issued, I assume that there is some indication of 
how much data is going to follow. I know that it's not just 'insert 
everything until the TCP connection terminates' because that would give 
you no way of knowing if the copy got everything in or was interrupted 
part way through. think about what happens with ftp if the connection 
drops, you get a partial file 'successfully' as there is no size provided, 
but with HTTP you get a known-bad transfer that you can abort or resume.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-04 Thread david


On Fri, 4 Feb 2011, Vitalii Tymchyshyn wrote:


04.02.11 16:33, Kenneth Marshall ???(??):


In addition, the streaming ANALYZE can provide better statistics at
any time during the load and it will be complete immediately. As far
as passing the entire table through the ANALYZE process, a simple
counter can be used to only send the required samples based on the
statistics target. Where this would seem to help the most is in
temporary tables which currently do not work with autovacuum but it
would streamline their use for more complicated queries that need
an analyze to perform well.

Actually for me the main "con" with streaming analyze is that it adds 
significant CPU burden to already not too fast load process. Especially if 
it's automatically done for any load operation performed (and I can't see how 
it can be enabled on some threshold).


two thoughts

1. if it's a large enough load, itsn't it I/O bound?


2. this chould be done in a separate process/thread than the load itself, 
that way the overhead of the load is just copying the data in memory to 
the other process.


with a multi-threaded load, this would eat up some cpu that could be used 
for the load, but cores/chip are still climbing rapidly so I expect that 
it's still pretty easy to end up with enough CPU to handle the extra load.


David Lang

And you can't start after some threshold of data passed by since you may 
loose significant information (like minimal values).


Best regards, Vitalii Tymchyshyn



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] getting the most of out multi-core systems for repeated complex SELECT statements

2011-02-04 Thread david


On Fri, 4 Feb 2011, Chris Browne wrote:


2.  The query needs to NOT be I/O-bound.  If it's I/O bound, then your
system is waiting for the data to come off disk, rather than to do
processing of that data.


yes and no on this one.

it is very possible to have a situation where the process generating the 
I/O is waiting for the data to come off disk, but if there are still idle 
resources in the disk subsystem.


it may be that the best way to address this is to have the process 
generating the I/O send off more requests, but that sometimes is 
significantly more complicated than splitting the work between two 
processes and letting them each generate I/O requests


with rotating disks, ideally you want to have at least two requests 
outstanding, one that the disk is working on now, and one for it to start 
on as soon as it finishes the one that it's on (so that the disk doesn't 
sit idle while the process decides what the next read should be). In 
practice you tend to want to have even more outstanding from the 
application so that they can be optimized (combined, reordered, etc) by 
the lower layers.


if you end up with a largish raid array (say 16 disks), this can translate 
into a lot of outstanding requests that you want to have active to fully 
untilize the array, but having the same number of requests outstanding 
with a single disk would be counterproductive as the disk would not be 
able to see all the outstanding requests and therefor would not be able to 
optimize them as effectivly.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Which RAID Controllers to pick/avoid?

2011-02-06 Thread david


On Sun, 6 Feb 2011, Scott Marlowe wrote:


On Sun, Feb 6, 2011 at 2:39 AM, Royce Ausburn  wrote:



On Wed, Feb 2, 2011 at 7:00 PM, Craig Ringer
 wrote:

Whatever RAID controller you get, make sure you have a battery backup
unit (BBU) installed so you can safely enable write-back caching.
Without that, you might as well use software RAID - it'll generally be
faster (and cheaper) than HW RAID w/o a BBU.


Recently we had to pull our RAID controllers and go to plain SAS
cards.  While random access dropped a bit, sequential throughput
skyrocketed, saturating the 4 lane cable we use.    4x300Gb/s =
1200Gb/s or right around 1G of data a second off the array.  VERY
impressive.



This is really surprising.  Software raid generally outperform hardware 
raid without BBU?  Why is that?  My company uses hardware raid quite a 
bit without BBU and have never thought to compare with software raid =/


For raw throughtput it's not uncommon to beat a RAID card whether it
has a battery backed cache or not.  If I'm wiriting a 200G file to the
disks, a BBU cache isn't gonna make that any faster, it'll fill up in
a second and then it's got to write to disk.  BBU Cache are for faster
random writes, and will handily beat SW RAID.  But for raw large file
read and write SW RAID is the fastest thing I've seen.



keep in mind that hardware raide with BBU is safer than software raid.

since the updates to the drives do not all happen at the same time, there 
is a chance that a write to software raid may have happened on some drives 
and not others when the system crashes.


with hardware raid and BBU, the controller knows what it was trying to 
write where, and if it didn't get the scknowledgement, it will complete 
the write when it comes up again.


but with software raid you will have updates some part of the array and 
not others. this will result in a corrupted stripe in the array.


David Lang
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] choosing the right RAID level for PostgresQL database

2011-02-14 Thread david


On Sun, 13 Feb 2011, Dave Crooke wrote:


For any database, anywhere, the answer is pretty much always RAID-10.

The only time you would do anything else is for odd special cases.


there are two situations where you would opt for something other than 
RAID-10


1. if you need the space that raid 6 gives you compared to raid 10 you may 
not have much choice


2. if you do almost no updates to the disk during the time you are doing 
the reads then raid 6 can be at least as fast as raid 10 in non-degraded 
mode (it could be faster if you are able to use faster parts of the disks 
in raid 6 than you could in raid 10). degraded mode suffers more, but you 
can tolerate any 2 drives failing rather than just any 1 drive failing for 
raid 10 (the wrong two drives failing can kill a raid 10, while if the 
right drives fail you can loose a lot more drives in raid 10)


where raid 6 is significantly slower than raid 10 is when you are doing 
small random writes. Also many the performance variation between raid 
controllers will be much higher with raid 6 than with raid 10


David Lang


Cheers
Dave

On Sun, Feb 13, 2011 at 2:12 PM, sergey  wrote:


Hello,

I got a disk array appliance of 8 disks 1T each (UltraStor RS8IP4). It will
be used solely by PostgresQL database and I am trying to choose the best
RAID level for it.

The most priority is for read performance since we operate large data sets
(tables, indexes) and we do lots of searches/scans, joins and nested
queries. With the old disks that we have now the most slowdowns happen on
SELECTs.

Fault tolerance is less important, it can be 1 or 2 disks.

Space is the least important factor. Even 1T will be enough.

Which RAID level would you recommend in this situation. The current options
are 60, 50 and 10, but probably other options can be even better.

Thank you!






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] TB-sized databases

2007-11-28 Thread david


On Wed, 28 Nov 2007, Csaba Nagy wrote:


On Wed, 2007-11-28 at 08:27 -0500, Bill Moran wrote:

Is there something wrong with:
set enable_seqscan = off
?


Nothing wrong with enable_seqscan = off except it is all or nothing type
of thing... if you want the big table to never use seqscan, but a medium
table which is joined in should use it, then what you do ? And setting
enable_seqscan = off will actually not mean the planner can't use a
sequential scan for the query if no other alternative exist. In any case
it doesn't mean "please throw an error if you can't do this without a
sequential scan".

In fact an even more useful option would be to ask the planner to throw
error if the expected cost exceeds a certain threshold...


and even better if the option can be overridden for a specific transaction 
or connection. that way it can be set relativly low for normal operations, 
but when you need to do an expensive query you can change it for that 
query.


David Lang

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] TB-sized databases

2007-11-28 Thread david


On Wed, 28 Nov 2007, Simon Riggs wrote:


On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:


In fact an even more useful option would be to ask the planner to throw
error if the expected cost exceeds a certain threshold...


Well, I've suggested it before:

statement_cost_limit on pgsql-hackers, 1 March 2006

Would people like me to re-write and resubmit this patch for 8.4?

Tom's previous concerns were along the lines of "How would know what to
set it to?", given that the planner costs are mostly arbitrary numbers.


arbitrary numbers are fine if they are relativly consistant with each 
other.


will a plan with a estimated cost of 1,000,000 take approximatly 100 times 
as long as one with a cost of 10,000?


or more importantly, will a plan with an estimated cost of 2000 reliably 
take longer then one with an estimated cost of 1000?


David Lang


Any bright ideas, or is it we want it and we don't care about the
possible difficulties?




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10

2007-12-26 Thread david


On Wed, 26 Dec 2007, Fernando Hevia wrote:


Mark Mielke Wrote:


In my experience, software RAID 5 is horrible. Write performance can
decrease below the speed of one disk on its own, and read performance will
not be significantly more than RAID 1+0 as the number of stripes has only
increased from 2 to 3, and if reading while writing, you will not get 3X as
RAID 5 write requires at least two disks to be involved. I believe hardware
RAID 5 is also horrible, but since the hardware hides it from the
application, a hardware RAID 5 user might not care.



Software RAID 1+0 works fine on Linux with 4 disks. This is the setup I use
for my personal server.


I will use software RAID so RAID 1+0 seems to be the obvious choice.
Thanks for the advice!


to clarify things a bit more.

with only four drives the space difference between raid 1+0 and raid 5 
isn't that much, but when you do a write you must write to two drives (the 
drive holding the data you are changing, and the drive that holds the 
parity data for that stripe, possibly needing to read the old parity data 
first, resulting in stalling for seek/read/calculate/seek/write since 
the drive moves on after the read), when you read you must read _all_ 
drives in the set to check the data integrity.


for seek heavy workloads (which almost every database application is) the 
extra seeks involved can be murder on your performance. if your workload 
is large sequential reads/writes, and you can let the OS buffer things for 
you, the performance of raid 5 is much better.


on the other hand, doing raid 6 (instead of raid 5) gives you extra data 
protection in exchange for the performance hit, but with only 4 drives 
this probably isn't what you are looking for.


Linux software raid can do more then two disks in a mirror, so you may be 
able to get the added protection with raid 1 sets (again, probably not 
relavent to four drives), although there were bugs in this within the last 
six months or so, so you need to be sure your kernel is new enough to have 
the fix.


now, if you can afford solid-state drives which don't have noticable seek 
times, things are completely different ;-)


David Lang

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10

2007-12-26 Thread david


On Wed, 26 Dec 2007, Fernando Hevia wrote:


David Lang Wrote:

with only four drives the space difference between raid 1+0 and raid 5
isn't that much, but when you do a write you must write to two drives (the
drive holding the data you are changing, and the drive that holds the
parity data for that stripe, possibly needing to read the old parity data
first, resulting in stalling for seek/read/calculate/seek/write since
the drive moves on after the read), when you read you must read _all_
drives in the set to check the data integrity.


Thanks for the explanation David. It's good to know not only what but also
why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be
read: the one with the data and the parity disk?


no, becouse the parity is of the sort (A+B+C+P) mod X = 0

so if X=10 (which means in practice that only the last decimal digit of 
anything matters, very convienient for examples)


A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0

if you read B and get 3 and P and get 4 you don't know if this is right or 
not unless you also read A and C (at which point you would get 
A+B+C+P=11=1=error)



for seek heavy workloads (which almost every database application is) the
extra seeks involved can be murder on your performance. if your workload
is large sequential reads/writes, and you can let the OS buffer things for
you, the performance of raid 5 is much better.


Well, actually most of my application involves large sequential
reads/writes. The memory available for buffering (4GB) isn't bad either, at
least for my scenario. On the other hand I have got such strong posts
against RAID 5 that I doubt to even consider it.


in theory a system could get the same performance with a large sequential 
read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same 
number of data disks, ignoring the parity disks) becouse the OS could read 
the entire stripe in at once, do the calculation once, and use all the 
data (or when writing, don't write anything until you are ready to write 
the entire stripe, calculate the parity and write everything once).


Unfortunantly in practice filesystems don't support this, they don't do 
enough readahead to want to keep the entire stripe (so after they read it 
all in they throw some of it away), they (mostly) don't know where a 
stripe starts (and so intermingle different types of data on one stripe 
and spread data across multiple stripes unessasarily), and they tend to do 
writes in small, scattered chunks (rather then flushing an entire stripes 
worth of data at once)


those who have been around long enough to remember the days of MFM/RLL 
(when you could still find the real layout of the drives) may remember 
optmizing things to work a track at a time instead of a sector at a time. 
this is the exact same logic, just needing to be applied to drive stripes 
instead of sectors and tracks on a single drive.


the issue has been raised with the kernel developers, but there's a lot of 
work to be done (especially in figuring out how to get all the layers the 
info they need in a reasonable way)



Linux software raid can do more then two disks in a mirror, so you may be
able to get the added protection with raid 1 sets (again, probably not
relavent to four drives), although there were bugs in this within the last
six months or so, so you need to be sure your kernel is new enough to have
the fix.



Well, here rises another doubt. Should I go for a single RAID 1+0 storing OS
+ Data + WAL files or will I be better off with two RAID 1 separating data
from OS + Wal files?


if you can afford the space, you are almost certinly better seperating the 
WAL from the data (I think I've seen debates about which is better 
OS+data/Wal or date/OS+Wal, but very little disagreement that either is 
better than combining them all)


David Lang


now, if you can afford solid-state drives which don't have noticable seek
times, things are completely different ;-)


Ha, sadly budget is very tight. :)

Regards,
Fernando.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10

2007-12-26 Thread david


On Wed, 26 Dec 2007, Florian Weimer wrote:


seek/read/calculate/seek/write since the drive moves on after the
read), when you read you must read _all_ drives in the set to check
the data integrity.


I don't know of any RAID implementation that performs consistency
checking on each read operation. 8-(


I could see a raid 1 array not doing consistancy checking (after all, it 
has no way of knowing what's right if it finds an error), but since raid 
5/6 can repair the data I would expect them to do the checking each time.


David Lang

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10

2007-12-26 Thread david


On Wed, 26 Dec 2007, Mark Mielke wrote:


Florian Weimer wrote:

seek/read/calculate/seek/write since the drive moves on after the
read), when you read you must read _all_ drives in the set to check
the data integrity.


I don't know of any RAID implementation that performs consistency
checking on each read operation. 8-(



Dave had too much egg nog... :-)

Yep - checking consistency on read would eliminate the performance benefits 
of RAID under any redundant configuration.


except for raid0, raid is primarily a reliability benifit, any performance 
benifit is incidental, not the primary purpose.


that said, I have heard of raid1 setups where it only reads off of one of 
the drives, but I have not heard of higher raid levels doing so.


David Lang

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10

2007-12-26 Thread david


On Wed, 26 Dec 2007, Mark Mielke wrote:


[EMAIL PROTECTED] wrote:

Thanks for the explanation David. It's good to know not only what but also
why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be
read: the one with the data and the parity disk?

no, becouse the parity is of the sort (A+B+C+P) mod X = 0
so if X=10 (which means in practice that only the last decimal digit of 
anything matters, very convienient for examples)

A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0
if you read B and get 3 and P and get 4 you don't know if this is right or 
not unless you also read A and C (at which point you would get 
A+B+C+P=11=1=error)
I don't think this is correct. RAID 5 is parity which is XOR. The property of 
XOR is such that it doesn't matter what the other drives are. You can write 
any block given either: 1) The block you are overwriting and the parity, or 
2) all other blocks except for the block we are writing and the parity. Now, 
it might be possible that option 2) is taken more than option 1) for some 
complicated reasons, but it is NOT to check consistency. The array is assumed 
consistent until proven otherwise.


I was being sloppy in explaining the reason, you are correct that for 
writes you don't need to read all the data, you just need the current 
parity block, the old data you are going to replace, and the new data to 
be able to calculate the new parity block (and note that even with my 
checksum example this would be the case).


however I was addressing the point that for reads you can't do any 
checking until you have read in all the blocks.


if you never check the consistency, how will it ever be proven otherwise.

in theory a system could get the same performance with a large sequential 
read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same 
number of data disks, ignoring the parity disks) becouse the OS could read 
the entire stripe in at once, do the calculation once, and use all the data 
(or when writing, don't write anything until you are ready to write the 
entire stripe, calculate the parity and write everything once).
For the same number of drives, this cannot be possible. With 10 disks, on 
raid5, 9 disks hold data, and 1 holds parity. The theoretical maximum 
performance is only 9/10 of the 10/10 performance possible with RAID 0.


I was saying that a 10 drive raid0 could be the same performance as a 10+1 
drive raid 5 or a 10+2 drive raid 6 array.


this is why I said 'same number of data disks, ignoring the parity disks'.

in practice you would probably not do quite this good anyway (you have the 
parity calculation to make and the extra drive or two's worth of data 
passing over your busses), but it could be a lot closer then any 
implementation currently is.


Unfortunantly in practice filesystems don't support this, they don't do 
enough readahead to want to keep the entire stripe (so after they read it 
all in they throw some of it away), they (mostly) don't know where a stripe 
starts (and so intermingle different types of data on one stripe and spread 
data across multiple stripes unessasarily), and they tend to do writes in 
small, scattered chunks (rather then flushing an entire stripes worth of 
data at once)
In my experience, this theoretical maximum is not attainable without 
significant write cache, and an intelligent controller, neither of which 
Linux software RAID seems to have by default. My situation was a bit worse in 
that I used applications that fsync() or journalled metadata that is ordered, 
which forces the Linux software RAID to flush far more than it should - but 
the same system works very well with RAID 1+0.


my statements above apply to any type of raid implementation, hardware or 
software.


the thing that saves the hardware implementation is that the data is 
written to a battery-backed cache and the controller lies to the system, 
telling it that the write is complete, and then it does the write later.


on a journaling filesystem you could get very similar results if you put 
the journal on a solid-state drive.


but for your application, the fact that you are doing lots of fsyncs is 
what's killing you, becouse the fsync forces a lot of data to be written 
out, swamping the caches involved, and requiring that you wait for seeks. 
nothing other then a battery backed disk cache of some sort (either on the 
controller or a solid-state drive on a journaled filesystem would work)


David Lang


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10

2007-12-26 Thread david


On Wed, 26 Dec 2007, Mark Mielke wrote:


[EMAIL PROTECTED] wrote:
I could see a raid 1 array not doing consistancy checking (after all, it 
has no way of knowing what's right if it finds an error), but since raid 
5/6 can repair the data I would expect them to do the checking each time.

Your messages are spread across the thread. :-)

RAID 5 cannot repair the data. I don't know much about RAID 6, but I expect 
it cannot necessarily repair the data either. It still doesn't know which 
drive is wrong. In any case, there is no implementation I am aware of that 
performs mandatory consistency checks on read. This would be silliness.


sorry, raid 5 can repair data if it knows which chunk is bad (the same way 
it can rebuild a drive). Raid 6 does something slightly different for it's 
parity, I know it can recover from two drives going bad, but I haven't 
looked into the question of it detecting bad data.


David Lang

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] With 4 disks should I go for RAID 5 or RAID 10

2007-12-26 Thread david


On Wed, 26 Dec 2007, Mark Mielke wrote:


[EMAIL PROTECTED] wrote:

On Wed, 26 Dec 2007, Mark Mielke wrote:


Florian Weimer wrote:

seek/read/calculate/seek/write since the drive moves on after the
read), when you read you must read _all_ drives in the set to check
the data integrity.

I don't know of any RAID implementation that performs consistency
checking on each read operation. 8-(

Dave had too much egg nog... :-)
Yep - checking consistency on read would eliminate the performance 
benefits of RAID under any redundant configuration.
except for raid0, raid is primarily a reliability benifit, any performance 
benifit is incidental, not the primary purpose.
that said, I have heard of raid1 setups where it only reads off of one of 
the drives, but I have not heard of higher raid levels doing so.
What do you mean "heard of"? Which raid system do you know of that reads all 
drives for RAID 1?


Linux dmraid reads off ONLY the first. Linux mdadm reads off the "best" one. 
Neither read from both. Why should it need to read from both? What will it do 
if the consistency check fails? It's not like it can tell which disk is the 
right one. It only knows that the whole array is inconsistent. Until it gets 
an actual hardware failure (read error, write error), it doesn't know which 
disk is wrong.


yes, the two linux software implementations only read from one disk, but I 
have seen hardware implementations where it reads from both drives, and if 
they disagree it returns a read error rather then possibly invalid data 
(it's up to the admin to figure out which drive is bad at that point).


no, I don't remember which card this was. I've been playing around with 
things in this space for quite a while.


David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Best way to index IP data?

2008-01-11 Thread david


On Fri, 11 Jan 2008, Greg Smith wrote:

You may well need netmasks to configure your interface, but there's 
absolutely no need for them to identify an IP endpoint, which is all you 
need to identify the destination the packet is going to, and that is the 
most common use of IP addresses.


Technically you can't ever send a packet unless you know both the endpoint 
and your local netmask.  As the sender, you're obligated to determine if the 
destination is on your local LAN (in which case you send it there) or if it 
goes to the gateway.  That's similar to a routing decision, but it's not 
quite--if you don't have to look in a routing table, it's not actually part 
of routing.


you also need to know your local IP address, but there is no reason to 
need the netmask of the other end


my IP address is 64.81.33.126 why do you need to know my netmaask? how 
would you find out what it is?


DNS doesn't report the netmask,and it's arguably the biggest database of 
IP addresses around ;-)


one of the biggest reasons for storing IP addresses in a SQL database is 
as part of log analysis.


David Lang

I believe this sort of detail is why subnet masks are considered required for 
some things even though it doesn't seem like they are needed. Regardless, the 
details of how the packets move aren't important to some applications, and 
arguing over what the RFCs do and don't require doesn't change that.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

 http://www.postgresql.org/docs/faq



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] [OT] RAID controllers blocking one another?

2008-01-19 Thread david


On Thu, 17 Jan 2008, Sean Davis wrote:


We have a machine that serves as a fileserver and a database server.  Our
server hosts a raid array of 40 disk drives, attached to two3-ware cards,
one 9640SE-24 and one 9640SE-16. We have noticed that activity on one
controller blocks access on the second controller, not only for disk-IO but
also the command line tools which become unresponsive for the inactive
controller.   The controllers are sitting in adjacent PCI-express slots on a
machine with dual-dual AMD and 16GB of RAM.  Has anyone else noticed issues
like this?  Throughput for either controller is a pretty respectable
150-200MB/s writing and somewhat faster for reading, but the "blocking" is
problematic, as the machine is serving multiple purposes.

I know this is off-topic, but I know lots of folks here deal with very large
disk arrays; it is hard to get real-world input on machines such as these.


there have been a lot of discussions on the linux-kernel mailing list over 
the last several months on the topic of IO to one set of drives 
interfearing with IO to another set of drives. The soon-to-be-released 
2.6.24 kernel includes a substantial amount of work in this area that (at 
least on initial reports) is showing significant improvements.


I haven't had the time to test this out yet, so I can't add personal 
experiance, but it's definantly something to look at on a test system.


David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] scheduler

2008-01-22 Thread david


On Tue, 22 Jan 2008, Adrian Moisey wrote:


Hi


Which scheduler is recommended for a box that is dedicated to running
postgres?

I've asked google and found no answers.


Is it the OS itself?


Yes, in linux.  I've found that cfq or deadline is best, but I haven't seen 
anyone try a benchmark


it also depends on your hardware. if you have a good battery-backed cache 
on your I/O system you may be best off with 'none' (or simple elevator, I 
don't remember the exact name)


the more complexity in your I/O that the kernek oesn't know about, the 
less likely it is to make the right decision.


David Lang

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] 8x2.5" or 6x3.5" disks

2008-01-28 Thread david


On Mon, 28 Jan 2008, Arjen van der Meijden wrote:


On 28-1-2008 20:25 Christian Nicolaisen wrote:
So, my question is: should I go for the 2.5" disk setup or 3.5" disk setup, 
and does the raid setup in either case look correct?


Afaik they are about equal in speed. With the smaller ones being a bit faster 
in random access and the larger ones a bit faster for sequential 
reads/writes.


I missed the initial post in this thread, but I haven't seen any 15K rpm 
2.5" drives, so if you compare 10K rpm 2.5" drives with 15K rpm 3.5" 
drives you will see differences (depending on your workload and controller 
cache)


David Lang


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] migration of 7.4 to 8.1

2008-03-11 Thread david


On Wed, 12 Mar 2008, sathiya psql wrote:


My question is that how to migrate my database to 7.4 to 8.1


that is not only dumping the db and extracting that in 8.1 ..

If i do that whether it will work without problem, or i have to do some
manual changes is my question...


you would need to look at the release notes for each version to see if you 
are using any of the features that were changed.


but in return for this hassle (which you will need to do eventually, you 
are just deciding when) you get to use the newer version with all the 
speed improvements, not to mention the new features.


going from 7.x to 8.0 is the most painful step. going on from 8.0 to 8.1 
to 8.2 to 8.3 are relativly minor steps in comparison.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] migration of 7.4 to 8.1

2008-03-11 Thread david


On Wed, 12 Mar 2008, Craig Ringer wrote:


sathiya psql wrote:

This might be a silly question, but ... why 8.1 ?

If you're doing a major upgrade, why not go straight to 8.3? It's been
out long enough that there aren't any obvious nasty bugs, and there have
been a fair few fixes and improvements since prior versions.


Because am using Debian ETCH stable... i need to continue using this... In
Debian ETCH stable 8.1 is only there


I use Etch on my servers. The magic of etch-backports from backports.org 
allows me to use 8.3 without messing with anything else. It works extremely 
well.


All I had to do to install 8.3 was add the etch-backports line to my 
/etc/apt/sources.list, run `apt-get update', then run:


apt-get -t etch-backports install postgresql-8.3

... then migrate the data to it and remove the old packages.

See backports.org for a mirror list.


I will be installing my project in other machines where it is having DEBIAN
ETCH STABLE, i dont want to face any problems... so only...

am i right ???


Personally I'd use 8.3 from backports.org, but it's of course up to you.


consider the fact that after 1 year Etch is 2 major releases out of date. 
by the time the next stable release is made you will be running on ancient 
versions unless you make use of backports.org (or compile it yourself, 
which is useually my choice).


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] migration of 7.4 to 8.1

2008-03-11 Thread david


On Wed, 12 Mar 2008, sathiya psql wrote:


In the home page itself they were saying  testing ... unstable


you are talking about the debian home page right?


then we should not use that for live.

so i prefer 8.1 .


Debian selected the version of Postgres for Etch about a year and a half 
ago. At that point selecting 8.1 was a resonable choice.


Debian has a policy that they will never change the version number of a 
package in a stable release (they will backport many bugfixes, but not 
upgrade the version)


As a result 2 years from now when Postgres is on 8.5 stable (and looking 
at 8.6), Debian Etch will still be on 8.1


it is of course your option to stick with 8.1, but before very long the 
answer to most of your questions about postgres is going to be 'upgrade to 
a resonably current version' (people running 7.4 and 8.0 are starting to 
get that answer now. 8.1 and 8.2 are recent enough that it's rare to get 
that now, but in a year or two that will change)


for most utility software you just want it to work and don't really care 
about new features over a couple of years (or the project has reached the 
stage where it just doesn't change much over a couple of years). In these 
situations the Debian policy is great, you don't have to worry about new 
stuff messing you up.


however some software has much faster development cycles. The kernel has a 
new release about every two months, Postgres is aiming at a one year 
cycle, Apache has it's own release schedule. These packages are usually 
pretty core to your operation and the improvments (and security fixes that 
are not possible to backport sanely) mean that you need to think very hard 
about what version of them you are going to run. On my systems I bypass 
Debian directly for such packages and compile them myself, the 
backports.org option allows you to avoid that hassle, and get versions 
that are fairly well tested (like any new version, you need to do some 
testing yourself), just wait a month or two after a new release hits 
backports.org and you will be pretty safe.


David Lang



You can get 8.3 from backports: http://www.backports.org/ - it's a
debian project to get more up to date versions for existing stable
releases (they package everything exactly the same way).

--
Postgresql & php tutorials
http://www.designmagick.com/





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] What is the best way to storage music files in Postgresql

2008-03-18 Thread david


On Wed, 19 Mar 2008, Andrej Ricnik-Bay wrote:



On 18/03/2008, Craig Ringer <[EMAIL PROTECTED]> wrote:

Isn't a 10 or 24 spindle RAID 5 array awfully likely to encounter a
 double disk failure (such as during the load imposed by rebuild onto a
 spare) ?


that's why you should use raid6 (allowing for dual failures)

David Lang


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] What is the best way to storage music files in Postgresql

2008-03-18 Thread david


On Tue, 18 Mar 2008, Gregory Stark wrote:


<[EMAIL PROTECTED]> writes:


On Wed, 19 Mar 2008, Andrej Ricnik-Bay wrote:



On 18/03/2008, Craig Ringer <[EMAIL PROTECTED]> wrote:

Isn't a 10 or 24 spindle RAID 5 array awfully likely to encounter a
 double disk failure (such as during the load imposed by rebuild onto a
 spare) ?


that's why you should use raid6 (allowing for dual failures)


You can have as many parity drives as you want with RAID 5 too.


you can? I've never seen a raid 5 setup with more then a single parity 
dirve (or even the option of having more then one drives worth of 
redundancy). you can have hot-spare drives, but thats a different thing.


what controller/software lets you do this?

David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] What is the best way to storage music files in Postgresql

2008-03-18 Thread david


On Wed, 19 Mar 2008, Craig Ringer wrote:


[EMAIL PROTECTED] wrote:
you can? I've never seen a raid 5 setup with more then a single parity 
dirve (or even the option of having more then one drives worth of 
redundancy). you can have hot-spare drives, but thats a different thing.


With RAID 4, where the "parity drives"  are in fact dedicated to parity 
information, the controller could just store the parity data mirrored on more 
than one drive. Unfortunately write performance on RAID 4 is absolutely 
horrible, and a second or third parity disk would not help with that.


I suppose there's nothing stopping a controller adding a second disk's worth 
of duplicate parity information when striping a four or more disk RAID 5 
array, but I thought that's basically what RAID 6 was.


just duplicating the Raid 4 or 5 pairity information will not help you if 
the parity drive is not one of the drives that fail.


raid 6 uses a different pairity algorithm so that any two drives in the 
array can fail with no data loss.


even this isn't completely error proof. I just went through a scare with a 
15 disk array where it reported 3 dead drives after a power outage. one of 
the dead drives ended up being the hot-spare, and another drive that acted 
up worked well enough to let me eventually recover all the data (seek 
errors), but it was a very scary week while I worked through this.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance increase with elevator=deadline

2008-04-15 Thread david


On Tue, 15 Apr 2008, Florian Weimer wrote:


* Jeff:


Using 4 of these with a dataset of about 30GB across a few files
(Machine has 8GB mem) I went from around 100 io/sec to 330 changing to
noop.   Quite an improvement.  If you have a decent controller CFQ is
not what you want.   I tried deadline as well and it was a touch
slower.  The controller is a 3ware 9550sx with 4 disks in a raid10.

I'll be trying this out on the big array later today.  I found it
suprising this info wasn't more widespread (the use of CFQ on a good
controller).


3ware might be a bit special because the controller has got very deep
queues on its own, so many assumptions of the kernel I/O schedulers do
not seem to apply.  Toying with the kernel/controller queue depths
might help, but I haven't done real benchmarks to see if it's actually
a difference.

A few days ago, I experienced this: On a machine with a 3ware
controller, a simple getxattr call on a file in an uncontended
directory took several minutes because a PostgreSQL dump process was
running in the background (and some other activity of a legacy
database which caused frequent fdatasync calls).  Clearly, this is
unacceptable, and I've since switched to the deadline scheduler, too.
So far, this particular behavior hasn't occurred again.


one other thing to watch out for. up until very recent kernels (2.6.23 or 
2.6.24) it was possible for one very busy block device to starve other 
block devices. they added isolation of queues for different block devices, 
but I've seen reports that the isolation can end up throttling high 
performance devices as a result. I haven't had time to really dig into 
this, but there are tuning knobs available to adjust the que space 
available to different devices and the reports are significantly better 
activity on a tuned system.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Strange behavior: pgbench and new Linux kernels

2008-04-17 Thread david


On Thu, 17 Apr 2008, Greg Smith wrote:


On Thu, 17 Apr 2008, Matthew wrote:

The last message in the thread says that 2.6.25-rc6 has the problem nailed. 
That was a month ago. So I guess, upgrade to 2.6.25, which was released 
today.


Ah, even more support for me to distrust everything I read.  The change has 
flattened out things, so now the pgbench results are awful everywhere. On 
this benchmark 2.6.25 is the worst kernel yet:


-bash-3.00$ pgbench -S -c 4 -t 1 pgbench | grep excluding
tps = 8619.710649 (excluding connections establishing)
tps = 8664.321235 (excluding connections establishing)
tps = 8671.973915 (excluding connections establishing)
(was 18388 in 2.6.9 and 16621 in 2.6.23-3)

-bash-3.00$ pgbench -S -c 8 -t 1 pgbench | grep excluding
tps = 9011.728765 (excluding connections establishing)
tps = 9039.441796 (excluding connections establishing)
tps = 9206.574000 (excluding connections establishing)
(was 15760 in 2.6.9 and 15551 in 2.6.23-3)

-bash-3.00$ pgbench -S -c 16 -t 1 pgbench | grep excluding
tps = 7063.710786 (excluding connections establishing)
tps = 6956.266777 (excluding connections establishing)
tps = 7120.971600 (excluding connections establishing)
(was 14148 in 2.6.9 and 7311 in 2.6.23-3)

-bash-3.00$ pgbench -S -c 32 -t 1 pgbench | grep excluding
tps = 7006.311636 (excluding connections establishing)
tps = 6971.305909 (excluding connections establishing)
tps = 7002.820583 (excluding connections establishing)
(was 13647 in 2.6.9 and 7141 in 2.6.23-3)

This is what happens when the kernel developers are using results from a 
MySQL tool to optimize things I guess.  It seems I have a lot of work ahead 
of me here to nail down and report what's going on here.


report this to the kernel list so that they know, and be ready to test 
fixes. the kernel developers base sucess or failure on the results of 
tests. if the only people providing test results are MySQL people, how 
would they know there is a problem?


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Postgres replication

2008-04-30 Thread david


On Wed, 30 Apr 2008, Gernot Schwed wrote:


Hi all,

looking for a HA master/master or master/slave replication solution. Our
setup consists of two databases and we want to use them both for queries.


Aside from pgpool II there seems no advisable replication solution. But
the problem seems to be that we will have a single point of failure with
pgpool. slony also has the disadvantage not to cover a real failover
solution. Are there any other manageable and well tested tools/setups for
our scenario?


I'm about to setup a similar config and what I was intending to do is to 
run pgpool on both boxes and use heartbeat (from http://linux-ha.org ) to 
move an IP address from one box to the other. clients connect to this 
virtual IP and then pgpool will distribute the connections to both systems 
from there.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread david


On Thu, 15 May 2008, Matthew Wakeling wrote:


On Thu, 15 May 2008, Philippe Amelant wrote:

using mkfs.ext3 I can use "-T" to tune the filesytem

mkfs.ext3 -T fs_type ...

fs_type are in /etc/mke2fs.conf (on debian)


If you look at that file, you'd see that tuning really doesn't change that 
much. In fact, the only thing it does change (if you avoid "small" and 
"floppy") is the number of inodes available in the filesystem. Since Postgres 
tends to produce few large files, you don't need that many inodes, so the 
"largefile" option may be best. However, note that the number of inodes is a 
hard limit of the filesystem - if you try to create more files on the 
filesystem than there are available inodes, then you will get an out of space 
error even if the filesystem has space left.
The only real benefit of having not many inodes is that you waste a little 
less space, so many admins are pretty generous with this setting.


IIRC postgres likes to do 1M/file, which isn't very largeas far as the -T 
setting goes.



Probably of more use are some of the other settings:

-m reserved-blocks-percentage - this reserves a portion of the filesystem
   that only root can write to. If root has no need for it, you can kill
   this by setting it to zero. The default is for 5% of the disc to be
   wasted.


think twice about this. ext2/3 get slow when they fill up (they have 
fragmentation problems when free space gets too small), this 5% that 
only root can use also serves as a buffer against that as well.



-j turns the filesystem into ext3 instead of ext2 - many people say that
   for Postgres you shouldn't do this, as ext2 is faster.


for the partition with the WAL on it you may as well do ext2 (the WAL is 
written synchronously and sequentially so the journal doesn't help you), 
but for the data partition you may benifit from the journal.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-16 Thread david

On Thu, 15 May 2008, [EMAIL PROTECTED] wrote:

On Thu, 15 May 2008, Matthew Wakeling wrote:

On Thu, 15 May 2008, Philippe Amelant wrote:

using mkfs.ext3 I can use "-T" to tune the filesytem

mkfs.ext3 -T fs_type ...

fs_type are in /etc/mke2fs.conf (on debian)

If you look at that file, you'd see that tuning really doesn't change that
much. In fact, the only thing it does change (if you avoid "small" and
"floppy") is the number of inodes available in the filesystem. Since
Postgres tends to produce few large files, you don't need that many inodes,
so the "largefile" option may be best. However, note that the number of
inodes is a hard limit of the filesystem - if you try to create more files
on the filesystem than there are available inodes, then you will get an out
of space error even if the filesystem has space left.
The only real benefit of having not many inodes is that you waste a little
less space, so many admins are pretty generous with this setting.

IIRC postgres likes to do 1M/file, which isn't very largeas far as the -T
setting goes.

Probably of more use are some of the other settings:

-m reserved-blocks-percentage - this reserves a portion of the filesystem
that only root can write to. If root has no need for it, you can kill
this by setting it to zero. The default is for 5% of the disc to be
wasted.

think twice about this. ext2/3 get slow when they fill up (they have
fragmentation problems when free space gets too small), this 5% that only
root can use also serves as a buffer against that as well.

-j turns the filesystem into ext3 instead of ext2 - many people say that
for Postgres you shouldn't do this, as ext2 is faster.

for the partition with the WAL on it you may as well do ext2 (the WAL is
written synchronously and sequentially so the journal doesn't help you), but
for the data partition you may benifit from the journal.

a fairly recent article on the subject

http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/

David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Hardware vs Software Raid

2008-06-26 Thread david


On Thu, 26 Jun 2008, Peter T. Breuer wrote:


"Also sprach Merlin Moncure:"

The linux software raid algorithms are highly optimized, and run on a


I can confidently tell you that that's balderdash both as a Linux author
and as a software RAID linux author (check the attributions in the
kernel source, or look up something like "Raiding the Noosphere" on
google).


presumably (much faster) cpu than what the controller supports.
However, there is still some extra oomph you can get out of letting
the raid controller do what the software raid can't...namely delay
sync for a time.


There are several design problems left in software raid in the linux kernel.
One of them is the need for extra memory to dispatch requests with and
as (i.e. buffer heads and buffers, both). bhs should be OK since the
small cache per device won't be exceeded while the raid driver itself
serialises requests, which is essentially the case (it does not do any
buffering, queuing, whatever .. and tries hard to avoid doing so). The
need for extra buffers for the data is a problem. On different
platforms different aspects of that problem are important (would you
believe that on ARM mere copying takes so much cpu time that one wants
to avoid it at all costs, whereas on intel it's a forgettable trivium).

I also wouldn't aboslutely swear that request ordering is maintained
under ordinary circumstances.


which flavor of linux raid are you talking about (the two main families I 
am aware of are the md and dm ones)


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] syslog performance when logging big statements

2008-07-08 Thread david


On Tue, 8 Jul 2008, Tom Lane wrote:


Jeff <[EMAIL PROTECTED]> writes:

On Jul 8, 2008, at 8:24 AM, Achilleas Mantzios wrote:

File sizes of about 3M result in actual logging output of ~ 10Mb.
In this case, the INSERT *needs* 20 minutes to return. This is
because the logging through syslog seems to severely slow the system.
If instead, i use stderr, even with logging_collector=on, the same
statement needs 15 seconds to return.



In syslog.conf is the destination for PG marked with a "-"? (ie -/var/
log/pg.log) which tells syslog to not sync after each line logged.
That could explain a large chunk of the difference in time.


I experimented with this a bit here.  There definitely is an O(N^2)
penalty from the repeated strchr() calls, but it doesn't really start
to hurt till 1MB or so statement length.  Even with that patched,
syslog logging pretty much sucks performance-wise.  Here are the numbers
I got on a Fedora 8 workstation, testing the time to log a statement of
the form SELECT length('123456...lots of data, no newlines...7890');

statement length1MB 10MB

CVS HEAD2523ms  215588ms
+ patch to fix repeated strchr   529ms   36734ms
after turning off syslogd's fsync569ms5881ms
PG_SYSLOG_LIMIT 1024, fsync on   216ms2532ms
PG_SYSLOG_LIMIT 1024, no fsync   242ms2692ms
For comparison purposes:
logging statements to stderr 155ms2042ms
execute statement without logging 42ms 520ms

This machine is running a cheap IDE drive that caches writes, so
the lack of difference between fsync off and fsync on is not too
surprising --- on a machine with server-grade drives there'd be
a lot more difference.  (The fact that there is a difference in
the 10MB case probably reflects filling the drive's write cache.)

On my old HPUX machine, where fsync really works (and the syslogd
doesn't seem to allow turning it off), the 1MB case takes
195957ms with the strchr patch, 22922ms at PG_SYSLOG_LIMIT=1024.

So there's a fairly clear case to be made for fixing the repeated
strchr, but I also think that there's a case for jacking up
PG_SYSLOG_LIMIT.  As far as I can tell the current value of 128
was chosen *very* conservatively without thought for performance:
http://archives.postgresql.org/pgsql-hackers/2000-05/msg01242.php

At the time we were looking at evidence that the then-current
Linux syslogd got tummyache with messages over about 1KB:
http://archives.postgresql.org/pgsql-hackers/2000-05/msg00880.php

Some experimentation with the machines I have handy now says that

Fedora 8:   truncates messages at 2KB (including syslog's header)
HPUX 10.20 (ancient):   ditto
Mac OS X 10.5.3:drops messages if longer than about 1900 bytes

So it appears to me that setting PG_SYSLOG_LIMIT = 1024 would be
perfectly safe on current systems (and probably old ones too),
and would give at least a factor of two speedup for logging long
strings --- more like a factor of 8 if syslogd is fsync'ing.

Comments?  Anyone know of systems where this is too high?
Perhaps we should make that change only in HEAD, not in the
back branches, or crank it up only to 512 in the back branches?


with linux ext2/ext3 filesystems I have seen similar problems when the 
syslog starts getting large. there are several factors here


1. fsync after each write unless you have "-" in syslog.conf (only 
available on linux AFAIK)


2. ext2/ext3 tend to be very inefficiant when doing appends to large 
files. each write requires that the syslog daemon seek to the end of the 
file (becouse something else may have written to the file in the meantime) 
and with the small block sizes and chaining of indirect blocks this can 
start to be painful when logfiles get up in to the MB range.


note that you see this same problem when you start to get lots of 
files in one directory as well. even if you delete a lot of files the 
directory itself is still large and this can cause serious performance 
problems.


other filesystems are much less sensitive to file (and directory) sizes.

my suggestion would be to first make sure you are doing async writes to 
syslog, and then try putting the logfiles on different filesystems to see 
how they differ. personally I use XFS most of the time where I expect lots 
of files or large files.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 994 matches

Mail list logo