Re: [PERFORM] RAID stripe size question

Ron Peacetree Mon, 17 Jul 2006 06:42:01 -0700

>From: Mikael Carneholm <[EMAIL PROTECTED]>
>Sent: Jul 16, 2006 6:52 PM
>To: pgsql-performance@postgresql.org
>Subject: [PERFORM] RAID stripe size question
>
>I have finally gotten my hands on the MSA1500 that we ordered some time
>ago. It has 28 x 10K 146Gb drives,
>
Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 
15K, not 10K.
(unless they are old?)
I'm not just being pedantic.  The correct, let alone optimal, answer to your 
question depends on your exact HW characteristics as well as your SW config and 
your usage pattern.
15Krpm HDs will have average access times of 5-6ms.  10Krpm ones of 7-8ms.
Most modern HDs in this class will do ~60MB/s inner tracks ~75MB/s avg and 
~90MB/s outer tracks.


If you are doing OLTP-like things, you are more sensitive to latency than most 
and should use the absolute lowest latency HDs available within you budget.  
The current latency best case is 15Krpm FC HDs.


>currently grouped as 10 (for wal) + 18 (for data). There's only one controller 
>(an emulex), but I hope
>performance won't suffer too much from that. Raid level is 0+1,
>filesystem is ext3. 
>
I strongly suspect having only 1 controller is an I/O choke w/ 28 HDs.

28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s,  ~75*9= ~675MB/s.
If both sets are to run at peak average speed, the Emulex would have to be able 
to handle ~1050MBps on average.
It is doubtful the 1 Emulex can do this.

In order to handle this level of bandwidth, a RAID controller must aggregate 
multiple FC, SCSI, or SATA streams as well as down any RAID 5 checksumming etc 
that is required.
Very, very few RAID controllers can do >= 1GBps 
One thing that help greatly with bursty IO patterns is to up your battery 
backed RAID cache as high as you possibly can.  Even multiple GBs of BBC can be 
worth it.  Another reason to have multiple controllers ;-)

Then there is the question of the BW of the bus that the controller is plugged 
into.
~800MB/s is the RW max to be gotten from a 64b 133MHz PCI-X channel.
PCI-E channels are usually good for 1/10 their rated speed in bps as Bps.
So a PCI-Ex4 10Gbps bus can be counted on for 1GBps, PCI-Ex8 for 2GBps, etc.
At present I know of no RAID controllers that can singlely saturate a PCI-Ex4 
or greater bus.

...and we haven't even touched on OS, SW, and usage pattern issues.

Bottom line is that the IO chain is only as fast as its slowest component.


>Now to the interesting part: would it make sense to use different stripe
>sizes on the separate disk arrays? 
>
The short answer is Yes.
WAL's are basically appends that are written in bursts of your chosen log chunk 
size and that are almost never read afterwards.  Big DB pages and big RAID 
stripes makes sense for WALs.

Tables with OLTP-like characteristics need smaller DB pages and stripes to 
minimize latency issues (although locality of reference can make the optimum 
stripe size larger).

Tables with Data Mining like characteristics usually work best with larger DB 
pages sizes and RAID stripe sizes.

OS and FS overhead can make things more complicated.  So can DB layout and 
access pattern issues.

Side note: a 10 HD RAID 10 seems a bit much for WAL.  Do you really need 
375MBps IO on average to your WAL more than you need IO capacity for other 
tables?
If WAL IO needs to be very high, I'd suggest getting a SSD or SSD-like device 
that fits your budget and having said device async mirror to HD. 

Bottom line is to optimize your RAID stripe sizes =after= you optimize your OS, 
FS, and pg design for best IO for your usage pattern(s).

Hope this helps,
Ron

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: [PERFORM] RAID stripe size question

Reply via email to