Re: [PERFORM] performance impact of non-C locale

2008-09-11 Thread Peter Eisentraut

Axel Rau wrote:

some erp software requires a change of my pgsql cluster from
locale Cencoding UTF-8
to
locale de_DE.UTF-8encoding UTF-8

Most of my databases have only ASCII text data (8 bit UTF8 code range) 
in the text columns.

Does the above change influence index performance on such columns?


Yes.

Does postmaster keep track on any multibyte characters being inserted in 
such columns, so that the planner can adapt?


No.


What other performance impacts can be expected?


The performance impact is mainly with string comparisons and sorts.  I 
suggest you run your own tests to find out what is acceptable in your 
scenario.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance impact of non-C locale

2008-09-11 Thread Axel Rau


Am 11.09.2008 um 11:29 schrieb Peter Eisentraut:



What other performance impacts can be expected?


The performance impact is mainly with string comparisons and sorts.   
I suggest you run your own tests to find out what is acceptable in  
your scenario.
Im not yet convinced to switch to non-C locale. Is the following  
intended behavior:

With lc_ctype  C:  select lower('ÄÖÜ'); => ÄÖÜ
With lc_ctype  en_US.utf8  select lower('ÆÅË'); => æåë
? (Both have server encoding UTF8)

Axel
---


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance impact of non-C locale

2008-09-11 Thread Peter Eisentraut

Axel Rau wrote:
Im not yet convinced to switch to non-C locale. Is the following 
intended behavior:

With lc_ctype  C:  select lower('ÄÖÜ'); => ÄÖÜ
With lc_ctype  en_US.utf8  select lower('ÆÅË'); => æåë
? (Both have server encoding UTF8)


I would expect exactly that.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Choosing a filesystem

2008-09-11 Thread Laszlo Nagy
I'm about to buy a new server. It will be a Xeon system with two 
processors (4 cores per processor) and  16GB RAM. Two RAID extenders 
will be attached to an Intel s5000 series motherboard, providing 12 
SAS/Serial ATA connectors.


The server will run FreeBSD 7.0, PostgreSQL 8, apache, PHP, mail server, 
dovecot IMAP server and background programs for database maintenance. On 
our current system, I/O performance for PostgreSQL is the biggest 
problem, but sometimes all CPUs are at 100%. Number of users using this 
system:


PostgreSQL:  30 connections
Apache: 30 connections
IMAP server: 15 connections

The databases are mostly OLTP, but the background programs are creating 
historical data and statistic data continuously, and sometimes web site 
visitors/serach engine robots run searches in bigger tables (with 
3million+ records).


There is an expert at the company who sells the server, and he 
recommended that I use SAS disks for the base system at least. I would 
like to use many SAS disks, but they are just too expensive. So the 
basic system will reside on a RAID 1 array, created from two SAS disks 
spinning at 15 000 rpm. I will buy 10 pieces of Seagate Barracuda 320GB 
SATA (ES 7200) disks for the rest.


The expert told me to use RAID 5 but I'm hesitating. I think that RAID 
1+0 would be much faster, and I/O performance is what I really need.


I would like to put the WAL file on the SAS disks to improve 
performance, and create one big RAID 1+0 disk for the data directory. 
But maybe I'm completely wrong. Can you please advise how to create 
logical partitions? The hardware is capable of handling different types 
of RAID volumes on the same set of disks. For example, a smaller RAID 0 
for indexes and a bigger RAID 5 etc.


If you need more information about the database, please ask. :-)

Thank you very much,

  Laszlo


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Choosing a filesystem

2008-09-11 Thread Andrew Sullivan
On Thu, Sep 11, 2008 at 06:29:36PM +0200, Laszlo Nagy wrote:

> The expert told me to use RAID 5 but I'm hesitating. I think that RAID 1+0 
> would be much faster, and I/O performance is what I really need.

I think you're right.  I think it's a big mistake to use RAID 5 in a
database server where you're hoping for reasonable write performance.
In theory RAID 5 ought to be fast for reads, but I've never seen it
work that way.

> I would like to put the WAL file on the SAS disks to improve performance, 
> and create one big RAID 1+0 disk for the data directory. But maybe I'm 
> completely wrong. Can you please advise how to create logical partitions? 

I would listen to yourself before you listen to the expert.  You sound
right to me :)

A


-- 
Andrew Sullivan
[EMAIL PROTECTED]
+1 503 667 4564 x104
http://www.commandprompt.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Choosing a filesystem

2008-09-11 Thread Matthew Wakeling

On Thu, 11 Sep 2008, Laszlo Nagy wrote:
So the basic system will reside on a RAID 1 array, created from two SAS 
disks spinning at 15 000 rpm. I will buy 10 pieces of Seagate Barracuda 
320GB SATA (ES 7200) disks for the rest.


That sounds good. Put RAID 1 on the pair, and RAID 1+0 on the rest. It'll 
be pretty good. Put the OS and the WAL on the pair, and the database on 
the large array.


However, one of the biggest things that will improve your performance 
(especially in OLTP) is to use a proper RAID controller with a 
battery-backed-up cache.


Matthew

--
X's book explains this very well, but, poor bloke, he did the Cambridge Maths 
Tripos...   -- Computer Science Lecturer


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Choosing a filesystem

2008-09-11 Thread Kenneth Marshall
On Thu, Sep 11, 2008 at 06:18:37PM +0100, Matthew Wakeling wrote:
> On Thu, 11 Sep 2008, Laszlo Nagy wrote:
>> So the basic system will reside on a RAID 1 array, created from two SAS 
>> disks spinning at 15 000 rpm. I will buy 10 pieces of Seagate Barracuda 
>> 320GB SATA (ES 7200) disks for the rest.
>
> That sounds good. Put RAID 1 on the pair, and RAID 1+0 on the rest. It'll 
> be pretty good. Put the OS and the WAL on the pair, and the database on the 
> large array.
>
> However, one of the biggest things that will improve your performance 
> (especially in OLTP) is to use a proper RAID controller with a 
> battery-backed-up cache.
>
> Matthew
>

But remember that putting the WAL on a separate drive(set) will only
help if you do not have competing I/O, such as system logging or paging,
going to the same drives. This turns your fast sequential I/O into
random I/O with the accompaning 10x or more performance decrease.

Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Choosing a filesystem

2008-09-11 Thread Kevin Grittner
>>> Kenneth Marshall <[EMAIL PROTECTED]> wrote:
> On Thu, Sep 11, 2008 at 06:18:37PM +0100, Matthew Wakeling wrote:
>> On Thu, 11 Sep 2008, Laszlo Nagy wrote:
>>> So the basic system will reside on a RAID 1 array, created from two
SAS 
>>> disks spinning at 15 000 rpm. I will buy 10 pieces of Seagate
Barracuda 
>>> 320GB SATA (ES 7200) disks for the rest.
>>
>> That sounds good. Put RAID 1 on the pair, and RAID 1+0 on the rest.
It'll 
>> be pretty good. Put the OS and the WAL on the pair, and the database
on the 
>> large array.
>>
>> However, one of the biggest things that will improve your
performance 
>> (especially in OLTP) is to use a proper RAID controller with a 
>> battery-backed-up cache.
> 
> But remember that putting the WAL on a separate drive(set) will only
> help if you do not have competing I/O, such as system logging or
paging,
> going to the same drives. This turns your fast sequential I/O into
> random I/O with the accompaning 10x or more performance decrease.
 
Unless you have a good RAID controller with battery-backed-up cache.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Choosing a filesystem

2008-09-11 Thread Laszlo Nagy



going to the same drives. This turns your fast sequential I/O into
random I/O with the accompaning 10x or more performance decrease.

 
Unless you have a good RAID controller with battery-backed-up cache.
  

All right. :-) This is what I'll have:

Boxed Intel Server Board S5000PSLROMB with 8-port SAS ROMB card 
(Supports 45nm processors (Harpertown and Wolfdale-DP)
Intel® RAID Activation key AXXRAK18E enables full intelligent SAS RAID 
on S5000PAL, S5000PSL, SR4850HW4/M, SR6850HW4/M. RoHS Compliant.
512 MB 400MHz DDR2 ECC Registered CL3 DIMM Single Rank, x8(for 
s5000pslromb)
6-drive SAS/SATA backplane with expander (requires 2 SAS ports) for 
SC5400 and SC5299 (two pieces)

5410 Xeon 2.33 GHz/1333 FSB/12MB Dobozos , Passive cooling / 80W (2 pieces)
2048 MB 667MHz DDR2 ECC Fully Buffered CL5 DIMM Dual Rank, x8 (8 pieces)

SAS disks will be:  146.8 GB, SAS 3G,15000RPM, 16 MB cache (two pieces)
SATA disks will be: HDD Server SEAGATE Barracuda ES 7200.1 
(320GB,16MB,SATA II-300) __(10 pieces)


I cannot spend more money on this computer, but since you are all 
talking about battery back up, I'll try to get money from the management 
and buy this:


Intel® RAID Smart Battery AXXRSBBU3, optional battery back up for use 
with AXXRAK18E and SRCSAS144E.  RoHS Complaint.



This server will also be an IMAP server, web server etc. so I'm 100% 
sure that the SAS disks will be used for logging. I have two spare 200GB 
SATA disks here in the office but they are cheap ones designed for 
desktop computers. Is it okay to dedicate these disks for the WAL file 
in RAID1? Will it improve performance? How much trouble would it cause 
if the WAL file goes wrong? Should I just put the WAL file on the RAID 
1+0 array?


Thanks,

 Laszlo


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Choosing a filesystem

2008-09-11 Thread Scott Marlowe
On Thu, Sep 11, 2008 at 10:29 AM, Laszlo Nagy <[EMAIL PROTECTED]> wrote:
> I'm about to buy a new server. It will be a Xeon system with two processors
> (4 cores per processor) and  16GB RAM. Two RAID extenders will be attached
> to an Intel s5000 series motherboard, providing 12 SAS/Serial ATA
> connectors.
>
> The server will run FreeBSD 7.0, PostgreSQL 8, apache, PHP, mail server,
> dovecot IMAP server and background programs for database maintenance. On our
> current system, I/O performance for PostgreSQL is the biggest problem, but
> sometimes all CPUs are at 100%. Number of users using this system:

100% what? sys?  user? iowait?  if it's still iowait, then the newer,
bigger, faster RAID should really help.

> PostgreSQL:  30 connections
> Apache: 30 connections
> IMAP server: 15 connections
>
> The databases are mostly OLTP, but the background programs are creating
> historical data and statistic data continuously, and sometimes web site
> visitors/serach engine robots run searches in bigger tables (with 3million+
> records).

This might be a good application to setup where you slony replicate to
another server, then run your I/O intensive processes against the
slave.

> There is an expert at the company who sells the server, and he recommended
> that I use SAS disks for the base system at least. I would like to use many
> SAS disks, but they are just too expensive. So the basic system will reside
> on a RAID 1 array, created from two SAS disks spinning at 15 000 rpm. I will
> buy 10 pieces of Seagate Barracuda 320GB SATA (ES 7200) disks for the rest.

SAS = a bit faster, and better at parallel work.  However, short
stroking 7200 RPM SATA drives on the fastest parts of the platters can
get you close to SAS territory for a fraction of the cost, plus you
can then store backups etc on the rest of the drives at night.

So, you're gonna put the OS o RAID1, and pgsql on the rest...  Makes
sense.  consider setting up another RAID1 for the pg_clog directory.

> The expert told me to use RAID 5 but I'm hesitating. I think that RAID 1+0
> would be much faster, and I/O performance is what I really need.

The expert is most certainly wrong for an OLTP database.  If your RAID
controller can't run RAID-10 quickly compared to RAID-5 then it's a
crap card, and you need a better one.  Or put it into JBOD and let the
OS handle the RAID-10 work.  Or split it RAID-1 sets on the
controller, RAID-0 in the OS.

> I would like to put the WAL file on the SAS disks to improve performance,

Actually, the WAL doesn't need SAS for good performance really.
Except for the 15K.6 Seagate Cheetahs, most decent SATA drives are
within a few percentage of SAS drives for sequential write / read
speed, which is what the WAL basically does.

> and create one big RAID 1+0 disk for the data directory. But maybe I'm
> completely wrong. Can you please advise how to create logical partitions?
> The hardware is capable of handling different types of RAID volumes on the
> same set of disks. For example, a smaller RAID 0 for indexes and a bigger
> RAID 5 etc.

Avoid RAID-5 on OLTP.

Now, if you have a slony slave for the aggregate work stuff, and
you're doing big reads and writes, RAID-5 on a large SATA set may be a
good and cost effective solution.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Choosing a filesystem

2008-09-11 Thread Scott Marlowe
On Thu, Sep 11, 2008 at 11:47 AM, Laszlo Nagy <[EMAIL PROTECTED]> wrote:
> I cannot spend more money on this computer, but since you are all talking
> about battery back up, I'll try to get money from the management and buy
> this:
>
> Intel(R) RAID Smart Battery AXXRSBBU3, optional battery back up for use with
> AXXRAK18E and SRCSAS144E.  RoHS Complaint.

Sacrifice a couple of SAS drives to get that.

I'd rather have all SATA drives and a BBU than SAS without one.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread Scott Carey
Hmm, I would expect this tunable to potentially be rather file system
dependent, and potentially raid controller dependant.  The test was using
ext2, perhaps the others automatically prefetch or read ahead?   Does it
vary by RAID controller?

Well I went and found out, using ext3 and xfs.  I have about 120+ data
points but here are a few interesting ones before I compile the rest and
answer a few other questions of my own.

1:  readahead does not affect "pure" random I/O -- there seems to be a
heuristic trigger -- a single process or file probably has to request a
sequence of linear I/O of some size to trigger it.  I set it to over 64MB of
read-ahead and random iops remained the same to prove this.
2:  File system matters more than you would expect.  XFS sequential
transfers when readahead was tuned had TWICE the sequential throughput of
ext3, both for a single reader and 8 concurrent readers on 8 different
files.
3:  The RAID controller and its configuration make a pretty significant
difference as well.

Hardware:
12 7200RPM SATA (Seagate) in raid 10 on 3Ware 9650 (only ext3)
12 7200RPM SATA ('nearline SAS' : Seagate ES.2) on PERC 6 in raid 10 (ext3,
xfs)
I also have some results with PERC raid 10 with 4x 15K SAS, not reporting in
this message though


Testing process:
All tests begin with
#sync; echo 3 > /proc/sys/vm/drop_caches;
followed by
#blockdev --setra XXX /dev/sdb
Even though FIO claims that it issues reads that don't go to cache, the
read-ahead DOES go to the file system cache, and so one must drop them to
get consistent results unless you disable the read-ahead.  Even if you are
reading more than 2x the physical RAM, that first half of the test is
distorted.  By flushing the cache first my results became consistent within
about +-2%.

Tests
-- fio, read 8 files concurrently, sequential read profile, one process per
file:
[seq-read8]
rw=read
; this will be total of all individual files per process
size=8g
directory=/data/test
fadvise_hint=0
blocksize=8k
direct=0
ioengine=sync
iodepth=1
numjobs=8
; this is number of files total per process
nrfiles=1
runtime=1m

-- fio, read one large file sequentially with one process
[seq-read]
rw=read
; this will be total of all individual files per process
size=64g
directory=/data/test
fadvise_hint=0
blocksize=8k
direct=0
ioengine=sync
iodepth=1
numjobs=1
; this is number of files total per process
nrfiles=1
runtime=1m

-- 'dd' in a few ways:
Measure direct to partition / disk read rate at the start of the disk:
'dd if=/dev/sdb of=/dev/null ibs=24M obs=64K'
Measure direct to partition / disk read rate near the end of the disk:
'dd if=/dev/sdb1 of=/dev/null ibs=24M obs=64K skip=160K'
Measure direct read of the large file used by the FIO one sequential file
test:
'dd if=/data/test/seq-read.1.0 of=/dev/null ibs=32K obs=32K'

the dd paramters for block sizes were chosen with much experimentation to
get the best result.


Results:
I've got a lot of results, I'm only going to put a few of them here for now
while I investigate a few other things (see the end of this message)
Preliminary summary:

PERC 6, ext3, full partition.
dd beginning of disk :  642MB/sec
dd end of disk: 432MB/sec
dd large file (readahead 49152): 312MB/sec
-- maximum expected sequential capabilities above?

fio: 8 concurrent readers and 1 concurrent reader results
readahead is in 512 byte blocks, sequential transfer rate in MiB/sec as
reported by fio.

readahead  |  8 conc read rate  |  1 conc read rate
49152  |  311  |  314
16384  |  312  |  312
12288  |  304  |  309
 8192  |  292  |
 4096  |  264  |
 2048  |  211  |
 1024  |  162  |  302
  512  |  108  |
  256  |  81  | 300
8  |  38  |

Conclusion, on this array going up to 12288 (6MB) readahead makes a huge
impact on concurrent sequential reads.  That is 1MB per raid slice (6, 12
disks raid 10).  Sequential read performance under concurrent.  It has
almost no impact at all on one sequential read alone, the OS or the RAID
controller are dealing with that case just fine.

But, how much of the above effect is ext3?  How much is it the RAID card?
At the top end, the sequential rate for both concurrent and single
sequential access is in line with what dd can get going through ext3.  But
it is not even close to what you can get going right to the device and
bypassing the file system.

Lets try a different RAID card first.  The disks aren't exactly the same,
and there is no guarantee that the file is positioned near the beginning or
end, but I've got another 12 disk RAID 10, using a 3Ware 9650 card.

Results, as above -- don't conclude this card is faster, the files may have
just been closer to the front of the partition.
dd, beginning of disk: 522MB/sec
dd, end of disk array: 412MB/sec
dd, file read via file system (readahead 49152): 391MB/sec

readahead  |  8 conc read rate  |  1 conc read rate
49152  |  343  |  392
16384  |  349  |  379
12288  |  348  |  387
 8192  |  344  |
 6144  |  |  376
 4096  |  340  |
 2048  |  319  |
 1024  |  284 

Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread James Mansion

Greg Smith wrote:
The point I was trying to make there is that even under impossibly 
optimal circumstances, you'd be hard pressed to blow out the disk's 
read cache with seek-dominated data even if you read a lot at each 
seek point.  That idea didn't make it from my head into writing very 
well though.


Isn't there a bigger danger in blowing out the cache on the controller 
and causing premature pageout of its dirty pages?


If you could get the readahead to work on the drive and not return data 
to the controller, that might be dandy, but I'm sceptical.


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Choosing a filesystem

2008-09-11 Thread Craig James

Laszlo Nagy wrote:
I cannot spend more money on this computer, but since you are all 
talking about battery back up, I'll try to get money from the management 
and buy this:


Intel® RAID Smart Battery AXXRSBBU3, optional battery back up for use 
with AXXRAK18E and SRCSAS144E.  RoHS Complaint.


The battery-backup is really important.  You'd be better off to drop down to 8 
disks in a RAID 1+0 and put everything on it, if that meant you could use the 
savings to get the battery-backed RAID controller.  The performance improvement 
of a BB cache is amazing.

Based on advice from this group, configured our systems with a single 8-disk 
RAID 1+0 with a battery-backed cache.  It holds the OS, WAL and database, and 
it is VERY fast.  We're very happy with it.

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Choosing a filesystem

2008-09-11 Thread Greg Smith

On Thu, 11 Sep 2008, Laszlo Nagy wrote:


The expert told me to use RAID 5 but I'm hesitating.


Your "expert" isn't--at least when it comes to database performance. 
Trust yourself here, you've got the right general idea.


But I can't make any sense out of exactly how your disks are going to be 
connected to the server with that collection of hardware.  What I can tell 
is that you're approaching that part backwards, probably under the 
influence of the vendor you're dealing with, and since they don't 
understand what you're doing you're stuck sorting that out.


If you want your database to perform well on writes, the first thing you 
do is select a disk controller that performs well, has a well-known stable 
driver for your OS, has a reasonably large cache (>=256MB), and has a 
battery backup on it.  I don't know anything about how well this Intel 
RAID performs under FreeBSD, but you should check that if you haven't 
already.  From the little bit I read about it I'm concerned if it's fast 
enough for as many drives as you're using.  The wrong disk controller will 
make a slow mess out of any hardware you throw at it.


Then, you connect as many drives to the caching controller as you can for 
the database.  OS drives can connect to another controller (like the ports 
on the motherboard), but you shouldn't use them for either the database 
data or the WAL.  That's what I can't tell from your outline of the server 
configuration; if it presumes a couple of the SATA disks holding database 
data are using the motherboard ports, you need to stop there and get a 
larger battery backed caching controller.


If you're on a limited budget and the choice is between more SATA disks or 
less SAS disks, get more of the SATA ones.  Once you've filled the 
available disk slots on the controller or run out of room in the chassis, 
if there's money leftover then you can go back and analyze whether 
replacing some of those with SAS disks makes sense--as long as they're 
still connected to a caching controller.  I don't know what flexibility 
the "SAS/SATA backplane" you listed has here.


You've got enough disks that it may be worthwhile to set aside two of them 
to be dedicated WAL volumes.  That call depends on the balance of OLTP 
writes (which are more likely to take advantage of that) versus the 
reports scans (which would prefer more disks in the database array), and 
the only way you'll know for sure is to benchmark both configurations with 
something resembling your application.  Since you should always do stress 
testing on any new hardware anyway before it goes into production, that's 
a good time to run comparisons like that.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread Scott Carey
Drives have their own read-ahead in the firmware.  Many can keep track of 2
or 4 concurrent file accesses.  A few can keep track of more.  This also
plays in with the NCQ or SCSI command queuing implementation.

Consumer drives will often read-ahead much more than server drives optimized
for i/o per second.
The difference in read-ahead sensitivity between the two setups I tested may
be due to one setup using nearline-SAS (SATA, tuned for io-per sec using SAS
firmware) and the other used consumer SATA.
For example, here is one "nearline SAS" style server tuned drive versus a
consumer tuned one:
http://www.storagereview.com/php/benchmark/suite_v4.php?typeID=10&testbedID=4&osID=6&raidconfigID=1&numDrives=1&devID_0=354&devID_1=348&devCnt=2

The Linux readahead setting is _definitely_ in the kernel, definitely uses
and fills the page cache, and from what I can gather, simply issues extra
I/O's to the hardware beyond the last one requested by an app in certain
situations.  It does not make your I/O request larger, it just queues an
extra I/O following your request.

On Thu, Sep 11, 2008 at 12:54 PM, James Mansion <
[EMAIL PROTECTED]> wrote:

> Greg Smith wrote:
>
>> The point I was trying to make there is that even under impossibly optimal
>> circumstances, you'd be hard pressed to blow out the disk's read cache with
>> seek-dominated data even if you read a lot at each seek point.  That idea
>> didn't make it from my head into writing very well though.
>>
>>  Isn't there a bigger danger in blowing out the cache on the controller
> and causing premature pageout of its dirty pages?
>
> If you could get the readahead to work on the drive and not return data to
> the controller, that might be dandy, but I'm sceptical.
>
> James
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread Scott Carey
Sorry, I forgot to mention the Linux kernel version I'm using, etc:

2.6.18-92.1.10.el5 #1 SMP x86_64
CentOS 5.2.

The "adaptive" read-ahead, as well as other enhancements in the kernel, are
taking place or coming soon in the most recent stuff.  Some distributions
offer the adaptive read-ahead as an add-on (Debian, for example).  This is
an area where much can be improved in Linux http://kerneltrap.org/node/6642
http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7

I obviously did not test how the new read-ahead stuff impacts these sorts of
tests.

On Thu, Sep 11, 2008 at 12:07 PM, Scott Carey <[EMAIL PROTECTED]>wrote:

> Hmm, I would expect this tunable to potentially be rather file system
> dependent, and potentially raid controller dependant.  The test was using
> ext2, perhaps the others automatically prefetch or read ahead?   Does it
> vary by RAID controller?
>
> Well I went and found out, using ext3 and xfs.  I have about 120+ data
> points but here are a few interesting ones before I compile the rest and
> answer a few other questions of my own.
>
> 1:  readahead does not affect "pure" random I/O -- there seems to be a
> heuristic trigger -- a single process or file probably has to request a
> sequence of linear I/O of some size to trigger it.  I set it to over 64MB of
> read-ahead and random iops remained the same to prove this.
> 2:  File system matters more than you would expect.  XFS sequential
> transfers when readahead was tuned had TWICE the sequential throughput of
> ext3, both for a single reader and 8 concurrent readers on 8 different
> files.
> 3:  The RAID controller and its configuration make a pretty significant
> difference as well.
>
> Hardware:
> 12 7200RPM SATA (Seagate) in raid 10 on 3Ware 9650 (only ext3)
> 12 7200RPM SATA ('nearline SAS' : Seagate ES.2) on PERC 6 in raid 10 (ext3,
> xfs)
> I also have some results with PERC raid 10 with 4x 15K SAS, not reporting
> in this message though
>
>   . . . {snip}


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread david

On Thu, 11 Sep 2008, Scott Carey wrote:


Drives have their own read-ahead in the firmware.  Many can keep track of 2
or 4 concurrent file accesses.  A few can keep track of more.  This also
plays in with the NCQ or SCSI command queuing implementation.

Consumer drives will often read-ahead much more than server drives optimized
for i/o per second.
The difference in read-ahead sensitivity between the two setups I tested may
be due to one setup using nearline-SAS (SATA, tuned for io-per sec using SAS
firmware) and the other used consumer SATA.
For example, here is one "nearline SAS" style server tuned drive versus a
consumer tuned one:
http://www.storagereview.com/php/benchmark/suite_v4.php?typeID=10&testbedID=4&osID=6&raidconfigID=1&numDrives=1&devID_0=354&devID_1=348&devCnt=2

The Linux readahead setting is _definitely_ in the kernel, definitely uses
and fills the page cache, and from what I can gather, simply issues extra
I/O's to the hardware beyond the last one requested by an app in certain
situations.  It does not make your I/O request larger, it just queues an
extra I/O following your request.


that extra I/O will be merged with your request by the I/O scheduler code 
so that by the time it gets to the drive it will be a single request.


by even if it didn't, most modern drives read the entire cylinder into 
their buffer so any additional requests to the drive will be satisfied 
from this buffer and not have to wait for the disk itself.


David Lang


On Thu, Sep 11, 2008 at 12:54 PM, James Mansion <
[EMAIL PROTECTED]> wrote:


Greg Smith wrote:


The point I was trying to make there is that even under impossibly optimal
circumstances, you'd be hard pressed to blow out the disk's read cache with
seek-dominated data even if you read a lot at each seek point.  That idea
didn't make it from my head into writing very well though.

 Isn't there a bigger danger in blowing out the cache on the controller

and causing premature pageout of its dirty pages?

If you could get the readahead to work on the drive and not return data to
the controller, that might be dandy, but I'm sceptical.

James



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread Scott Marlowe
On Thu, Sep 11, 2008 at 3:36 PM,  <[EMAIL PROTECTED]> wrote:
> On Thu, 11 Sep 2008, Scott Carey wrote:
>
>> Drives have their own read-ahead in the firmware.  Many can keep track of
>> 2
>> or 4 concurrent file accesses.  A few can keep track of more.  This also
>> plays in with the NCQ or SCSI command queuing implementation.
>>
>> Consumer drives will often read-ahead much more than server drives
>> optimized
>> for i/o per second.
>> The difference in read-ahead sensitivity between the two setups I tested
>> may
>> be due to one setup using nearline-SAS (SATA, tuned for io-per sec using
>> SAS
>> firmware) and the other used consumer SATA.
>> For example, here is one "nearline SAS" style server tuned drive versus a
>> consumer tuned one:
>>
>> http://www.storagereview.com/php/benchmark/suite_v4.php?typeID=10&testbedID=4&osID=6&raidconfigID=1&numDrives=1&devID_0=354&devID_1=348&devCnt=2
>>
>> The Linux readahead setting is _definitely_ in the kernel, definitely uses
>> and fills the page cache, and from what I can gather, simply issues extra
>> I/O's to the hardware beyond the last one requested by an app in certain
>> situations.  It does not make your I/O request larger, it just queues an
>> extra I/O following your request.
>
> that extra I/O will be merged with your request by the I/O scheduler code so
> that by the time it gets to the drive it will be a single request.
>
> by even if it didn't, most modern drives read the entire cylinder into their
> buffer so any additional requests to the drive will be satisfied from this
> buffer and not have to wait for the disk itself.

Generally speaking I agree, but I would still make a separate logical
partition for pg_xlog so that if the OS fills up the /var/log dir or
something, it doesn't impact the db.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread david

On Thu, 11 Sep 2008, Scott Marlowe wrote:


On Thu, Sep 11, 2008 at 3:36 PM,  <[EMAIL PROTECTED]> wrote:

by even if it didn't, most modern drives read the entire cylinder into their
buffer so any additional requests to the drive will be satisfied from this
buffer and not have to wait for the disk itself.


Generally speaking I agree, but I would still make a separate logical
partition for pg_xlog so that if the OS fills up the /var/log dir or
something, it doesn't impact the db.


this is a completely different discussion :-)

while I agree with you in theory, in practice I've seen multiple 
partitions cause far more problems than they have prevented (due to the 
partitions ending up not being large enough and having to be resized after 
they fill up, etc) so I tend to go in the direction of a few large 
partitions.


the only reason I do multiple partitions (besides when the hardware or 
performance considerations require it) is when I can identify that there 
is some data that I would not want to touch on a OS upgrade. I try to make 
it so that an OS upgrade can wipe the OS partitions if nessasary.


David Lang


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread Alan Hodgson
On Thursday 11 September 2008, [EMAIL PROTECTED] wrote:
> while I agree with you in theory, in practice I've seen multiple
> partitions cause far more problems than they have prevented (due to the
> partitions ending up not being large enough and having to be resized
> after they fill up, etc) so I tend to go in the direction of a few large
> partitions.

I used to feel this way until LVM became usable. LVM plus online resizable 
filesystems really makes multiple partitions manageable.


-- 
Alan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread david

On Thu, 11 Sep 2008, Alan Hodgson wrote:


On Thursday 11 September 2008, [EMAIL PROTECTED] wrote:

while I agree with you in theory, in practice I've seen multiple
partitions cause far more problems than they have prevented (due to the
partitions ending up not being large enough and having to be resized
after they fill up, etc) so I tend to go in the direction of a few large
partitions.


I used to feel this way until LVM became usable. LVM plus online resizable
filesystems really makes multiple partitions manageable.


won't the fragmentation of your filesystem across the different LVM 
segments hurt you?


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread Scott Carey
I also thought that LVM is unsafe for WAL logs and file system journals with
disk write cache -- it doesn't flush the disk write caches correctly and
build write barriers.

As pointed out here:
http://groups.google.com/group/pgsql.performance/browse_thread/thread/9dc43991c1887129
by Greg Smith
http://lwn.net/Articles/283161/



On Thu, Sep 11, 2008 at 3:41 PM, Alan Hodgson <[EMAIL PROTECTED]> wrote:

> On Thursday 11 September 2008, [EMAIL PROTECTED] wrote:
> > while I agree with you in theory, in practice I've seen multiple
> > partitions cause far more problems than they have prevented (due to the
> > partitions ending up not being large enough and having to be resized
> > after they fill up, etc) so I tend to go in the direction of a few large
> > partitions.
>
> I used to feel this way until LVM became usable. LVM plus online resizable
> filesystems really makes multiple partitions manageable.
>
>
> --
> Alan
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread Scott Marlowe
On Thu, Sep 11, 2008 at 4:33 PM,  <[EMAIL PROTECTED]> wrote:
> On Thu, 11 Sep 2008, Scott Marlowe wrote:
>
>> On Thu, Sep 11, 2008 at 3:36 PM,  <[EMAIL PROTECTED]> wrote:
>>>
>>> by even if it didn't, most modern drives read the entire cylinder into
>>> their
>>> buffer so any additional requests to the drive will be satisfied from
>>> this
>>> buffer and not have to wait for the disk itself.
>>
>> Generally speaking I agree, but I would still make a separate logical
>> partition for pg_xlog so that if the OS fills up the /var/log dir or
>> something, it doesn't impact the db.
>
> this is a completely different discussion :-)
>
> while I agree with you in theory, in practice I've seen multiple partitions
> cause far more problems than they have prevented (due to the partitions
> ending up not being large enough and having to be resized after they fill
> up, etc) so I tend to go in the direction of a few large partitions.

I've never had that problem.  I've always made the big enough.  I
can't imagine building a server where /var/log shared space with my
db.  It's not like every root level dir gets its own partition, but
seriously, logs should never go anywhere that another application is
writing to.

> the only reason I do multiple partitions (besides when the hardware or
> performance considerations require it) is when I can identify that there is
> some data that I would not want to touch on a OS upgrade. I try to make it
> so that an OS upgrade can wipe the OS partitions if nessasary.

it's quite handy to have /home on a separate partition I agree.  But
on most servers /home should be empty.  A few others like /opt or
/usr/local I tend to make a separate one for the reasons you mention
as well.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [GENERAL] PostgreSQL TPC-H test result?

2008-09-11 Thread Jignesh K. Shah
Moving this thread to Performance alias as it might make more sense for 
folks searching on this topic:




Greg Smith wrote:

On Tue, 9 Sep 2008, Amber wrote:

I read something from 
http://monetdb.cwi.nl/projects/monetdb/SQL/Benchmark/TPCH/index.html 
saying that PostgreSQL can't give the correct result of the some 
TPC-H queries


Jignesh Shah at Sun ran into that same problem.  It's mentioned 
briefly in his presentation at 
http://blogs.sun.com/jkshah/entry/postgresql_east_2008_talk_postgresql 
on pages 26 and 27.  5 of the 22 reference TCP-H queries (4, 5, 6, 10, 
14) returned zero rows immediately for his tests.  Looks like the 
MonetDB crew is saying it does that on queries 4,5,6,10,12,14,15 and 
that 20 takes too long to run to generate a result.  Maybe 12/15/20 
were fixed by changes in 8.3, or perhaps there were subtle errors 
there that Jignesh didn't catch--it's not like he did a formal 
submission run, was just kicking the tires.  I suspect the difference 
on 20 was that his hardware and tuning was much better, so it probably 
did execute fast enough.


I redid a quick test with the same workload on one of my systems with SF 
10 which is about 10GB

(I hope it comes out properly displayed)

JigneshFrom Monet (8.3T/8.2.9)

Q  Time PG8.3.3Time PG8.2.9 Ratio

1429.01  5100.84

2  3.65   540.07

3 33.49  7980.04

4  6.53Empty  35  (E)   0.19

5  8.45Empty   5.5(E)   1.54

6 32.84Empty 172  (E)   0.19

7477.95  4391.09

8 58.55  2510.23

9781.96 22400.35

10 9.03Empty   6.1(E)   1.48

11 3.57Empty  250.14

1256.11Empty 179  (E)   0.31

1361.01  1400.44

1430.69Empty 169  (E)   0.18

1532.81Empty 168  (E)   0.2

1623.98  1150.21

17Did not finish Did not finish

1858.93  8820.07

1971.55  2180.33

20Did not finish Did not finish

21   550.51  4771.15

22 6.21 Did not finish   




All time is in seconds (sub seconds where availabe)
Ratio > 1 means 8.3.3 is slower and <1 means 8.3.3 is faster

My take on the results:

* I had to tweak the statement of Q1 in order to execute it.
 (TPC-H kit does not directly support POSTGRESQL statements)

* Timings with 8.3.3 and bit of tuning gives much better time overall
 This was expected (Some queries finish in 7% of the time than what
 MonetDB reported. From the queries that worked only Q7 & Q21 seem to
 have regressed)

* However Empty rows results is occuring consistently
 (Infact Q11 also returned empty for me while it worked in their test)
 Queries: 4,5,6,10,11,12,14,15
 (ACTION ITEM: I will start separate threads for each of those queries in
  HACKERS alias to figure out the problem since it looks like Functional
  problem to me and should be interesting to hackers alias)

* Two queries 17,20 looks like will not finish (I let Q17 to run for 18 
hrs and
 yet it had not completed. As for Q20 I killed it as it was approaching 
an hour.)
 (ACTION ITEM: Not sure whether debugging for these queries will go in 
hackers or

  perform alias but I will start a thread on them too.)

* Looks like bit of tuning is required for Q1, Q7, Q9, Q21 to improve their
 overall time. Specially understanding if PostgreSQL is missing a more 
efficient

 plan for them.
 (ACTION ITEM: I will start separate threads on performance alias to 
dig into

  those queries)


I hope to start separate threads for each queries so we can track them 
easier. I hope to provide explain analyze outputs for each one of them 
and lets see if there are any problems.


Feedback welcome on what you want to see for each threads.

Regards,
Jignesh


--
Jignesh Shah   http://blogs.sun.com/jkshah  
Sun Microsystems,Inc   http://sun.com/postgresql


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [GENERAL] PostgreSQL TPC-H test result?

2008-09-11 Thread Tom Lane
"Jignesh K. Shah" <[EMAIL PROTECTED]> writes:
> * However Empty rows results is occuring consistently
>   (Infact Q11 also returned empty for me while it worked in their test)
>   Queries: 4,5,6,10,11,12,14,15
>   (ACTION ITEM: I will start separate threads for each of those queries in
>HACKERS alias to figure out the problem since it looks like Functional
>problem to me and should be interesting to hackers alias)

See discussion suggesting that this is connected to misinterpretation of
INTERVAL literals.  If TPC-H relies heavily on syntax that we'd get
wrong, then pretty much every test result has to be under suspicion,
since we might be fetching many more or fewer rows than the test
intends.

I've recently committed fixes that I think would cover this, but you'd
really need to compare specific query rowcounts against other DBMSes
to make sure we're all on the same page.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread Greg Smith

On Thu, 11 Sep 2008, Alan Hodgson wrote:

LVM plus online resizable filesystems really makes multiple partitions 
manageable.


I've seen so many reports blaming Linux's LVM for performance issues that 
its managability benefits don't seem too compelling.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [GENERAL] PostgreSQL TPC-H test result?

2008-09-11 Thread Jonah H. Harris
On Thu, Sep 11, 2008 at 11:30 PM, Jignesh K. Shah <[EMAIL PROTECTED]> wrote:
> Moving this thread to Performance alias as it might make more sense for
> folks searching on this topic:

You should be using DBT-3.  Similarly, a scale factor of 10 is
pointless.  How many data warehouses are only 10GB?  Also, it's
well-known that MonetDB will quickly fall over when you run a test
larger than can fit in memory.  In the real benchmark, the minimum
scale factor is 100GB; try it and see what you get.

If you have the resources and want to compare it to something, compare
it with Oracle on the exact same system.  If tuned properly, Oracle
10g (Standard Edition with the exact same tables/indexes/queries as
Postgres) is ~5-10x faster and Enterprise Edition is ~50-100x faster.
To be fair, an Oracle Enterprise Edition configuration for TPC-H uses
advanced partitioning and materialized views, both of which Postgres
does not support, which makes it an apples-to-oranges comparison.  I
haven't tried 11g, but I expect it will perform a bit better in this
area given several of the enhancements.  Also, while it's not widely
known, if you wanted to compare systems and don't want to set it all
up yourself, Oracle released Oracle-compatible versions of OSDL's
Database Test Suite for DBT-2 (TPC-C) and DBT-3 (TPC-H) as part of
their Linux Test Kit which can be found at oss.oracle.com.  Due to
Oracle's license, I can't give you exact timings, but I have confirmed
with several other benchmark professionals that the results mentioned
above have been confirmed by others as well.

To be clear, I'm not trying to bash on PG and I don't want to start a
flame-war.  I just think that people should be aware of where we stand
in comparison to commercial systems and understand that there's quite
a bit of work to be done in the VLDB area.  Specifically, I don't
think we should be striving for great TPC-H performance, but I believe
there is some areas we could improve based on it.  Similarly, this is
an area where a properly-utilized fadvise may show some benefit.

As for running the TPC-H on Postgres, you need a
default_statistics_target of at least 250.  IIRC, the last time I
checked (on 8.3), you really needed a statistics target around
400-500.  For the most part, the planner is choosing a bad plan for
several of the queries.  After you resolve that, you'll quickly notice
that Postgres' buffer manager design and the lack of a good
multi-block read quickly comes into play.  The hash join
implementation also has a couple issues which I've recently seen
mentioned in other threads.

Use DBT-3, it will save you quite a few headaches :)

-- 
Jonah H. Harris, Senior DBA
myYearbook.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance