Re: [PERFORM] query optimization

2012-05-03 Thread Richard Kojedzinszky

Dear Tom,

Thanks for your response. I my test cases, until an analyze have been run, 
the queries run fast. After only a have been analyzed, the query plan 
changes so that a sequence scan for b and c tables is done, and joining 
them with 'a' is done within memory.


So, my tests are here, with autovacuum turned off:

krichy=> \i test.sql
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
INSERT 0 10
INSERT 0 10
INSERT 0 10
INSERT 0 10
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
krichy=> explain analyze select * from a left join (b inner join c on b.id 
= c.id) on a.b = b.id where a.id = 1;

QUERY PLAN
--
 Merge Left Join  (cost=504.88..2633669.90 rows=12500 width=16) 
(actual time=0.183..0.189 rows=1 loops=1)

   Merge Cond: (a.b = b.id)
   ->  Sort  (cost=504.88..506.13 rows=500 width=8) (actual 
time=0.073..0.074 rows=1 loops=1)

 Sort Key: a.b
 Sort Method: quicksort  Memory: 17kB
 ->  Bitmap Heap Scan on a  (cost=12.13..482.47 rows=500 width=8) 
(actual time=0.048..0.048 rows=1 loops=1)

   Recheck Cond: (id = 1)
   ->  Bitmap Index Scan on a_idx1  (cost=0.00..12.01 rows=500 
width=0) (actual time=0.044..0.044 rows=1 loops=1)

 Index Cond: (id = 1)
   ->  Materialize  (cost=0.00..883162.52 rows=5000 width=8) (actual 
time=0.103..0.108 rows=2 loops=1)
 ->  Merge Join  (cost=0.00..758162.52 rows=5000 width=8) 
(actual time=0.100..0.104 rows=2 loops=1)

   Merge Cond: (b.id = c.id)
   ->  Index Scan using b_idx1 on b  (cost=0.00..3956.26 
rows=10 width=4) (actual time=0.050..0.050 rows=2 loops=1)
   ->  Materialize  (cost=0.00..4206.26 rows=10 width=4) 
(actual time=0.048..0.049 rows=2 loops=1)
 ->  Index Scan using c_idx1 on c  (cost=0.00..3956.26 
rows=10 width=4) (actual time=0.046..0.047 rows=2 loops=1)

 Total runtime: 0.276 ms
(16 rows)

krichy=> ANALYZE a;
ANALYZE
krichy=> explain analyze select * from a left join (b inner join c on b.id 
= c.id) on a.b = b.id where a.id = 1;

 QUERY PLAN

 Merge Right Join  (cost=8.29..883178.31 rows=500 width=16) (actual 
time=0.050..0.056 rows=1 loops=1)

   Merge Cond: (b.id = a.b)
   ->  Merge Join  (cost=0.00..758162.52 rows=5000 width=8) (actual 
time=0.034..0.038 rows=2 loops=1)

 Merge Cond: (b.id = c.id)
 ->  Index Scan using b_idx1 on b  (cost=0.00..3956.26 rows=10 
width=4) (actual time=0.015..0.016 rows=2 loops=1)
 ->  Materialize  (cost=0.00..4206.26 rows=10 width=4) (actual 
time=0.015..0.017 rows=2 loops=1)
   ->  Index Scan using c_idx1 on c  (cost=0.00..3956.26 
rows=10 width=4) (actual time=0.012..0.013 rows=2 loops=1)
   ->  Materialize  (cost=8.29..8.29 rows=1 width=8) (actual 
time=0.015..0.016 rows=1 loops=1)
 ->  Sort  (cost=8.29..8.29 rows=1 width=8) (actual 
time=0.013..0.013 rows=1 loops=1)

   Sort Key: a.b
   Sort Method: quicksort  Memory: 17kB
   ->  Index Scan using a_idx1 on a  (cost=0.00..8.28 rows=1 
width=8) (actual time=0.007..0.008 rows=1 loops=1)

 Index Cond: (id = 1)
 Total runtime: 0.101 ms
(14 rows)

krichy=> ANALYZE b;
ANALYZE
krichy=> explain analyze select * from a left join (b inner join c on b.id 
= c.id) on a.b = b.id where a.id = 1;

QUERY PLAN
---
 Hash Right Join  (cost=2651.29..6419.30 rows=1 width=16) (actual 
time=83.823..257.890 rows=1 loops=1)

   Hash Cond: (b.id = a.b)
   ->  Hash Join  (cost=2643.00..6036.00 rows=10 width=8) (actual 
time=83.790..224.552 rows=10 loops=1)

 Hash Cond: (c.id = b.id)
 ->  Seq Scan on c  (cost=0.00..1393.00 rows=10 width=4) 
(actual time=0.010..35.925 rows=10 loops=1)
 ->  Hash  (cost=1393.00..1393.00 rows=10 width=4) (actual 
time=83.752..83.752 rows=10 loops=1)

   Buckets: 16384  Batches: 1  Memory Usage: 2344kB
   ->  Seq Scan on b  (cost=0.00..1393.00 rows=10 width=4) 
(actual time=0.009..36.302 rows=10 loops=1)
   ->  Hash  (cost=8.28..8.28 rows=1 width=8) (actual time=0.012..0.012 
rows=1 loops=1)

 Buckets: 1024  Batches: 1  Memory Usage: 1kB
 ->  Index Scan using a_idx1 on a  (cost=0.00..8.28 rows=1 
width=8) (actual time=0.007..0.008 rows=1 loops=1)

   Index Cond: 

Re: [PERFORM] scale up (postgresql vs mssql)

2012-05-03 Thread Eyal Wilde
hi, all.

well, i wondered why there is high rate of bo (blocks out). the procedure
is practically read-only during the whole test. although it's not strictly
read-only, because in a certain condition, there might be writing to a
certain table. but that condition can not be met during this test.

so, i created a ramdisk:
mkfs -q /dev/ram2 10
mkdir -p /ram4
mount /dev/ram2 /ram4
df -H | grep /ram4

and then:
CREATE TABLESPACE pgram4 OWNER postgres LOCATION '/ram4';

and in postgresql.conf i configured:
temp_tablespaces = 'pgram4'

now, i believe, all the temp-table were in RAM.
vmstat showed:
r b swpd free buff cache si so bi bo in cs us sy id wa st
6 0 0 5916720 69488 1202668 0 0 0 3386 1902 1765 25 3 72 1 0
9 1 0 5907728 69532 1204212 0 0 0 1392 5375 4510 88 8 3 1 0
7 0 0 5886584 69672 1205096 0 0 0 1472 5278 4520 88 10 2 0 0
8 0 0 5877384 69688 1206384 0 0 0 1364 5312 4522 89 8 2 1 0
8 0 0 5869332 69748 1207188 0 0 0 1296 5285 4437 88 8 3 1 0
6 1 0 5854404 69852 1208776 0 0 0 2955 5333 4518 88 9 2 0 0

10 times less bo (blocks out)
5 times less wa (percentage of time spent by cpu waiting to IO)
2 times less b (wait Queue – Process which are waiting for I/O)

the overall test result was (just?) ~15% better...

when i created the ramdisk with mkfs.ext4 (instead of the default ext2),
the performance was the same (~15% better), but vmstat output looked much
the same as before (without the ramdisk) !?? why is that?

as i mentioned, the procedure is practically read-only. shouldn't i expect
bo (blocks out) to be ~0? after forcing temp-tables to be in the RAM, what
other reasons may be the cause for bo (blocks out)?

i see no point pasting the whole procedure here, since it's very long. the
general course of the procedure is:
create temp-tables if they are not exist (practically, they do exist)
do a lot of: insert into temp-table select from table
and : insert into temp-table select from table join temp-table
after finished insert into temp-table: analyze temp-table (this was the
only way the optimizer behaved properly)
finally, open refcursors of select from temp-tables

Thanks again.


[PERFORM] Result Set over Network Question

2012-05-03 Thread Ronald Hahn, DOCFOCUS INC.

Hi,
We have recently switch our product from MS SQL 2000 to Postgresql 
9.0.7. We have tuned the searches and indexes so that they are very 
close (often better) to what sql2k was giving us.  We are noticing some 
differences now in the time it takes for the result set to make it back 
to the client and would like some help finding out why.


What we see on the PG side is that if we run:
Select SomeInt32 from someTable where something Limit 1
It consistently returns the results "instantaneously"  after the fetch 
time.  If we run the same select but ask for more data the fetch time 
stays the same but the row takes longer to come back.  Bringing back 400 
bytes takes 1-2 s but bringing back 866 bytes takes 9 - 11 s.


We went to the SQL2k server (On the same hardware) and ran the selects 
again. When bringing back on an int32 PG was faster with the fetch and 
the row coming back in 1-5 ms and SQL2k coming back in 500-700 ms. This 
tells me that the problem is not related to PG index or Disk.  When 
bringing back 400 bytes PG fetch time would be 1-2 ms but the results 
would take 2-3 s but SQL2k would it bring back in 700-900 ms. Working 
with 866 bytes, PG fetch time is 1-3 ms with the results coming back in 
9 - 11 s and SQL2k bringing the results back in 2-3 s.


The head to head test was run in Aqua Data Studio 10.0.8 and ODBC driver 
9.0.3.10. The same slow down happens when use PGadminIII. The differnces 
in time to not occure when running on the pg/sql server computer so I 
think there is a network component to this.


I know that as you bring back more data it takes longer but why is there 
such a difference in the time it takes PG to send the data compared to 
the time it takes sql2k to send it?


Any thoughts and suggestions are very much appreciated
Thanks
Ron
--

*Ronald Hahn* , CCP, CIPS Member
*DOCFOCUS INC.*
Suite 103, 17505 - 107 Avenue,
Edmonton, Alberta T5S 1E5
Phone: 780.444.5407
Toll Free: 800.661.2727 (ext 6)
Fax: 780.444.5409
Email: rh...@docfocus.ca
Support:dfisupp...@docfocus.ca 
DOCFOCUS.ca 

There are 2 kinds of people in the world.
Those who can extrapolate from incomplete data



[PERFORM] Query got slow from 9.0 to 9.1 upgrade

2012-05-03 Thread Josh Turmel
We just upgraded from 9.0 to 9.1, we're using the same server configuration, 
that has been confirmed 3 or 4 times over. Any help would be appreciated. If I 
remove the "ORDER BY" it gets fast again because it goes back to using the 
user_id index, if I remove the LIMIT/OFFSET it gets fast again, obviously I 
need both of those, but that was just to test and see what would happen.

Query: SELECT * FROM bookmark_groups WHERE user_id = 6708929 ORDER BY created 
DESC LIMIT 25 OFFSET 0;

explain analyze from 9.0:

Limit  (cost=1436.78..1436.84 rows=25 width=99) (actual time=15.399..15.403 
rows=25 loops=1)
  ->  Sort  (cost=1436.78..1438.67 rows=757 width=99) (actual 
time=15.397..15.397 rows=25 loops=1)
Sort Key: created
Sort Method:  top-N heapsort  Memory: 28kB
->  Index Scan using bookmark_groups_user_id_idx on bookmark_groups  
(cost=0.00..1415.42 rows=757 width=99) (actual time=0.011..9.953 rows=33868 
loops=1)
  Index Cond: (user_id = 6708929)
Total runtime: 15.421 ms


explain analyze from 9.1:

Limit  (cost=0.00..1801.30 rows=25 width=101) (actual time=1565.071..5002.068 
rows=25 loops=1)
  ->  Index Scan using bookmark_groups_created_idx on bookmark_groups  
(cost=0.00..2592431.76 rows=35980 width=101) (actual time=1565.069..5002.062 
rows=25 loops=1)
Filter: (user_id = 6708929)
Total runtime: 5002.095 ms


DDL:

CREATE TABLE "public"."bookmark_groups" (
"id" int8 NOT NULL DEFAULT nextval('bookmark_groups_id_seq'::regclass),
"user_id" int4 NOT NULL DEFAULT NULL,
"version" varchar DEFAULT NULL,
"created" timestamp(6) WITH TIME ZONE NOT NULL DEFAULT now(),
"username" varchar NOT NULL DEFAULT NULL,
"labels" varchar DEFAULT NULL,
"reference" varchar NOT NULL DEFAULT NULL,
"human" varchar NOT NULL DEFAULT NULL,
"highlight_color" char(6) DEFAULT NULL,
"title" varchar DEFAULT NULL,
"version_id" int4 NOT NULL DEFAULT NULL,
CONSTRAINT "bookmark_groups_pkey1" PRIMARY KEY ("id", "reference")
)
WITH (OIDS=FALSE);
ALTER TABLE "public"."bookmark_groups" OWNER TO "dev";
CREATE INDEX "bookmark_groups_created_idx" ON "public"."bookmark_groups" USING 
btree(created DESC NULLS FIRST);
CREATE INDEX "bookmark_groups_user_id_idx" ON "public"."bookmark_groups" USING 
btree(user_id ASC NULLS LAST);



Re: [PERFORM] Configuration Recommendations

2012-05-03 Thread John Lister

On 03/05/2012 03:10, Jan Nielsen wrote:


300GB RAID10 2x15k drive for OS on local storage
*/dev/sda1 RA*4096
*/dev/sda1 FS*ext4
*/dev/sda1 MO*

600GB RAID 10 8x15k drive for $PGDATA on SAN
*IO Scheduler sda* noop anticipatory deadline [cfq]
*/dev/sdb1 RA* 4096
*/dev/sdb1 FS* xfs
*/dev/sdb1 MO* allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime

300GB RAID 10 2x15k drive for $PGDATA/pg_xlog on SAN
*IO Scheduler sdb*noop anticipatory deadline [cfq]
*/dev/sde1 RA* 4096
*/dev/sde1 FS* xfs
*/dev/sde1 MO* allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
*


I was wondering if it would be better to put the xlog on the same disk 
as the OS? Apart from the occasional log writes I'd have thought most OS 
data is loaded into cache at the beginning, so you effectively have an 
unused disk. This gives you another spindle (mirrored) for your data.


Or have I missed something fundamental?

--
www.pricegoblin.co.uk



Re: [PERFORM] Several optimization options (config/hardware)

2012-05-03 Thread Albe Laurenz
Martin Grotzke wrote:
> we want to see if we can gain better performance with our postgresql
> database. In the last year the amount of data growed from ~25G to now
> ~140G and we're currently developing a new feature that needs to get
> data faster from the database. The system is both read and write
heavy.
> 
> At first I want to give you an overview over the hardware, software
and
> configuration and the changes that I see we could check out. I'd be
very
> happy if you could review and tell if the one or the other is
nonsense.
> 
> Hardware:
> - CPU: 4x4 Cores Intel Xeon L5630  @ 2.13GHz
> - RAM: 64GB
> - RAID 1 (1+0) HP Company Smart Array G6 controllers, P410i
>   (I don't know the actual number of discs)
> - A single partition for data and wal-files
> 
> Software
> - RHEL 6, Kernel 2.6.32-220.4.1.el6.x86_64
> - postgresql90-server-9.0.6-1PGDG.rhel6.x86_64

You could try different kernel I/O elevators and see if that improves
something.

I have made good experiences with elevator=deadline and elevator=noop.

Yours,
Laurenz Albe

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Configuration Recommendations

2012-05-03 Thread Robert Klemme
Hi Jan,

On Thu, May 3, 2012 at 4:10 AM, Jan Nielsen  wrote:
> Below is the hardware, firmware, OS, and PG configuration pieces that I'm
> settling in on. As was noted, the local storage used for OS is actually two
> disks with RAID 10. If anything appears like a mistake or something is
> missing, I'd appreciate the feedback.

You should quickly patent this solution.  As far as I know you need at
least four disks for RAID 10. :-)
http://en.wikipedia.org/wiki/RAID#Nested_.28hybrid.29_RAID

Or did you mean RAID 1?

> I'm still working on the benchmarks scripts and I don't have good/reliable
> numbers yet since our SAN is still very busy reconfiguring from the 2x4 to
> 1x8. I'm hoping to get them running tomorrow when the SAN should complete
> its 60 hours of reconfiguration.

Yeah, does not seem to make a lot of sense to test during this phase.

> Thanks, again, for all the great feedback.

You're welcome!

> 300GB RAID10 2x15k drive for OS on local storage
> */dev/sda1 RA*    4096
> */dev/sda1 FS*    ext4
> */dev/sda1 MO*

See above.

> 600GB RAID 10 8x15k drive for $PGDATA on SAN
> *IO Scheduler sda*    noop anticipatory deadline [cfq]
> */dev/sdb1 RA*    4096
> */dev/sdb1 FS* xfs
> */dev/sdb1 MO*
> allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
>
> 300GB RAID 10 2x15k drive for $PGDATA/pg_xlog on SAN
> *IO Scheduler sdb*    noop anticipatory deadline [cfq]
> */dev/sde1 RA*    4096
> */dev/sde1 FS* xfs
> */dev/sde1 MO*    allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
> *IO Scheduler sde*    noop anticipatory deadline [cfq]

See above.

With regard to the scheduler, I have frequently read that [deadline]
and [noop] perform better for PG loads.  Fortunately this can be
easily changed.

Maybe this also has some additional input:
http://www.fccps.cz/download/adv/frr/hdd/hdd.html

On Thu, May 3, 2012 at 8:54 AM, John Lister  wrote:
> I was wondering if it would be better to put the xlog on the same disk as
> the OS? Apart from the occasional log writes I'd have thought most OS data
> is loaded into cache at the beginning, so you effectively have an unused
> disk. This gives you another spindle (mirrored) for your data.
>
> Or have I missed something fundamental?

Separating avoids interference between OS and WAL logging (i.e. a
script running berserk and filling OS filesystem).  Also it's easier
to manage (e.g. in case of relocation to another volume etc.).  And
you can have different mount options (i.e. might want to have atime
for OS volume).

Kind regards

robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query got slow from 9.0 to 9.1 upgrade

2012-05-03 Thread Ants Aasma
On Tue, May 1, 2012 at 12:17 AM, Josh Turmel  wrote:
> We just upgraded from 9.0 to 9.1, we're using the same server configuration,
> that has been confirmed 3 or 4 times over. Any help would be appreciated. If
> I remove the "ORDER BY" it gets fast again because it goes back to using the
> user_id index, if I remove the LIMIT/OFFSET it gets fast again, obviously I
> need both of those, but that was just to test and see what would happen.
>
> Query: SELECT * FROM bookmark_groups WHERE user_id = 6708929 ORDER BY
> created DESC LIMIT 25 OFFSET 0;

Based on the explain numbers I'd say that 9.0 was fast by accident of
having inaccurate statistics. You can see that 9.0 estimated that 757
rows have this user_id, while actually it had 33868 rows. 9.1
estimated a more accurate 35980 rows, and because of that assumed that
reading the newest created rows would return 25 rows of this user
rather fast, faster than sorting the 35980 rows. This assumption seems
to be incorrect, probably because the rows with this user_id are all
rather old.

You could try tweaking cpu_index_tuple_cost to be higher so that large
index scans get penalized. But ultimately with the current PG version
there isn't a good general way to fix this kind of behavior. You can
rewrite the query to enforce filtering before sorting:

SELECT * FROM (
SELECT * FROM bookmark_groups WHERE user_id = 6708929
OFFSET 0 -- Prevents pushdown of ordering and limit
) AS sub ORDER BY created DESC LIMIT 25 OFFSET 0;

This is the same issue that Simon Riggs talks about in this e-mail:
http://archives.postgresql.org/message-id/ca+u5nmlbxfut9cwdhj3tpxjc3btwqizbkqtwdgzebcb5bag...@mail.gmail.com

The more general approach is to be more pessimistic about limited
filtered index-scans, or collecting multi-dimensional stats to figure
out the correlation that all rows for this user are likely to be old.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Several optimization options (config/hardware)

2012-05-03 Thread Martin Grotzke
Hi Laurenz,

On 05/03/2012 09:26 AM, Albe Laurenz wrote:
> Martin Grotzke wrote:
>> we want to see if we can gain better performance with our postgresql
>> database. In the last year the amount of data growed from ~25G to now
>> ~140G and we're currently developing a new feature that needs to get
>> data faster from the database. The system is both read and write
> heavy.
>>
>> At first I want to give you an overview over the hardware, software
> and
>> configuration and the changes that I see we could check out. I'd be
> very
>> happy if you could review and tell if the one or the other is
> nonsense.
>>
>> Hardware:
>> - CPU: 4x4 Cores Intel Xeon L5630  @ 2.13GHz
>> - RAM: 64GB
>> - RAID 1 (1+0) HP Company Smart Array G6 controllers, P410i
>>   (I don't know the actual number of discs)
>> - A single partition for data and wal-files
>>
>> Software
>> - RHEL 6, Kernel 2.6.32-220.4.1.el6.x86_64
>> - postgresql90-server-9.0.6-1PGDG.rhel6.x86_64
> 
> You could try different kernel I/O elevators and see if that improves
> something.
> 
> I have made good experiences with elevator=deadline and elevator=noop.
Ok, great info.

I'm not sure at which device to look honestly to check the current
configuration.

mount/fstab shows the device /dev/mapper/VG01-www for the relevant
partition. When I check iostat high utilization is reported for the
devices dm-4 and sda (showing nearly the same numbers for util always),
so I suspect that dm-4 is mapped on sda.

This is the current config:
$ cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
$ cat /sys/block/dm-4/queue/scheduler
none

Which of them should be changed?
I'll discuss this also with our hosting provider next week, he'll know
what has to be done.

Cheers,
Martin



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Query got slow from 9.0 to 9.1 upgrade

2012-05-03 Thread Jonathan
On 4/30/2012 5:17 PM, Josh Turmel wrote:
> We just upgraded from 9.0 to 9.1, we're using the same server
> configuration, that has been confirmed 3 or 4 times over. Any help would
> be appreciated. If I remove the "ORDER BY" it gets fast again because it
> goes back to using the user_id index, if I remove the LIMIT/OFFSET it
> gets fast again, obviously I need both of those, but that was just to
> test and see what would happen.

I had this problem as well and ended up working around it by having the
application cache the highest seen user_id and send that back to the
server which uses it in a where clause.  This way I had just the LIMIT
and was able to remove the OFFSET and things ran great.  I don't know
how feasible it is for you to change things application side but it
worked well for me.

Jonathan


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Configuration Recommendations

2012-05-03 Thread Jan Nielsen
Hi Robert,

On Thu, May 3, 2012 at 1:28 AM, Robert Klemme wrote:

> Hi Jan,
>
> On Thu, May 3, 2012 at 4:10 AM, Jan Nielsen 
> wrote:
> > Below is the hardware, firmware, OS, and PG configuration pieces that I'm
> > settling in on. As was noted, the local storage used for OS is actually
> two
> > disks with RAID 10. If anything appears like a mistake or something is
> > missing, I'd appreciate the feedback.
>
> You should quickly patent this solution.  As far as I know you need at
> least four disks for RAID 10. :-)
> http://en.wikipedia.org/wiki/RAID#Nested_.28hybrid.29_RAID
>
> Or did you mean RAID 1?
>

Ugh - yeah - sorry. RAID-1 for the 2-disk OS and WAL.


>  > I'm still working on the benchmarks scripts and I don't have
> good/reliable
> > numbers yet since our SAN is still very busy reconfiguring from the 2x4
> to
> > 1x8. I'm hoping to get them running tomorrow when the SAN should complete
> > its 60 hours of reconfiguration.
>
> Yeah, does not seem to make a lot of sense to test during this phase.
>
> > Thanks, again, for all the great feedback.
>
> You're welcome!
>
> > 300GB RAID10 2x15k drive for OS on local storage
>

Correction: RAID-1 on the 2x15k local storage device for OS


> > */dev/sda1 RA*4096
> > */dev/sda1 FS*ext4
> > */dev/sda1 MO*
>
> See above.
>
> > 600GB RAID 10 8x15k drive for $PGDATA on SAN
>

Clarification: RAID-10 on the 8x15k SAN device for $PGDATA


> > *IO Scheduler sda*noop anticipatory deadline [cfq]
> > */dev/sdb1 RA*4096
> > */dev/sdb1 FS* xfs
> > */dev/sdb1 MO*
> > allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
> >
> > 300GB RAID 10 2x15k drive for $PGDATA/pg_xlog on SAN
>

Correction: RAID-1 on the 2x15k SAN device for $PGDATA/pg_log


> > *IO Scheduler sdb*noop anticipatory deadline [cfq]
> > */dev/sde1 RA*4096
> > */dev/sde1 FS* xfs
> > */dev/sde1 MO*
> allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
> > *IO Scheduler sde*noop anticipatory deadline [cfq]
>
> See above.
>
> With regard to the scheduler, I have frequently read that [deadline]
> and [noop] perform better for PG loads.  Fortunately this can be
> easily changed.
>
> Maybe this also has some additional input:
> http://www.fccps.cz/download/adv/frr/hdd/hdd.html
>

Thanks for the reference, Robert.


>  On Thu, May 3, 2012 at 8:54 AM, John Lister 
> wrote:
> > I was wondering if it would be better to put the xlog on the same disk as
> > the OS? Apart from the occasional log writes I'd have thought most OS
> data
> > is loaded into cache at the beginning, so you effectively have an unused
> > disk. This gives you another spindle (mirrored) for your data.
> >
> > Or have I missed something fundamental?
>
> Separating avoids interference between OS and WAL logging (i.e. a
> script running berserk and filling OS filesystem).  Also it's easier
> to manage (e.g. in case of relocation to another volume etc.).  And
> you can have different mount options (i.e. might want to have atime
> for OS volume).
>
> Kind regards
>
> robert
>
>
> --
> remember.guy do |as, often| as.you_can - without end
> http://blog.rubybestpractices.com/
>


Re: [PERFORM] Configuration Recommendations

2012-05-03 Thread Jan Nielsen
On Thu, May 3, 2012 at 7:05 AM, Shaun Thomas wrote:

> I like to follow this:
>
> http://www.westnet.com/~**gsmith/content/linux-pdflush.**htm
>

Thanks for the reference, Shaun.


> As a note, there are actually new tunables for some of this: dirty_bytes,
> and dirty_background_bytes. With them, you can match them better to the
> actual size of your controller write cache so you can avoid page flush
> storms causing IO stalls.


That sounds interesting. How do you identify a page flush storm?


> Mr. Nielsen's setup actually looks pretty darn good. It's my personal
> opinion he might run into some IO waits if he plans to use this for heavy
> OLTP, thanks to having only 8 spindles in his RAID1+0, but he may
> eventually grow into a SAN. That's fine. It's a good starting point.


Cool - thanks, again, for the review, Shaun.


Cheers,

Jan



>
>
> --
> Shaun Thomas
> OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
> 312-444-8534
> stho...@optionshouse.com
>
>
> __**
>
> See 
> http://www.peak6.com/email_**disclaimer/for
>  terms and conditions related to this email
>


Re: [PERFORM] scale up (postgresql vs mssql)

2012-05-03 Thread Merlin Moncure
On Sun, Apr 29, 2012 at 8:21 AM, Eyal Wilde  wrote:
> hi, all.
>
> well, i wondered why there is high rate of bo (blocks out). the procedure is
> practically read-only during the whole test. although it's not strictly
> read-only, because in a certain condition, there might be writing to a
> certain table. but that condition can not be met during this test.
>
> so, i created a ramdisk:
> mkfs -q /dev/ram2 10
> mkdir -p /ram4
> mount /dev/ram2 /ram4
> df -H | grep /ram4
>
> and then:
> CREATE TABLESPACE pgram4 OWNER postgres LOCATION '/ram4';
>
> and in postgresql.conf i configured:
> temp_tablespaces = 'pgram4'
>
> now, i believe, all the temp-table were in RAM.
> vmstat showed:
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 6 0 0 5916720 69488 1202668 0 0 0 3386 1902 1765 25 3 72 1 0
> 9 1 0 5907728 69532 1204212 0 0 0 1392 5375 4510 88 8 3 1 0
> 7 0 0 5886584 69672 1205096 0 0 0 1472 5278 4520 88 10 2 0 0
> 8 0 0 5877384 69688 1206384 0 0 0 1364 5312 4522 89 8 2 1 0
> 8 0 0 5869332 69748 1207188 0 0 0 1296 5285 4437 88 8 3 1 0
> 6 1 0 5854404 69852 1208776 0 0 0 2955 5333 4518 88 9 2 0 0
>
> 10 times less bo (blocks out)
> 5 times less wa (percentage of time spent by cpu waiting to IO)
> 2 times less b (wait Queue – Process which are waiting for I/O)
>
> the overall test result was (just?) ~15% better...
>
> when i created the ramdisk with mkfs.ext4 (instead of the default ext2), the
> performance was the same (~15% better), but vmstat output looked much the
> same as before (without the ramdisk) !?? why is that?
>
> as i mentioned, the procedure is practically read-only. shouldn't i expect
> bo (blocks out) to be ~0? after forcing temp-tables to be in the RAM, what
> other reasons may be the cause for bo (blocks out)?
>
> i see no point pasting the whole procedure here, since it's very long. the
> general course of the procedure is:
> create temp-tables if they are not exist (practically, they do exist)
> do a lot of: insert into temp-table select from table
> and         : insert into temp-table select from table join temp-table
> after finished insert into temp-table: analyze temp-table (this was the only
> way the optimizer behaved properly)
> finally, open refcursors of select from temp-tables

i/o writes from read queries can be caused by a couple of things:
*) sorts, and other 'spill to disk' features of large queries
*) hint bits (what I think is happening in your case):

the first time a tuple is touched after it's controlling transaction
is committed, the transaction's state (committed or aborted) is saved
on the tuple itself to optimize subsequent accesses.  for most
workloads this is barely noticeable but it can show up if you're
moving a lot of records around per transaction.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Configuration Recommendations

2012-05-03 Thread Jan Nielsen
Hi John,

On Thu, May 3, 2012 at 12:54 AM, John Lister wrote:

>  On 03/05/2012 03:10, Jan Nielsen wrote:
>
>
> 300GB RAID10 2x15k drive for OS on local storage
> */dev/sda1 RA*4096
> */dev/sda1 FS*ext4
> */dev/sda1 MO*
>
> 600GB RAID 10 8x15k drive for $PGDATA on SAN
> *IO Scheduler sda*noop anticipatory deadline [cfq]
> */dev/sdb1 RA*4096
> */dev/sdb1 FS* xfs
> */dev/sdb1 MO*allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
>
>
> 300GB RAID 10 2x15k drive for $PGDATA/pg_xlog on SAN
> *IO Scheduler sdb*noop anticipatory deadline [cfq]
> */dev/sde1 RA*4096
> */dev/sde1 FS* xfs
> */dev/sde1 MO*allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
> *
>
>
> I was wondering if it would be better to put the xlog on the same disk as
> the OS? Apart from the occasional log writes I'd have thought most OS data
> is loaded into cache at the beginning, so you effectively have an unused
> disk. This gives you another spindle (mirrored) for your data.
>
> Or have I missed something fundamental?
>

I followed Gregory Smith's arguments from PostgreSQL 9.0 High Performance,
wherein he notes that WAL is sequential with constant cache flushes whereas
OS is a mix of sequential and random with rare cache flushes. This (might)
lead one to conclude that separating these would be good for at least the
WAL and likely both. Regardless, separating these very different
use-patterns seems like a "Good Thing" if tuning is ever needed for either.


Cheers,

Jan



>
> -- www.pricegoblin.co.uk
>
>


Re: [PERFORM] Result Set over Network Question

2012-05-03 Thread Claudio Freire
On Mon, Apr 30, 2012 at 3:32 PM, Ronald Hahn, DOCFOCUS INC.
 wrote:
> We went to the SQL2k server (On the same hardware) and ran the selects
> again. When bringing back on an int32 PG was faster with the fetch and the
> row coming back in 1-5 ms and SQL2k coming back in 500-700 ms. This tells me
> that the problem is not related to PG index or Disk.  When bringing back 400
> bytes PG fetch time would be 1-2 ms but the results would take 2-3 s but
> SQL2k would it bring back in 700-900 ms. Working with 866 bytes, PG fetch
> time is 1-3 ms with the results coming back in 9 - 11 s and SQL2k bringing
> the results back in 2-3 s.

I think the opposite. I'm thinking it's quite probable that it's disk
access the one killing you. Remember, two different database systems
means two different access patterns.

To figure it out, you have to provide a lot more information on your
system and your query. Check out how to post "Slow Query Questions"
[0]. Only after getting all that information the people of the list
will be able to have a clue as to what your problem is.

[0] http://wiki.postgresql.org/wiki/SlowQueryQuestions

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Result Set over Network Question

2012-05-03 Thread Merlin Moncure
On Mon, Apr 30, 2012 at 1:32 PM, Ronald Hahn, DOCFOCUS INC.
 wrote:
> Hi,
>     We have recently switch our product from MS SQL 2000 to Postgresql
> 9.0.7. We have tuned the searches and indexes so that they are very close
> (often better) to what sql2k was giving us.  We are noticing some
> differences now in the time it takes for the result set to make it back to
> the client and would like some help finding out why.
>
> What we see on the PG side is that if we run:
>     Select SomeInt32 from someTable where something Limit 1
> It consistently returns the results "instantaneously"  after the fetch
> time.  If we run the same select but ask for more data the fetch time stays
> the same but the row takes longer to come back.  Bringing back 400 bytes
> takes 1-2 s but bringing back 866 bytes takes 9 - 11 s.
>
> We went to the SQL2k server (On the same hardware) and ran the selects
> again. When bringing back on an int32 PG was faster with the fetch and the
> row coming back in 1-5 ms and SQL2k coming back in 500-700 ms. This tells me
> that the problem is not related to PG index or Disk.  When bringing back 400
> bytes PG fetch time would be 1-2 ms but the results would take 2-3 s but
> SQL2k would it bring back in 700-900 ms. Working with 866 bytes, PG fetch
> time is 1-3 ms with the results coming back in 9 - 11 s and SQL2k bringing
> the results back in 2-3 s.
>
> The head to head test was run in Aqua Data Studio 10.0.8 and ODBC driver
> 9.0.3.10. The same slow down happens when use PGadminIII. The differnces in
> time to not occure when running on the pg/sql server computer so I think
> there is a network component to this.

to rule out network just do:
create temp table scratch as select ...

if it's a lot faster, then you have a probable network issue.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Result Set over Network Question

2012-05-03 Thread Ronald Hahn, DOCFOCUS INC.
After some testing using wiershark (poor mans profiler) to see what was 
going on with the network I found that it was the tools I've been using. 
Both Aqua and PGadminIII have a large overhead per column to get the 
meta data. MSSQL sends that data upfront so the impact isn't as bad. I'm 
not sure if it's a limitation of the pgsql protocol vs tds or a 
limitation of Aqua or a combination of both. At any rate it turns out 
not to be part of the problem I'm having with my software stalling out 
so I'm back to Square one with my problem.


Thanks,
Ron



Re: [PERFORM] Result Set over Network Question

2012-05-03 Thread Merlin Moncure
On Thu, May 3, 2012 at 10:28 AM, Ronald Hahn, DOCFOCUS INC.
 wrote:
> After some testing using wiershark (poor mans profiler) to see what was
> going on with the network I found that it was the tools I've been using.
> Both Aqua and PGadminIII have a large overhead per column to get the meta
> data. MSSQL sends that data upfront so the impact isn't as bad. I'm not sure
> if it's a limitation of the pgsql protocol vs tds or a limitation of Aqua or
> a combination of both. At any rate it turns out not to be part of the
> problem I'm having with my software stalling out so I'm back to Square one
> with my problem.

ok, let's figure out what the issue is then.  first, let's make sure
it isn't the server that's stalling: configure
log_min_duration_statement with an appropriate value so you start
catching queries that are taking longer then you think the should be.
 also some client side logging directly wrapping the SQL invocation
couldn't hurt.   is your application jdbc?

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query got slow from 9.0 to 9.1 upgrade

2012-05-03 Thread Rural Hunter
This is very similar with my problem: 
http://postgresql.1045698.n5.nabble.com/index-choosing-problem-td5567320.html


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Configuration Recommendations

2012-05-03 Thread Craig James
On Thu, May 3, 2012 at 6:42 AM, Jan Nielsen  wrote:
> Hi John,
>
> On Thu, May 3, 2012 at 12:54 AM, John Lister 
> wrote:
>>
>> On 03/05/2012 03:10, Jan Nielsen wrote:
>>
>>
>> 300GB RAID10 2x15k drive for OS on local storage
>> */dev/sda1 RA*    4096
>> */dev/sda1 FS*    ext4
>> */dev/sda1 MO*
>>
>> 600GB RAID 10 8x15k drive for $PGDATA on SAN
>> *IO Scheduler sda*    noop anticipatory deadline [cfq]
>> */dev/sdb1 RA*    4096
>> */dev/sdb1 FS* xfs
>> */dev/sdb1 MO*
>> allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
>>
>> 300GB RAID 10 2x15k drive for $PGDATA/pg_xlog on SAN
>> *IO Scheduler sdb*    noop anticipatory deadline [cfq]
>> */dev/sde1 RA*    4096
>> */dev/sde1 FS* xfs
>> */dev/sde1 MO*    allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
>> *
>>
>>
>> I was wondering if it would be better to put the xlog on the same disk as
>> the OS? Apart from the occasional log writes I'd have thought most OS data
>> is loaded into cache at the beginning, so you effectively have an unused
>> disk. This gives you another spindle (mirrored) for your data.
>>
>> Or have I missed something fundamental?
>
>
> I followed Gregory Smith's arguments from PostgreSQL 9.0 High Performance,
> wherein he notes that WAL is sequential with constant cache flushes whereas
> OS is a mix of sequential and random with rare cache flushes. This (might)
> lead one to conclude that separating these would be good for at least the
> WAL and likely both. Regardless, separating these very different
> use-patterns seems like a "Good Thing" if tuning is ever needed for either.

Another consideration is journaling vs. non-journaling file systems.
If the WAL is on its own file system (not necessarily its own
spindle), you can use a non-journaling file system like ext2.  The WAL
is actually quite small and is itself a journal, so there's no reason
to use a journaling file system.  On the other hand, you don't want
the operating system on ext2 because it takes a long time to recover
from a crash.

I think you're right about the OS: once it starts, there is very
little disk activity.  I'd say put both on the same disk but on
different partitions.  The OS can use ext4 or some other modern
journaling file system, and the WAL can use ext2.  This also means you
can put the WAL on the outer (fastest) part of the disk and leave the
slow inner tracks for the OS.

Craig

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Configuration Recommendations

2012-05-03 Thread Scott Carey


On 5/3/12 8:46 AM, "Craig James"  wrote:

>On Thu, May 3, 2012 at 6:42 AM, Jan Nielsen 
>wrote:
>> Hi John,
>>
>> On Thu, May 3, 2012 at 12:54 AM, John Lister
>>
>> wrote:
>>>
>>> On 03/05/2012 03:10, Jan Nielsen wrote:
>>>
>>>
>>> 300GB RAID10 2x15k drive for OS on local storage
>>> */dev/sda1 RA*4096
>>> */dev/sda1 FS*ext4
>>> */dev/sda1 MO*
>>>
>>> 600GB RAID 10 8x15k drive for $PGDATA on SAN
>>> *IO Scheduler sda*noop anticipatory deadline [cfq]
>>> */dev/sdb1 RA*4096
>>> */dev/sdb1 FS* xfs
>>> */dev/sdb1 MO*
>>> allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
>>>
>>> 300GB RAID 10 2x15k drive for $PGDATA/pg_xlog on SAN
>>> *IO Scheduler sdb*noop anticipatory deadline [cfq]
>>> */dev/sde1 RA*4096
>>> */dev/sde1 FS* xfs
>>> */dev/sde1 MO* 
>>>allocsize=256m,attr2,logbufs=8,logbsize=256k,noatime
>>> *
>>>
>>>
>>> I was wondering if it would be better to put the xlog on the same disk
>>>as
>>> the OS? Apart from the occasional log writes I'd have thought most OS
>>>data
>>> is loaded into cache at the beginning, so you effectively have an
>>>unused
>>> disk. This gives you another spindle (mirrored) for your data.
>>>
>>> Or have I missed something fundamental?
>>
>>
>> I followed Gregory Smith's arguments from PostgreSQL 9.0 High
>>Performance,
>> wherein he notes that WAL is sequential with constant cache flushes
>>whereas
>> OS is a mix of sequential and random with rare cache flushes. This
>>(might)
>> lead one to conclude that separating these would be good for at least
>>the
>> WAL and likely both. Regardless, separating these very different
>> use-patterns seems like a "Good Thing" if tuning is ever needed for
>>either.
>
>Another consideration is journaling vs. non-journaling file systems.

Not really.   ext4 with journaling on is faster than ext2 with it off.
ext2 should never be used if ext4 is available.

If you absolutely refuse to have a journal, turn the journal in ext4 off
and have a faster and safer file system than ext2.
ext2 should never be used if ext4 is available.

>If the WAL is on its own file system (not necessarily its own
>spindle), you can use a non-journaling file system like ext2.  The WAL
>is actually quite small and is itself a journal, so there's no reason
>to use a journaling file system.  On the other hand, you don't want
>the operating system on ext2 because it takes a long time to recover
>from a crash.
>
>I think you're right about the OS: once it starts, there is very
>little disk activity.  I'd say put both on the same disk but on
>different partitions.  The OS can use ext4 or some other modern
>journaling file system, and the WAL can use ext2.  This also means you
>can put the WAL on the outer (fastest) part of the disk and leave the
>slow inner tracks for the OS.
>
>Craig
>
>-- 
>Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Configuration Recommendations

2012-05-03 Thread Scott Carey

On 4/25/12 2:29 PM, "Shaun Thomas"  wrote:

>On 04/25/2012 02:46 AM, John Lister wrote:
>
>> Hi, I'd be grateful if you could share any XFS performance tweaks as I'm
>> not entirely sure I'm getting the most out of my setup and any
>> additional guidance would be very helpful.
>
>Ok, I'll give this with a huge caveat: these settings came from lots of
>testing, both load and pgbench based. I'll explain as much as I can.

The configured file system read-ahead is also an important factor -- how
important is sequential scan performance?  More read-ahead (up to a point)
biases your I/O for sequential throughput.  The deadline scheduler is also
biased slightly for throughput, meaning it will sacrifice some random iops
in order to get a sequential scan out of the way.

We have a couple systems that have aged a long time on XFS and ext3.  Over
time, XFS slaughters ext3.  This is due primarily to one feature:  online
defragmentation.  our ext3 systems are so horribly fragmented that
sequential scans almost no longer exist.  ext4 is supposed to be better at
preventing fragmentation, but there is no online defragmenter.   After a
parallel restore, postgres is rather fragmented.   XFS can correct that,
and disk throughput for sequential scans increases significantly after
defragmentation.   We schedule defragmentation passes nightly, which do
not take long after the initial pass.

>
>For initializing the XFS filesystem, you can take advantage of a few
>settings that are pretty handy.
>
>* -d agcount=256 - Higher amount of allocation groups works better with
>multi-CPU systems. We used 256, but you'll want to do tests to confirm
>this. The point is that you can have several threads writing to the
>filesystem simultaneously.
>
>* -l lazy-count=1 - Log data is written more efficiently. Gives a
>measurable performance boost. Newer versions set this, but CentOS 5 has
>the default to 0. I'm not sure about CentOS 6. Just enable it. :)
>
>* -l version=2 - Forces the most recent version of the logging
>algorithm; allows a larger log buffer on mount. Since you're using
>CentOS, the default value is still probably 1, which you don't want.
>
>And then there are the mount options. These actually seemed to make more
>of an impact in our testing:
>
>* allocsize=256m - Database files are up to 1GB in size. To prevent
>fragmentation, always pre-allocate in 256MB chunks. In recent 3.0+
>kernels, this setting will result in phantom storage allocation as each
>file is initially allocated with 256MB until all references have exited
>memory. Due to aggressive Linux inode cache behavior, this may not
>happen for several hours. On 3.0 kernels, this setting should be
>removed. I think the 2.6.35 kernel had this backported, so *TEST THIS
>SETTING BEFORE USING IT!*
>
>* logbufs=8 - Forces more of the log buffer to remain in RAM, improving
>file deletion performance. Good for temporary files. XFS often gets
>knocked for file deletion performance, and this brings it way up. Not
>really an issue with PG usage, but handy anyway. See logbsize.
>
>* logbsize=256k - Larger log buffers keep track of modified files in
>memory for better performance. See logbufs.
>
>* noatime - Negates touching the disk for file accesses. Reduces disk IO.
>
>* attr2 - Opportunistic improvement in the way inline extended
>attributes are stored on-disk. Not strictly necessary, but handy.
>
>
>I'm hoping someone else will pipe in, because these settings are pretty
>"old" and based on a CentOS 5.5 setup. I haven't done any metrics on the
>newer kernels, but I have followed enough to know allocsize is dangerous
>on new systems.
>
>Your mileage may vary. :)
>
>-- 
>Shaun Thomas
>OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
>312-444-8534
>stho...@peak6.com
>
>__
>
>See http://www.peak6.com/email_disclaimer/ for terms and conditions
>related to this email
>
>-- 
>Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance