Re: [PERFORM] hardware performance and some more

2003-07-25 Thread Kasim Oztoprak
On 24 Jul 2003 23:25 EEST you wrote:

> On Thu, 2003-07-24 at 13:25, Kasim Oztoprak wrote:
> > On 24 Jul 2003 17:08 EEST you wrote:
> > 
> > > On 24 Jul 2003 at 15:54, Kasim Oztoprak wrote:
> [snip]
> > 
> > we do not have memory problem or disk problems. as I have seen in the list the 
> > best way to 
> > use disks are using raid 10 for data and raid 1 for os. we can put as much memory 
> > as 
> > we require. 
> > 
> > now the question, if we have 100 searches per second and in each search if we need 
> > 30 sql
> > instruction, what will be the performance of the system in the order of time. Let 
> > us say
> > we have two machines described aove in a cluster.
> 
> That's 3000 sql statements per second, 180 thousand per minute
> What the heck is this database doing!
> 
> A quad-CPU Opteron sure is looking useful right about now...  Or
> an quad-CPU AlphaServer ES45 running Linux, if 4x Opterons aren't
> available.
> 
> How complicated are each of these SELECT statements?

this is kind of directory assistance application. actually the select statements are 
not
very complex. the database contain 25 million subscriber records and the operators 
searches 
for the subscriber numbers or addresses. there are not much update operations actually 
the 
update ratio is approximately %0.1 . 

i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon 
processors.
and suitable memory capacity with it. 

i hope it will overcome with this problem. any similar implementation?

> 
> -- 
>  - 
> | Ron Johnson, Jr.Home: [EMAIL PROTECTED] |
> | Jefferson, LA  USA  |
> | |
> | "I'm not a vegetarian because I love animals, I'm a vegetarian  |
> |  because I hate vegetables!"|
> |unknown  |
>  - 
> 
> 
> 
> ---(end of broadcast)---
> TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] hardware performance and some more

2003-07-25 Thread Shridhar Daithankar
On 25 Jul 2003 at 16:38, Kasim Oztoprak wrote:
> this is kind of directory assistance application. actually the select statements are 
> not
> very complex. the database contain 25 million subscriber records and the operators 
> searches 
> for the subscriber numbers or addresses. there are not much update operations 
> actually the 
> update ratio is approximately %0.1 . 
> 
> i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon 
> processors.
> and suitable memory capacity with it. 

Are you going to duplicate the data?

If you are going to have 3000 sql statements per second, I would suggest,

1. Get quad CPU. You probably need that horsepower
2. Use prepared statements and stored procedures to avoid parsing overhead.

I doubt you would need cluster of machines though. If you run it thr. a pilot 
program, that would give you an idea whether or not you need a cluster..

Bye
 Shridhar

--
Default, n.:The hardware's, of course.


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] hardware performance and some more

2003-07-25 Thread Ron Johnson
On Fri, 2003-07-25 at 11:38, Kasim Oztoprak wrote:
> On 24 Jul 2003 23:25 EEST you wrote:
> 
> > On Thu, 2003-07-24 at 13:25, Kasim Oztoprak wrote:
> > > On 24 Jul 2003 17:08 EEST you wrote:
> > > 
> > > > On 24 Jul 2003 at 15:54, Kasim Oztoprak wrote:
> > [snip]
> > > 
> > > we do not have memory problem or disk problems. as I have seen in the list the 
> > > best way to 
> > > use disks are using raid 10 for data and raid 1 for os. we can put as much 
> > > memory as 
> > > we require. 
> > > 
> > > now the question, if we have 100 searches per second and in each search if we 
> > > need 30 sql
> > > instruction, what will be the performance of the system in the order of time. 
> > > Let us say
> > > we have two machines described aove in a cluster.
> > 
> > That's 3000 sql statements per second, 180 thousand per minute
> > What the heck is this database doing!
> > 
> > A quad-CPU Opteron sure is looking useful right about now...  Or
> > an quad-CPU AlphaServer ES45 running Linux, if 4x Opterons aren't
> > available.
> > 
> > How complicated are each of these SELECT statements?
> 
> this is kind of directory assistance application. actually the select statements are 
> not
> very complex. the database contain 25 million subscriber records and the operators 
> searches 
> for the subscriber numbers or addresses. there are not much update operations 
> actually the 
> update ratio is approximately %0.1 . 
> 
> i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon 
> processors.
> and suitable memory capacity with it. 
> 
> i hope it will overcome with this problem. any similar implementation?

Since PG doesn't have active-active clustering, that's out, but since
the database will be very static, why not have, say 8 machines, each
with it's own copy of the database?  (Since there are so few updates,
you feed the updates to a litle Perl app that then makes the changes
on each machine.)  (A round-robin load balancer would do the trick
in utilizing them all.)

Also, with lots of machines, you could get away with less expensive
machines, say 2GHz CPU, 1GB RAM and a 40GB IDE drive.  Then, if one
goes down for some reason, you've only lost a small portion of your
capacity, and replacing a part will be very inexpensive.

And if volume increases, just add more USD1000 machines...

-- 
+-+
| Ron Johnson, Jr.Home: [EMAIL PROTECTED] |
| Jefferson, LA  USA  |
| |
| "I'm not a vegetarian because I love animals, I'm a vegetarian  |
|  because I hate vegetables!"|
|unknown  |
+-+



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] hardware performance and some more

2003-07-25 Thread Kasim Oztoprak
On 25 Jul 2003 17:13 EEST you wrote:

> On 25 Jul 2003 at 16:38, Kasim Oztoprak wrote:
> > this is kind of directory assistance application. actually the select statements 
> > are not
> > very complex. the database contain 25 million subscriber records and the operators 
> > searches 
> > for the subscriber numbers or addresses. there are not much update operations 
> > actually the 
> > update ratio is approximately %0.1 . 
> > 
> > i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon 
> > processors.
> > and suitable memory capacity with it. 
> 
> Are you going to duplicate the data?
> 
> If you are going to have 3000 sql statements per second, I would suggest,
> 
> 1. Get quad CPU. You probably need that horsepower
> 2. Use prepared statements and stored procedures to avoid parsing overhead.
> 
> I doubt you would need cluster of machines though. If you run it thr. a pilot 
> program, that would give you an idea whether or not you need a cluster..
> 
> Bye
>  Shridhar
>

i will try to cluster them. i can duplicate the data if i need. in the case of 
update, then, i will fix them through. 

what exactly do you mean from a pilot program?

-kasým 
> --
> Default, n.:  The hardware's, of course.
> 
> 
> ---(end of broadcast)---
> TIP 9: the planner will ignore your desire to choose an index scan if your
>   joining column's datatypes do not match


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] hardware performance and some more

2003-07-25 Thread Shridhar Daithankar
On 25 Jul 2003 at 18:41, Kasim Oztoprak wrote:
> what exactly do you mean from a pilot program?

Like get a quad CPU box, load the data and ask only 10 operators to test the 
system..

Beta testing basically..

Bye
 Shridhar

--
The man on tops walks a lonely street; the "chain" of command is often a noose.


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] hardware performance and some more

2003-07-25 Thread Josh Berkus
Folks,

> Since PG doesn't have active-active clustering, that's out, but since
> the database will be very static, why not have, say 8 machines, each
> with it's own copy of the database?  (Since there are so few updates,
> you feed the updates to a litle Perl app that then makes the changes
> on each machine.)  (A round-robin load balancer would do the trick
> in utilizing them all.)

Another approach I've seen work is to have several servers connect to one SAN 
or NAS where the data lives.  Only one server is enabled to handle "write" 
requests; all the rest are read-only.  This does mean having dispacting 
middleware that parcels out requests among the servers, but works very well 
for the java-based company that's using it.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] index questions

2003-07-25 Thread Franco Bruno Borghesi




Hi everyone.

I have created a simplified example of a real case, to show you what I'm tryng to do. I have
a table, like this:

CREATE TABLE sales (
   saleId SERIAL,
   clientId INTEGER,
   branchId INTEGER,
   productId INTEGER,
   employeeId INTEGER,
   saleDate DATE,
   price NUMERIC(12, 2),
   qty INTEGER,
   PRIMARY KEY(saleId)
);
CREATE INDEX sales_k1 ON sales(clientId, branchId, productId, employeeId, saleDate, price, qty);

This table will grow to *many* rows in the future.

I want to make a function that returns the FIRS saleId of the sale that matches some conditions. I will
always receive the Client Id, but not always the other arguments (sent as NULLs).

The fetched resultset shoud prioritize the passed arguments, and after that, the saleDate, price
and quantity.


/**
 * Finds the first sale that matches the conditions received.
 * @param $1 Client Id.
 * @param $2 Preferred Branch Id.
 * @param $3 Preferred Product Id.
 * @param $4 Preferred Employee Id.
 * @return Sale Id if found, NULL if not.
 */
CREATE OR REPLACE FUNCTION findSale(INTEGER, INTEGER, INTEGER, INTEGER) RETURNS INTEGER AS '
DECLARE
   a_clientId ALIAS FOR $1;
   a_branchId ALIAS FOR $1;
   a_productId ALIAS FOR $1;
   a_employeeId ALIAS FOR $1;
   r_result INTEGER;
BEGIN
   SELECT
  INTO r_result employeeId
   FROM
  sales
   WHERE
  clientId=a_clientId AND
  branchId=coalesce(a_branchId, branchId) AND /*branchId is null? anything will be ok*/
  productId=coalesce(a_productId, productId) AND /*productId is null? anything will be ok*/
  employeeId=coalesce(a_employeeId, employeeId) /*employeeId is null? anything will be ok*/
   ORDER BY
  clientId, branchId, productId, employeeId, saleDate, price, qty
   LIMIT 1;

   RETURN r_result;
END;
' LANGUAGE 'plpgsql';


Will findSale() in the future, when I have *many* rows still use the index when only the first couple of
arguments are passed to the function?
If not, should I create more indexes (and functions) for each possible argument combination? (of course, with
the given order)

The thing here is that I don't understand how postgreSQL solves the query when the COALESCEs are used... it uses
the index now, with a few thowsand records, but what will happen in a few months?

Thanks in advance.




signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] hardware performance and some more

2003-07-25 Thread Ron Johnson
On Fri, 2003-07-25 at 11:13, Josh Berkus wrote:
> Folks,
> 
> > Since PG doesn't have active-active clustering, that's out, but since
> > the database will be very static, why not have, say 8 machines, each
> > with it's own copy of the database?  (Since there are so few updates,
> > you feed the updates to a litle Perl app that then makes the changes
> > on each machine.)  (A round-robin load balancer would do the trick
> > in utilizing them all.)
> 
> Another approach I've seen work is to have several servers connect to one SAN 
> or NAS where the data lives.  Only one server is enabled to handle "write" 
> requests; all the rest are read-only.  This does mean having dispacting 
> middleware that parcels out requests among the servers, but works very well 
> for the java-based company that's using it.

Wouldn't the cache on the read-only databases get out of sync with
the true on-disk data?

-- 
+-+
| Ron Johnson, Jr.Home: [EMAIL PROTECTED] |
| Jefferson, LA  USA  |
| |
| "I'm not a vegetarian because I love animals, I'm a vegetarian  |
|  because I hate vegetables!"|
|unknown  |
+-+



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] index questions

2003-07-25 Thread Ron Johnson
On Fri, 2003-07-25 at 11:52, Franco Bruno Borghesi wrote:
[snip]
> 
> 
> Will findSale() in the future, when I have *many* rows still use the
> index when only the first couple of
> arguments are passed to the function?
> If not, should I create more indexes (and functions) for each possible
> argument combination? (of course, with
> the given order)
> 
> The thing here is that I don't understand how postgreSQL solves the
> query when the COALESCEs are used... it uses
> the index now, with a few thowsand records, but what will happen in a
> few months?

When faced with cases like this, I cobble together a script/program
that generates a few million rows of random data (within the confines
of FKs, of course) to populate these tables like "sales", and then I
see how things perform.

-- 
+-+
| Ron Johnson, Jr.Home: [EMAIL PROTECTED] |
| Jefferson, LA  USA  |
| |
| "I'm not a vegetarian because I love animals, I'm a vegetarian  |
|  because I hate vegetables!"|
|unknown  |
+-+



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] index questions

2003-07-25 Thread Josh Berkus
Franco,

> CREATE INDEX sales_k1 ON sales(clientId, branchId, productId,
> employeeId, saleDate, price, qty);

A 7-column index is unlikely to be effective -- the index will be almost as 
large as the table.   Try indexing only the first 3-4 columns instead.  

> I want to make a function that returns the FIRS saleId of the sale that
> matches some conditions. I will
> always receive the Client Id, but not always the other arguments (sent
> as NULLs).

Well, keep in mind that your multi-column index will only be useful if all 
columns are queried starting from the left.  That is, the index will be 
ignored if you have a "where productId = x" without a "where branchid = y".

> CREATE OR REPLACE FUNCTION findSale(INTEGER, INTEGER, INTEGER, INTEGER)
> RETURNS INTEGER AS '
> DECLARE
>a_clientId ALIAS FOR $1;
>a_branchId ALIAS FOR $1;
>a_productId ALIAS FOR $1;
>a_employeeId ALIAS FOR $1;

Your aliases are wrong here.

>   branchId=coalesce(a_branchId, branchId) AND /*branchId is null?
> anything will be ok*/
>   productId=coalesce(a_productId, productId) AND /*productId is
> null? anything will be ok*/

On a very large table this will be very inefficient.   you'll be comparing the 
productid, for example, even if no productid is passed ... and the index 
won't do you any good because the planner should figure out that 100% of rows 
match the condition.  

Instead, I recommend that you build up a dynamic query as a string and then 
pass only the conditions sent by the user.  You can then EXECUTE the query 
and loop through it for a result.

Of course, YMMV.   My approach will require you to create more indexes which 
could be a problem if you have limited disk space.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] index questions

2003-07-25 Thread Franco Bruno Borghesi




what you say is that the index is not effective because of its size, but it would still be used *if* the conditions are right... In this case, I care about performance, not space. 

But what you say about the index not being good because 100% of rows match the condition confirms what I suspected. 

Thanks for your help.


On Fri, 2003-07-25 at 14:28, Josh Berkus wrote:

Franco,

> CREATE INDEX sales_k1 ON sales(clientId, branchId, productId,
> employeeId, saleDate, price, qty);

A 7-column index is unlikely to be effective -- the index will be almost as 
large as the table.   Try indexing only the first 3-4 columns instead.  

> I want to make a function that returns the FIRS saleId of the sale that
> matches some conditions. I will
> always receive the Client Id, but not always the other arguments (sent
> as NULLs).

Well, keep in mind that your multi-column index will only be useful if all 
columns are queried starting from the left.  That is, the index will be 
ignored if you have a "where productId = x" without a "where branchid = y".

> CREATE OR REPLACE FUNCTION findSale(INTEGER, INTEGER, INTEGER, INTEGER)
> RETURNS INTEGER AS '
> DECLARE
>a_clientId ALIAS FOR $1;
>a_branchId ALIAS FOR $1;
>a_productId ALIAS FOR $1;
>a_employeeId ALIAS FOR $1;

Your aliases are wrong here.

>   branchId=coalesce(a_branchId, branchId) AND /*branchId is null?
> anything will be ok*/
>   productId=coalesce(a_productId, productId) AND /*productId is
> null? anything will be ok*/

On a very large table this will be very inefficient.   you'll be comparing the 
productid, for example, even if no productid is passed ... and the index 
won't do you any good because the planner should figure out that 100% of rows 
match the condition.  

Instead, I recommend that you build up a dynamic query as a string and then 
pass only the conditions sent by the user.  You can then EXECUTE the query 
and loop through it for a result.

Of course, YMMV.   My approach will require you to create more indexes which 
could be a problem if you have limited disk space.





signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] hardware performance and some more

2003-07-25 Thread Shridhar Daithankar
On 24 Jul 2003 at 9:42, William Yu wrote:

> As far as I can tell, the performance impact seems to be minimal. 
> There's a periodic storm of replication updates in cases where there's 
> mass updates sync last resync. But if you have mostly reads and few 
> writes, you shouldn't see this situation. The biggest performance impact 
> seems to be the CPU power needed to zip/unzip/encrypt/decrypt files.

Can you use WAL based replication? I don't have a URL handy but there are 
replication projects which transmit WAL files to another server when they fill 
in.

OTOH, I was thinking of a simple replication theme. If postgresql provides a 
hook where it calls an external library routine for each heapinsert in WAL, 
there could be a simple multi-slave replication system. One doesn't have to 
wait till WAL file fills up.

Of course, it's upto the library to make sure that it does not hold postgresql 
commits for too long that would hamper the performance.

Also there would need a receiving hook which would directly heapinsert the data 
on another node.

But if the external library is threaded, will that work well with postgresql?

Just a thought. If it works, load-balancing could be lot easy and near-
realtime..


Bye
 Shridhar

--
We fight only when there is no other choice.  We prefer the ways ofpeaceful contact.   
 -- Kirk, "Spectre of the Gun", stardate 4385.3


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[PERFORM] Wrong rows number expected

2003-07-25 Thread Mendola Gaetano
Hi all,
last week Josh Berkus point my attenction
( see post Wrong plan or what )
to the
fact that in this select:

select *
from user_logs ul,
 user_data ud,
 class_default cd
where
 ul.id_user = ud.id_user and
 ud.id_class = cd.id_class and
 cd.id_provider = 39;


The planner say:
QUERY PLAN
 Hash Join  (cost=265.64..32000.76 rows=40612 width=263) (actual
time=11074.21..11134.28 rows=10 loops=1)
   Hash Cond: ("outer".id_user = "inner".id_user)
   ->  Seq Scan on user_logs ul  (cost=0.00..24932.65 rows=1258965 width=48)
(actual time=0.02..8530.21 rows=1258966 loops=1)
   ->  Hash  (cost=264.81..264.81 rows=331 width=215) (actual
time=30.22..30.22 rows=0 loops=1)
 ->  Nested Loop  (cost=0.00..264.81 rows=331 width=215) (actual
time=29.95..30.20 rows=6 loops=1)
   ->  Seq Scan on class_default cd  (cost=0.00..1.39 rows=1
width=55) (actual time=0.08..0.10 rows=1 loops=1)
 Filter: (id_provider = 39)
   ->  Index Scan using idx_user_data_class on user_data ud
(cost=0.00..258.49 rows=395 width=160) (actual time=29.82..29.96 rows=6
loops=1)
 Index Cond: (ud.id_class = "outer".id_class)
 Total runtime: 11135.65 msec
(10 rows)


and the quantity reported in:
Hash Join  (cost=265.64..32000.76 rows=40612 width=263) (actual
time=11074.21..11134.28 rows=10 loops=1)

is wrong about the rows returned,
I did what Josh Berkus suggeted me:

1) Make sure you've VACUUM ANALYZED
2) Adjust the following postgresql.conf statistics:
a) effective_cache_size: increase to 70% of available (not used by other
processes) RAM.
b) random_page_cost: decrease, maybe to 2.
c) default_statistics_target: try increasing to 100
(warning: this will significantly increase the time required to do ANALYZE)


I pushed also default_statistics_target to 1000 but the plan remain the same
with an execution
of 11 secs but If I do the followin 3 equivalent query I obatin the same
result in olny fews ms:


SELECT id_class from class_default where id_provider = 39;
 id_class
--
   48
(1 row)

SELECT id_user from user_data where id_class in ( 48 );
 id_user
-
   10943
   10942
   10934
   10927
   10910
   10909
(6 rows)


SELECT * from user_logs where id_user in (
 10943,   10942,   10934,   10927,   10910,  10909
);



May I do something else ?


Thank you in advance
Gaetano Mendola












---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings