Re: [PERFORM] hardware performance and some more
On 24 Jul 2003 23:25 EEST you wrote: > On Thu, 2003-07-24 at 13:25, Kasim Oztoprak wrote: > > On 24 Jul 2003 17:08 EEST you wrote: > > > > > On 24 Jul 2003 at 15:54, Kasim Oztoprak wrote: > [snip] > > > > we do not have memory problem or disk problems. as I have seen in the list the > > best way to > > use disks are using raid 10 for data and raid 1 for os. we can put as much memory > > as > > we require. > > > > now the question, if we have 100 searches per second and in each search if we need > > 30 sql > > instruction, what will be the performance of the system in the order of time. Let > > us say > > we have two machines described aove in a cluster. > > That's 3000 sql statements per second, 180 thousand per minute > What the heck is this database doing! > > A quad-CPU Opteron sure is looking useful right about now... Or > an quad-CPU AlphaServer ES45 running Linux, if 4x Opterons aren't > available. > > How complicated are each of these SELECT statements? this is kind of directory assistance application. actually the select statements are not very complex. the database contain 25 million subscriber records and the operators searches for the subscriber numbers or addresses. there are not much update operations actually the update ratio is approximately %0.1 . i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon processors. and suitable memory capacity with it. i hope it will overcome with this problem. any similar implementation? > > -- > - > | Ron Johnson, Jr.Home: [EMAIL PROTECTED] | > | Jefferson, LA USA | > | | > | "I'm not a vegetarian because I love animals, I'm a vegetarian | > | because I hate vegetables!"| > |unknown | > - > > > > ---(end of broadcast)--- > TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] hardware performance and some more
On 25 Jul 2003 at 16:38, Kasim Oztoprak wrote: > this is kind of directory assistance application. actually the select statements are > not > very complex. the database contain 25 million subscriber records and the operators > searches > for the subscriber numbers or addresses. there are not much update operations > actually the > update ratio is approximately %0.1 . > > i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon > processors. > and suitable memory capacity with it. Are you going to duplicate the data? If you are going to have 3000 sql statements per second, I would suggest, 1. Get quad CPU. You probably need that horsepower 2. Use prepared statements and stored procedures to avoid parsing overhead. I doubt you would need cluster of machines though. If you run it thr. a pilot program, that would give you an idea whether or not you need a cluster.. Bye Shridhar -- Default, n.:The hardware's, of course. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] hardware performance and some more
On Fri, 2003-07-25 at 11:38, Kasim Oztoprak wrote: > On 24 Jul 2003 23:25 EEST you wrote: > > > On Thu, 2003-07-24 at 13:25, Kasim Oztoprak wrote: > > > On 24 Jul 2003 17:08 EEST you wrote: > > > > > > > On 24 Jul 2003 at 15:54, Kasim Oztoprak wrote: > > [snip] > > > > > > we do not have memory problem or disk problems. as I have seen in the list the > > > best way to > > > use disks are using raid 10 for data and raid 1 for os. we can put as much > > > memory as > > > we require. > > > > > > now the question, if we have 100 searches per second and in each search if we > > > need 30 sql > > > instruction, what will be the performance of the system in the order of time. > > > Let us say > > > we have two machines described aove in a cluster. > > > > That's 3000 sql statements per second, 180 thousand per minute > > What the heck is this database doing! > > > > A quad-CPU Opteron sure is looking useful right about now... Or > > an quad-CPU AlphaServer ES45 running Linux, if 4x Opterons aren't > > available. > > > > How complicated are each of these SELECT statements? > > this is kind of directory assistance application. actually the select statements are > not > very complex. the database contain 25 million subscriber records and the operators > searches > for the subscriber numbers or addresses. there are not much update operations > actually the > update ratio is approximately %0.1 . > > i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon > processors. > and suitable memory capacity with it. > > i hope it will overcome with this problem. any similar implementation? Since PG doesn't have active-active clustering, that's out, but since the database will be very static, why not have, say 8 machines, each with it's own copy of the database? (Since there are so few updates, you feed the updates to a litle Perl app that then makes the changes on each machine.) (A round-robin load balancer would do the trick in utilizing them all.) Also, with lots of machines, you could get away with less expensive machines, say 2GHz CPU, 1GB RAM and a 40GB IDE drive. Then, if one goes down for some reason, you've only lost a small portion of your capacity, and replacing a part will be very inexpensive. And if volume increases, just add more USD1000 machines... -- +-+ | Ron Johnson, Jr.Home: [EMAIL PROTECTED] | | Jefferson, LA USA | | | | "I'm not a vegetarian because I love animals, I'm a vegetarian | | because I hate vegetables!"| |unknown | +-+ ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] hardware performance and some more
On 25 Jul 2003 17:13 EEST you wrote: > On 25 Jul 2003 at 16:38, Kasim Oztoprak wrote: > > this is kind of directory assistance application. actually the select statements > > are not > > very complex. the database contain 25 million subscriber records and the operators > > searches > > for the subscriber numbers or addresses. there are not much update operations > > actually the > > update ratio is approximately %0.1 . > > > > i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon > > processors. > > and suitable memory capacity with it. > > Are you going to duplicate the data? > > If you are going to have 3000 sql statements per second, I would suggest, > > 1. Get quad CPU. You probably need that horsepower > 2. Use prepared statements and stored procedures to avoid parsing overhead. > > I doubt you would need cluster of machines though. If you run it thr. a pilot > program, that would give you an idea whether or not you need a cluster.. > > Bye > Shridhar > i will try to cluster them. i can duplicate the data if i need. in the case of update, then, i will fix them through. what exactly do you mean from a pilot program? -kasým > -- > Default, n.: The hardware's, of course. > > > ---(end of broadcast)--- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] hardware performance and some more
On 25 Jul 2003 at 18:41, Kasim Oztoprak wrote: > what exactly do you mean from a pilot program? Like get a quad CPU box, load the data and ask only 10 operators to test the system.. Beta testing basically.. Bye Shridhar -- The man on tops walks a lonely street; the "chain" of command is often a noose. ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] hardware performance and some more
Folks, > Since PG doesn't have active-active clustering, that's out, but since > the database will be very static, why not have, say 8 machines, each > with it's own copy of the database? (Since there are so few updates, > you feed the updates to a litle Perl app that then makes the changes > on each machine.) (A round-robin load balancer would do the trick > in utilizing them all.) Another approach I've seen work is to have several servers connect to one SAN or NAS where the data lives. Only one server is enabled to handle "write" requests; all the rest are read-only. This does mean having dispacting middleware that parcels out requests among the servers, but works very well for the java-based company that's using it. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] index questions
Hi everyone. I have created a simplified example of a real case, to show you what I'm tryng to do. I have a table, like this: CREATE TABLE sales ( saleId SERIAL, clientId INTEGER, branchId INTEGER, productId INTEGER, employeeId INTEGER, saleDate DATE, price NUMERIC(12, 2), qty INTEGER, PRIMARY KEY(saleId) ); CREATE INDEX sales_k1 ON sales(clientId, branchId, productId, employeeId, saleDate, price, qty); This table will grow to *many* rows in the future. I want to make a function that returns the FIRS saleId of the sale that matches some conditions. I will always receive the Client Id, but not always the other arguments (sent as NULLs). The fetched resultset shoud prioritize the passed arguments, and after that, the saleDate, price and quantity. /** * Finds the first sale that matches the conditions received. * @param $1 Client Id. * @param $2 Preferred Branch Id. * @param $3 Preferred Product Id. * @param $4 Preferred Employee Id. * @return Sale Id if found, NULL if not. */ CREATE OR REPLACE FUNCTION findSale(INTEGER, INTEGER, INTEGER, INTEGER) RETURNS INTEGER AS ' DECLARE a_clientId ALIAS FOR $1; a_branchId ALIAS FOR $1; a_productId ALIAS FOR $1; a_employeeId ALIAS FOR $1; r_result INTEGER; BEGIN SELECT INTO r_result employeeId FROM sales WHERE clientId=a_clientId AND branchId=coalesce(a_branchId, branchId) AND /*branchId is null? anything will be ok*/ productId=coalesce(a_productId, productId) AND /*productId is null? anything will be ok*/ employeeId=coalesce(a_employeeId, employeeId) /*employeeId is null? anything will be ok*/ ORDER BY clientId, branchId, productId, employeeId, saleDate, price, qty LIMIT 1; RETURN r_result; END; ' LANGUAGE 'plpgsql'; Will findSale() in the future, when I have *many* rows still use the index when only the first couple of arguments are passed to the function? If not, should I create more indexes (and functions) for each possible argument combination? (of course, with the given order) The thing here is that I don't understand how postgreSQL solves the query when the COALESCEs are used... it uses the index now, with a few thowsand records, but what will happen in a few months? Thanks in advance. signature.asc Description: This is a digitally signed message part
Re: [PERFORM] hardware performance and some more
On Fri, 2003-07-25 at 11:13, Josh Berkus wrote: > Folks, > > > Since PG doesn't have active-active clustering, that's out, but since > > the database will be very static, why not have, say 8 machines, each > > with it's own copy of the database? (Since there are so few updates, > > you feed the updates to a litle Perl app that then makes the changes > > on each machine.) (A round-robin load balancer would do the trick > > in utilizing them all.) > > Another approach I've seen work is to have several servers connect to one SAN > or NAS where the data lives. Only one server is enabled to handle "write" > requests; all the rest are read-only. This does mean having dispacting > middleware that parcels out requests among the servers, but works very well > for the java-based company that's using it. Wouldn't the cache on the read-only databases get out of sync with the true on-disk data? -- +-+ | Ron Johnson, Jr.Home: [EMAIL PROTECTED] | | Jefferson, LA USA | | | | "I'm not a vegetarian because I love animals, I'm a vegetarian | | because I hate vegetables!"| |unknown | +-+ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] index questions
On Fri, 2003-07-25 at 11:52, Franco Bruno Borghesi wrote: [snip] > > > Will findSale() in the future, when I have *many* rows still use the > index when only the first couple of > arguments are passed to the function? > If not, should I create more indexes (and functions) for each possible > argument combination? (of course, with > the given order) > > The thing here is that I don't understand how postgreSQL solves the > query when the COALESCEs are used... it uses > the index now, with a few thowsand records, but what will happen in a > few months? When faced with cases like this, I cobble together a script/program that generates a few million rows of random data (within the confines of FKs, of course) to populate these tables like "sales", and then I see how things perform. -- +-+ | Ron Johnson, Jr.Home: [EMAIL PROTECTED] | | Jefferson, LA USA | | | | "I'm not a vegetarian because I love animals, I'm a vegetarian | | because I hate vegetables!"| |unknown | +-+ ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] index questions
Franco, > CREATE INDEX sales_k1 ON sales(clientId, branchId, productId, > employeeId, saleDate, price, qty); A 7-column index is unlikely to be effective -- the index will be almost as large as the table. Try indexing only the first 3-4 columns instead. > I want to make a function that returns the FIRS saleId of the sale that > matches some conditions. I will > always receive the Client Id, but not always the other arguments (sent > as NULLs). Well, keep in mind that your multi-column index will only be useful if all columns are queried starting from the left. That is, the index will be ignored if you have a "where productId = x" without a "where branchid = y". > CREATE OR REPLACE FUNCTION findSale(INTEGER, INTEGER, INTEGER, INTEGER) > RETURNS INTEGER AS ' > DECLARE >a_clientId ALIAS FOR $1; >a_branchId ALIAS FOR $1; >a_productId ALIAS FOR $1; >a_employeeId ALIAS FOR $1; Your aliases are wrong here. > branchId=coalesce(a_branchId, branchId) AND /*branchId is null? > anything will be ok*/ > productId=coalesce(a_productId, productId) AND /*productId is > null? anything will be ok*/ On a very large table this will be very inefficient. you'll be comparing the productid, for example, even if no productid is passed ... and the index won't do you any good because the planner should figure out that 100% of rows match the condition. Instead, I recommend that you build up a dynamic query as a string and then pass only the conditions sent by the user. You can then EXECUTE the query and loop through it for a result. Of course, YMMV. My approach will require you to create more indexes which could be a problem if you have limited disk space. -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] index questions
what you say is that the index is not effective because of its size, but it would still be used *if* the conditions are right... In this case, I care about performance, not space. But what you say about the index not being good because 100% of rows match the condition confirms what I suspected. Thanks for your help. On Fri, 2003-07-25 at 14:28, Josh Berkus wrote: Franco, > CREATE INDEX sales_k1 ON sales(clientId, branchId, productId, > employeeId, saleDate, price, qty); A 7-column index is unlikely to be effective -- the index will be almost as large as the table. Try indexing only the first 3-4 columns instead. > I want to make a function that returns the FIRS saleId of the sale that > matches some conditions. I will > always receive the Client Id, but not always the other arguments (sent > as NULLs). Well, keep in mind that your multi-column index will only be useful if all columns are queried starting from the left. That is, the index will be ignored if you have a "where productId = x" without a "where branchid = y". > CREATE OR REPLACE FUNCTION findSale(INTEGER, INTEGER, INTEGER, INTEGER) > RETURNS INTEGER AS ' > DECLARE >a_clientId ALIAS FOR $1; >a_branchId ALIAS FOR $1; >a_productId ALIAS FOR $1; >a_employeeId ALIAS FOR $1; Your aliases are wrong here. > branchId=coalesce(a_branchId, branchId) AND /*branchId is null? > anything will be ok*/ > productId=coalesce(a_productId, productId) AND /*productId is > null? anything will be ok*/ On a very large table this will be very inefficient. you'll be comparing the productid, for example, even if no productid is passed ... and the index won't do you any good because the planner should figure out that 100% of rows match the condition. Instead, I recommend that you build up a dynamic query as a string and then pass only the conditions sent by the user. You can then EXECUTE the query and loop through it for a result. Of course, YMMV. My approach will require you to create more indexes which could be a problem if you have limited disk space. signature.asc Description: This is a digitally signed message part
Re: [PERFORM] hardware performance and some more
On 24 Jul 2003 at 9:42, William Yu wrote: > As far as I can tell, the performance impact seems to be minimal. > There's a periodic storm of replication updates in cases where there's > mass updates sync last resync. But if you have mostly reads and few > writes, you shouldn't see this situation. The biggest performance impact > seems to be the CPU power needed to zip/unzip/encrypt/decrypt files. Can you use WAL based replication? I don't have a URL handy but there are replication projects which transmit WAL files to another server when they fill in. OTOH, I was thinking of a simple replication theme. If postgresql provides a hook where it calls an external library routine for each heapinsert in WAL, there could be a simple multi-slave replication system. One doesn't have to wait till WAL file fills up. Of course, it's upto the library to make sure that it does not hold postgresql commits for too long that would hamper the performance. Also there would need a receiving hook which would directly heapinsert the data on another node. But if the external library is threaded, will that work well with postgresql? Just a thought. If it works, load-balancing could be lot easy and near- realtime.. Bye Shridhar -- We fight only when there is no other choice. We prefer the ways ofpeaceful contact. -- Kirk, "Spectre of the Gun", stardate 4385.3 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PERFORM] Wrong rows number expected
Hi all, last week Josh Berkus point my attenction ( see post Wrong plan or what ) to the fact that in this select: select * from user_logs ul, user_data ud, class_default cd where ul.id_user = ud.id_user and ud.id_class = cd.id_class and cd.id_provider = 39; The planner say: QUERY PLAN Hash Join (cost=265.64..32000.76 rows=40612 width=263) (actual time=11074.21..11134.28 rows=10 loops=1) Hash Cond: ("outer".id_user = "inner".id_user) -> Seq Scan on user_logs ul (cost=0.00..24932.65 rows=1258965 width=48) (actual time=0.02..8530.21 rows=1258966 loops=1) -> Hash (cost=264.81..264.81 rows=331 width=215) (actual time=30.22..30.22 rows=0 loops=1) -> Nested Loop (cost=0.00..264.81 rows=331 width=215) (actual time=29.95..30.20 rows=6 loops=1) -> Seq Scan on class_default cd (cost=0.00..1.39 rows=1 width=55) (actual time=0.08..0.10 rows=1 loops=1) Filter: (id_provider = 39) -> Index Scan using idx_user_data_class on user_data ud (cost=0.00..258.49 rows=395 width=160) (actual time=29.82..29.96 rows=6 loops=1) Index Cond: (ud.id_class = "outer".id_class) Total runtime: 11135.65 msec (10 rows) and the quantity reported in: Hash Join (cost=265.64..32000.76 rows=40612 width=263) (actual time=11074.21..11134.28 rows=10 loops=1) is wrong about the rows returned, I did what Josh Berkus suggeted me: 1) Make sure you've VACUUM ANALYZED 2) Adjust the following postgresql.conf statistics: a) effective_cache_size: increase to 70% of available (not used by other processes) RAM. b) random_page_cost: decrease, maybe to 2. c) default_statistics_target: try increasing to 100 (warning: this will significantly increase the time required to do ANALYZE) I pushed also default_statistics_target to 1000 but the plan remain the same with an execution of 11 secs but If I do the followin 3 equivalent query I obatin the same result in olny fews ms: SELECT id_class from class_default where id_provider = 39; id_class -- 48 (1 row) SELECT id_user from user_data where id_class in ( 48 ); id_user - 10943 10942 10934 10927 10910 10909 (6 rows) SELECT * from user_logs where id_user in ( 10943, 10942, 10934, 10927, 10910, 10909 ); May I do something else ? Thank you in advance Gaetano Mendola ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings