Re: [PERFORM] Functionscan estimates
But with all due respect to Joe, I think the reason that stuff got trimmed is that it didn't work very well. In most cases it's *hard* to write an estimator for a SRF. Let's see you produce one for dblink() for instance ... Good one... Well in some cases it'll be impossible, but suppose I have a function get_id_for_something() which just grabs an ID using dblink, then I know it returns one row, and pg would be interested in that information too ! ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Functionscan estimates
My solution would be a lot simpler, since we could simply populate pg_proc.proestrows with "1000" by default if not changed by the DBA. In an even better world, we could tie it to a table, saying that, for example, proestrows = my_table*0.02. What if the estimated row is a function of a parameter ? Say a function takes as a parameter : - a number to use in a LIMIT - it's a function to generate a certain number of values from a predetermined set (like, array -> set returning function) In all those cases it's no use to have just a fixed number. Id suggest two solutions : - The ideal solution which is impossible to do : The function tells the planner about its stats, looking at its parameters - A solution that would be possible to do pg_proc.proestrows is... the name of another function, defined by the user, which takes the exact same parameters as the set returning function we're talking about, and which returns estimates. For instance, in pseudo-sql : CREATE FUNCTION int_array_srf( INTEGER[] ) RETURNS SETOF INTEGER LANGUAGE plpgsql AS $$ BEGIN FOR _i IN 1..icount($1) RETURN NEXT $1[_i]; END END In the two cases above, this would give : CREATE FUNCTION array_srf_estimator( INTEGER[] ) RETURNS INTEGER LANGUAGE plpgsql AS $$ BEGIN RETURN icount( $1 ); END; ALTER FUNCTION array_srf SET ESTIMATOR array_srf_estimator; Another interesting case would be the famous "Top 5 by category" case where we use a SRF to emulate an index skip scan. Say we have a table Categories and a table Users, each User having columns "categories" and "score" and we want the N users with best score in each category : CREATE FUNCTION top_n_by_category( INTEGER ) RETURN SETOF users%ROWTYPE LANGUAGE plpgsql AS $$ DECLARE _cat_id INTEGER; _n ALIAS FOR $1; _user users%ROWTYPE; BEGIN FOR _cat_id IN SELECT category_id FROM categories DO FOR _user IN SELECT * FROM users WHERE category_id = _cat_id ORDER BY score DESC LIMIT _n DO RETURN NEXT _user; END END END CREATE FUNCTION top_n_by_category_estimator( INTEGER ) RETURN INTEGER LANGUAGE plpgsql AS $$ BEGIN RETURN $1 * (the estimated number of rows for the categories table taken from the table statistics); END; ALTER FUNCTION top_n_by_category SET ESTIMATOR top_n_by_category_estimator; Got it ? The array_srf case would be extremely useful as this type of function is generally used to join against other tables, and having estimates is useful for that. The top_n case would be useless if we're just returning the rows from the function directly, but would be useful if we'll join them to other tables. This sounds pretty simple, powerful, and versatile. Additionally, in some cases (most notably the array case) it's easy to estimate the statistics on the returned values because they're all in the array already, so the mechanism could be extended to have a way of returning a pseudo pg_stats for a Set Returning function. For instance, say you have a SRF which returns N random rows from a table. It could have an estimator which would return a rowcount of N, and a statistics estimator which would return the sats rows for the source table, appropriately modified. This sounds harder to do. WHat do you think ? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Any way to speed this up?
2 things to point out from this last run: 50% of the time is taken scanning tblassociate -> Seq Scan on tblassociate a (cost=0.00..38388.79 rows=199922 width=53) (actual time=62.000..10589.000 rows=176431 loops=1) Filter: ((clientnum)::text = 'SAKS'::text) If you had an index on clientnum and didn't cast it to text in the view, you might be able to use an indexscan, which could be faster (depends on how big the table actually is). This sort is taking about 25% of the time: -> Sort (cost=85542.59..86042.39 rows=199922 width=75) (actual time=19641.000..19955.000 rows=159960 loops=1)" Sort Key: a.locationid -> Merge Right Join (cost=60850.40..62453.22 rows=199922 width=75) (actual time=13500.000..14734.000 rows=176431 loops=1) I suspect it shouldn't take 5 seconds to sort 160k rows in memory, and that this sort is spilling to disk. If you increase your working memory the sort might fit entirely in memory. As a quick test, you could set working memory to 80% of system memory and see how that changes the speed. But you wouldn't want to set it that high in production. On Thu, Apr 07, 2005 at 01:14:33PM -0400, Joel Fradkin wrote: > Here is the result after putting it back to 4 the original value (I had done > that prior to your suggestion of using 2 or 3) to see what might change. > I also vacummed and thought I saw records deleted in associate, which I > found odd as this is a test site and no new records were added or deleted. > > "Merge Join (cost=86788.09..87945.00 rows=10387 width=112) (actual > time=19703.000..21154.000 rows=159959 loops=1)" > " Merge Cond: ("outer".locationid = "inner".locationid)" > " -> Sort (cost=1245.50..1246.33 rows=332 width=48) (actual > time=62.000..62.000 rows=441 loops=1)" > "Sort Key: l.locationid" > "-> Index Scan using ix_location on tbllocation l > (cost=0.00..1231.60 rows=332 width=48) (actual time=15.000..62.000 rows=441 > loops=1)" > " Index Cond: ('SAKS'::text = (clientnum)::text)" > " -> Sort (cost=85542.59..86042.39 rows=199922 width=75) (actual > time=19641.000..19955.000 rows=159960 loops=1)" > "Sort Key: a.locationid" > "-> Merge Right Join (cost=60850.40..62453.22 rows=199922 > width=75) (actual time=13500.000..14734.000 rows=176431 loops=1)" > " Merge Cond: (("outer".id = "inner".jobtitleid) AND > ("outer"."?column4?" = "inner"."?column10?"))" > " -> Sort (cost=554.11..570.13 rows=6409 width=37) (actual > time=94.000..94.000 rows=6391 loops=1)" > "Sort Key: jt.id, (jt.clientnum)::text" > "-> Seq Scan on tbljobtitle jt (cost=0.00..148.88 > rows=6409 width=37) (actual time=0.000..63.000 rows=6391 loops=1)" > " Filter: (1 = presentationid)" > " -> Sort (cost=60296.29..60796.09 rows=199922 width=53) > (actual time=13406.000..13859.000 rows=176431 loops=1)" > "Sort Key: a.jobtitleid, (a.clientnum)::text" > "-> Seq Scan on tblassociate a (cost=0.00..38388.79 > rows=199922 width=53) (actual time=62.000..10589.000 rows=176431 loops=1)" > " Filter: ((clientnum)::text = 'SAKS'::text)" > "Total runtime: 22843.000 ms" > > Joel Fradkin > > -Original Message- > From: Tom Lane [mailto:[EMAIL PROTECTED] > Sent: Thursday, April 07, 2005 11:43 AM > To: Joel Fradkin > Cc: 'PostgreSQL Perform' > Subject: Re: [PERFORM] Any way to speed this up? > > "Joel Fradkin" <[EMAIL PROTECTED]> writes: > > random_page_cost = 1.2#4# units are one sequential page > > fetch cost > > That is almost certainly overoptimistic; it's causing the planner to > use indexscans when it shouldn't. Try 2 or 3 or thereabouts. > > regards, tom lane > > > ---(end of broadcast)--- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?" ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Functionscan estimates
On Sat, Apr 09, 2005 at 12:00:56AM -0400, Tom Lane wrote: > Not too many releases ago, there were several columns in pg_proc that > were intended to support estimation of the runtime cost and number of > result rows of set-returning functions. I believe in fact that these > were the remains of Joe Hellerstein's thesis on expensive-function > evaluation, and are exactly what he was talking about here: > http://archives.postgresql.org/pgsql-hackers/2002-06/msg00085.php > > But with all due respect to Joe, I think the reason that stuff got > trimmed is that it didn't work very well. In most cases it's > *hard* to write an estimator for a SRF. Let's see you produce > one for dblink() for instance ... Actually, if the remote database supported a way to get a rows estimate from the query passed to db_link, it would be trivial, since you'd just pass that back. In fact, having such a function (estimate_rows_for_sql(text)) would probably be very useful to functions that wanted to support returning a rows estimate. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?" ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Functionscan estimates
"Jim C. Nasby" <[EMAIL PROTECTED]> writes: > On Sat, Apr 09, 2005 at 12:00:56AM -0400, Tom Lane wrote: >> But with all due respect to Joe, I think the reason that stuff got >> trimmed is that it didn't work very well. In most cases it's >> *hard* to write an estimator for a SRF. Let's see you produce >> one for dblink() for instance ... > Actually, if the remote database supported a way to get a rows estimate > from the query passed to db_link, it would be trivial, since you'd just > pass that back. This assumes that (1) you have the complete query argument at the time of estimation, and (2) it's OK to contact the remote database and do an EXPLAIN at that time. Both of these seem pretty shaky assumptions. The larger point is that writing an estimator for an SRF is frequently a task about as difficult as writing the SRF itself, and sometimes much *more* difficult due to lack of information. I don't foresee a whole lot of use of an estimator hook designed as proposed here. In particular, if the API is such that we can only use the estimator when all the function arguments are plan-time constants, it's not going to be very helpful. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[PERFORM] performance - triggers, row existence etc.
Hello, I'm just in the middle of performance tunning of our database running on PostgreSQL, and I've several questions (I've searched the online docs, but without success). 1) When I first use the EXPLAIN ANALYZE command, the time is much larger than in case of subsequent invocations of EXPLAIN ANALYZE. I suppose the plan prepared during the first invocation is cached somewhere, but I'm not sure where and for how long. I suppose the execution plans are connection specific, but I'm not sure whether this holds for the sql queries inside the triggers too. I've done some testing but the things are somehow more difficult thanks to persistent links (the commands will be executed from PHP). 2) Is there some (performance) difference between BEFORE and AFTER triggers? I believe there's no measurable difference. 3) Vast majority of SQL commands inside the trigger checks whether there exists a row that suits some conditions (same IP, visitor ID etc.) Currently I do this by SELECT INTO tmp id FROM ... JOIN ... WHERE ... LIMIT 1 IF NOT FOUND THEN END IF; and so on. I believe this is fast and low-cost solution (compared to the COUNT(*) way I've used before), but is there some even better (faster) way to check row existence? Thanks t.v. ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Functionscan estimates
Tom Lane wrote: Not too many releases ago, there were several columns in pg_proc that were intended to support estimation of the runtime cost and number of result rows of set-returning functions. I believe in fact that these were the remains of Joe Hellerstein's thesis on expensive-function evaluation FYI, Hellerstein's thesis on xfunc optimization is available here: ftp://ftp.cs.wisc.edu/pub/tech-reports/reports/1996/tr1304.ps.Z There's also a paper on this subject by Hellerstein that was published in Transactions on Database Systems: http://www.cs.berkeley.edu/~jmh/miscpapers/todsxfunc.pdf I haven't had a chance to digest either one yet, but it might be worth a look. -Neil ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Functionscan estimates
Tom Lane wrote: The larger point is that writing an estimator for an SRF is frequently a task about as difficult as writing the SRF itself True, although I think this doesn't necessarily kill the idea. If writing an estimator for a given SRF is too difficult, the user is no worse off than they are today. Hopefully there would be a fairly large class of SRFs for which writing an estimator would be relatively simple, and result in improved planner behavior. I don't foresee a whole lot of use of an estimator hook designed as proposed here. In particular, if the API is such that we can only use the estimator when all the function arguments are plan-time constants, it's not going to be very helpful. Yes :( One approach might be to break the function's domain into pieces and have the estimator function calculate the estimated result set size for each piece. So, given a trivial function like: foo(int): if $1 < 10 then produce 100 rows else produce 1 rows If the planner has encoded the distribution of input tuples to the function as a histogram, it could invoke the SRF's estimator function for the boundary values of each histogram bucket, and use that to get an idea of the function's likely result set size at runtime. And yes, the idea as sketched is totally unworkable :) For one thing, the difficulty of doing this grows rapidly as the number of arguments to the function increases. But perhaps there is some variant of this idea that might work... Another thought is that the estimator could provide information on the cost of evaluating the function, the number of tuples produced by the function, and even the distribution of those tuples. BTW, why is this on -performance? It should be on -hackers. -Neil ---(end of broadcast)--- TIP 8: explain analyze is your friend