[PERFORM] Optimizer difference using function index between 7.3 and 7.4
We have a large (several million row) table with a field containing URLs. Now, funny thing about URLs: they mostly start with a common substring ("http://www.";). But not all the rows start with this, so we can't just lop off the first N characters. However, we noticed some time ago that an index on this field wasn't as effective as an index on the REVERSE of the field. So ... CREATE OR REPLACE FUNCTION fn_urlrev(text) returns text as ' return reverse(lc($_[0])) ' language 'plperl' with (iscachable,isstrict); and then CREATE UNIQUE INDEX ix_links_3 ON links (fn_urlrev(path_base)); seemed to be much faster. When we have to look up a single entry in "links", we do so by something like -- SELECT * FROM links WHERE fn_urlrev(path_base) = ?; and it's rather fast. When we have a bunch of them to do, under 7.3 we found it useful to create a temporary table, fill it with reversed URLs, and join: INSERT INTO temp_link_urls VALUES (fn_urlrev(?)); SELECT l.path_base,l.link_id FROM links l JOIN temp_link_urls t ON (fn_urlrev(l.path_base) = t.rev_path_base); Here are query plans from the two versions (using a temp table with 200 rows, after ANALYZE on the temp table): 7.3: # explain select link_id from links l join clm_tmp_links t on (fn_urlrev(l.path_base) = t.rev_path_base); QUERY PLAN - Nested Loop (cost=0.00..3936411.13 rows=2000937 width=152) -> Seq Scan on clm_tmp_links t (cost=0.00..5.00 rows=200 width=74) -> Index Scan using ix_links_3 on links l (cost=0.00..19531.96 rows=10005 width=78) Index Cond: (fn_urlrev(l.path_base) = "outer".rev_path_base) (4 rows) 7.4: # explain select link_id from links l join clm_tmp_links t on (fn_urlrev(l.path_base) = t.rev_path_base); QUERY PLAN -- Hash Join (cost=5.50..88832.88 rows=1705551 width=4) Hash Cond: (fn_urlrev("outer".path_base) = "inner".rev_path_base) -> Seq Scan on links l (cost=0.00..50452.50 rows=1705550 width=78) -> Hash (cost=5.00..5.00 rows=200 width=74) -> Seq Scan on clm_tmp_links t (cost=0.00..5.00 rows=200 width=74) (5 rows) Although the cost for the 7.4 query is lower, the 7.3 plan executes in about 3 seconds, while the 7.4 plan executes in 59.8 seconds! Now the odd part: if I change the query to this: # explain analyze select link_id from links l join clm_tmp_links t on (fn_urlrev(l.path_base) = fn_urlrev(t.rev_path_base)); QUERY PLAN -- Merge Join (cost=12.64..219974.16 rows=1705551 width=4) (actual time=17.928..17.928 rows=0 loops=1) Merge Cond: (fn_urlrev("outer".path_base) = "inner"."?column2?") -> Index Scan using ix_links_3 on links l (cost=0.00..173058.87 rows=1705550 width=78) (actual time=0.229..0.285 rows=7 loops=1) -> Sort (cost=12.64..13.14 rows=200 width=74) (actual time=9.652..9.871 rows=200 loops=1) Sort Key: fn_urlrev(t.rev_path_base) -> Seq Scan on clm_tmp_links t (cost=0.00..5.00 rows=200 width=74) (actual time=0.166..5.753 rows=200 loops=1) Total runtime: 18.125 ms (i.e., apply the function to the data in the temp table), it runs a whole lot faster! Is this a bug in the optimizer? Or did something change about the way functional indexes are used? -- Jeff Boes vox 269.226.9550 ext 24 Database Engineer fax 269.349.9076 Nexcerpt, Inc. http://www.nexcerpt.com ...Nexcerpt... Extend your Expertise ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Optimizer difference using function index between 7.3 and 7.4
Jeff Boes <[EMAIL PROTECTED]> writes: > Is this a bug in the optimizer? Or did something > change about the way functional indexes are used? In 7.3, the only possible plan for these queries was a nestloop or nestloop with inner indexscan, because the planner could not generate merge or hash joins on join conditions more complex than "var1 = var2". You were fortunate that a nestloop was fast enough for your situation. In 7.4 the planner can (as you see) generate both merge and hash options for this query. What it's not very good at yet is picking the best option to use, because it doesn't have any statistics about the distribution of functional indexes, and so it's pretty much guessing about selectivity. As of just a couple days ago, there is code in CVS tip that keeps and uses stats about the values of functional index columns. It seems likely that this would help out tremendously in terms of estimating the costs well for your problem. Don't suppose you'd like to try setting up a test system with your data and trying it ... BTW, as best I can tell, the amazing speed for the mergejoin is a bit of a fluke. > Merge Join (cost=12.64..219974.16 rows=1705551 width=4) (actual > time=17.928..17.928 rows=0 loops=1) >Merge Cond: (fn_urlrev("outer".path_base) = "inner"."?column2?") >-> Index Scan using ix_links_3 on links l (cost=0.00..173058.87 > rows=1705550 width=78) (actual time=0.229..0.285 rows=7 loops=1) >-> Sort (cost=12.64..13.14 rows=200 width=74) (actual > time=9.652..9.871 rows=200 loops=1) > Sort Key: fn_urlrev(t.rev_path_base) > -> Seq Scan on clm_tmp_links t (cost=0.00..5.00 rows=200 > width=74) (actual time=0.166..5.753 rows=200 loops=1) > Total runtime: 18.125 ms Notice how the indexscan on links is reporting that it only returned 7 rows. Ordinarily you'd expect that it'd scan the whole table (and that's what the cost estimate is expecting). I think what must have happened is that the scan stopped only a little way into the table, because the sequence of values from the temp table ended with a value that was close to the start of the range of values in the main table. Mergejoin stops fetching as soon as it exhausts either input table. This was good luck for you in this case but would likely not hold up with another set of temp values. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] UPDATE with subquery too slow
> I can't get the following statement to complete with reasonable time. Upgraded to 7.4.1, and realized that NOT IN is far more efficient than IN, EXISTS or NOT EXISTS, at least for the amount and distribution of data that I have. Here are some numbers from before and after performing the problematic clean up operation: | Before| After +---+--- COUNT(*)| 6'104'075 | 6'104'075 COUNT(session) | 5'945'272 | 3'640'659 COUNT(DISTINCT session) | 2'865'570 | 560'957 The following query completes within less than three hours on a machine with a high load, versa many many hours for any of the alternatives: UPDATE requests SET session = NULL WHERE session NOT IN ( SELECT r.session FROM requests r WHERE r.session IS NOT NULL GROUP BY r.session HAVING COUNT(*) > 1 ); Note that in order to correctly reverse an IN subquery, IS NOT NULL needs to be added. Interestingly, the query planner believes that using EXISTS would be more efficient than NOT IN, and IN only slightly less efficient; I assume the query planner is not able to accurately estimate the number of rows returned by the subquery. EXISTS351'511 NOT IN376'577 IN386'780 LEFT JOIN 18'263'826 NOT EXISTS 7'241'815'330 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] Forcing filter/join order?
Folks, Have an interesting issue with a complex query, where apparently I need to twist the query planner's arm, and am looking for advice on how to do so. The situation: I have a table, events, with about 300,000 records. It does an outer join to a second table, cases, with about 150,000 records. A very simplified version query would be like SELECT * FROM events LEFT OUTER JOIN cases ON events.case_id = cases.case_id WHERE events.event_date BETWEEN 'date1' AND 'date2' This join is very expensive, as you can imagine. Yet I can't seem to force the query planner to apply the filter conditions to the events table *before* attempting to join it to cases. Here's the crucial explain lines: -> Merge Left Join (cost=0.00..11880.82 rows=15879 width=213) (actual time=5.777..901.899 rows=648 loops=1) Merge Cond: ("outer".case_id = "inner".case_id) Join Filter: (("outer".link_type)::text = 'case'::text) -> Index Scan using idx_event_ends on events (cost=0.00..4546.15 rows=15879 width=80 ) (actual time=4.144..333.769 rows=648 loops=1) Filter: ((status <> 0) AND ((event_date + duration) >= '2004-02-18 00:00:00'::timestamp without time zone) AND (event_date <= '2004-03-05 23:59:00'::timestamp without time zone)) -> Index Scan using cases_pkey on cases (cost=0.00..6802.78 rows=117478 width=137) ( actual time=0.139..402.363 rows=116835 loops=1) As you can see, part of the problem is a pretty drastic (20x) mis-estimation of the selectivity of the date limits on events -- and results in 90% of the execution time of my query on this one join. I've tried raising the statistics on event_date, duration, and case_id (to 1000), but this doesn't seem to affect the estimate or the query plan. In the above test, idx_event_ends indexes (case_id, status, event_date, (event_date + duration)), but as you can see the planner uses only the first column. This was an attempt to circumvent the planner's tendency to completely ignoring any index on (event_date, (event_date + duration)) -- even though that index is the most selective combination on the events table. Is there anything I can do to force the query planner to filter on events before joining cases, other than waiting for version 7.5? -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Forcing filter/join order?
Josh, I'm sure the big brains have a better suggestion, but in the mean time could you do something as simple as: SELECT * FROM (select * from events where event_date BETWEEN 'date1' AND 'date2') e LEFT OUTER JOIN cases ON e.case_id = cases.case_id; Thanks, Peter Darley -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Josh Berkus Sent: Wednesday, February 18, 2004 4:11 PM To: pgsql-performance Subject: [PERFORM] Forcing filter/join order? Folks, Have an interesting issue with a complex query, where apparently I need to twist the query planner's arm, and am looking for advice on how to do so. The situation: I have a table, events, with about 300,000 records. It does an outer join to a second table, cases, with about 150,000 records. A very simplified version query would be like SELECT * FROM events LEFT OUTER JOIN cases ON events.case_id = cases.case_id WHERE events.event_date BETWEEN 'date1' AND 'date2' This join is very expensive, as you can imagine. Yet I can't seem to force the query planner to apply the filter conditions to the events table *before* attempting to join it to cases. Here's the crucial explain lines: -> Merge Left Join (cost=0.00..11880.82 rows=15879 width=213) (actual time=5.777..901.899 rows=648 loops=1) Merge Cond: ("outer".case_id = "inner".case_id) Join Filter: (("outer".link_type)::text = 'case'::text) -> Index Scan using idx_event_ends on events (cost=0.00..4546.15 rows=15879 width=80 ) (actual time=4.144..333.769 rows=648 loops=1) Filter: ((status <> 0) AND ((event_date + duration) >= '2004-02-18 00:00:00'::timestamp without time zone) AND (event_date <= '2004-03-05 23:59:00'::timestamp without time zone)) -> Index Scan using cases_pkey on cases (cost=0.00..6802.78 rows=117478 width=137) ( actual time=0.139..402.363 rows=116835 loops=1) As you can see, part of the problem is a pretty drastic (20x) mis-estimation of the selectivity of the date limits on events -- and results in 90% of the execution time of my query on this one join. I've tried raising the statistics on event_date, duration, and case_id (to 1000), but this doesn't seem to affect the estimate or the query plan. In the above test, idx_event_ends indexes (case_id, status, event_date, (event_date + duration)), but as you can see the planner uses only the first column. This was an attempt to circumvent the planner's tendency to completely ignoring any index on (event_date, (event_date + duration)) -- even though that index is the most selective combination on the events table. Is there anything I can do to force the query planner to filter on events before joining cases, other than waiting for version 7.5? -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Forcing filter/join order?
Folks, Hmmm posted too soon. Figured out the problem: The planner can't, or doesn't want to, use an index on (event_date, (event_date + duration)) where the first column is an ascending sort and the second a descending sort.So I've coded a workaround that's quite inelegant but does get the correct results in 0.3 seconds (as opposed to the 2.2 seconds taken by the example plan). Is this the sort of thing which is ever likely to get fixed, or just a fundamental limitation of index algorithms? Would using a non B-Tree index allow me to work around this? -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Forcing filter/join order?
Peter, > I'm sure the big brains have a better suggestion, but in the mean time > could you do something as simple as: > > SELECT * > FROM (select * from events where event_date BETWEEN 'date1' AND 'date2') e > LEFT OUTER JOIN cases ON e.case_id = cases.case_id; Thanks, but that doens't work; the planner will collapse the subquery into the main query, which most of the time is the right thing to do. -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Forcing filter/join order?
On Wed, 18 Feb 2004, Josh Berkus wrote: > The planner can't, or doesn't want to, use an index on (event_date, > (event_date + duration)) where the first column is an ascending sort and the > second a descending sort.So I've coded a workaround that's quite > inelegant but does get the correct results in 0.3 seconds (as opposed to the > 2.2 seconds taken by the example plan). Can you give more information? I know that I'm not exactly certain what the situation is from the above and the original query/explain piece. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Forcing filter/join order?
Stephan, > Can you give more information? I know that I'm not exactly certain what > the situation is from the above and the original query/explain piece. > Believe me, if I posted the query it wouldn't help.Heck, I'd have trouble following it without my notes. a simplifed version: SELECT events.*, cases.case_name FROM events LEFT OUTER JOIN cases ON events.case_id = cases.case_id WHERE (event_date >= '2004-03-05' OR (event_date + duration) <= '2004-02-18') AND events.status <> 0; ... this is to get me all vaild events which overlap with the range '2004-02-18' to '2004-03-05'. I had thought, in 7.4, that adding an index on (event_date, (event_date + duration)) would improve the execution of this query. It doesn't, presumably because the multi-column index can't be used for both ascending and descending sorts at the same time, and event_date >= '2004-03-05' isn't selective enough. There was a workaround for this posted on hackers about a year ago as I recally, that involved creating custom operators for indexing. Too much trouble when there's a hackish workaround (due to the fact that events have to be less than a month long). -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Forcing filter/join order?
On Wed, 18 Feb 2004, Josh Berkus wrote: > Stephan, > > > Can you give more information? I know that I'm not exactly certain what > > the situation is from the above and the original query/explain piece. > > > > Believe me, if I posted the query it wouldn't help.Heck, I'd have trouble > following it without my notes. > > a simplifed version: > > SELECT events.*, cases.case_name > FROM events LEFT OUTER JOIN cases ON events.case_id = cases.case_id > WHERE (event_date >= '2004-03-05' OR (event_date + duration) <= '2004-02-18') > AND events.status <> 0; > > ... this is to get me all vaild events which overlap with the range > '2004-02-18' to '2004-03-05'. > > I had thought, in 7.4, that adding an index on (event_date, (event_date + > duration)) would improve the execution of this query. It doesn't, > presumably because the multi-column index can't be used for both ascending > and descending sorts at the same time, and event_date >= '2004-03-05' isn't > selective enough. I don't think the direction issue is the problem in the above. I think the problem is that given a condition like: a>value or bvalue alone since that'd do the wrong thing. Testing on a two column table, I see behavior like the following (with seqscan off) sszabo=# create table q2(a int, b int); CREATE TABLE sszabo=# create index q2ind on q2(a,b); CREATE INDEX sszabo=# set enable_seqscan=off; SET sszabo=# explain select * from q2 where a>3 and b<5; QUERY PLAN --- Index Scan using q2ind on q2 (cost=0.00..42.79 rows=112 width=8) Index Cond: ((a > 3) AND (b < 5)) (2 rows) sszabo=# explain select * from q2 where a>3 or b<5; QUERY PLAN Seq Scan on q2 (cost=1.00..10025.00 rows=556 width=8) Filter: ((a > 3) OR (b < 5)) (2 rows) sszabo=# create index q2ind2 on q2(b); CREATE INDEX sszabo=# explain select * from q2 where a>3 or b<5; QUERY PLAN --- Index Scan using q2ind, q2ind2 on q2 (cost=0.00..92.68 rows=556 width=8) Index Cond: ((a > 3) OR (b < 5)) (2 rows) ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Forcing filter/join order?
Josh Berkus <[EMAIL PROTECTED]> writes: > SELECT events.*, cases.case_name > FROM events LEFT OUTER JOIN cases ON events.case_id = cases.case_id > WHERE (event_date >= '2004-03-05' OR (event_date + duration) <= '2004-02-18') > AND events.status <> 0; > ... this is to get me all vaild events which overlap with the range > '2004-02-18' to '2004-03-05'. Did you mean events that *don't* overlap with the range? Seems like what you say you want should be expressed as event_date <= 'end-date' AND (event_date + duration) >= 'start-date' This assumes duration is never negative of course. I think you could make this btree-indexable by negating the second clause. Imagine create index evi on events (event_date, (-(event_date+duration))) and then transforming the query to event_date <= 'end-date' AND -(event_date + duration) <= -'start-date' but that doesn't quite work because there's no unary minus for date or timestamp types. Is this too ugly for you? create index evi on events (event_date, ('ref-date'-event_date-duration)) event_date <= 'end-date' AND ('ref-date'-event_date-duration) <= 'ref-date'-'start-date' where 'ref-date' is any convenient fixed reference date, say 1-1-2000. Now, what this will look like to the planner is a one-sided two-column restriction, and I'm not certain that the planner will assign a sufficiently small selectivity estimate. But in theory it could work. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Forcing filter/join order?
Tom, First off, you are correct, I swapped the dates when typing the simplified query into e-mail. > create index evi on events (event_date, ('ref-date'-event_date-duration)) > > event_date <= 'end-date' > AND ('ref-date'-event_date-duration) <= 'ref-date'-'start-date' > > where 'ref-date' is any convenient fixed reference date, say 1-1-2000. > > Now, what this will look like to the planner is a one-sided two-column > restriction, and I'm not certain that the planner will assign a > sufficiently small selectivity estimate. But in theory it could work. Interesting idea. I'll try it just to see if it works when I have a chance. In the meantime, for production, I'll stick with the hackish solution I was using under 7.2. Knowing that events are never more than one month long for this application, I can do: "WHERE event.event_date >= (begin_date - '1 month) AND event.event_date <= end_date" ... which works because I have a child table which has event information by day: AND events.event_id IN (SELECT event_id FROM event_day WHERE calendar_day BETWEEN begin_date AND end_date); Note that this subselect isn't sufficent on its own, because once again the query planner is unable to correctly estimate the selectivity of the subselect. It needs the "help" of the filter against events.event_date. This is the workaround I was using with 7.2. I had just hoped that some of the improvements that Tom has made over the last two versions would cure the problem, but no dice. -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Forcing filter/join order?
Josh Berkus <[EMAIL PROTECTED]> writes: > Knowing that events are never more than one month long for this > application, I can do: > "WHERE event.event_date >= (begin_date - '1 month) AND event.event_date <= > end_date" > ... which works because I have a child table which has event information by > day: Uh, why do you need the child table? Seems like the correct incantation given an assumption about maximum duration is event_date <= 'end-date' AND (event_date + duration) >= 'start-date' AND event_date >= 'start-date' - 'max-duration' The last clause is redundant with the one involving the duration field, but it provides a lower bound for the index scan on event_date. The only index you really need here is one on event_date, but possibly one on (event_date, (event_date + duration)) would be marginally faster. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster