[PERFORM] Optimizer difference using function index between 7.3 and 7.4

2004-02-18 Thread Jeff Boes
We have a large (several million row) table with a field containing 
URLs. Now, funny thing about URLs: they mostly start with a common 
substring ("http://www.";). But not all the rows start with this, so we 
can't just lop off the first N characters. However, we noticed some time 
ago that an index on this field wasn't as effective as an index on the 
REVERSE of the field. So ...

CREATE OR REPLACE FUNCTION fn_urlrev(text) returns text as '
return reverse(lc($_[0]))
' language 'plperl' with (iscachable,isstrict);
and then

CREATE UNIQUE INDEX ix_links_3 ON links
(fn_urlrev(path_base));
seemed to be much faster. When we have to look up a single entry in 
"links", we do so by something like --

SELECT * FROM links WHERE fn_urlrev(path_base) = ?;

and it's rather fast. When we have a bunch of them to do, under 7.3 we 
found it useful to create a temporary table, fill it with reversed URLs, 
and join:

INSERT INTO temp_link_urls VALUES (fn_urlrev(?));

SELECT l.path_base,l.link_id
   FROM links l
   JOIN temp_link_urls t
   ON (fn_urlrev(l.path_base) = t.rev_path_base);
Here are query plans from the two versions (using a temp table with 200 
rows, after ANALYZE on the temp table):

7.3:

# explain select link_id from links l join clm_tmp_links t on 
(fn_urlrev(l.path_base) = t.rev_path_base);
  QUERY PLAN
-
Nested Loop  (cost=0.00..3936411.13 rows=2000937 width=152)
  ->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200 width=74)
  ->  Index Scan using ix_links_3 on links l  (cost=0.00..19531.96 
rows=10005 width=78)
Index Cond: (fn_urlrev(l.path_base) = "outer".rev_path_base)
(4 rows)

7.4:

# explain select link_id from links l join clm_tmp_links t on 
(fn_urlrev(l.path_base) = t.rev_path_base);
 QUERY PLAN
--
Hash Join  (cost=5.50..88832.88 rows=1705551 width=4)
  Hash Cond: (fn_urlrev("outer".path_base) = "inner".rev_path_base)
  ->  Seq Scan on links l  (cost=0.00..50452.50 rows=1705550 width=78)
  ->  Hash  (cost=5.00..5.00 rows=200 width=74)
->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200 
width=74)
(5 rows)

Although the cost for the 7.4 query is lower, the 7.3 plan executes in 
about 3 seconds, while the 7.4 plan executes in 59.8 seconds!

Now the odd part: if I change the query to this:

# explain analyze select link_id from links l join clm_tmp_links t on 
(fn_urlrev(l.path_base) = fn_urlrev(t.rev_path_base));
 QUERY 
PLAN   
--
Merge Join  (cost=12.64..219974.16 rows=1705551 width=4) (actual 
time=17.928..17.928 rows=0 loops=1)
  Merge Cond: (fn_urlrev("outer".path_base) = "inner"."?column2?")
  ->  Index Scan using ix_links_3 on links l  (cost=0.00..173058.87 
rows=1705550 width=78) (actual time=0.229..0.285 rows=7 loops=1)
  ->  Sort  (cost=12.64..13.14 rows=200 width=74) (actual 
time=9.652..9.871 rows=200 loops=1)
Sort Key: fn_urlrev(t.rev_path_base)
->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200 
width=74) (actual time=0.166..5.753 rows=200 loops=1)
Total runtime: 18.125 ms

(i.e., apply the function to the data in the temp table), it runs a 
whole lot faster! Is this a bug in the optimizer? Or did something 
change about the way functional indexes are used?

--
Jeff Boes  vox 269.226.9550 ext 24
Database Engineer fax 269.349.9076
Nexcerpt, Inc. http://www.nexcerpt.com
  ...Nexcerpt... Extend your Expertise
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Optimizer difference using function index between 7.3 and 7.4

2004-02-18 Thread Tom Lane
Jeff Boes <[EMAIL PROTECTED]> writes:
> Is this a bug in the optimizer? Or did something 
> change about the way functional indexes are used?

In 7.3, the only possible plan for these queries was a nestloop or
nestloop with inner indexscan, because the planner could not generate
merge or hash joins on join conditions more complex than "var1 = var2".
You were fortunate that a nestloop was fast enough for your situation.

In 7.4 the planner can (as you see) generate both merge and hash options
for this query.  What it's not very good at yet is picking the best
option to use, because it doesn't have any statistics about the
distribution of functional indexes, and so it's pretty much guessing
about selectivity.

As of just a couple days ago, there is code in CVS tip that keeps and
uses stats about the values of functional index columns.  It seems
likely that this would help out tremendously in terms of estimating
the costs well for your problem.  Don't suppose you'd like to try
setting up a test system with your data and trying it ...

BTW, as best I can tell, the amazing speed for the mergejoin is a bit of
a fluke.

>  Merge Join  (cost=12.64..219974.16 rows=1705551 width=4) (actual 
> time=17.928..17.928 rows=0 loops=1)
>Merge Cond: (fn_urlrev("outer".path_base) = "inner"."?column2?")
>->  Index Scan using ix_links_3 on links l  (cost=0.00..173058.87 
> rows=1705550 width=78) (actual time=0.229..0.285 rows=7 loops=1)
>->  Sort  (cost=12.64..13.14 rows=200 width=74) (actual 
> time=9.652..9.871 rows=200 loops=1)
>  Sort Key: fn_urlrev(t.rev_path_base)
>  ->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200 
> width=74) (actual time=0.166..5.753 rows=200 loops=1)
>  Total runtime: 18.125 ms

Notice how the indexscan on links is reporting that it only returned
7 rows.  Ordinarily you'd expect that it'd scan the whole table (and
that's what the cost estimate is expecting).  I think what must have
happened is that the scan stopped only a little way into the table,
because the sequence of values from the temp table ended with a value
that was close to the start of the range of values in the main table.
Mergejoin stops fetching as soon as it exhausts either input table.
This was good luck for you in this case but would likely not hold up
with another set of temp values.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] UPDATE with subquery too slow

2004-02-18 Thread Eric Jain
> I can't get the following statement to complete with reasonable time.

Upgraded to 7.4.1, and realized that NOT IN is far more efficient than
IN, EXISTS or NOT EXISTS, at least for the amount and distribution of
data that I have. Here are some numbers from before and after performing
the problematic clean up operation:

| Before| After
+---+---
COUNT(*)| 6'104'075 | 6'104'075
COUNT(session)  | 5'945'272 | 3'640'659
COUNT(DISTINCT session) | 2'865'570 |   560'957

The following query completes within less than three hours on a machine
with a high load, versa many many hours for any of the alternatives:

UPDATE requests
SET session = NULL
WHERE session NOT IN
(
  SELECT r.session
  FROM requests r
  WHERE r.session IS NOT NULL
  GROUP BY r.session
  HAVING COUNT(*) > 1
);

Note that in order to correctly reverse an IN subquery, IS NOT NULL
needs to be added.

Interestingly, the query planner believes that using EXISTS would be
more efficient than NOT IN, and IN only slightly less efficient; I
assume the query planner is not able to accurately estimate the number
of rows returned by the subquery.

EXISTS351'511
NOT IN376'577
IN386'780
LEFT JOIN  18'263'826
NOT EXISTS  7'241'815'330


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Forcing filter/join order?

2004-02-18 Thread Josh Berkus
Folks,

Have an interesting issue with a complex query, where apparently I need to 
twist the query planner's arm, and am looking for advice on how to do so.

The situation:  I have a table, events, with about 300,000 records.  
It does an outer join to a second table, cases, with about 150,000 records.

A very simplified version query would be like

SELECT *
FROM events LEFT OUTER JOIN cases ON events.case_id = cases.case_id
WHERE events.event_date BETWEEN 'date1' AND 'date2'

This join is very expensive, as you can imagine.   Yet I can't seem to force 
the query planner to apply the filter conditions to the events table *before* 
attempting to join it to cases.  Here's the crucial explain lines:

 ->  Merge Left Join  (cost=0.00..11880.82 
rows=15879 width=213) (actual time=5.777..901.899 rows=648 loops=1)
   Merge Cond: ("outer".case_id = 
"inner".case_id)
   Join Filter: (("outer".link_type)::text 
= 'case'::text)
   ->  Index Scan using idx_event_ends on 
events  (cost=0.00..4546.15 rows=15879 width=80
) (actual time=4.144..333.769 rows=648 loops=1)
 Filter: ((status <> 0) AND 
((event_date + duration) >= '2004-02-18 00:00:00'::timestamp without time 
zone) AND (event_date <= '2004-03-05 23:59:00'::timestamp without time zone))
   ->  Index Scan using cases_pkey on 
cases  (cost=0.00..6802.78 rows=117478 width=137) (
actual time=0.139..402.363 rows=116835 loops=1)

As you can see, part of the problem is a pretty drastic (20x) mis-estimation 
of the selectivity of the date limits on events -- and results in 90% of the 
execution time of my query on this one join.  I've tried raising the 
statistics on event_date, duration, and case_id (to 1000), but this doesn't 
seem to affect the estimate or the query plan.

In the above test, idx_event_ends indexes (case_id, status, event_date, 
(event_date + duration)), but as you can see the planner uses only the first 
column.  This was an attempt to circumvent the planner's tendency to 
completely ignoring any index on (event_date, (event_date + duration))  -- 
even though that index is the most selective combination on the events table.

Is there anything I can do to force the query planner to filter on events 
before joining cases, other than waiting for version 7.5?

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Forcing filter/join order?

2004-02-18 Thread Peter Darley
Josh,
I'm sure the big brains have a better suggestion, but in the mean time
could you do something as simple as:

SELECT *
FROM (select * from events where event_date BETWEEN 'date1' AND 'date2') e
LEFT OUTER JOIN cases ON e.case_id = cases.case_id;

Thanks,
Peter Darley

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Josh Berkus
Sent: Wednesday, February 18, 2004 4:11 PM
To: pgsql-performance
Subject: [PERFORM] Forcing filter/join order?


Folks,

Have an interesting issue with a complex query, where apparently I need to
twist the query planner's arm, and am looking for advice on how to do so.

The situation:  I have a table, events, with about 300,000 records.
It does an outer join to a second table, cases, with about 150,000 records.

A very simplified version query would be like

SELECT *
FROM events LEFT OUTER JOIN cases ON events.case_id = cases.case_id
WHERE events.event_date BETWEEN 'date1' AND 'date2'

This join is very expensive, as you can imagine.   Yet I can't seem to force
the query planner to apply the filter conditions to the events table
*before*
attempting to join it to cases.  Here's the crucial explain lines:

 ->  Merge Left Join  (cost=0.00..11880.82
rows=15879 width=213) (actual time=5.777..901.899 rows=648 loops=1)
   Merge Cond: ("outer".case_id =
"inner".case_id)
   Join Filter:
(("outer".link_type)::text
= 'case'::text)
   ->  Index Scan using idx_event_ends
on
events  (cost=0.00..4546.15 rows=15879 width=80
) (actual time=4.144..333.769 rows=648 loops=1)
 Filter: ((status <> 0) AND
((event_date + duration) >= '2004-02-18 00:00:00'::timestamp without time
zone) AND (event_date <= '2004-03-05 23:59:00'::timestamp without time
zone))
   ->  Index Scan using cases_pkey on
cases  (cost=0.00..6802.78 rows=117478 width=137) (
actual time=0.139..402.363 rows=116835 loops=1)

As you can see, part of the problem is a pretty drastic (20x) mis-estimation
of the selectivity of the date limits on events -- and results in 90% of the
execution time of my query on this one join.  I've tried raising the
statistics on event_date, duration, and case_id (to 1000), but this doesn't
seem to affect the estimate or the query plan.

In the above test, idx_event_ends indexes (case_id, status, event_date,
(event_date + duration)), but as you can see the planner uses only the first
column.  This was an attempt to circumvent the planner's tendency to
completely ignoring any index on (event_date, (event_date + duration))  --
even though that index is the most selective combination on the events
table.

Is there anything I can do to force the query planner to filter on events
before joining cases, other than waiting for version 7.5?

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Forcing filter/join order?

2004-02-18 Thread Josh Berkus
Folks,

Hmmm posted too soon.  Figured out the problem:

The planner can't, or doesn't want to, use an index on (event_date, 
(event_date + duration)) where the first column is an ascending sort and the 
second a descending sort.So I've coded a workaround that's quite 
inelegant but does get the correct results in 0.3 seconds (as opposed to the 
2.2 seconds taken by the example plan).

Is this the sort of thing which is ever likely to get fixed, or just a 
fundamental limitation of index algorithms?   Would using a non B-Tree index 
allow me to work around this?

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Forcing filter/join order?

2004-02-18 Thread Josh Berkus
Peter,

>   I'm sure the big brains have a better suggestion, but in the mean time
> could you do something as simple as:
> 
> SELECT *
> FROM (select * from events where event_date BETWEEN 'date1' AND 'date2') e
> LEFT OUTER JOIN cases ON e.case_id = cases.case_id;

Thanks, but that doens't work; the planner will collapse the subquery into the 
main query, which most of the time is the right thing to do.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Forcing filter/join order?

2004-02-18 Thread Stephan Szabo
On Wed, 18 Feb 2004, Josh Berkus wrote:

> The planner can't, or doesn't want to, use an index on (event_date,
> (event_date + duration)) where the first column is an ascending sort and the
> second a descending sort.So I've coded a workaround that's quite
> inelegant but does get the correct results in 0.3 seconds (as opposed to the
> 2.2 seconds taken by the example plan).

Can you give more information?  I know that I'm not exactly certain what
the situation is from the above and the original query/explain piece.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Forcing filter/join order?

2004-02-18 Thread Josh Berkus
Stephan,

> Can you give more information?  I know that I'm not exactly certain what
> the situation is from the above and the original query/explain piece.
> 

Believe me, if I posted the query it wouldn't help.Heck, I'd have trouble 
following it without my notes.

a simplifed version:

SELECT events.*, cases.case_name
FROM events LEFT OUTER JOIN cases ON events.case_id = cases.case_id
WHERE (event_date >= '2004-03-05' OR (event_date + duration) <= '2004-02-18')
AND events.status <> 0;

... this is to get me all vaild events which overlap with the range 
'2004-02-18' to '2004-03-05'.

I had thought, in 7.4, that adding an index on (event_date, (event_date + 
duration)) would improve the execution of this query.   It doesn't, 
presumably because the multi-column index can't be used for both ascending 
and descending sorts at the same time, and event_date >= '2004-03-05' isn't 
selective enough.

There was a workaround for this posted on hackers about a year ago as I 
recally, that involved creating custom operators for indexing.  Too much 
trouble when there's a hackish workaround (due to the fact that events have 
to be less than a month long).

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Forcing filter/join order?

2004-02-18 Thread Stephan Szabo
On Wed, 18 Feb 2004, Josh Berkus wrote:

> Stephan,
>
> > Can you give more information?  I know that I'm not exactly certain what
> > the situation is from the above and the original query/explain piece.
> >
>
> Believe me, if I posted the query it wouldn't help.Heck, I'd have trouble
> following it without my notes.
>
> a simplifed version:
>
> SELECT events.*, cases.case_name
> FROM events LEFT OUTER JOIN cases ON events.case_id = cases.case_id
> WHERE (event_date >= '2004-03-05' OR (event_date + duration) <= '2004-02-18')
>   AND events.status <> 0;
>
> ... this is to get me all vaild events which overlap with the range
> '2004-02-18' to '2004-03-05'.
>
> I had thought, in 7.4, that adding an index on (event_date, (event_date +
> duration)) would improve the execution of this query.   It doesn't,
> presumably because the multi-column index can't be used for both ascending
> and descending sorts at the same time, and event_date >= '2004-03-05' isn't
> selective enough.

I don't think the direction issue is the problem in the above.  I think
the problem is that given a condition like:
  a>value or bvalue alone since that'd do the wrong thing.

Testing on a two column table, I see behavior like the following (with
seqscan off)

sszabo=# create table q2(a int, b int);
CREATE TABLE
sszabo=# create index q2ind on q2(a,b);
CREATE INDEX
sszabo=# set enable_seqscan=off;
SET
sszabo=# explain select * from q2 where a>3 and b<5;
QUERY PLAN
---
 Index Scan using q2ind on q2  (cost=0.00..42.79 rows=112 width=8)
   Index Cond: ((a > 3) AND (b < 5))
(2 rows)

sszabo=# explain select * from q2 where a>3 or b<5;
 QUERY PLAN

 Seq Scan on q2  (cost=1.00..10025.00 rows=556 width=8)
   Filter: ((a > 3) OR (b < 5))
(2 rows)

sszabo=# create index q2ind2 on q2(b);
CREATE INDEX
sszabo=# explain select * from q2 where a>3 or b<5;
QUERY PLAN
---
 Index Scan using q2ind, q2ind2 on q2  (cost=0.00..92.68 rows=556 width=8)
   Index Cond: ((a > 3) OR (b < 5))
(2 rows)



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Forcing filter/join order?

2004-02-18 Thread Tom Lane
Josh Berkus <[EMAIL PROTECTED]> writes:
> SELECT events.*, cases.case_name
> FROM events LEFT OUTER JOIN cases ON events.case_id = cases.case_id
> WHERE (event_date >= '2004-03-05' OR (event_date + duration) <= '2004-02-18')
>   AND events.status <> 0;

> ... this is to get me all vaild events which overlap with the range 
> '2004-02-18' to '2004-03-05'.

Did you mean events that *don't* overlap with the range?  Seems like
what you say you want should be expressed as

event_date <= 'end-date' AND (event_date + duration) >= 'start-date'

This assumes duration is never negative of course.

I think you could make this btree-indexable by negating the second
clause.  Imagine

create index evi on events (event_date, (-(event_date+duration)))

and then transforming the query to

event_date <= 'end-date' AND -(event_date + duration) <= -'start-date'

but that doesn't quite work because there's no unary minus for date or
timestamp types.  Is this too ugly for you?

create index evi on events (event_date, ('ref-date'-event_date-duration))

event_date <= 'end-date'
AND ('ref-date'-event_date-duration) <= 'ref-date'-'start-date'

where 'ref-date' is any convenient fixed reference date, say 1-1-2000.

Now, what this will look like to the planner is a one-sided two-column
restriction, and I'm not certain that the planner will assign a
sufficiently small selectivity estimate.  But in theory it could work.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Forcing filter/join order?

2004-02-18 Thread Josh Berkus
Tom,

First off, you are correct, I swapped the dates when typing the simplified 
query into e-mail.

> create index evi on events (event_date, ('ref-date'-event_date-duration))
> 
> event_date <= 'end-date'
> AND ('ref-date'-event_date-duration) <= 'ref-date'-'start-date'
> 
> where 'ref-date' is any convenient fixed reference date, say 1-1-2000.
> 
> Now, what this will look like to the planner is a one-sided two-column
> restriction, and I'm not certain that the planner will assign a
> sufficiently small selectivity estimate.  But in theory it could work.

Interesting idea.   I'll try it just to see if it works when I have a chance.   

In the meantime, for production, I'll stick with the hackish solution I was 
using under 7.2.   

Knowing that events are never more than one month long for this application, I 
can do:

"WHERE event.event_date >= (begin_date - '1 month) AND event.event_date <= 
end_date"

... which works because I have a child table which has event information by 
day:

AND events.event_id IN (SELECT event_id FROM event_day
WHERE calendar_day BETWEEN begin_date AND end_date);

Note that this subselect isn't sufficent on its own, because once again the 
query planner is unable to correctly estimate the selectivity of the 
subselect.   It needs the "help" of the filter against events.event_date.

This is the workaround I was using with 7.2.   I had just hoped that some of 
the improvements that Tom has made over the last two versions would cure the 
problem, but no dice.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Forcing filter/join order?

2004-02-18 Thread Tom Lane
Josh Berkus <[EMAIL PROTECTED]> writes:
> Knowing that events are never more than one month long for this
> application, I can do:

> "WHERE event.event_date >= (begin_date - '1 month) AND event.event_date <= 
> end_date"

> ... which works because I have a child table which has event information by 
> day:

Uh, why do you need the child table?  Seems like the correct incantation
given an assumption about maximum duration is

event_date <= 'end-date' AND (event_date + duration) >= 'start-date'
AND event_date >= 'start-date' - 'max-duration'

The last clause is redundant with the one involving the duration field,
but it provides a lower bound for the index scan on event_date.  The
only index you really need here is one on event_date, but possibly one
on (event_date, (event_date + duration)) would be marginally faster.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster