Re: [GENERAL] Select performance variation based on the different combinations of using where lower(), order by, and limit

Tyler Reese Sun, 18 Aug 2013 21:14:17 -0700

On Sun, Aug 18, 2013 at 10:33 PM, Kevin Grittner <kgri...@ymail.com> wrote:


> Tyler Reese <juke...@gmail.com> wrote:
> > Kevin Grittner <kgri...@ymail.com> wrote:
> >> Tyler Reese <juke...@gmail.com> wrote:
>
> >>> mydb=> explain analyze SELECT * FROM "cdr" WHERE
> lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;
> >>>
> >>> Limit  (cost=0.00..72882.05 rows=100 width=757) (actual
> time=20481.083..30464.960 rows=11 loops=1)
> >>>   ->  Index Scan using cdr_pkey on cdr  (cost=0.00..30036152.32
> rows=41212 width=757) (actual time=20481.049..30464.686 rows=11 loops=1)
> >>>         Filter: (lower("CallingPartyNumber") = '9725551212'::text)
> >>> Total runtime: 30465.246 ms
> >>
> >> It thinks that it will only need to read 1/412th of the table to
> >> find 100 matching rows, and using that index it will be able to
> >> skip the sort.  Since there aren't 100 matching rows, it has to
> >> read the whole table through the index.
>
> > So, since it thinks it needs to read 1/412th of the table is the
> > reason why the query planner chooses to use the primary key index
> > instead of the callingpartynumber index, like it does in the
> > first 3 cases?
>
> The optimizer compares the estimated cost of reading all matching
> rows (but only matching rows) out of order and then sorting them to
> the estimated cost of reading them in order and filtering out the
> rows that don't match (and stopping when the limit is reached).
> Since it though a lot of rows would match, that made the sort look
> more expensive and also like it would not reed to read a very large
> percentage of the table.
>
> > I'm curious as to why it says "rows=41212".  Is that the estimate
> > of the number of rows that meet the filter condition?  Where does
> > that come from?
>
> That's based on the distribution observed in the last random sample
> when ANALYZE was run (as a command or by autovacuum).  When there
> is an error in the estimate that bad, either autovacuum is not
> configured to be aggressive enough in analyzing or the stample size
> was not large enough.
>
> >>  Raising the statistics target and running ANALYZE might allow
> >> it to use a more accurate estimate, and thereby make a better
> >> choice.
> >
> > I haven't heard of raising the statistics target, so I'll read up
> > on that.
>
>
> http://www.postgresql.org/docs/9.2/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER
>
> http://www.postgresql.org/docs/9.2/interactive/sql-altertable.html
>
> > A few days ago, all 4 cases were responding equally fast. I had
> > been messing around with the postgres settings, and I went and
> > dropped all of the indexes and recreated them just to see what
> > would happen.  I wouldn't think that recreating the indexes would
> > cause case 4 to go slow, but that's the symptom I am seeing now.
> > Should I be running analyze on a table after it has been
> > reindexed?
>
> Only if you have indexes on expressions rather than simple column
> names.  If you have an index on lower("CallingPartyNumber") that
> would qualify as an expression.
>
> The normal reason to need to get fresh statistics is because of
> changes in the distribution of values in a column, particularly
> after a bulk load.  Also, columns with a large number of distinct
> values tend to benefit from a higher statistics target. The down
> sides of higher statistics targets are a longer time to ANALYZE and
> increased planning time; so it's generally best to use the default
> except where a particular problem has been observed, like in this
> case.
>

I performed ANALYZE on the table and now case 4 is fast again:

mydb=> explain analyze SELECT * FROM "cdr" WHERE
lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;
                                                                   QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=958.72..958.97 rows=100 width=758) (actual time=0.251..0.310
rows=11 loops=1)
   ->  Sort  (cost=958.72..959.32 rows=240 width=758) (actual
time=0.244..0.266 rows=11 loops=1)
         Sort Key: key
         Sort Method:  quicksort  Memory: 30kB
         ->  Bitmap Heap Scan on cdr  (cost=7.30..949.55 rows=240
width=758) (actual time=0.105..0.162 rows=11 loops=1)
               Recheck Cond: (lower("CallingPartyNumber") =
'9725551212'::text)
               ->  Bitmap Index Scan on callingpartynumber_lower
 (cost=0.00..7.24 rows=240 width=0) (actual time=0.085..0.085 rows=11
loops=1)
                     Index Cond: (lower("CallingPartyNumber") =
'9725551212'::text)
 Total runtime: 0.517 ms
(9 rows)

Thanks for the help, Kevin.

Re: [GENERAL] Select performance variation based on the different combinations of using where lower(), order by, and limit

Reply via email to