[PERFORM] Query optimization using order by and limit

2011-09-21 Thread Michael Viscuso
First of all, thank you for taking the time to review my question.  After
attending the PostgresOpen conference in Chicago last week, I've been
pouring over explain logs for hours on end and although my system is MUCH
better, I still can't resolve a few issues.  Luckily my data is pretty well
structured so solving one issue will likely solve many more so I'll start
with this one.

Version: PostgreSQL 9.1rc1, compiled by Visual C++ build 1500, 64-bit
OS: Windows 7 64-bit
ORM: SQLAlchemy
Postgres table structure: I have daily partitioned tables for each of 4
"core tables" (the tables with the majority of my application's data).  Each
daily table inherits from its parent.  I do not explicitly define a
REFERENCE between these tables because I cannot guarantee the order in which
the events are inserted into the database, but where there are references,
the referenced row should exist in the other's daily table.  The reason I
partitioned the data in this manner is to increase query speed and make it
easy to archive old data. (I'm new to high-end Postgres performance so
there's likely several fundamental flaws in my assumptions.  I won't turn
down any recommendation.)

An example of a daily partitioned table follows:

cb=# \d osmoduleloads_2011_09_14;
 Table "public.osmoduleloads_2011_09_14"
Column |Type |
  Modifiers
---+-+
 guid  | numeric(20,0)   | not null
 osprocess_guid| numeric(20,0)   | not null
 filepath_guid | numeric(20,0)   | not null
 firstloadtime | numeric(20,0)   | not null
 md5hash   | bytea   | not null
 host_guid | numeric(20,0)   | default NULL::numeric
 process_create_time   | numeric(20,0)   | default NULL::numeric
 process_filepath_guid | numeric(20,0)   | default NULL::numeric
 event_time| timestamp without time zone | default '2011-09-14
00:00:00'::timestamp without time zone
Indexes:
"osmoduleloads_2011_09_14_pkey" PRIMARY KEY, btree (guid)
"idx_osmoduleloads_2011_09_14_filepath_guid" btree (filepath_guid)
"idx_osmoduleloads_2011_09_14_firstload_time" btree (firstloadtime)
"idx_osmoduleloads_2011_09_14_host_guid" btree (host_guid)
"idx_osmoduleloads_2011_09_14_md5hash" btree (md5hash)
"idx_osmoduleloads_2011_09_14_osprocess_guid" btree (osprocess_guid)
Check constraints:
"osmoduleloads_2011_09_14_event_time_check" CHECK (event_time =
'2011-09-14 00:00:00'::timestamp without time zone)
"osmoduleloads_2011_09_14_firstloadtime_check" CHECK (firstloadtime >=
1296044640::bigint::numeric AND firstloadtime <
1296053280::bigint::numeric)
Inherits: osmoduleloads

Objective:  The firstloadtime check constraint ensures that the record is
applicable to that daily table. (In case you were wondering, the large
numerics correspond to the Windows 100-nanosecond since the Epoch.) I'm
inserting millions of records into each daily table so "query slowness" is
quite easy to spot.  Given that there is so much data per daily table, I was
hoping to use the order by and limit clauses to "stop out" a query once it
sufficed the limit clause and not be forced to visit each daily table.
 However, I'm spending way too much time in the older tables than I'd like -
which leads me to believe that I;m doing something wrong.  For ease of
viewing, my explain analyze can be found at http://explain.depesz.com/s/tot

I'm still very new to this so I'm not sure if explain.depesz.com saves the
original query.  It wasn't readily apparent that it did so here is the
original query:

SELECT osm_1.*, storefiles_1.*, filepaths_1.*, filepaths_2.* FROM (SELECT *
FROM osmoduleloads JOIN hosts ON hosts.guid = osmoduleloads.host_guid WHERE
hosts.guid = '2007075705813916178' AND osmoduleloads.firstloadtime >=
1296043200 AND osmoduleloads.firstloadtime < 12961036799000 AND
hosts.enabled = true AND hosts.user_id = 111 ORDER BY
osmoduleloads.firstloadtime DESC LIMIT 251) AS osm_1 LEFT OUTER JOIN
storefiles AS storefiles_1 ON osm_1.md5hash = storefiles_1.md5hash LEFT
OUTER JOIN filepaths AS filepaths_1 ON osm_1.process_filepath_guid =
filepaths_1.guid AND osm_1.event_time = filepaths_1.event_time LEFT OUTER
JOIN filepaths AS filepaths_2 ON osm_1.filepath_guid = filepaths_2.guid AND
osm_1.event_time= filepaths_2.event_time ORDER BY osm_1.firstloadtime DESC;

Hopefully my assumptions about order by and limit are correct and this query
can be optimized.

Again, appreciate any help you can lend.  Thanks in advance.

Mike


Re: [PERFORM] Query optimization using order by and limit

2011-09-21 Thread Michael Viscuso
Thanks guys,

First of all, I should have included my postgres.conf file with the
original submission.  Sorry about that.  It is now attached.

Based on a recommendation, I also should have shown the parent child
relationship between osmoduleloads and its daily partitioned tables.  to
reduce clutter, It is at the end of this message.

Taking this one step at a time and taking Greg's second suggestion
first, issuing

select * from osmoduleloads WHERE osmoduleloads.firstloadtime >=
1296043200 AND osmoduleloads.firstloadtime < 12961036799000;

appears to only query the appropriate daily tables (2011_09_13 through
2011_09_20 - http://explain.depesz.com/s/QCG).  So it appears that
constraint_exclusion is working properly.  Putting a limit on the query
like:

select * from osmoduleloads WHERE osmoduleloads.firstloadtime >=
1296043200 AND osmoduleloads.firstloadtime < 12961036799000
limit 251;

has the result that I'd expect to see http://explain.depesz.com/s/O7fZ.
Ordering by firstloadtime AND limiting like:

select * from osmoduleloads WHERE osmoduleloads.firstloadtime >=
1296043200 AND osmoduleloads.firstloadtime < 12961036799000
order by firstloadtime desc limit 251;

also has the result that I'd expect to see
http://explain.depesz.com/s/RDh.  

Adding the hosts join condition to the mix was still OK
http://explain.depesz.com/s/2Ns. 

Adding the hosts.enabled condition was still OK
http://explain.depesz.com/s/UYN. 

Adding the hosts.user_id = 111 started the descent but it appears to
still be obeying the proper contraint_exclusion that I'd expect, just
with a ton of rows returned from the most recent daily tables
http://explain.depesz.com/s/4WE. 

Adding the final condition hosts_guid = '2007075705813916178' is what
ultimately kills it http://explain.depesz.com/s/8zy.  By adding the
host_guid, it spends considerably more time in the older tables than
without this condition and I'm not sure why. 

Thanks Greg for the recommendation to step through it like that -
hopefully this helps get us closer to a resolution.

Greg/Tom, you are correct, these columns should be modified to whatever
is easiest for Postgres to recognize 64-bit unsigned integers.  Would
you still recommend bigint for unsigned integers?  I likely read the
wrong documentation that suggested bigint for signed 64-bit integers and
numeric(20) for unsigned 64-bit integers.

Thanks again for all your help!  Perhaps 15 hours of pouring over
explain logs will finally pan out!

Mike

cb=# \d+ osmoduleloads;
 Table "public.osmoduleloads"
Column |Type |  
Modifiers   | Storage  | Description
---+-+---+--+-
 guid  | numeric(20,0)   | not
null  | main |
 osprocess_guid| numeric(20,0)   | not
null  | main |
 filepath_guid | numeric(20,0)   | not
null  | main |
 firstloadtime | numeric(20,0)   | not
null  | main |
 md5hash   | bytea   | not
null  | extended |
 host_guid | numeric(20,0)   | default
NULL::numeric | main |
 process_create_time   | numeric(20,0)   | default
NULL::numeric | main |
 process_filepath_guid | numeric(20,0)   | default
NULL::numeric | main |
 event_time| timestamp without time zone
|   | plain|
Indexes:
"osmoduleloads_pkey" PRIMARY KEY, btree (guid)
Child tables: osmoduleloads_2001_12_31,
  osmoduleloads_2010_10_11,
  osmoduleloads_2010_10_12,
  osmoduleloads_2010_10_13,
  osmoduleloads_2011_07_27,
  osmoduleloads_2011_08_04,
  osmoduleloads_2011_08_05,
  osmoduleloads_2011_08_06,
  osmoduleloads_2011_08_07,
  osmoduleloads_2011_08_08,
  osmoduleloads_2011_08_09,
  osmoduleloads_2011_08_10,
  osmoduleloads_2011_08_11,
  osmoduleloads_2011_08_12,
  osmoduleloads_2011_08_13,
  osmoduleloads_2011_08_14,
  osmoduleloads_2011_08_15,
  osmoduleloads_2011_08_16,
  osmoduleloads_2011_08_17,
  osmoduleloads_2011_08_18,
  osmoduleloads_2011_08_19,
  osmoduleloads_2011_08_20,
  osmoduleloads_2011_08_21,
  osmoduleloads_2011_08_22,
  osmoduleloads_2011_08_23,
  osmoduleloads_2011_08_24,
  osmoduleloads_2011_08_25,
  osmoduleloads_2011_08_26,
  osmoduleloads_2011_08_27,
  osmoduleloads_2011_08_28,
  osmoduleloads_2011_08_29,
  osmoduleloads_2011_08_30,
  osmoduleloads_2011_08_3

Re: [PERFORM] Query optimization using order by and limit

2011-09-22 Thread Michael Viscuso
Thanks Ken,

I'm discussing with my coworker how to best make that change *as we
speak*.  Do you think this will also resolve the original issue I'm
seeing where the query doesn't "limit out properly" and spends time in
child tables that won't yield any results?  I was hoping that by using
the check constraints, I could query over a week or month's worth of
partitioned tables and the combination of order by and limit would
eliminate any time searching unnecessary tables but that doesn't appear
to be true. (I'm still very new to high-end Postgres performance so I
could be mistaken.)

Regardless, in the meantime, I'll switch those columns to bigint instead
of numeric and have an update as soon as possible.

Thanks for your help!

Mike

On 9/22/2011 9:41 AM, k...@rice.edu wrote:
> On Wed, Sep 21, 2011 at 11:22:53PM -0400, Tom Lane wrote:
>> Michael Viscuso  writes:
>>> Greg/Tom, you are correct, these columns should be modified to whatever
>>> is easiest for Postgres to recognize 64-bit unsigned integers.  Would
>>> you still recommend bigint for unsigned integers?  I likely read the
>>> wrong documentation that suggested bigint for signed 64-bit integers and
>>> numeric(20) for unsigned 64-bit integers.
>> Unsigned?  Oh, hm, that's a bit of a problem because we don't have any
>> unsigned types.  If you really need to go to 2^64 and not 2^63 then
>> you're stuck with numeric ... but that last bit is costing ya a lot.
>>
>>  regards, tom lane
>>
> Hi Michael,
>
> If you have access to the application, you can map the unsigned 64-bits
> to the PostgreSQL signed 64-bit type with a simple subtraction. That will
> allow you to drop all the numeric use. Also if the guid is a 64-bit
> values stuffed into a numeric(20), you can do it there as well. I achieved
> a hefty performance boost by making those application level changes in a
> similar situation.
>
> Regards,
> Ken


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query optimization using order by and limit

2011-09-22 Thread Michael Viscuso

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
Stephen,
 
I spent the better part of the day implementing an application layer
nested loop and it seems to be working well.  Of course it's a little
slower than a Postgres only solution because it has to pass data back
and forth for each daily table query until it reaches the limit, but at
least I don't have "runaway" queries like I was seeing before.  That
should be a pretty good stopgap solution for the time being.
 
I was really hoping there was a Postgres exclusive answer though! :)  If
there are any other suggestions, it's a simple flag in my application to
query the other way again...
 
Thanks for all your help - and I'm still looking to change those
numerics to bigints, just haven't figured out the best way yet.
 
Mike
 
On 9/22/2011 10:53 AM, Stephen Frost wrote:
> * Michael Viscuso (michael.visc...@getcarbonblack.com) wrote:
>> Adding the final condition hosts_guid = '2007075705813916178' is what
>> ultimately kills it http://explain.depesz.com/s/8zy. By adding the
>> host_guid, it spends considerably more time in the older tables than
>> without this condition and I'm not sure why.
>
> What I think is happening here is that PG is pushing down that filter
> (not typically a bad thing..), but with that condition, it's going to
> scan the index until it finds a match for that filter before returning
> back up only to have that result cut out due to the limit. Having it as
> numerics isn't helping here, but the bigger issue is having to check all
> those tuples for a match to the filter.
>
> Mike, the filter has to be applied before the order by/limit, since
> those clauses come after the filter has been applied (you wouldn't want
> a 'where x = 2 limit 10' to return early just because it found 10
> records where x didn't equal 2).
>
> What would be great is if PG would realize that the CHECK constraints
> prevent earlier records from being in these earlier tables, so it
> shouldn't need to consider them at all once the records from the
> 'latest' table has been found and the limit reached (reverse all this
> for an 'ascending' query, of course), which we can do when there's no
> order by. I don't believe we have that kind of logic or that
> information available at this late stage- the CHECK constraints are used
> to eliminate the impossible-to-match tables, but that's it.
>
> One option, which isn't great of course, would be to implement your own
> 'nested loop' construct (something I typically despise..) in the
> application which just walks backwards from the latest and pulls
> whatever records it can from each day and then stops once it hits the
> limit.
>
> Thanks,
>
> Stephen
 
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
 
iQEcBAEBAgAGBQJOe7zzAAoJEBKjVK2HR1IXYwAIAKQBnFOtCNljL1Hs1ZQW3e+I
ele/kZCiHzgHLFpN7zawt1Y7qf+3ntd6u+mkatJsnqeC+HY1Qee4VTUqr+hIKhcc
VIGuuYkzuojs6/PgF6MAERHP24lRFdLCQtMgTY8RshYODvc07VpqkLq1cXhsNJZw
6pNBTEpEmA0MzMrmk3x6C8lFbyXZAYUxNLwG5SEWecV+lkOjnA70oKnSxG6EXRgk
fkj2l1ezVn23KoO8SSUp4xBFHHOY/PQP9JtV7b52Gm5PC7lOqFFrXFygNP0KkWho
TzyjoYKttShEjmTMXoLt181+NB4rQEas8USasemRA1pUkx2NrfvcK46gYucOAsg=
=8yQW
-END PGP SIGNATURE-


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query optimization using order by and limit

2011-09-22 Thread Michael Viscuso
Stephen,

Yes, I couldn't agree more.  The next two things I will be looking at very
carefully are the timestamps and indexes.  I will reply to this post if
either dramatically helps.

Thanks again for all your help.  My eyes were starting to bleed from staring
at explain logs!

Mike

On Thu, Sep 22, 2011 at 7:14 PM, Stephen Frost  wrote:

> Mike,
>
> * Michael Viscuso (michael.visc...@getcarbonblack.com) wrote:
> > I spent the better part of the day implementing an application layer
> > nested loop and it seems to be working well.  Of course it's a little
> > slower than a Postgres only solution because it has to pass data back
> > and forth for each daily table query until it reaches the limit, but at
> > least I don't have "runaway" queries like I was seeing before.  That
> > should be a pretty good stopgap solution for the time being.
>
> Glad to hear that you were able to get something going which worked for
> you.
>
> > I was really hoping there was a Postgres exclusive answer though! :)  If
> > there are any other suggestions, it's a simple flag in my application to
> > query the other way again...
>
> I continue to wonder if some combination of multi-column indexes might
> have made the task of finding the 'lowest' record from each of the
> tables fast enough that it wouldn't be an issue.
>
> > Thanks for all your help - and I'm still looking to change those
> > numerics to bigints, just haven't figured out the best way yet.
>
> Our timestamps are also implemented using 64bit integers and would allow
> you to use all the PG date/time functions and operators.  Just a
> thought.
>
>Thanks,
>
>Stephen
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.10 (GNU/Linux)
>
> iEYEARECAAYFAk57wXAACgkQrzgMPqB3kijaNwCfQ9cSdzzHyiPwa+BTzIihWR7T
> baoAoIbL8P3atU1cfbcCoFXFGbKE7fPt
> =ZRqu
> -END PGP SIGNATURE-
>
>


Re: [PERFORM] Slow query when using ORDER BY *and* LIMIT

2011-10-15 Thread Michael Viscuso
Dave,

Since I control the application that was performing the query and I've
separated my data into daily partitioned tables (which enforced my order by
clause on a macro-level), I took Stephen's advice and implemented the nested
loop over each daily table from within the application versus having
Postgres figure it out for me.  Sorry I don't have a better answer for you.

Mike

On Thu, Oct 13, 2011 at 3:34 PM, davidsarmstrong wrote:

> I'm running into the same problem. I removed the limit and it was fine. I
> guess I could have removed the order by as well but it doesn't help if you
> really need both.
>
> Have you found any more information on this?
>
> Thanks!
>
> Dave (Armstrong)
>
> --
> View this message in context:
> http://postgresql.1045698.n5.nabble.com/Slow-query-when-using-ORDER-BY-and-LIMIT-tp4555260p4900348.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>