[PERFORM] Query optimization using order by and limit
First of all, thank you for taking the time to review my question. After attending the PostgresOpen conference in Chicago last week, I've been pouring over explain logs for hours on end and although my system is MUCH better, I still can't resolve a few issues. Luckily my data is pretty well structured so solving one issue will likely solve many more so I'll start with this one. Version: PostgreSQL 9.1rc1, compiled by Visual C++ build 1500, 64-bit OS: Windows 7 64-bit ORM: SQLAlchemy Postgres table structure: I have daily partitioned tables for each of 4 "core tables" (the tables with the majority of my application's data). Each daily table inherits from its parent. I do not explicitly define a REFERENCE between these tables because I cannot guarantee the order in which the events are inserted into the database, but where there are references, the referenced row should exist in the other's daily table. The reason I partitioned the data in this manner is to increase query speed and make it easy to archive old data. (I'm new to high-end Postgres performance so there's likely several fundamental flaws in my assumptions. I won't turn down any recommendation.) An example of a daily partitioned table follows: cb=# \d osmoduleloads_2011_09_14; Table "public.osmoduleloads_2011_09_14" Column |Type | Modifiers ---+-+ guid | numeric(20,0) | not null osprocess_guid| numeric(20,0) | not null filepath_guid | numeric(20,0) | not null firstloadtime | numeric(20,0) | not null md5hash | bytea | not null host_guid | numeric(20,0) | default NULL::numeric process_create_time | numeric(20,0) | default NULL::numeric process_filepath_guid | numeric(20,0) | default NULL::numeric event_time| timestamp without time zone | default '2011-09-14 00:00:00'::timestamp without time zone Indexes: "osmoduleloads_2011_09_14_pkey" PRIMARY KEY, btree (guid) "idx_osmoduleloads_2011_09_14_filepath_guid" btree (filepath_guid) "idx_osmoduleloads_2011_09_14_firstload_time" btree (firstloadtime) "idx_osmoduleloads_2011_09_14_host_guid" btree (host_guid) "idx_osmoduleloads_2011_09_14_md5hash" btree (md5hash) "idx_osmoduleloads_2011_09_14_osprocess_guid" btree (osprocess_guid) Check constraints: "osmoduleloads_2011_09_14_event_time_check" CHECK (event_time = '2011-09-14 00:00:00'::timestamp without time zone) "osmoduleloads_2011_09_14_firstloadtime_check" CHECK (firstloadtime >= 1296044640::bigint::numeric AND firstloadtime < 1296053280::bigint::numeric) Inherits: osmoduleloads Objective: The firstloadtime check constraint ensures that the record is applicable to that daily table. (In case you were wondering, the large numerics correspond to the Windows 100-nanosecond since the Epoch.) I'm inserting millions of records into each daily table so "query slowness" is quite easy to spot. Given that there is so much data per daily table, I was hoping to use the order by and limit clauses to "stop out" a query once it sufficed the limit clause and not be forced to visit each daily table. However, I'm spending way too much time in the older tables than I'd like - which leads me to believe that I;m doing something wrong. For ease of viewing, my explain analyze can be found at http://explain.depesz.com/s/tot I'm still very new to this so I'm not sure if explain.depesz.com saves the original query. It wasn't readily apparent that it did so here is the original query: SELECT osm_1.*, storefiles_1.*, filepaths_1.*, filepaths_2.* FROM (SELECT * FROM osmoduleloads JOIN hosts ON hosts.guid = osmoduleloads.host_guid WHERE hosts.guid = '2007075705813916178' AND osmoduleloads.firstloadtime >= 1296043200 AND osmoduleloads.firstloadtime < 12961036799000 AND hosts.enabled = true AND hosts.user_id = 111 ORDER BY osmoduleloads.firstloadtime DESC LIMIT 251) AS osm_1 LEFT OUTER JOIN storefiles AS storefiles_1 ON osm_1.md5hash = storefiles_1.md5hash LEFT OUTER JOIN filepaths AS filepaths_1 ON osm_1.process_filepath_guid = filepaths_1.guid AND osm_1.event_time = filepaths_1.event_time LEFT OUTER JOIN filepaths AS filepaths_2 ON osm_1.filepath_guid = filepaths_2.guid AND osm_1.event_time= filepaths_2.event_time ORDER BY osm_1.firstloadtime DESC; Hopefully my assumptions about order by and limit are correct and this query can be optimized. Again, appreciate any help you can lend. Thanks in advance. Mike
Re: [PERFORM] Query optimization using order by and limit
Thanks guys, First of all, I should have included my postgres.conf file with the original submission. Sorry about that. It is now attached. Based on a recommendation, I also should have shown the parent child relationship between osmoduleloads and its daily partitioned tables. to reduce clutter, It is at the end of this message. Taking this one step at a time and taking Greg's second suggestion first, issuing select * from osmoduleloads WHERE osmoduleloads.firstloadtime >= 1296043200 AND osmoduleloads.firstloadtime < 12961036799000; appears to only query the appropriate daily tables (2011_09_13 through 2011_09_20 - http://explain.depesz.com/s/QCG). So it appears that constraint_exclusion is working properly. Putting a limit on the query like: select * from osmoduleloads WHERE osmoduleloads.firstloadtime >= 1296043200 AND osmoduleloads.firstloadtime < 12961036799000 limit 251; has the result that I'd expect to see http://explain.depesz.com/s/O7fZ. Ordering by firstloadtime AND limiting like: select * from osmoduleloads WHERE osmoduleloads.firstloadtime >= 1296043200 AND osmoduleloads.firstloadtime < 12961036799000 order by firstloadtime desc limit 251; also has the result that I'd expect to see http://explain.depesz.com/s/RDh. Adding the hosts join condition to the mix was still OK http://explain.depesz.com/s/2Ns. Adding the hosts.enabled condition was still OK http://explain.depesz.com/s/UYN. Adding the hosts.user_id = 111 started the descent but it appears to still be obeying the proper contraint_exclusion that I'd expect, just with a ton of rows returned from the most recent daily tables http://explain.depesz.com/s/4WE. Adding the final condition hosts_guid = '2007075705813916178' is what ultimately kills it http://explain.depesz.com/s/8zy. By adding the host_guid, it spends considerably more time in the older tables than without this condition and I'm not sure why. Thanks Greg for the recommendation to step through it like that - hopefully this helps get us closer to a resolution. Greg/Tom, you are correct, these columns should be modified to whatever is easiest for Postgres to recognize 64-bit unsigned integers. Would you still recommend bigint for unsigned integers? I likely read the wrong documentation that suggested bigint for signed 64-bit integers and numeric(20) for unsigned 64-bit integers. Thanks again for all your help! Perhaps 15 hours of pouring over explain logs will finally pan out! Mike cb=# \d+ osmoduleloads; Table "public.osmoduleloads" Column |Type | Modifiers | Storage | Description ---+-+---+--+- guid | numeric(20,0) | not null | main | osprocess_guid| numeric(20,0) | not null | main | filepath_guid | numeric(20,0) | not null | main | firstloadtime | numeric(20,0) | not null | main | md5hash | bytea | not null | extended | host_guid | numeric(20,0) | default NULL::numeric | main | process_create_time | numeric(20,0) | default NULL::numeric | main | process_filepath_guid | numeric(20,0) | default NULL::numeric | main | event_time| timestamp without time zone | | plain| Indexes: "osmoduleloads_pkey" PRIMARY KEY, btree (guid) Child tables: osmoduleloads_2001_12_31, osmoduleloads_2010_10_11, osmoduleloads_2010_10_12, osmoduleloads_2010_10_13, osmoduleloads_2011_07_27, osmoduleloads_2011_08_04, osmoduleloads_2011_08_05, osmoduleloads_2011_08_06, osmoduleloads_2011_08_07, osmoduleloads_2011_08_08, osmoduleloads_2011_08_09, osmoduleloads_2011_08_10, osmoduleloads_2011_08_11, osmoduleloads_2011_08_12, osmoduleloads_2011_08_13, osmoduleloads_2011_08_14, osmoduleloads_2011_08_15, osmoduleloads_2011_08_16, osmoduleloads_2011_08_17, osmoduleloads_2011_08_18, osmoduleloads_2011_08_19, osmoduleloads_2011_08_20, osmoduleloads_2011_08_21, osmoduleloads_2011_08_22, osmoduleloads_2011_08_23, osmoduleloads_2011_08_24, osmoduleloads_2011_08_25, osmoduleloads_2011_08_26, osmoduleloads_2011_08_27, osmoduleloads_2011_08_28, osmoduleloads_2011_08_29, osmoduleloads_2011_08_30, osmoduleloads_2011_08_3
Re: [PERFORM] Query optimization using order by and limit
Thanks Ken, I'm discussing with my coworker how to best make that change *as we speak*. Do you think this will also resolve the original issue I'm seeing where the query doesn't "limit out properly" and spends time in child tables that won't yield any results? I was hoping that by using the check constraints, I could query over a week or month's worth of partitioned tables and the combination of order by and limit would eliminate any time searching unnecessary tables but that doesn't appear to be true. (I'm still very new to high-end Postgres performance so I could be mistaken.) Regardless, in the meantime, I'll switch those columns to bigint instead of numeric and have an update as soon as possible. Thanks for your help! Mike On 9/22/2011 9:41 AM, k...@rice.edu wrote: > On Wed, Sep 21, 2011 at 11:22:53PM -0400, Tom Lane wrote: >> Michael Viscuso writes: >>> Greg/Tom, you are correct, these columns should be modified to whatever >>> is easiest for Postgres to recognize 64-bit unsigned integers. Would >>> you still recommend bigint for unsigned integers? I likely read the >>> wrong documentation that suggested bigint for signed 64-bit integers and >>> numeric(20) for unsigned 64-bit integers. >> Unsigned? Oh, hm, that's a bit of a problem because we don't have any >> unsigned types. If you really need to go to 2^64 and not 2^63 then >> you're stuck with numeric ... but that last bit is costing ya a lot. >> >> regards, tom lane >> > Hi Michael, > > If you have access to the application, you can map the unsigned 64-bits > to the PostgreSQL signed 64-bit type with a simple subtraction. That will > allow you to drop all the numeric use. Also if the guid is a 64-bit > values stuffed into a numeric(20), you can do it there as well. I achieved > a hefty performance boost by making those application level changes in a > similar situation. > > Regards, > Ken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization using order by and limit
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Stephen, I spent the better part of the day implementing an application layer nested loop and it seems to be working well. Of course it's a little slower than a Postgres only solution because it has to pass data back and forth for each daily table query until it reaches the limit, but at least I don't have "runaway" queries like I was seeing before. That should be a pretty good stopgap solution for the time being. I was really hoping there was a Postgres exclusive answer though! :) If there are any other suggestions, it's a simple flag in my application to query the other way again... Thanks for all your help - and I'm still looking to change those numerics to bigints, just haven't figured out the best way yet. Mike On 9/22/2011 10:53 AM, Stephen Frost wrote: > * Michael Viscuso (michael.visc...@getcarbonblack.com) wrote: >> Adding the final condition hosts_guid = '2007075705813916178' is what >> ultimately kills it http://explain.depesz.com/s/8zy. By adding the >> host_guid, it spends considerably more time in the older tables than >> without this condition and I'm not sure why. > > What I think is happening here is that PG is pushing down that filter > (not typically a bad thing..), but with that condition, it's going to > scan the index until it finds a match for that filter before returning > back up only to have that result cut out due to the limit. Having it as > numerics isn't helping here, but the bigger issue is having to check all > those tuples for a match to the filter. > > Mike, the filter has to be applied before the order by/limit, since > those clauses come after the filter has been applied (you wouldn't want > a 'where x = 2 limit 10' to return early just because it found 10 > records where x didn't equal 2). > > What would be great is if PG would realize that the CHECK constraints > prevent earlier records from being in these earlier tables, so it > shouldn't need to consider them at all once the records from the > 'latest' table has been found and the limit reached (reverse all this > for an 'ascending' query, of course), which we can do when there's no > order by. I don't believe we have that kind of logic or that > information available at this late stage- the CHECK constraints are used > to eliminate the impossible-to-match tables, but that's it. > > One option, which isn't great of course, would be to implement your own > 'nested loop' construct (something I typically despise..) in the > application which just walks backwards from the latest and pulls > whatever records it can from each day and then stops once it hits the > limit. > > Thanks, > > Stephen -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOe7zzAAoJEBKjVK2HR1IXYwAIAKQBnFOtCNljL1Hs1ZQW3e+I ele/kZCiHzgHLFpN7zawt1Y7qf+3ntd6u+mkatJsnqeC+HY1Qee4VTUqr+hIKhcc VIGuuYkzuojs6/PgF6MAERHP24lRFdLCQtMgTY8RshYODvc07VpqkLq1cXhsNJZw 6pNBTEpEmA0MzMrmk3x6C8lFbyXZAYUxNLwG5SEWecV+lkOjnA70oKnSxG6EXRgk fkj2l1ezVn23KoO8SSUp4xBFHHOY/PQP9JtV7b52Gm5PC7lOqFFrXFygNP0KkWho TzyjoYKttShEjmTMXoLt181+NB4rQEas8USasemRA1pUkx2NrfvcK46gYucOAsg= =8yQW -END PGP SIGNATURE- -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization using order by and limit
Stephen, Yes, I couldn't agree more. The next two things I will be looking at very carefully are the timestamps and indexes. I will reply to this post if either dramatically helps. Thanks again for all your help. My eyes were starting to bleed from staring at explain logs! Mike On Thu, Sep 22, 2011 at 7:14 PM, Stephen Frost wrote: > Mike, > > * Michael Viscuso (michael.visc...@getcarbonblack.com) wrote: > > I spent the better part of the day implementing an application layer > > nested loop and it seems to be working well. Of course it's a little > > slower than a Postgres only solution because it has to pass data back > > and forth for each daily table query until it reaches the limit, but at > > least I don't have "runaway" queries like I was seeing before. That > > should be a pretty good stopgap solution for the time being. > > Glad to hear that you were able to get something going which worked for > you. > > > I was really hoping there was a Postgres exclusive answer though! :) If > > there are any other suggestions, it's a simple flag in my application to > > query the other way again... > > I continue to wonder if some combination of multi-column indexes might > have made the task of finding the 'lowest' record from each of the > tables fast enough that it wouldn't be an issue. > > > Thanks for all your help - and I'm still looking to change those > > numerics to bigints, just haven't figured out the best way yet. > > Our timestamps are also implemented using 64bit integers and would allow > you to use all the PG date/time functions and operators. Just a > thought. > >Thanks, > >Stephen > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.10 (GNU/Linux) > > iEYEARECAAYFAk57wXAACgkQrzgMPqB3kijaNwCfQ9cSdzzHyiPwa+BTzIihWR7T > baoAoIbL8P3atU1cfbcCoFXFGbKE7fPt > =ZRqu > -END PGP SIGNATURE- > >
Re: [PERFORM] Slow query when using ORDER BY *and* LIMIT
Dave, Since I control the application that was performing the query and I've separated my data into daily partitioned tables (which enforced my order by clause on a macro-level), I took Stephen's advice and implemented the nested loop over each daily table from within the application versus having Postgres figure it out for me. Sorry I don't have a better answer for you. Mike On Thu, Oct 13, 2011 at 3:34 PM, davidsarmstrong wrote: > I'm running into the same problem. I removed the limit and it was fine. I > guess I could have removed the order by as well but it doesn't help if you > really need both. > > Have you found any more information on this? > > Thanks! > > Dave (Armstrong) > > -- > View this message in context: > http://postgresql.1045698.n5.nabble.com/Slow-query-when-using-ORDER-BY-and-LIMIT-tp4555260p4900348.html > Sent from the PostgreSQL - performance mailing list archive at Nabble.com. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >