Varying the segment length upwards might have a salutary effect for a while, as the efficiency improvement of fewer inner loops battles with the inefficiency of having more points selected by the index filter. Worth an experiment.
P On Thu, Jan 5, 2017 at 1:00 PM, Israel Brewster <isr...@ravnalaska.net> wrote: > > On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pram...@cleverelephant.ca> > wrote: > > Yes, you did. You want a query that spits out a tupleset of goemetries > (one each for each wee segment), and then you can join that set to your > main table using st_dwithin() as the join clause. > So start by ditching the main table and just work on a query that > generates a pile of wee segments. > > > Ahhh, I see you've done this sort of thing before ( > http://blog.cleverelephant.ca/2015/02/breaking- > linestring-into-segments.html) :-) > > So following that advice I came up with the following query: > > WITH dump AS (SELECT > ST_DumpPoints( > ST_Segmentize( > ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 > 61.179167,-156.77 71.285833)'), > 600 > )::geometry > ) as pt > ), > pts AS ( > SELECT (pt).geom, (pt).path[1] as vert FROM dump > ) > SELECT elevation > FROM data > INNER JOIN (SELECT > ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line > FROM pts a > INNER JOIN pts b > ON a.vert=b.vert-1 AND b.vert>1) segments > ON ST_DWithin(location, segments.short_line, 600) > ORDER BY elevation DESC limit 1; > > Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/ > RsTD <https://explain.depesz.com/s/ukwc>): > > > QUERY PLAN > > > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------------------------------------------------ > -------------------------------------------------------- > Limit (cost=11611706.90..11611706.91 rows=1 width=4) (actual > time=1171.814..1171.814 rows=1 loops=1) > CTE dump > -> Result (cost=0.00..5.25 rows=1000 width=32) (actual > time=0.024..1.989 rows=1939 loops=1) > CTE pts > -> CTE Scan on dump (cost=0.00..20.00 rows=1000 width=36) (actual > time=0.032..4.071 rows=1939 loops=1) > -> Sort (cost=11611681.65..11611768.65 rows=34800 width=4) (actual > time=1171.813..1171.813 rows=1 loops=1) > Sort Key: data.elevation DESC > Sort Method: top-N heapsort Memory: 25kB > -> Nested Loop (cost=0.55..11611507.65 rows=34800 width=4) > (actual time=0.590..1167.615 rows=28408 loops=1) > -> Nested Loop (cost=0.00..8357.50 rows=1665 width=64) > (actual time=0.046..663.475 rows=1938 loops=1) > Join Filter: (a.vert = (b.vert - 1)) > Rows Removed by Join Filter: 3755844 > -> CTE Scan on pts b (cost=0.00..22.50 rows=333 > width=36) (actual time=0.042..0.433 rows=1938 loops=1) > Filter: (vert > 1) > Rows Removed by Filter: 1 > -> CTE Scan on pts a (cost=0.00..20.00 rows=1000 > width=36) (actual time=0.000..0.149 rows=1939 loops=1938) > -> Index Scan using location_gix on > data (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256 > rows=15 loops=1938) > Index Cond: (location && > _st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double > precision)) > Filter: (((st_makeline(ARRAY[a.geom, > b.geom]))::geography && _st_expand(location, '600'::double precision)) AND > _st_dwithin(location, (st_makeline(ARRAY[a.geom, > b.geom]))::geography, '600'::double precision, true)) > Rows Removed by Filter: 7 > Planning time: 4.318 ms > Execution time: 1171.994 ms > (22 rows) > > So not bad. Went from 20+ seconds to a little over 1 second. Still > noticeable for a end user, but defiantly usable - and like mentioned, > that's a worst-case scenario query. Thanks! > > Of course, if you have any suggestions for further improvement, I'm all > ears :-) > ----------------------------------------------- > Israel Brewster > Systems Analyst II > Ravn Alaska > 5245 Airport Industrial Rd > Fairbanks, AK 99709 > (907) 450-7293 > ----------------------------------------------- > > > On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <isr...@ravnalaska.net> > wrote: > >> On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pram...@cleverelephant.ca> >> wrote: >> >> >> The index filters using bounding boxes. A long, diagonal route will have >> a large bounding box, relative to the area you actually care about (within >> a narrow strip of the route). Use ST_Segmentize() to add points to your >> route, ST_DumpPoints() to dump those out as point and ST_MakeLine to >> generate new lines from those points, each line very short. The maximum >> index effectiveness will come when your line length is close to your buffer >> width. >> >> P >> >> >> Ok, I think I understand the concept. So attempting to follow your >> advice, I modified the query to be: >> >> SELECT elevation >> FROM data >> WHERE >> ST_DWithin( >> location, >> (SELECT ST_MakeLine(geom)::geography as split_line >> FROM (SELECT >> (ST_DumpPoints( >> ST_Segmentize( >> ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 >> 61.179167,-156.77 71.285833)'), >> 600 >> )::geometry >> )).geom >> ) s1), >> 600 >> ) >> ORDER BY elevation DESC limit 1; >> >> It took some fiddling to find a syntax that Postgresql would accept, but >> eventually that's what I came up with. Unfortunately, far from improving >> performance, it killed it - in running the query, it went from 22 seconds >> to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at >> the query execution plan shows, at least partially, why: >> >> QUERY PLAN >> >> ------------------------------------------------------------ >> ------------------ >> Limit (cost=17119748.98..17119748.98 rows=1 width=4) >> InitPlan 1 (returns $0) >> -> Aggregate (cost=17.76..17.77 rows=1 width=32) >> -> Result (cost=0.00..5.25 rows=1000 width=32) >> -> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4) >> Sort Key: data.elevation DESC >> -> Seq Scan on data (cost=0.00..17015226.76 rows=20900890 >> width=4) >> Filter: st_dwithin(location, $0, '600'::double precision) >> (8 rows) >> >> So apparently it is now doing a sequential scan on data rather than using >> the index. And, of course, sorting 20 million rows is not trivial either. >> Did I do something wrong with forming the query? >> >> ----------------------------------------------- >> Israel Brewster >> Systems Analyst II >> Ravn Alaska >> 5245 Airport Industrial Rd >> Fairbanks, AK 99709 >> (907) 450-7293 >> ----------------------------------------------- >> >> >> On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <isr...@ravnalaska.net> >> wrote: >> >>> I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of >>> latitude (numeric), longitude(numeric), elevation(integer) data, along with >>> a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box >>> with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum >>> elevation along a path, for which purpose I've come up with the following >>> query (for one particular path example): >>> >>> SELECT elevation FROM data >>> >>> >>> >>> >>> WHERE ST_DWithin(location, >>> ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 >>> 61.179167,-156.77 71.285833)'), 600) >>> >>> >>> >>> ORDER BY elevation LIMIT 1; >>> >>> The EXPLAIN ANALYZE output of this particular query ( >>> https://explain.depesz.com/s/heZ) shows: >>> >>> >>> >>> QUERY PLAN >>> >>> >>> ------------------------------------------------------------ >>> ------------------------------------------------------------ >>> ------------------------------------------------------------ >>> ------------------------------------------------------------ >>> ------------------------------------------------------------ >>> ------------------------------------------ >>> Limit (cost=4.83..4.83 rows=1 width=4) (actual >>> time=22653.840..22653.842 rows=1 loops=1) >>> -> Sort (cost=4.83..4.83 rows=1 width=4) (actual >>> time=22653.837..22653.837 rows=1 loops=1) >>> Sort Key: elevation DESC >>> Sort Method: top-N heapsort Memory: 25kB >>> -> Index Scan using location_gix on data (cost=0.42..4.82 >>> rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1) >>> Index Cond: (location && '0102000020E6100000020000002C1 >>> 1A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD2514 >>> 0'::geography) >>> Filter: (('0102000020E6100000020000002 >>> C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography >>> && _st_expand(location, '600'::double precision)) AND >>> _st_dwithin(location, '0102000020E6100000020000002C11A8FE41C >>> 062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography, >>> '600'::double precision, true)) >>> Rows Removed by Filter: 4934534 >>> Planning time: 0.741 ms >>> Execution time: 22653.906 ms >>> (10 rows) >>> >>> So it is using the index properly, but still takes a good 22 seconds to >>> run, most of which appears to be in the Index Scan. >>> >>> Is there any way to improve this, or is this going to be about as good >>> as it gets with the number of rows being dealt with? I was planning to use >>> this for a real-time display - punch in a couple of points, get some >>> information about the route between, including maximum elevation - but with >>> it taking 22 seconds for the longer routes at least, that doesn't make for >>> the best user experience. >>> >>> It's perhaps worth noting that the example above is most likely a worst >>> case scenario. I would expect the vast majority of routes to be >>> significantly shorter, and I want to say the shorter routes query much >>> faster [testing needed]. That said, the faster the better, even for short >>> routes :-) >>> ----------------------------------------------- >>> Israel Brewster >>> Systems Analyst II >>> Ravn Alaska >>> 5245 Airport Industrial Rd >>> Fairbanks, AK 99709 >>> (907) 450-7293 >>> ----------------------------------------------- >>> >>> >>> >>> >>> >>> >> >> > >