Re: [postgis-users] Question on optimizing slow geospatial query

Trang Nguyen Wed, 11 Feb 2015 00:37:17 -0800

Hi Birgit,

Thanks very much.
I'm not familiar with the "with" clause but will look into that. There are also 
indices on startts and z.uuid ( time range condition is ok, since these are 
timestamps and not days).


I am also finding that a cause of slowness was due to the OR condition: 
ST_intersects(t.startloc, z.geom)
or ST_intersects(t.endloc, z.geom)

Splitting up the query  into two unions, one using a where on 
ST_intersects(t.startloc, z.geom) and the second on ST_intersects(t.endloc, 
z.geom), but your suggestion would get me closer to what I wanted in the final 
output.

Best,
Trang

From: postgis-users-boun...@lists.osgeo.org 
[mailto:postgis-users-boun...@lists.osgeo.org] On Behalf Of Birgit Laggner
Sent: Wednesday, February 11, 2015 12:20 AM
To: postgis-users@lists.osgeo.org
Subject: Re: [postgis-users] Question on optimizing slow geospatial query

Hi Trang,

I think, it could help to create btree indices on "startts" and "uuid", too, 
since you are using them in your where clause as a filter (a probably 
unnecessary question regarding your date filter: I would expect the result of 
"t.startts > '2015-01-16' and t.startts < '2015-01-17'" to be null because 
there are no days between the 16th and the 17th of january - but perhaps it was 
only an example...). And in general, my suggestion would be to reduce the use 
of st_intersects to the necessary mimimum.
You could use the with clause for the filtration of your input data and 
afterwards double join the two tables first on the startloc and second on the 
endloc for the assignment of the origin and the destination zone. Then group by 
origin and destination zones while counting your trips and you should have your 
end result.

Here is how I would imagine the query:

with
t as (select trip_id, startloc, endloc from od1.trip_v1 where startts between 
'minimum start date' and 'maximum start date'),
z as (select zone from od1.taz where uuid in ('kansas_303', 'kansas_601', 
'kansas_603', etc))
select  count(t.trip_id) as number_of_trips, orig.zone as orig_zone, dest.zone 
as dest_zone from t left join z as orig on st_intersects(t.startloc, z.geom) 
left join z as dest on st_intersects(t.endloc, z.geom) group by orig.zone, 
dest.zone;

However, I am not sure about how the gist indices work together with the 
subselects of the with clause...

Hope this helps,

Birgit.


Am 11.02.2015 um 08:19 schrieb Trang Nguyen:
Hi,

I am a newbie to Postgres/PostGIS and have a long running query that I would 
like to optimize.
There are two tables (trip and zone) that I am joining in the query, one which 
has "startloc" and "endloc" columns with type Geometry(Point) and other which 
contains a Geometry(MultiPolygon). There are GIST indexes on all above columns:

CREATE TABLE od1.trip_v1
(
  pkey bigint NOT NULL,
  trip_id character varying,
  startts timestamp without time zone,
  endts timestamp without time zone,
  startloc geometry(Point),
  endloc geometry(Point),
  ...
)

CREATE TABLE od1.taz
(
  uuid character varying NOT NULL,
  zone character varying NOT NULL,
  createdts timestamp without time zone NOT NULL,
  updatedts timestamp without time zone NOT NULL,
  geom geometry(MultiPolygon) NOT NULL,
  CONSTRAINT taz_pkey PRIMARY KEY (uuid)
)

I'm interested in building a matrix that, for a given set of input zones, 
returns trips along with their start and end zones. Output looks like:

10 trips that start at Zone A, ends at Zone B
2 trips that start at Zone A, ends at Zone C
9 trips that start at Zone A, ends at other
13 trips that start at Zone C, ends at Zone D

Since I am dealing with a large dataset (> 24 million records and growing), I 
was planning on writing a query that returns the trips grouped by each zone 
along with match condition (start, end or both) and doing aggregation on the 
client layer. I'm not sure whether this is the best approach but I expect that 
otherwise, I would end up having to write a very complex query to handle that 
type of aggregation.

Even so, the current query is very slow with very high cost:

SELECT t.trip_id,
case
                when ST_intersects(t.startloc, z.geom) and 
ST_intersects(t.endloc, z.geom) then 'orig-dest'
                when ST_intersects(t.startloc, z.geom) then 'orig'
                when ST_intersects(t.endloc, z.geom) then 'dest'
                else 'none'
end  as match_cond,
z.zone from od1.trip_v1               t, od1.taz z
where t.startts > '2015-01-16' and t.startts < '2015-01-17'
and z.uuid in ('kansas_303', 'kansas_601', 'kansas_603', etc)
and ST_intersects(t.startloc, z.geom)
or ST_intersects(t.endloc, z.geom)
group by z.zone, t.trip_id, match_cond;

Explain plan:
"Group  (cost=231446695055.73..245971533247.59 rows=14240037443 width=3498)"
"  ->  Sort  (cost=231446695055.73..231482295149.34 rows=14240037443 
width=3498)"
"        Sort Key: z.zone, t.trip_id, (CASE WHEN ((t.startloc && z.geom) AND 
_st_intersects(t.startloc, z.geom) AND (t.endloc && z.geom) AND 
_st_intersects(t.endloc, z.geom)) THEN 'orig-dest'::text WHEN ((t.startloc && 
z.geom) AND _st_intersects(t.startloc, (...)"
"        ->  Nested Loop  (cost=91.70..14401634128.24 rows=14240037443 
width=3498)"
"              ->  Seq Scan on taz z  (cost=0.00..739.19 rows=4619 width=3406)"
"              ->  Bitmap Heap Scan on trip_v1 t  (cost=91.70..4151.26 rows=453 
width=107)"
"                    Recheck Cond: ((startloc && z.geom) OR (endloc && z.geom))"
"                    Filter: (((startts > '2015-01-16 00:00:00'::timestamp 
without time zone) AND (startts < '2015-01-17 00:00:00'::timestamp without time 
zone) AND ((z.uuid)::text = ANY 
('{kansas_303,kansas_601,kansas_603,kansas_604,kansas_10,kansas_11,kan (...)"
"                    ->  BitmapOr  (cost=91.70..91.70 rows=2706 width=0)"
"                          ->  Bitmap Index Scan on idx_trip_v1_startloc  
(cost=0.00..45.74 rows=1353 width=0)"
"                                Index Cond: (startloc && z.geom)"
"                          ->  Bitmap Index Scan on idx_trip_v1_endloc  
(cost=0.00..45.74 rows=1353 width=0)"
"                                Index Cond: (endloc && z.geom)"

Some help or suggestions on how to speed up the query would be much appreciated.
Also, I currently don't have a specific map projection defined on the geom 
columns so they are using the default of 0 in postgis. The points in both trip 
and zone geometries are lon/lat. Is this an issue for the ST_intersect?

Thanks,
Trang





_______________________________________________

postgis-users mailing list

postgis-users@lists.osgeo.org<mailto:postgis-users@lists.osgeo.org>

http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] Question on optimizing slow geospatial query

Reply via email to