[
https://issues.apache.org/jira/browse/IMPALA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Csaba Ringhofer updated IMPALA-13851:
-------------------------------------
Description:
It is a common (and currently the most efficient) way to store point data as
double x/double y pairs instead of a single geometry (BINARY) column.
For predicates where these points must intersect a complex geometry it can be a
useful optimization to prefilter x/y columns directly with the bounding rect of
the complex geometry in addittiton to running the st_ predicate. An example:
{code}
WHERE st_contains(<const_geometry>, st_point(x,y))
{code}
can be rewritten as:
{code}
WHERE st_contains(<const_geometry>, st_point(x,y))
AND x>=st_xmin(<const_geometry>) AND y>=st_ymin(<const_geometry>)
AND x<=st_xmax(<const_geometry>) AND y<=st_ymax(<const_geometry>)
{code}
This has two benefits:
1. the planner will move the >= <= predicates before st_contains which allows
avoiding the expensive per row st_contains() call for points that failed the
bounding box check
2. >=/<= predicates can be pushed down in more cases, for example for Parquet
min/max stat filtering or Iceberg min/max stat filtering - if the files/pages
have limited bounding boxes then this can save IO.
An expression rewrite rule can be added that does this automatically.
An issue that blocks this (and also makes manual rewrites harder) is
IMPALA-10349 - currently constant folding only works for STRING/BINARY with
ascii characters, and the result of function like st_polygon is very likely to
contain non-ascii character, so x>=st_xmin(<const_geometry>) can't be
converted to x >= <const_double> at compile time.
was:
It is a common (and currently the most efficient) way to store point data as
double x/double y pairs instead of a single geometry (BINARY) column.
For predicates where these points must intersect a complex geometry it can be a
useful optimization to prefilter x/y columns directly with the bounding rect of
the complex geometry in addittiton to running the st_ predicate. An example:
{code}
WHERE st_contains(<const_geometry>, st_point(x,y))
{code}
can be rewritten as:
{code}
WHERE st_contains(<const_geometry>, st_point(x,y))
AND x>=st_xmin(<const_geometry>) AND y>=st_ymin(<const_geometry>)
AND x<=st_xmax(<const_geometry>) AND y<=st_ymax(<const_geometry>)
{code}
This has two benefits:
1. the planner will move the >= <= predicates before st_contains which allows
avoiding the expensive per row st_contains() call for points that failed the
bounding box check
2. >=/<= predicates can be pushed down in more cases, for example for Parquet
min/max stat filtering or Iceberg min/max stat filtering - if the files/pages
have limited bounding boxes then this can save IO.
An expression rewrite rule can be added that does this automatically.
> Add geospatial expression rewrites for lat/lon coded points
> -----------------------------------------------------------
>
> Key: IMPALA-13851
> URL: https://issues.apache.org/jira/browse/IMPALA-13851
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Reporter: Csaba Ringhofer
> Priority: Major
> Labels: geospatial
>
> It is a common (and currently the most efficient) way to store point data as
> double x/double y pairs instead of a single geometry (BINARY) column.
> For predicates where these points must intersect a complex geometry it can be
> a useful optimization to prefilter x/y columns directly with the bounding
> rect of the complex geometry in addittiton to running the st_ predicate. An
> example:
> {code}
> WHERE st_contains(<const_geometry>, st_point(x,y))
> {code}
> can be rewritten as:
> {code}
> WHERE st_contains(<const_geometry>, st_point(x,y))
> AND x>=st_xmin(<const_geometry>) AND y>=st_ymin(<const_geometry>)
> AND x<=st_xmax(<const_geometry>) AND y<=st_ymax(<const_geometry>)
> {code}
> This has two benefits:
> 1. the planner will move the >= <= predicates before st_contains which allows
> avoiding the expensive per row st_contains() call for points that failed the
> bounding box check
> 2. >=/<= predicates can be pushed down in more cases, for example for Parquet
> min/max stat filtering or Iceberg min/max stat filtering - if the files/pages
> have limited bounding boxes then this can save IO.
> An expression rewrite rule can be added that does this automatically.
> An issue that blocks this (and also makes manual rewrites harder) is
> IMPALA-10349 - currently constant folding only works for STRING/BINARY with
> ascii characters, and the result of function like st_polygon is very likely
> to contain non-ascii character, so x>=st_xmin(<const_geometry>) can't be
> converted to x >= <const_double> at compile time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]