> The min/max stats are discussed in the doc (Phase 2), depending on the
non-trivial encoding.

Just want to add that min/max stats filtering could be supported by file
format natively. Adding geometry type to parquet spec
is under discussion: https://github.com/apache/parquet-format/pull/240

Best,
Gang

On Thu, Jun 6, 2024 at 5:53 AM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Hi Peter
>
> Yes the document only concerns the predicate pushdown of geometric
> column.  Predicate pushdown takes two forms, 1) partition filter and 2)
> min/max stats.  The min/max stats are discussed in the doc (Phase 2),
> depending on the non-trivial encoding.
>
> The evaluators are always AND'ed together, so I dont see any issue of
> partitioning with another key not working on a table with a geo column.
>
> On another note, Jia and I thought that we may have a discussion about
> Snowflake geo types in a call to drill down on some details?  What time
> zone are you folks in/ what time works better ?  I think Jia and I are both
> in Pacific time zone.
>
> Thanks
> Szehon
>
> On Wed, Jun 5, 2024 at 1:02 AM Peter Popov <peter.po...@snowflake.com>
> wrote:
>
>> Hi Szehon, hi Jia,
>>
>> Thank you for your replies. We now better understand the connection
>> between the metadata and partitioning in this proposal. Supporting the
>> Mapping 1 is a great starting point, and we would like to work closer with
>> you on bringing the support for spherical edges and other coordinate
>> systems into Iceberg geometry.
>>
>> We have some follow-up questions regarding the partitioning (let us know
>> if it’s better to comment directly in the document): Does this proposal
>> imply that XZ2 partitioning is always required? In the current proposal,
>> do you see a possibility of predicate pushdown to rely on x/y min/max
>> column metadata instead of a partition key? We see use-cases where a table
>> with a geo column can be partitioned by a different key(e.g. date) or
>> combination of keys. It would be great to support such use cases from the
>> very beginning.
>>
>> Thanks,
>>
>> Peter
>>
>> On Thu, May 30, 2024 at 8:07 AM Jia Yu <ji...@apache.org> wrote:
>>
>>> Hi Dmtro,
>>>
>>> Thanks for your email. To add to Szehon's answer,
>>>
>>> 1. How to represent Snowflake Geometry and Geography type in Iceberg,
>>> given the Geo Iceberg Phase 1 design:
>>>
>>> Answer:
>>> Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg
>>> Geometry + CRS84 + edges: Planar
>>> Mapping 2 (impossible): Snowflake Geography -> Iceberg Geometry +
>>> CRS84 + edges: Spherical
>>> Mapping 3 (impossible): Snowflake Geometry + SRID:ABCDE-> Iceberg
>>> Geometry + SRID:ABCDE + edges: Planar
>>>
>>> As Szehon mentioned, only Mapping 1 is possible because we need to
>>> support spatial query push down in Iceberg. This function relies on the
>>> Iceberg partition transform, which requires a 1:1 mapping between a value
>>> (point/polygon/linestring) and a partition key. That is: given any
>>> precision level, a polygon must produce a single ID; and the covering
>>> indicated by this single ID must fully cover the extent of the polygon.
>>> Currently, only xz2 can satisfy this requirement. If the theory from
>>> Michael Entin can be proven to be correct, then we can support Mapping 2 in
>>> Phase 2 of Geo Iceberg.
>>>
>>> Regarding Mapping 3, this requires Iceberg to be able to understand SRID
>>> / PROJJSON such that we will know min max X Y of the CRS (@Szehon, maybe
>>> Iceberg can ask the engine to provide this information?). See my answer 2.
>>>
>>> 2. Why choose projjson instead of SRID?
>>>
>>> The projjson idea was borrowed from GeoParquet because we'd like to
>>> enable possible conversion between Geo Iceberg and GeoParquet. However, I
>>> do understand that this is not a good idea for Iceberg since not many libs
>>> can parse projjson.
>>>
>>> @Szehon Is there a way that we can support both SRID and PROJJSON in Geo
>>> Iceberg?
>>>
>>> It is also worth noting that, although there are many libs that can
>>> parse SRID and perform look-up in the EPSG database, the license of the
>>> EPSG database is NOT compatible with the Apache Software Foundation. That
>>> means: Iceberg still cannot parse / understand SRID.
>>>
>>> Thanks,
>>> Jia
>>>
>>> On Wed, May 29, 2024 at 11:08 AM Szehon Ho <szehon.apa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Dmytro
>>>>
>>>> Thank you for looking through the proposal and excited to hear from you
>>>> guys!  I am not a 'geo expert' and I will definitely need to pull in Jia Yu
>>>> for some of these points.
>>>>
>>>> Although most calculations are done on the query engine, Iceberg
>>>> reference implementations (ie, Java, Python) does have to support a few
>>>> calculations to handle filter push down:
>>>>
>>>>    1. push down of the proposed Geospatial transforms ST_COVERS,
>>>>    ST_COVERED_BY, and ST_INTERSECTS
>>>>    2. evaluation of proposed Geospatial partition transform XZ2.  As
>>>>    you may have seen, this was chosen as its the only standard one today 
>>>> that
>>>>    solves the 'boundary object' problem, still preserving 1-to-1 mapping of
>>>>    row => partition value.
>>>>
>>>> This is the primary rationale for choosing the values, as these were
>>>> implemented in the GeoLake and Havasu projects (Iceberg forks that sparked
>>>> the proposal) based on Geometry type (edge=planar, crs=OGC:CRS84/
>>>> SRID=4326).
>>>>
>>>> 2. As you mentioned [2] in the proposal there are difficulties with
>>>>> supporting the full PROJSSON specification of the SRS. From our experience
>>>>> most of the use-cases do not require the full definition of the SRS, in
>>>>> fact that definition is only needed when converting between coordinate
>>>>> systems. On the other hand, it’s often needed to check whether two 
>>>>> geometry
>>>>> columns have the same coordinate system, for example when joining two
>>>>> columns from different data providers.
>>>>>
>>>>> To address this we would like to propose including the option to
>>>>> specify the SRS with only a SRID in phase 1. The query engine may choose 
>>>>> to
>>>>> treat it as opaque identified or make a look-up in the EPSG database of
>>>>> supported.
>>>>>
>>>>
>>>> The way to specify CRS definition is actually taken from GeoParquet
>>>> [1], I think we are not bound to follow it if there are better options.  I
>>>> feel we might need to at least list out supported configurations in the
>>>> spec, though.  There is some conversation on the doc here about this [2].
>>>> Basically:
>>>>
>>>>    1. XZ2 assumes planar edges.  This is a feature of the algorithm,
>>>>    based on the original paper.  A possible solution to spherical edge is
>>>>    proposed by Michael Entin here: [3], please feel free to evaluate.
>>>>    2. XZ2 needs to know the coordinate range.  According to Jia's
>>>>    comments, this needs parsing of the CRS.  Can it be done with SRID 
>>>> alone?
>>>>
>>>>
>>>>> 1. In the first version of the specification Phase1 it is mentioned as
>>>>> the version focused on the planar geometry model with a CRS system fixed 
>>>>> on
>>>>> 4326. In this model, Snowflake would not be able to map our Geography type
>>>>> since it is based on the spherical Geography model. Given that Snowflake
>>>>> supports both edge types, we would like to better understand how to map
>>>>> them to the proposed Geometry type and its metadata.
>>>>>
>>>>>    -
>>>>>
>>>>>    How is the edge type supposed to be interpreted by the query
>>>>>    engine? Is it necessary for the system to adhere to the edge model for
>>>>>    geospatial functions, or can it use the model that it supports or let 
>>>>> the
>>>>>    customer choose it? Will it affect the bounding box or other row group
>>>>>    metadata
>>>>>    -
>>>>>
>>>>>    Is there any reason why the flexible model has to be postponed to
>>>>>    further iterations? Would it be more extensible to support mutable edge
>>>>>    type from the Phase 1, but allow systems to ignore it if they do not
>>>>>    support the spherical computation model
>>>>>
>>>>>
>>>> It may be answered by the previous paragraph in regards to XZ2.
>>>>
>>>>    1. If we get XZ2 to work with a more variable CRS without requiring
>>>>    full PROJJSON specification, it seems it is a path to support Snowflake
>>>>    Geometry type?
>>>>    2. If we get another one-to-one partition function on spherical
>>>>    edges, like the one proposed by Michael, it seems a path to support
>>>>    Snowflake Geography type?
>>>>
>>>> Does that sound correct?  As for why certain things are marked as Phase
>>>> 1, they are just chosen so we can all agree on an initial design and
>>>> iterate faster and not set in stone, maybe the path 1 is possible to do
>>>> quickly, for example.
>>>>
>>>> Also , I am not sure about handling evaluation of ST_COVERS,
>>>> ST_COVERED_BY, and ST_INTERSECTS (how easy to handle different CRS +
>>>> spherical edges).  I will leave it to Jia.
>>>>
>>>> Thanks!
>>>> Szehon
>>>>
>>>> [1]:
>>>> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata
>>>> [2]:
>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk
>>>> <https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk>
>>>> [3]:
>>>> https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit
>>>> <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit>
>>>>
>>>>
>>>> On Wed, May 29, 2024 at 8:30 AM Dmytro Koval
>>>> <dmytro.ko...@snowflake.com.invalid> wrote:
>>>>
>>>>> Dear Szehon and Iceberg Community,
>>>>>
>>>>>
>>>>> This is Dmytro, Peter, Aihua, and Tyler from Snowflake. As part of our
>>>>> desire to be more active in the Iceberg community, we’ve been looking over
>>>>> this geospatial proposal. We’re excited geospatial is getting traction, as
>>>>> we see a lot of geo usage within Snowflake, and expect that usage to carry
>>>>> over to our Iceberg offerings soon. After reviewing the proposal, we have
>>>>> some questions we’d like to pose given our experience with geospatial
>>>>> support in Snowflake.
>>>>>
>>>>> We would like to clarify two aspects of the proposal: handling of the
>>>>> spherical model and definition of the spatial reference system. Both of
>>>>> which have a big impact on the interoperability with Snowflake and other
>>>>> query engines and Geo processing systems.
>>>>>
>>>>>
>>>>> Let us first share some context about geospatial types at Snowflake;
>>>>> geo experts will certainly be familiar with this context already, but for
>>>>> the sake of others we want to err on the side of being explicit and clear.
>>>>> Snowflake supports two Geospatial types [1]:
>>>>> - Geography – uses a spherical approximation of the earth for all the
>>>>> computations. It does not perfectly represent the earth, but allows 
>>>>> getting
>>>>> accurate results on WGS84 coordinates, used by GPS without any need to
>>>>> perform coordinate system reprojections. It is also quite fast for
>>>>> end-to-end computations. In general, it has less distortions compared to
>>>>> the 2d planar model .
>>>>> - Geometry – uses planar Euclidean geometry model. Geometric
>>>>> computations are simpler, but require transforming the data between
>>>>> coordinate systems to minimize the distortion. The Geometry data type
>>>>> allows setting a spatial reference system for each row using the SRID. The
>>>>> binary geospatial functions are only allowed on the geometries with the
>>>>> same SRID. The only function that interprets SRID is ST_TRANFORM that
>>>>> allows conversion between different SRSs.
>>>>>
>>>>> Geography
>>>>>
>>>>> Geometry
>>>>>
>>>>>
>>>>>
>>>>> Given the choice of two types and a set of operations on top of them,
>>>>> the majority of Snowflake users select the Geography type to represent
>>>>> their geospatial data.
>>>>>
>>>>> From our perspective, Iceberg users would benefit most from being
>>>>> given the flexibility to store and process data using the model that 
>>>>> better
>>>>> fits their needs and specific use cases.
>>>>>
>>>>> Therefore, we would like to ask some design clarifying questions,
>>>>> important for interoperability:
>>>>>
>>>>>
>>>>> 1. In the first version of the specification Phase1 it is mentioned as
>>>>> the version focused on the planar geometry model with a CRS system fixed 
>>>>> on
>>>>> 4326. In this model, Snowflake would not be able to map our Geography type
>>>>> since it is based on the spherical Geography model. Given that Snowflake
>>>>> supports both edge types, we would like to better understand how to map
>>>>> them to the proposed Geometry type and its metadata.
>>>>>
>>>>>    -
>>>>>
>>>>>    How is the edge type supposed to be interpreted by the query
>>>>>    engine? Is it necessary for the system to adhere to the edge model for
>>>>>    geospatial functions, or can it use the model that it supports or let 
>>>>> the
>>>>>    customer choose it? Will it affect the bounding box or other row group
>>>>>    metadata
>>>>>    -
>>>>>
>>>>>    Is there any reason why the flexible model has to be postponed to
>>>>>    further iterations? Would it be more extensible to support mutable edge
>>>>>    type from the Phase 1, but allow systems to ignore it if they do not
>>>>>    support the spherical computation model
>>>>>
>>>>>
>>>>>
>>>>> 2. As you mentioned [2] in the proposal there are difficulties with
>>>>> supporting the full PROJSSON specification of the SRS. From our experience
>>>>> most of the use-cases do not require the full definition of the SRS, in
>>>>> fact that definition is only needed when converting between coordinate
>>>>> systems. On the other hand, it’s often needed to check whether two 
>>>>> geometry
>>>>> columns have the same coordinate system, for example when joining two
>>>>> columns from different data providers.
>>>>>
>>>>> To address this we would like to propose including the option to
>>>>> specify the SRS with only a SRID in phase 1. The query engine may choose 
>>>>> to
>>>>> treat it as opaque identified or make a look-up in the EPSG database of
>>>>> supported.
>>>>>
>>>>> Thank you again for driving this effort forward. We look forward to
>>>>> hearing your thoughts.
>>>>>
>>>>> [1]
>>>>> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry
>>>>>
>>>>> [2]
>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf
>>>>>
>>>>>
>>>>> On 2024/05/02 00:41:52 Szehon Ho wrote:
>>>>> > Hi everyone,
>>>>> >
>>>>> > We have created a formal proposal for adding Geospatial support to
>>>>> Iceberg.
>>>>> >
>>>>> > Please read the following for details.
>>>>> >
>>>>> >    - Github Proposal :
>>>>> https://github.com/apache/iceberg/issues/10260
>>>>> >    - Proposal Doc:
>>>>> >
>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI
>>>>> >
>>>>> >
>>>>> > Note that this proposal is built on existing extensive research and
>>>>> POC
>>>>> > implementations (Geolake, Havasu).  Special thanks to Jia Yu and
>>>>> Kristin
>>>>> > Cowalcijk from Wherobots/Geolake for extensive consultation and help
>>>>> in
>>>>> > writing this proposal, as well as support from Yuanyuan Zhang from
>>>>> Geolake.
>>>>> >
>>>>> > We would love to get more feedback for this proposal from the wider
>>>>> > community and eventually discuss this in a community sync.
>>>>> >
>>>>> > Thanks
>>>>> > Szehon
>>>>> >
>>>>>
>>>>

Reply via email to