Hi Szehon, hi Jia,

Thank you for your replies. We now better understand the connection between
the metadata and partitioning in this proposal. Supporting the Mapping 1 is
a great starting point, and we would like to work closer with you on
bringing the support for spherical edges and other coordinate systems into
Iceberg geometry.

We have some follow-up questions regarding the partitioning (let us know if
it’s better to comment directly in the document): Does this proposal imply
that XZ2 partitioning is always required? In the current proposal, do you
see a possibility of predicate pushdown to rely on x/y min/max column
metadata instead of a partition key? We see use-cases where a table with a
geo column can be partitioned by a different key(e.g. date) or combination
of keys. It would be great to support such use cases from the very
beginning.

Thanks,

Peter

On Thu, May 30, 2024 at 8:07 AM Jia Yu <ji...@apache.org> wrote:

> Hi Dmtro,
>
> Thanks for your email. To add to Szehon's answer,
>
> 1. How to represent Snowflake Geometry and Geography type in Iceberg,
> given the Geo Iceberg Phase 1 design:
>
> Answer:
> Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg
> Geometry + CRS84 + edges: Planar
> Mapping 2 (impossible): Snowflake Geography -> Iceberg Geometry + CRS84 +
> edges: Spherical
> Mapping 3 (impossible): Snowflake Geometry + SRID:ABCDE-> Iceberg
> Geometry + SRID:ABCDE + edges: Planar
>
> As Szehon mentioned, only Mapping 1 is possible because we need to support
> spatial query push down in Iceberg. This function relies on the Iceberg
> partition transform, which requires a 1:1 mapping between a value
> (point/polygon/linestring) and a partition key. That is: given any
> precision level, a polygon must produce a single ID; and the covering
> indicated by this single ID must fully cover the extent of the polygon.
> Currently, only xz2 can satisfy this requirement. If the theory from
> Michael Entin can be proven to be correct, then we can support Mapping 2 in
> Phase 2 of Geo Iceberg.
>
> Regarding Mapping 3, this requires Iceberg to be able to understand SRID /
> PROJJSON such that we will know min max X Y of the CRS (@Szehon, maybe
> Iceberg can ask the engine to provide this information?). See my answer 2.
>
> 2. Why choose projjson instead of SRID?
>
> The projjson idea was borrowed from GeoParquet because we'd like to enable
> possible conversion between Geo Iceberg and GeoParquet. However, I do
> understand that this is not a good idea for Iceberg since not many libs can
> parse projjson.
>
> @Szehon Is there a way that we can support both SRID and PROJJSON in Geo
> Iceberg?
>
> It is also worth noting that, although there are many libs that can parse
> SRID and perform look-up in the EPSG database, the license of the EPSG
> database is NOT compatible with the Apache Software Foundation. That means:
> Iceberg still cannot parse / understand SRID.
>
> Thanks,
> Jia
>
> On Wed, May 29, 2024 at 11:08 AM Szehon Ho <szehon.apa...@gmail.com>
> wrote:
>
>> Hi Dmytro
>>
>> Thank you for looking through the proposal and excited to hear from you
>> guys!  I am not a 'geo expert' and I will definitely need to pull in Jia Yu
>> for some of these points.
>>
>> Although most calculations are done on the query engine, Iceberg
>> reference implementations (ie, Java, Python) does have to support a few
>> calculations to handle filter push down:
>>
>>    1. push down of the proposed Geospatial transforms ST_COVERS,
>>    ST_COVERED_BY, and ST_INTERSECTS
>>    2. evaluation of proposed Geospatial partition transform XZ2.  As you
>>    may have seen, this was chosen as its the only standard one today that
>>    solves the 'boundary object' problem, still preserving 1-to-1 mapping of
>>    row => partition value.
>>
>> This is the primary rationale for choosing the values, as these were
>> implemented in the GeoLake and Havasu projects (Iceberg forks that sparked
>> the proposal) based on Geometry type (edge=planar, crs=OGC:CRS84/
>> SRID=4326).
>>
>> 2. As you mentioned [2] in the proposal there are difficulties with
>>> supporting the full PROJSSON specification of the SRS. From our experience
>>> most of the use-cases do not require the full definition of the SRS, in
>>> fact that definition is only needed when converting between coordinate
>>> systems. On the other hand, it’s often needed to check whether two geometry
>>> columns have the same coordinate system, for example when joining two
>>> columns from different data providers.
>>>
>>> To address this we would like to propose including the option to specify
>>> the SRS with only a SRID in phase 1. The query engine may choose to treat
>>> it as opaque identified or make a look-up in the EPSG database of
>>> supported.
>>>
>>
>> The way to specify CRS definition is actually taken from GeoParquet [1],
>> I think we are not bound to follow it if there are better options.  I feel
>> we might need to at least list out supported configurations in the spec,
>> though.  There is some conversation on the doc here about this [2].
>> Basically:
>>
>>    1. XZ2 assumes planar edges.  This is a feature of the algorithm,
>>    based on the original paper.  A possible solution to spherical edge is
>>    proposed by Michael Entin here: [3], please feel free to evaluate.
>>    2. XZ2 needs to know the coordinate range.  According to Jia's
>>    comments, this needs parsing of the CRS.  Can it be done with SRID alone?
>>
>>
>>> 1. In the first version of the specification Phase1 it is mentioned as
>>> the version focused on the planar geometry model with a CRS system fixed on
>>> 4326. In this model, Snowflake would not be able to map our Geography type
>>> since it is based on the spherical Geography model. Given that Snowflake
>>> supports both edge types, we would like to better understand how to map
>>> them to the proposed Geometry type and its metadata.
>>>
>>>    -
>>>
>>>    How is the edge type supposed to be interpreted by the query engine?
>>>    Is it necessary for the system to adhere to the edge model for geospatial
>>>    functions, or can it use the model that it supports or let the customer
>>>    choose it? Will it affect the bounding box or other row group metadata
>>>    -
>>>
>>>    Is there any reason why the flexible model has to be postponed to
>>>    further iterations? Would it be more extensible to support mutable edge
>>>    type from the Phase 1, but allow systems to ignore it if they do not
>>>    support the spherical computation model
>>>
>>>
>> It may be answered by the previous paragraph in regards to XZ2.
>>
>>    1. If we get XZ2 to work with a more variable CRS without requiring
>>    full PROJJSON specification, it seems it is a path to support Snowflake
>>    Geometry type?
>>    2. If we get another one-to-one partition function on spherical
>>    edges, like the one proposed by Michael, it seems a path to support
>>    Snowflake Geography type?
>>
>> Does that sound correct?  As for why certain things are marked as Phase
>> 1, they are just chosen so we can all agree on an initial design and
>> iterate faster and not set in stone, maybe the path 1 is possible to do
>> quickly, for example.
>>
>> Also , I am not sure about handling evaluation of ST_COVERS,
>> ST_COVERED_BY, and ST_INTERSECTS (how easy to handle different CRS +
>> spherical edges).  I will leave it to Jia.
>>
>> Thanks!
>> Szehon
>>
>> [1]:
>> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata
>> [2]:
>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk
>> <https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk>
>> [3]:
>> https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit
>> <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit>
>>
>>
>> On Wed, May 29, 2024 at 8:30 AM Dmytro Koval
>> <dmytro.ko...@snowflake.com.invalid> wrote:
>>
>>> Dear Szehon and Iceberg Community,
>>>
>>>
>>> This is Dmytro, Peter, Aihua, and Tyler from Snowflake. As part of our
>>> desire to be more active in the Iceberg community, we’ve been looking over
>>> this geospatial proposal. We’re excited geospatial is getting traction, as
>>> we see a lot of geo usage within Snowflake, and expect that usage to carry
>>> over to our Iceberg offerings soon. After reviewing the proposal, we have
>>> some questions we’d like to pose given our experience with geospatial
>>> support in Snowflake.
>>>
>>> We would like to clarify two aspects of the proposal: handling of the
>>> spherical model and definition of the spatial reference system. Both of
>>> which have a big impact on the interoperability with Snowflake and other
>>> query engines and Geo processing systems.
>>>
>>>
>>> Let us first share some context about geospatial types at Snowflake; geo
>>> experts will certainly be familiar with this context already, but for the
>>> sake of others we want to err on the side of being explicit and clear.
>>> Snowflake supports two Geospatial types [1]:
>>> - Geography – uses a spherical approximation of the earth for all the
>>> computations. It does not perfectly represent the earth, but allows getting
>>> accurate results on WGS84 coordinates, used by GPS without any need to
>>> perform coordinate system reprojections. It is also quite fast for
>>> end-to-end computations. In general, it has less distortions compared to
>>> the 2d planar model .
>>> - Geometry – uses planar Euclidean geometry model. Geometric
>>> computations are simpler, but require transforming the data between
>>> coordinate systems to minimize the distortion. The Geometry data type
>>> allows setting a spatial reference system for each row using the SRID. The
>>> binary geospatial functions are only allowed on the geometries with the
>>> same SRID. The only function that interprets SRID is ST_TRANFORM that
>>> allows conversion between different SRSs.
>>>
>>> Geography
>>>
>>> Geometry
>>>
>>>
>>>
>>> Given the choice of two types and a set of operations on top of them,
>>> the majority of Snowflake users select the Geography type to represent
>>> their geospatial data.
>>>
>>> From our perspective, Iceberg users would benefit most from being given
>>> the flexibility to store and process data using the model that better fits
>>> their needs and specific use cases.
>>>
>>> Therefore, we would like to ask some design clarifying questions,
>>> important for interoperability:
>>>
>>>
>>> 1. In the first version of the specification Phase1 it is mentioned as
>>> the version focused on the planar geometry model with a CRS system fixed on
>>> 4326. In this model, Snowflake would not be able to map our Geography type
>>> since it is based on the spherical Geography model. Given that Snowflake
>>> supports both edge types, we would like to better understand how to map
>>> them to the proposed Geometry type and its metadata.
>>>
>>>    -
>>>
>>>    How is the edge type supposed to be interpreted by the query engine?
>>>    Is it necessary for the system to adhere to the edge model for geospatial
>>>    functions, or can it use the model that it supports or let the customer
>>>    choose it? Will it affect the bounding box or other row group metadata
>>>    -
>>>
>>>    Is there any reason why the flexible model has to be postponed to
>>>    further iterations? Would it be more extensible to support mutable edge
>>>    type from the Phase 1, but allow systems to ignore it if they do not
>>>    support the spherical computation model
>>>
>>>
>>>
>>> 2. As you mentioned [2] in the proposal there are difficulties with
>>> supporting the full PROJSSON specification of the SRS. From our experience
>>> most of the use-cases do not require the full definition of the SRS, in
>>> fact that definition is only needed when converting between coordinate
>>> systems. On the other hand, it’s often needed to check whether two geometry
>>> columns have the same coordinate system, for example when joining two
>>> columns from different data providers.
>>>
>>> To address this we would like to propose including the option to specify
>>> the SRS with only a SRID in phase 1. The query engine may choose to treat
>>> it as opaque identified or make a look-up in the EPSG database of
>>> supported.
>>>
>>> Thank you again for driving this effort forward. We look forward to
>>> hearing your thoughts.
>>>
>>> [1]
>>> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry
>>>
>>> [2]
>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf
>>>
>>>
>>> On 2024/05/02 00:41:52 Szehon Ho wrote:
>>> > Hi everyone,
>>> >
>>> > We have created a formal proposal for adding Geospatial support to
>>> Iceberg.
>>> >
>>> > Please read the following for details.
>>> >
>>> >    - Github Proposal : https://github.com/apache/iceberg/issues/10260
>>> >    - Proposal Doc:
>>> >
>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI
>>> >
>>> >
>>> > Note that this proposal is built on existing extensive research and POC
>>> > implementations (Geolake, Havasu).  Special thanks to Jia Yu and
>>> Kristin
>>> > Cowalcijk from Wherobots/Geolake for extensive consultation and help in
>>> > writing this proposal, as well as support from Yuanyuan Zhang from
>>> Geolake.
>>> >
>>> > We would love to get more feedback for this proposal from the wider
>>> > community and eventually discuss this in a community sync.
>>> >
>>> > Thanks
>>> > Szehon
>>> >
>>>
>>

Reply via email to