Re: [Discuss] Geospatial Support

Jean-Baptiste Onofré Tue, 18 Jun 2024 01:36:17 -0700

Hi Jia

Thanks for the update. I'm gonna re-read the whole thread and document to
have a better understanding.


Thanks !
Regards
JB

On Mon, Jun 17, 2024 at 7:44 PM Jia Yu <[email protected]> wrote:

> Hi Snowflake folks,
>
> Please let me know if you have other questions regarding the proposal. If
> any, Szehon and I can set up a zoom call with you guys to clarify some
> details. We are in the Pacific time zone. If you are in Europe, maybe early
> morning Pacific Time works best for you?
>
> Thanks,
> Jia
>
> On Wed, Jun 5, 2024 at 6:28 PM Gang Wu <[email protected]> wrote:
>
>> > The min/max stats are discussed in the doc (Phase 2), depending on the
>> non-trivial encoding.
>>
>> Just want to add that min/max stats filtering could be supported by file
>> format natively. Adding geometry type to parquet spec
>> is under discussion: https://github.com/apache/parquet-format/pull/240
>>
>> Best,
>> Gang
>>
>> On Thu, Jun 6, 2024 at 5:53 AM Szehon Ho <[email protected]> wrote:
>>
>>> Hi Peter
>>>
>>> Yes the document only concerns the predicate pushdown of geometric
>>> column.  Predicate pushdown takes two forms, 1) partition filter and 2)
>>> min/max stats.  The min/max stats are discussed in the doc (Phase 2),
>>> depending on the non-trivial encoding.
>>>
>>> The evaluators are always AND'ed together, so I dont see any issue of
>>> partitioning with another key not working on a table with a geo column.
>>>
>>> On another note, Jia and I thought that we may have a discussion about
>>> Snowflake geo types in a call to drill down on some details?  What time
>>> zone are you folks in/ what time works better ?  I think Jia and I are both
>>> in Pacific time zone.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Wed, Jun 5, 2024 at 1:02 AM Peter Popov <[email protected]>
>>> wrote:
>>>
>>>> Hi Szehon, hi Jia,
>>>>
>>>> Thank you for your replies. We now better understand the connection
>>>> between the metadata and partitioning in this proposal. Supporting the
>>>> Mapping 1 is a great starting point, and we would like to work closer with
>>>> you on bringing the support for spherical edges and other coordinate
>>>> systems into Iceberg geometry.
>>>>
>>>> We have some follow-up questions regarding the partitioning (let us
>>>> know if it’s better to comment directly in the document): Does this
>>>> proposal imply that XZ2 partitioning is always required? In the
>>>> current proposal, do you see a possibility of predicate pushdown to
>>>> rely on x/y min/max column metadata instead of a partition key? We see
>>>> use-cases where a table with a geo column can be partitioned by a different
>>>> key(e.g. date) or combination of keys. It would be great to support such
>>>> use cases from the very beginning.
>>>>
>>>> Thanks,
>>>>
>>>> Peter
>>>>
>>>> On Thu, May 30, 2024 at 8:07 AM Jia Yu <[email protected]> wrote:
>>>>
>>>>> Hi Dmtro,
>>>>>
>>>>> Thanks for your email. To add to Szehon's answer,
>>>>>
>>>>> 1. How to represent Snowflake Geometry and Geography type in Iceberg,
>>>>> given the Geo Iceberg Phase 1 design:
>>>>>
>>>>> Answer:
>>>>> Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg
>>>>> Geometry + CRS84 + edges: Planar
>>>>> Mapping 2 (impossible): Snowflake Geography -> Iceberg Geometry +
>>>>> CRS84 + edges: Spherical
>>>>> Mapping 3 (impossible): Snowflake Geometry + SRID:ABCDE-> Iceberg
>>>>> Geometry + SRID:ABCDE + edges: Planar
>>>>>
>>>>> As Szehon mentioned, only Mapping 1 is possible because we need to
>>>>> support spatial query push down in Iceberg. This function relies on the
>>>>> Iceberg partition transform, which requires a 1:1 mapping between a value
>>>>> (point/polygon/linestring) and a partition key. That is: given any
>>>>> precision level, a polygon must produce a single ID; and the covering
>>>>> indicated by this single ID must fully cover the extent of the polygon.
>>>>> Currently, only xz2 can satisfy this requirement. If the theory from
>>>>> Michael Entin can be proven to be correct, then we can support Mapping 2 
>>>>> in
>>>>> Phase 2 of Geo Iceberg.
>>>>>
>>>>> Regarding Mapping 3, this requires Iceberg to be able to understand
>>>>> SRID / PROJJSON such that we will know min max X Y of the CRS (@Szehon,
>>>>> maybe Iceberg can ask the engine to provide this information?). See my
>>>>> answer 2.
>>>>>
>>>>> 2. Why choose projjson instead of SRID?
>>>>>
>>>>> The projjson idea was borrowed from GeoParquet because we'd like to
>>>>> enable possible conversion between Geo Iceberg and GeoParquet. However, I
>>>>> do understand that this is not a good idea for Iceberg since not many libs
>>>>> can parse projjson.
>>>>>
>>>>> @Szehon Is there a way that we can support both SRID and PROJJSON in
>>>>> Geo Iceberg?
>>>>>
>>>>> It is also worth noting that, although there are many libs that can
>>>>> parse SRID and perform look-up in the EPSG database, the license of the
>>>>> EPSG database is NOT compatible with the Apache Software Foundation. That
>>>>> means: Iceberg still cannot parse / understand SRID.
>>>>>
>>>>> Thanks,
>>>>> Jia
>>>>>
>>>>> On Wed, May 29, 2024 at 11:08 AM Szehon Ho <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Dmytro
>>>>>>
>>>>>> Thank you for looking through the proposal and excited to hear
>>>>>> from you guys!  I am not a 'geo expert' and I will definitely need to 
>>>>>> pull
>>>>>> in Jia Yu for some of these points.
>>>>>>
>>>>>> Although most calculations are done on the query engine, Iceberg
>>>>>> reference implementations (ie, Java, Python) does have to support a few
>>>>>> calculations to handle filter push down:
>>>>>>
>>>>>>    1. push down of the proposed Geospatial transforms ST_COVERS,
>>>>>>    ST_COVERED_BY, and ST_INTERSECTS
>>>>>>    2. evaluation of proposed Geospatial partition transform XZ2.  As
>>>>>>    you may have seen, this was chosen as its the only standard one today 
>>>>>> that
>>>>>>    solves the 'boundary object' problem, still preserving 1-to-1 mapping 
>>>>>> of
>>>>>>    row => partition value.
>>>>>>
>>>>>> This is the primary rationale for choosing the values, as these were
>>>>>> implemented in the GeoLake and Havasu projects (Iceberg forks that 
>>>>>> sparked
>>>>>> the proposal) based on Geometry type (edge=planar, crs=OGC:CRS84/
>>>>>> SRID=4326).
>>>>>>
>>>>>> 2. As you mentioned [2] in the proposal there are difficulties with
>>>>>>> supporting the full PROJSSON specification of the SRS. From our 
>>>>>>> experience
>>>>>>> most of the use-cases do not require the full definition of the SRS, in
>>>>>>> fact that definition is only needed when converting between coordinate
>>>>>>> systems. On the other hand, it’s often needed to check whether two 
>>>>>>> geometry
>>>>>>> columns have the same coordinate system, for example when joining two
>>>>>>> columns from different data providers.
>>>>>>>
>>>>>>> To address this we would like to propose including the option to
>>>>>>> specify the SRS with only a SRID in phase 1. The query engine may 
>>>>>>> choose to
>>>>>>> treat it as opaque identified or make a look-up in the EPSG database of
>>>>>>> supported.
>>>>>>>
>>>>>>
>>>>>> The way to specify CRS definition is actually taken from GeoParquet
>>>>>> [1], I think we are not bound to follow it if there are better options.  
>>>>>> I
>>>>>> feel we might need to at least list out supported configurations in the
>>>>>> spec, though.  There is some conversation on the doc here about this [2].
>>>>>> Basically:
>>>>>>
>>>>>>    1. XZ2 assumes planar edges.  This is a feature of the algorithm,
>>>>>>    based on the original paper.  A possible solution to spherical edge is
>>>>>>    proposed by Michael Entin here: [3], please feel free to evaluate.
>>>>>>    2. XZ2 needs to know the coordinate range.  According to Jia's
>>>>>>    comments, this needs parsing of the CRS.  Can it be done with SRID 
>>>>>> alone?
>>>>>>
>>>>>>
>>>>>>> 1. In the first version of the specification Phase1 it is mentioned
>>>>>>> as the version focused on the planar geometry model with a CRS system 
>>>>>>> fixed
>>>>>>> on 4326. In this model, Snowflake would not be able to map our Geography
>>>>>>> type since it is based on the spherical Geography model. Given that
>>>>>>> Snowflake supports both edge types, we would like to better understand 
>>>>>>> how
>>>>>>> to map them to the proposed Geometry type and its metadata.
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    How is the edge type supposed to be interpreted by the query
>>>>>>>    engine? Is it necessary for the system to adhere to the edge model 
>>>>>>> for
>>>>>>>    geospatial functions, or can it use the model that it supports or 
>>>>>>> let the
>>>>>>>    customer choose it? Will it affect the bounding box or other row 
>>>>>>> group
>>>>>>>    metadata
>>>>>>>    -
>>>>>>>
>>>>>>>    Is there any reason why the flexible model has to be postponed
>>>>>>>    to further iterations? Would it be more extensible to support 
>>>>>>> mutable edge
>>>>>>>    type from the Phase 1, but allow systems to ignore it if they do not
>>>>>>>    support the spherical computation model
>>>>>>>
>>>>>>>
>>>>>> It may be answered by the previous paragraph in regards to XZ2.
>>>>>>
>>>>>>    1. If we get XZ2 to work with a more variable CRS without
>>>>>>    requiring full PROJJSON specification, it seems it is a path to 
>>>>>> support
>>>>>>    Snowflake Geometry type?
>>>>>>    2. If we get another one-to-one partition function on spherical
>>>>>>    edges, like the one proposed by Michael, it seems a path to support
>>>>>>    Snowflake Geography type?
>>>>>>
>>>>>> Does that sound correct?  As for why certain things are marked as
>>>>>> Phase 1, they are just chosen so we can all agree on an initial design 
>>>>>> and
>>>>>> iterate faster and not set in stone, maybe the path 1 is possible to do
>>>>>> quickly, for example.
>>>>>>
>>>>>> Also , I am not sure about handling evaluation of ST_COVERS,
>>>>>> ST_COVERED_BY, and ST_INTERSECTS (how easy to handle different CRS +
>>>>>> spherical edges).  I will leave it to Jia.
>>>>>>
>>>>>> Thanks!
>>>>>> Szehon
>>>>>>
>>>>>> [1]:
>>>>>> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata
>>>>>> [2]:
>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk
>>>>>> <https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk>
>>>>>> [3]:
>>>>>> https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit
>>>>>> <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit>
>>>>>>
>>>>>>
>>>>>> On Wed, May 29, 2024 at 8:30 AM Dmytro Koval
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> Dear Szehon and Iceberg Community,
>>>>>>>
>>>>>>>
>>>>>>> This is Dmytro, Peter, Aihua, and Tyler from Snowflake. As part of
>>>>>>> our desire to be more active in the Iceberg community, we’ve been 
>>>>>>> looking
>>>>>>> over this geospatial proposal. We’re excited geospatial is getting
>>>>>>> traction, as we see a lot of geo usage within Snowflake, and expect that
>>>>>>> usage to carry over to our Iceberg offerings soon. After reviewing the
>>>>>>> proposal, we have some questions we’d like to pose given our experience
>>>>>>> with geospatial support in Snowflake.
>>>>>>>
>>>>>>> We would like to clarify two aspects of the proposal: handling of
>>>>>>> the spherical model and definition of the spatial reference system. 
>>>>>>> Both of
>>>>>>> which have a big impact on the interoperability with Snowflake and other
>>>>>>> query engines and Geo processing systems.
>>>>>>>
>>>>>>>
>>>>>>> Let us first share some context about geospatial types at Snowflake;
>>>>>>> geo experts will certainly be familiar with this context already, but 
>>>>>>> for
>>>>>>> the sake of others we want to err on the side of being explicit and 
>>>>>>> clear.
>>>>>>> Snowflake supports two Geospatial types [1]:
>>>>>>> - Geography – uses a spherical approximation of the earth for all
>>>>>>> the computations. It does not perfectly represent the earth, but allows
>>>>>>> getting accurate results on WGS84 coordinates, used by GPS without any 
>>>>>>> need
>>>>>>> to perform coordinate system reprojections. It is also quite fast for
>>>>>>> end-to-end computations. In general, it has less distortions compared to
>>>>>>> the 2d planar model .
>>>>>>> - Geometry – uses planar Euclidean geometry model. Geometric
>>>>>>> computations are simpler, but require transforming the data between
>>>>>>> coordinate systems to minimize the distortion. The Geometry data type
>>>>>>> allows setting a spatial reference system for each row using the SRID. 
>>>>>>> The
>>>>>>> binary geospatial functions are only allowed on the geometries with the
>>>>>>> same SRID. The only function that interprets SRID is ST_TRANFORM that
>>>>>>> allows conversion between different SRSs.
>>>>>>>
>>>>>>> Geography
>>>>>>>
>>>>>>> Geometry
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Given the choice of two types and a set of operations on top of
>>>>>>> them, the majority of Snowflake users select the Geography type to
>>>>>>> represent their geospatial data.
>>>>>>>
>>>>>>> From our perspective, Iceberg users would benefit most from being
>>>>>>> given the flexibility to store and process data using the model that 
>>>>>>> better
>>>>>>> fits their needs and specific use cases.
>>>>>>>
>>>>>>> Therefore, we would like to ask some design clarifying questions,
>>>>>>> important for interoperability:
>>>>>>>
>>>>>>>
>>>>>>> 1. In the first version of the specification Phase1 it is mentioned
>>>>>>> as the version focused on the planar geometry model with a CRS system 
>>>>>>> fixed
>>>>>>> on 4326. In this model, Snowflake would not be able to map our Geography
>>>>>>> type since it is based on the spherical Geography model. Given that
>>>>>>> Snowflake supports both edge types, we would like to better understand 
>>>>>>> how
>>>>>>> to map them to the proposed Geometry type and its metadata.
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    How is the edge type supposed to be interpreted by the query
>>>>>>>    engine? Is it necessary for the system to adhere to the edge model 
>>>>>>> for
>>>>>>>    geospatial functions, or can it use the model that it supports or 
>>>>>>> let the
>>>>>>>    customer choose it? Will it affect the bounding box or other row 
>>>>>>> group
>>>>>>>    metadata
>>>>>>>    -
>>>>>>>
>>>>>>>    Is there any reason why the flexible model has to be postponed
>>>>>>>    to further iterations? Would it be more extensible to support 
>>>>>>> mutable edge
>>>>>>>    type from the Phase 1, but allow systems to ignore it if they do not
>>>>>>>    support the spherical computation model
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2. As you mentioned [2] in the proposal there are difficulties with
>>>>>>> supporting the full PROJSSON specification of the SRS. From our 
>>>>>>> experience
>>>>>>> most of the use-cases do not require the full definition of the SRS, in
>>>>>>> fact that definition is only needed when converting between coordinate
>>>>>>> systems. On the other hand, it’s often needed to check whether two 
>>>>>>> geometry
>>>>>>> columns have the same coordinate system, for example when joining two
>>>>>>> columns from different data providers.
>>>>>>>
>>>>>>> To address this we would like to propose including the option to
>>>>>>> specify the SRS with only a SRID in phase 1. The query engine may 
>>>>>>> choose to
>>>>>>> treat it as opaque identified or make a look-up in the EPSG database of
>>>>>>> supported.
>>>>>>>
>>>>>>> Thank you again for driving this effort forward. We look forward to
>>>>>>> hearing your thoughts.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry
>>>>>>>
>>>>>>> [2]
>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf
>>>>>>>
>>>>>>>
>>>>>>> On 2024/05/02 00:41:52 Szehon Ho wrote:
>>>>>>> > Hi everyone,
>>>>>>> >
>>>>>>> > We have created a formal proposal for adding Geospatial support to
>>>>>>> Iceberg.
>>>>>>> >
>>>>>>> > Please read the following for details.
>>>>>>> >
>>>>>>> >    - Github Proposal :
>>>>>>> https://github.com/apache/iceberg/issues/10260
>>>>>>> >    - Proposal Doc:
>>>>>>> >
>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI
>>>>>>> >
>>>>>>> >
>>>>>>> > Note that this proposal is built on existing extensive research
>>>>>>> and POC
>>>>>>> > implementations (Geolake, Havasu).  Special thanks to Jia Yu and
>>>>>>> Kristin
>>>>>>> > Cowalcijk from Wherobots/Geolake for extensive consultation and
>>>>>>> help in
>>>>>>> > writing this proposal, as well as support from Yuanyuan Zhang from
>>>>>>> Geolake.
>>>>>>> >
>>>>>>> > We would love to get more feedback for this proposal from the wider
>>>>>>> > community and eventually discuss this in a community sync.
>>>>>>> >
>>>>>>> > Thanks
>>>>>>> > Szehon
>>>>>>> >
>>>>>>>
>>>>>>

Re: [Discuss] Geospatial Support

Reply via email to