Hi Szehon, hi Jia, Thank you for your replies. We now better understand the connection between the metadata and partitioning in this proposal. Supporting the Mapping 1 is a great starting point, and we would like to work closer with you on bringing the support for spherical edges and other coordinate systems into Iceberg geometry.
We have some follow-up questions regarding the partitioning (let us know if it’s better to comment directly in the document): Does this proposal imply that XZ2 partitioning is always required? In the current proposal, do you see a possibility of predicate pushdown to rely on x/y min/max column metadata instead of a partition key? We see use-cases where a table with a geo column can be partitioned by a different key(e.g. date) or combination of keys. It would be great to support such use cases from the very beginning. Thanks, Peter On Thu, May 30, 2024 at 8:07 AM Jia Yu <ji...@apache.org> wrote: > Hi Dmtro, > > Thanks for your email. To add to Szehon's answer, > > 1. How to represent Snowflake Geometry and Geography type in Iceberg, > given the Geo Iceberg Phase 1 design: > > Answer: > Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg > Geometry + CRS84 + edges: Planar > Mapping 2 (impossible): Snowflake Geography -> Iceberg Geometry + CRS84 + > edges: Spherical > Mapping 3 (impossible): Snowflake Geometry + SRID:ABCDE-> Iceberg > Geometry + SRID:ABCDE + edges: Planar > > As Szehon mentioned, only Mapping 1 is possible because we need to support > spatial query push down in Iceberg. This function relies on the Iceberg > partition transform, which requires a 1:1 mapping between a value > (point/polygon/linestring) and a partition key. That is: given any > precision level, a polygon must produce a single ID; and the covering > indicated by this single ID must fully cover the extent of the polygon. > Currently, only xz2 can satisfy this requirement. If the theory from > Michael Entin can be proven to be correct, then we can support Mapping 2 in > Phase 2 of Geo Iceberg. > > Regarding Mapping 3, this requires Iceberg to be able to understand SRID / > PROJJSON such that we will know min max X Y of the CRS (@Szehon, maybe > Iceberg can ask the engine to provide this information?). See my answer 2. > > 2. Why choose projjson instead of SRID? > > The projjson idea was borrowed from GeoParquet because we'd like to enable > possible conversion between Geo Iceberg and GeoParquet. However, I do > understand that this is not a good idea for Iceberg since not many libs can > parse projjson. > > @Szehon Is there a way that we can support both SRID and PROJJSON in Geo > Iceberg? > > It is also worth noting that, although there are many libs that can parse > SRID and perform look-up in the EPSG database, the license of the EPSG > database is NOT compatible with the Apache Software Foundation. That means: > Iceberg still cannot parse / understand SRID. > > Thanks, > Jia > > On Wed, May 29, 2024 at 11:08 AM Szehon Ho <szehon.apa...@gmail.com> > wrote: > >> Hi Dmytro >> >> Thank you for looking through the proposal and excited to hear from you >> guys! I am not a 'geo expert' and I will definitely need to pull in Jia Yu >> for some of these points. >> >> Although most calculations are done on the query engine, Iceberg >> reference implementations (ie, Java, Python) does have to support a few >> calculations to handle filter push down: >> >> 1. push down of the proposed Geospatial transforms ST_COVERS, >> ST_COVERED_BY, and ST_INTERSECTS >> 2. evaluation of proposed Geospatial partition transform XZ2. As you >> may have seen, this was chosen as its the only standard one today that >> solves the 'boundary object' problem, still preserving 1-to-1 mapping of >> row => partition value. >> >> This is the primary rationale for choosing the values, as these were >> implemented in the GeoLake and Havasu projects (Iceberg forks that sparked >> the proposal) based on Geometry type (edge=planar, crs=OGC:CRS84/ >> SRID=4326). >> >> 2. As you mentioned [2] in the proposal there are difficulties with >>> supporting the full PROJSSON specification of the SRS. From our experience >>> most of the use-cases do not require the full definition of the SRS, in >>> fact that definition is only needed when converting between coordinate >>> systems. On the other hand, it’s often needed to check whether two geometry >>> columns have the same coordinate system, for example when joining two >>> columns from different data providers. >>> >>> To address this we would like to propose including the option to specify >>> the SRS with only a SRID in phase 1. The query engine may choose to treat >>> it as opaque identified or make a look-up in the EPSG database of >>> supported. >>> >> >> The way to specify CRS definition is actually taken from GeoParquet [1], >> I think we are not bound to follow it if there are better options. I feel >> we might need to at least list out supported configurations in the spec, >> though. There is some conversation on the doc here about this [2]. >> Basically: >> >> 1. XZ2 assumes planar edges. This is a feature of the algorithm, >> based on the original paper. A possible solution to spherical edge is >> proposed by Michael Entin here: [3], please feel free to evaluate. >> 2. XZ2 needs to know the coordinate range. According to Jia's >> comments, this needs parsing of the CRS. Can it be done with SRID alone? >> >> >>> 1. In the first version of the specification Phase1 it is mentioned as >>> the version focused on the planar geometry model with a CRS system fixed on >>> 4326. In this model, Snowflake would not be able to map our Geography type >>> since it is based on the spherical Geography model. Given that Snowflake >>> supports both edge types, we would like to better understand how to map >>> them to the proposed Geometry type and its metadata. >>> >>> - >>> >>> How is the edge type supposed to be interpreted by the query engine? >>> Is it necessary for the system to adhere to the edge model for geospatial >>> functions, or can it use the model that it supports or let the customer >>> choose it? Will it affect the bounding box or other row group metadata >>> - >>> >>> Is there any reason why the flexible model has to be postponed to >>> further iterations? Would it be more extensible to support mutable edge >>> type from the Phase 1, but allow systems to ignore it if they do not >>> support the spherical computation model >>> >>> >> It may be answered by the previous paragraph in regards to XZ2. >> >> 1. If we get XZ2 to work with a more variable CRS without requiring >> full PROJJSON specification, it seems it is a path to support Snowflake >> Geometry type? >> 2. If we get another one-to-one partition function on spherical >> edges, like the one proposed by Michael, it seems a path to support >> Snowflake Geography type? >> >> Does that sound correct? As for why certain things are marked as Phase >> 1, they are just chosen so we can all agree on an initial design and >> iterate faster and not set in stone, maybe the path 1 is possible to do >> quickly, for example. >> >> Also , I am not sure about handling evaluation of ST_COVERS, >> ST_COVERED_BY, and ST_INTERSECTS (how easy to handle different CRS + >> spherical edges). I will leave it to Jia. >> >> Thanks! >> Szehon >> >> [1]: >> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata >> [2]: >> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk >> <https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk> >> [3]: >> https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit >> <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit> >> >> >> On Wed, May 29, 2024 at 8:30 AM Dmytro Koval >> <dmytro.ko...@snowflake.com.invalid> wrote: >> >>> Dear Szehon and Iceberg Community, >>> >>> >>> This is Dmytro, Peter, Aihua, and Tyler from Snowflake. As part of our >>> desire to be more active in the Iceberg community, we’ve been looking over >>> this geospatial proposal. We’re excited geospatial is getting traction, as >>> we see a lot of geo usage within Snowflake, and expect that usage to carry >>> over to our Iceberg offerings soon. After reviewing the proposal, we have >>> some questions we’d like to pose given our experience with geospatial >>> support in Snowflake. >>> >>> We would like to clarify two aspects of the proposal: handling of the >>> spherical model and definition of the spatial reference system. Both of >>> which have a big impact on the interoperability with Snowflake and other >>> query engines and Geo processing systems. >>> >>> >>> Let us first share some context about geospatial types at Snowflake; geo >>> experts will certainly be familiar with this context already, but for the >>> sake of others we want to err on the side of being explicit and clear. >>> Snowflake supports two Geospatial types [1]: >>> - Geography – uses a spherical approximation of the earth for all the >>> computations. It does not perfectly represent the earth, but allows getting >>> accurate results on WGS84 coordinates, used by GPS without any need to >>> perform coordinate system reprojections. It is also quite fast for >>> end-to-end computations. In general, it has less distortions compared to >>> the 2d planar model . >>> - Geometry – uses planar Euclidean geometry model. Geometric >>> computations are simpler, but require transforming the data between >>> coordinate systems to minimize the distortion. The Geometry data type >>> allows setting a spatial reference system for each row using the SRID. The >>> binary geospatial functions are only allowed on the geometries with the >>> same SRID. The only function that interprets SRID is ST_TRANFORM that >>> allows conversion between different SRSs. >>> >>> Geography >>> >>> Geometry >>> >>> >>> >>> Given the choice of two types and a set of operations on top of them, >>> the majority of Snowflake users select the Geography type to represent >>> their geospatial data. >>> >>> From our perspective, Iceberg users would benefit most from being given >>> the flexibility to store and process data using the model that better fits >>> their needs and specific use cases. >>> >>> Therefore, we would like to ask some design clarifying questions, >>> important for interoperability: >>> >>> >>> 1. In the first version of the specification Phase1 it is mentioned as >>> the version focused on the planar geometry model with a CRS system fixed on >>> 4326. In this model, Snowflake would not be able to map our Geography type >>> since it is based on the spherical Geography model. Given that Snowflake >>> supports both edge types, we would like to better understand how to map >>> them to the proposed Geometry type and its metadata. >>> >>> - >>> >>> How is the edge type supposed to be interpreted by the query engine? >>> Is it necessary for the system to adhere to the edge model for geospatial >>> functions, or can it use the model that it supports or let the customer >>> choose it? Will it affect the bounding box or other row group metadata >>> - >>> >>> Is there any reason why the flexible model has to be postponed to >>> further iterations? Would it be more extensible to support mutable edge >>> type from the Phase 1, but allow systems to ignore it if they do not >>> support the spherical computation model >>> >>> >>> >>> 2. As you mentioned [2] in the proposal there are difficulties with >>> supporting the full PROJSSON specification of the SRS. From our experience >>> most of the use-cases do not require the full definition of the SRS, in >>> fact that definition is only needed when converting between coordinate >>> systems. On the other hand, it’s often needed to check whether two geometry >>> columns have the same coordinate system, for example when joining two >>> columns from different data providers. >>> >>> To address this we would like to propose including the option to specify >>> the SRS with only a SRID in phase 1. The query engine may choose to treat >>> it as opaque identified or make a look-up in the EPSG database of >>> supported. >>> >>> Thank you again for driving this effort forward. We look forward to >>> hearing your thoughts. >>> >>> [1] >>> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry >>> >>> [2] >>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf >>> >>> >>> On 2024/05/02 00:41:52 Szehon Ho wrote: >>> > Hi everyone, >>> > >>> > We have created a formal proposal for adding Geospatial support to >>> Iceberg. >>> > >>> > Please read the following for details. >>> > >>> > - Github Proposal : https://github.com/apache/iceberg/issues/10260 >>> > - Proposal Doc: >>> > >>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI >>> > >>> > >>> > Note that this proposal is built on existing extensive research and POC >>> > implementations (Geolake, Havasu). Special thanks to Jia Yu and >>> Kristin >>> > Cowalcijk from Wherobots/Geolake for extensive consultation and help in >>> > writing this proposal, as well as support from Yuanyuan Zhang from >>> Geolake. >>> > >>> > We would love to get more feedback for this proposal from the wider >>> > community and eventually discuss this in a community sync. >>> > >>> > Thanks >>> > Szehon >>> > >>> >>