> The min/max stats are discussed in the doc (Phase 2), depending on the non-trivial encoding.
Just want to add that min/max stats filtering could be supported by file format natively. Adding geometry type to parquet spec is under discussion: https://github.com/apache/parquet-format/pull/240 Best, Gang On Thu, Jun 6, 2024 at 5:53 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > Hi Peter > > Yes the document only concerns the predicate pushdown of geometric > column. Predicate pushdown takes two forms, 1) partition filter and 2) > min/max stats. The min/max stats are discussed in the doc (Phase 2), > depending on the non-trivial encoding. > > The evaluators are always AND'ed together, so I dont see any issue of > partitioning with another key not working on a table with a geo column. > > On another note, Jia and I thought that we may have a discussion about > Snowflake geo types in a call to drill down on some details? What time > zone are you folks in/ what time works better ? I think Jia and I are both > in Pacific time zone. > > Thanks > Szehon > > On Wed, Jun 5, 2024 at 1:02 AM Peter Popov <peter.po...@snowflake.com> > wrote: > >> Hi Szehon, hi Jia, >> >> Thank you for your replies. We now better understand the connection >> between the metadata and partitioning in this proposal. Supporting the >> Mapping 1 is a great starting point, and we would like to work closer with >> you on bringing the support for spherical edges and other coordinate >> systems into Iceberg geometry. >> >> We have some follow-up questions regarding the partitioning (let us know >> if it’s better to comment directly in the document): Does this proposal >> imply that XZ2 partitioning is always required? In the current proposal, >> do you see a possibility of predicate pushdown to rely on x/y min/max >> column metadata instead of a partition key? We see use-cases where a table >> with a geo column can be partitioned by a different key(e.g. date) or >> combination of keys. It would be great to support such use cases from the >> very beginning. >> >> Thanks, >> >> Peter >> >> On Thu, May 30, 2024 at 8:07 AM Jia Yu <ji...@apache.org> wrote: >> >>> Hi Dmtro, >>> >>> Thanks for your email. To add to Szehon's answer, >>> >>> 1. How to represent Snowflake Geometry and Geography type in Iceberg, >>> given the Geo Iceberg Phase 1 design: >>> >>> Answer: >>> Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg >>> Geometry + CRS84 + edges: Planar >>> Mapping 2 (impossible): Snowflake Geography -> Iceberg Geometry + >>> CRS84 + edges: Spherical >>> Mapping 3 (impossible): Snowflake Geometry + SRID:ABCDE-> Iceberg >>> Geometry + SRID:ABCDE + edges: Planar >>> >>> As Szehon mentioned, only Mapping 1 is possible because we need to >>> support spatial query push down in Iceberg. This function relies on the >>> Iceberg partition transform, which requires a 1:1 mapping between a value >>> (point/polygon/linestring) and a partition key. That is: given any >>> precision level, a polygon must produce a single ID; and the covering >>> indicated by this single ID must fully cover the extent of the polygon. >>> Currently, only xz2 can satisfy this requirement. If the theory from >>> Michael Entin can be proven to be correct, then we can support Mapping 2 in >>> Phase 2 of Geo Iceberg. >>> >>> Regarding Mapping 3, this requires Iceberg to be able to understand SRID >>> / PROJJSON such that we will know min max X Y of the CRS (@Szehon, maybe >>> Iceberg can ask the engine to provide this information?). See my answer 2. >>> >>> 2. Why choose projjson instead of SRID? >>> >>> The projjson idea was borrowed from GeoParquet because we'd like to >>> enable possible conversion between Geo Iceberg and GeoParquet. However, I >>> do understand that this is not a good idea for Iceberg since not many libs >>> can parse projjson. >>> >>> @Szehon Is there a way that we can support both SRID and PROJJSON in Geo >>> Iceberg? >>> >>> It is also worth noting that, although there are many libs that can >>> parse SRID and perform look-up in the EPSG database, the license of the >>> EPSG database is NOT compatible with the Apache Software Foundation. That >>> means: Iceberg still cannot parse / understand SRID. >>> >>> Thanks, >>> Jia >>> >>> On Wed, May 29, 2024 at 11:08 AM Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>> >>>> Hi Dmytro >>>> >>>> Thank you for looking through the proposal and excited to hear from you >>>> guys! I am not a 'geo expert' and I will definitely need to pull in Jia Yu >>>> for some of these points. >>>> >>>> Although most calculations are done on the query engine, Iceberg >>>> reference implementations (ie, Java, Python) does have to support a few >>>> calculations to handle filter push down: >>>> >>>> 1. push down of the proposed Geospatial transforms ST_COVERS, >>>> ST_COVERED_BY, and ST_INTERSECTS >>>> 2. evaluation of proposed Geospatial partition transform XZ2. As >>>> you may have seen, this was chosen as its the only standard one today >>>> that >>>> solves the 'boundary object' problem, still preserving 1-to-1 mapping of >>>> row => partition value. >>>> >>>> This is the primary rationale for choosing the values, as these were >>>> implemented in the GeoLake and Havasu projects (Iceberg forks that sparked >>>> the proposal) based on Geometry type (edge=planar, crs=OGC:CRS84/ >>>> SRID=4326). >>>> >>>> 2. As you mentioned [2] in the proposal there are difficulties with >>>>> supporting the full PROJSSON specification of the SRS. From our experience >>>>> most of the use-cases do not require the full definition of the SRS, in >>>>> fact that definition is only needed when converting between coordinate >>>>> systems. On the other hand, it’s often needed to check whether two >>>>> geometry >>>>> columns have the same coordinate system, for example when joining two >>>>> columns from different data providers. >>>>> >>>>> To address this we would like to propose including the option to >>>>> specify the SRS with only a SRID in phase 1. The query engine may choose >>>>> to >>>>> treat it as opaque identified or make a look-up in the EPSG database of >>>>> supported. >>>>> >>>> >>>> The way to specify CRS definition is actually taken from GeoParquet >>>> [1], I think we are not bound to follow it if there are better options. I >>>> feel we might need to at least list out supported configurations in the >>>> spec, though. There is some conversation on the doc here about this [2]. >>>> Basically: >>>> >>>> 1. XZ2 assumes planar edges. This is a feature of the algorithm, >>>> based on the original paper. A possible solution to spherical edge is >>>> proposed by Michael Entin here: [3], please feel free to evaluate. >>>> 2. XZ2 needs to know the coordinate range. According to Jia's >>>> comments, this needs parsing of the CRS. Can it be done with SRID >>>> alone? >>>> >>>> >>>>> 1. In the first version of the specification Phase1 it is mentioned as >>>>> the version focused on the planar geometry model with a CRS system fixed >>>>> on >>>>> 4326. In this model, Snowflake would not be able to map our Geography type >>>>> since it is based on the spherical Geography model. Given that Snowflake >>>>> supports both edge types, we would like to better understand how to map >>>>> them to the proposed Geometry type and its metadata. >>>>> >>>>> - >>>>> >>>>> How is the edge type supposed to be interpreted by the query >>>>> engine? Is it necessary for the system to adhere to the edge model for >>>>> geospatial functions, or can it use the model that it supports or let >>>>> the >>>>> customer choose it? Will it affect the bounding box or other row group >>>>> metadata >>>>> - >>>>> >>>>> Is there any reason why the flexible model has to be postponed to >>>>> further iterations? Would it be more extensible to support mutable edge >>>>> type from the Phase 1, but allow systems to ignore it if they do not >>>>> support the spherical computation model >>>>> >>>>> >>>> It may be answered by the previous paragraph in regards to XZ2. >>>> >>>> 1. If we get XZ2 to work with a more variable CRS without requiring >>>> full PROJJSON specification, it seems it is a path to support Snowflake >>>> Geometry type? >>>> 2. If we get another one-to-one partition function on spherical >>>> edges, like the one proposed by Michael, it seems a path to support >>>> Snowflake Geography type? >>>> >>>> Does that sound correct? As for why certain things are marked as Phase >>>> 1, they are just chosen so we can all agree on an initial design and >>>> iterate faster and not set in stone, maybe the path 1 is possible to do >>>> quickly, for example. >>>> >>>> Also , I am not sure about handling evaluation of ST_COVERS, >>>> ST_COVERED_BY, and ST_INTERSECTS (how easy to handle different CRS + >>>> spherical edges). I will leave it to Jia. >>>> >>>> Thanks! >>>> Szehon >>>> >>>> [1]: >>>> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata >>>> [2]: >>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk >>>> <https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk> >>>> [3]: >>>> https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit >>>> <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit> >>>> >>>> >>>> On Wed, May 29, 2024 at 8:30 AM Dmytro Koval >>>> <dmytro.ko...@snowflake.com.invalid> wrote: >>>> >>>>> Dear Szehon and Iceberg Community, >>>>> >>>>> >>>>> This is Dmytro, Peter, Aihua, and Tyler from Snowflake. As part of our >>>>> desire to be more active in the Iceberg community, we’ve been looking over >>>>> this geospatial proposal. We’re excited geospatial is getting traction, as >>>>> we see a lot of geo usage within Snowflake, and expect that usage to carry >>>>> over to our Iceberg offerings soon. After reviewing the proposal, we have >>>>> some questions we’d like to pose given our experience with geospatial >>>>> support in Snowflake. >>>>> >>>>> We would like to clarify two aspects of the proposal: handling of the >>>>> spherical model and definition of the spatial reference system. Both of >>>>> which have a big impact on the interoperability with Snowflake and other >>>>> query engines and Geo processing systems. >>>>> >>>>> >>>>> Let us first share some context about geospatial types at Snowflake; >>>>> geo experts will certainly be familiar with this context already, but for >>>>> the sake of others we want to err on the side of being explicit and clear. >>>>> Snowflake supports two Geospatial types [1]: >>>>> - Geography – uses a spherical approximation of the earth for all the >>>>> computations. It does not perfectly represent the earth, but allows >>>>> getting >>>>> accurate results on WGS84 coordinates, used by GPS without any need to >>>>> perform coordinate system reprojections. It is also quite fast for >>>>> end-to-end computations. In general, it has less distortions compared to >>>>> the 2d planar model . >>>>> - Geometry – uses planar Euclidean geometry model. Geometric >>>>> computations are simpler, but require transforming the data between >>>>> coordinate systems to minimize the distortion. The Geometry data type >>>>> allows setting a spatial reference system for each row using the SRID. The >>>>> binary geospatial functions are only allowed on the geometries with the >>>>> same SRID. The only function that interprets SRID is ST_TRANFORM that >>>>> allows conversion between different SRSs. >>>>> >>>>> Geography >>>>> >>>>> Geometry >>>>> >>>>> >>>>> >>>>> Given the choice of two types and a set of operations on top of them, >>>>> the majority of Snowflake users select the Geography type to represent >>>>> their geospatial data. >>>>> >>>>> From our perspective, Iceberg users would benefit most from being >>>>> given the flexibility to store and process data using the model that >>>>> better >>>>> fits their needs and specific use cases. >>>>> >>>>> Therefore, we would like to ask some design clarifying questions, >>>>> important for interoperability: >>>>> >>>>> >>>>> 1. In the first version of the specification Phase1 it is mentioned as >>>>> the version focused on the planar geometry model with a CRS system fixed >>>>> on >>>>> 4326. In this model, Snowflake would not be able to map our Geography type >>>>> since it is based on the spherical Geography model. Given that Snowflake >>>>> supports both edge types, we would like to better understand how to map >>>>> them to the proposed Geometry type and its metadata. >>>>> >>>>> - >>>>> >>>>> How is the edge type supposed to be interpreted by the query >>>>> engine? Is it necessary for the system to adhere to the edge model for >>>>> geospatial functions, or can it use the model that it supports or let >>>>> the >>>>> customer choose it? Will it affect the bounding box or other row group >>>>> metadata >>>>> - >>>>> >>>>> Is there any reason why the flexible model has to be postponed to >>>>> further iterations? Would it be more extensible to support mutable edge >>>>> type from the Phase 1, but allow systems to ignore it if they do not >>>>> support the spherical computation model >>>>> >>>>> >>>>> >>>>> 2. As you mentioned [2] in the proposal there are difficulties with >>>>> supporting the full PROJSSON specification of the SRS. From our experience >>>>> most of the use-cases do not require the full definition of the SRS, in >>>>> fact that definition is only needed when converting between coordinate >>>>> systems. On the other hand, it’s often needed to check whether two >>>>> geometry >>>>> columns have the same coordinate system, for example when joining two >>>>> columns from different data providers. >>>>> >>>>> To address this we would like to propose including the option to >>>>> specify the SRS with only a SRID in phase 1. The query engine may choose >>>>> to >>>>> treat it as opaque identified or make a look-up in the EPSG database of >>>>> supported. >>>>> >>>>> Thank you again for driving this effort forward. We look forward to >>>>> hearing your thoughts. >>>>> >>>>> [1] >>>>> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry >>>>> >>>>> [2] >>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf >>>>> >>>>> >>>>> On 2024/05/02 00:41:52 Szehon Ho wrote: >>>>> > Hi everyone, >>>>> > >>>>> > We have created a formal proposal for adding Geospatial support to >>>>> Iceberg. >>>>> > >>>>> > Please read the following for details. >>>>> > >>>>> > - Github Proposal : >>>>> https://github.com/apache/iceberg/issues/10260 >>>>> > - Proposal Doc: >>>>> > >>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI >>>>> > >>>>> > >>>>> > Note that this proposal is built on existing extensive research and >>>>> POC >>>>> > implementations (Geolake, Havasu). Special thanks to Jia Yu and >>>>> Kristin >>>>> > Cowalcijk from Wherobots/Geolake for extensive consultation and help >>>>> in >>>>> > writing this proposal, as well as support from Yuanyuan Zhang from >>>>> Geolake. >>>>> > >>>>> > We would love to get more feedback for this proposal from the wider >>>>> > community and eventually discuss this in a community sync. >>>>> > >>>>> > Thanks >>>>> > Szehon >>>>> > >>>>> >>>>