Hi all Please take a look at the proposed spec change to support Geo type for V3 in : https://github.com/apache/iceberg/pull/10981, and comment or otherwise let me know your thoughts.
Just as an FYI it incorporated the feedback from our last meeting (with Snowflake and Wherobots engineers). Thanks, Szehon On Wed, Jun 26, 2024 at 7:29 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > Hi > > It was great to meet in person with Snowflake engineers and we had a good > discussion on the paths forward. > > Meeting notes for Snowflake- Iceberg sync. > > - Iceberg proposed Geometry type defaults to (edges=planar , > crs=CRS84). > - Snowflake has two types Geography (spherical) and Geometry (planar, > with customizable CRS). The data layout/encoding is the same for both > types. Let's see how we can support each in Iceberg type, especially wrt > Iceberg partition/file pruning > - Geography type support > - Main concern is the need for a suitable partition transform for > partition-level filter, the candidate is Micahel Entin's proposal > > <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit> > . > - Secondary concern is file and RG-level filtering. Gang's Parquet > proposal <https://github.com/apache/parquet-format/pull/240/files> allow > storage of S2 / H3 ID's in Parquet stats, and so we can also leverage > that > in Iceberg pruning code (Google and Uber libraries are compatible) > - Geometry type support > - Main concern is partition transform needs to understand CRS, but > this can be solved by having XZ2 transform created with customizable > min/max lat/long range (its all it needs) > - Should (CRS, edges) be stored properties on Geography type in Phase > 1? > - Should be fine to store, with only allowing defaults in Phase 1. > - Concern 1: If edges is stored, there will be ask to store other > properties like (orientation, epoch). Solution is to punt these > follow-on > properties for later. > - Concern 2: if crs is stored, what format? PROJJSON vs SRID. > Solution is to leave it as a string > - Concern 3: if crs is stored as a string, Iceberg cannot read it. > This should be ok, as we only need this for XZ2 transform, where the > user > already passes in the info from CRS (up to user to make sure these > align). > > Thanks > Szehon > > On Tue, Jun 18, 2024 at 12:23 PM Szehon Ho <szehon.apa...@gmail.com> > wrote: > >> Jia and I will sync with the Snowflake folks to see if we can have a >> solution, or roadmap to solution, in the proposal. >> >> Thanks JB for the interest! By the way, I want to schedule a meeting to >> go over the proposal, it seems there's good feedback from folks from geo >> side (and even Parquet community), but not too many eyes/feedback from >> other folks/PMC on Iceberg community. This might be due to lack of >> familiarity/ time to read through it all. In fact, a lot of the advanced >> discussions like this one are for Phase 2 items, and Phase 1 items are >> relatively straightforward, so wanted to explain that. As I know its >> summer vacation for some folks, we can do this in a week or early July, >> hope that sounds good with everyone. >> >> Thanks, >> Szehon >> >> On Tue, Jun 18, 2024 at 1:54 AM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >>> Hi Jia >>> >>> Thanks for the update. I'm gonna re-read the whole thread and document >>> to have a better understanding. >>> >>> Thanks ! >>> Regards >>> JB >>> >>> On Mon, Jun 17, 2024 at 7:44 PM Jia Yu <ji...@apache.org> wrote: >>> >>>> Hi Snowflake folks, >>>> >>>> Please let me know if you have other questions regarding the proposal. >>>> If any, Szehon and I can set up a zoom call with you guys to clarify some >>>> details. We are in the Pacific time zone. If you are in Europe, maybe early >>>> morning Pacific Time works best for you? >>>> >>>> Thanks, >>>> Jia >>>> >>>> On Wed, Jun 5, 2024 at 6:28 PM Gang Wu <ust...@gmail.com> wrote: >>>> >>>>> > The min/max stats are discussed in the doc (Phase 2), depending on >>>>> the non-trivial encoding. >>>>> >>>>> Just want to add that min/max stats filtering could be supported by >>>>> file format natively. Adding geometry type to parquet spec >>>>> is under discussion: https://github.com/apache/parquet-format/pull/240 >>>>> >>>>> Best, >>>>> Gang >>>>> >>>>> On Thu, Jun 6, 2024 at 5:53 AM Szehon Ho <szehon.apa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Peter >>>>>> >>>>>> Yes the document only concerns the predicate pushdown of geometric >>>>>> column. Predicate pushdown takes two forms, 1) partition filter and 2) >>>>>> min/max stats. The min/max stats are discussed in the doc (Phase 2), >>>>>> depending on the non-trivial encoding. >>>>>> >>>>>> The evaluators are always AND'ed together, so I dont see any issue of >>>>>> partitioning with another key not working on a table with a geo column. >>>>>> >>>>>> On another note, Jia and I thought that we may have a discussion >>>>>> about Snowflake geo types in a call to drill down on some details? What >>>>>> time zone are you folks in/ what time works better ? I think Jia and I >>>>>> are >>>>>> both in Pacific time zone. >>>>>> >>>>>> Thanks >>>>>> Szehon >>>>>> >>>>>> On Wed, Jun 5, 2024 at 1:02 AM Peter Popov <peter.po...@snowflake.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Szehon, hi Jia, >>>>>>> >>>>>>> Thank you for your replies. We now better understand the connection >>>>>>> between the metadata and partitioning in this proposal. Supporting the >>>>>>> Mapping 1 is a great starting point, and we would like to work closer >>>>>>> with >>>>>>> you on bringing the support for spherical edges and other coordinate >>>>>>> systems into Iceberg geometry. >>>>>>> >>>>>>> We have some follow-up questions regarding the partitioning (let us >>>>>>> know if it’s better to comment directly in the document): Does this >>>>>>> proposal imply that XZ2 partitioning is always required? In the >>>>>>> current proposal, do you see a possibility of predicate pushdown to >>>>>>> rely on x/y min/max column metadata instead of a partition key? We see >>>>>>> use-cases where a table with a geo column can be partitioned by a >>>>>>> different >>>>>>> key(e.g. date) or combination of keys. It would be great to support such >>>>>>> use cases from the very beginning. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Peter >>>>>>> >>>>>>> On Thu, May 30, 2024 at 8:07 AM Jia Yu <ji...@apache.org> wrote: >>>>>>> >>>>>>>> Hi Dmtro, >>>>>>>> >>>>>>>> Thanks for your email. To add to Szehon's answer, >>>>>>>> >>>>>>>> 1. How to represent Snowflake Geometry and Geography type in >>>>>>>> Iceberg, given the Geo Iceberg Phase 1 design: >>>>>>>> >>>>>>>> Answer: >>>>>>>> Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg >>>>>>>> Geometry + CRS84 + edges: Planar >>>>>>>> Mapping 2 (impossible): Snowflake Geography -> Iceberg Geometry + >>>>>>>> CRS84 + edges: Spherical >>>>>>>> Mapping 3 (impossible): Snowflake Geometry + SRID:ABCDE-> Iceberg >>>>>>>> Geometry + SRID:ABCDE + edges: Planar >>>>>>>> >>>>>>>> As Szehon mentioned, only Mapping 1 is possible because we need to >>>>>>>> support spatial query push down in Iceberg. This function relies on the >>>>>>>> Iceberg partition transform, which requires a 1:1 mapping between a >>>>>>>> value >>>>>>>> (point/polygon/linestring) and a partition key. That is: given any >>>>>>>> precision level, a polygon must produce a single ID; and the covering >>>>>>>> indicated by this single ID must fully cover the extent of the polygon. >>>>>>>> Currently, only xz2 can satisfy this requirement. If the theory from >>>>>>>> Michael Entin can be proven to be correct, then we can support Mapping >>>>>>>> 2 in >>>>>>>> Phase 2 of Geo Iceberg. >>>>>>>> >>>>>>>> Regarding Mapping 3, this requires Iceberg to be able to understand >>>>>>>> SRID / PROJJSON such that we will know min max X Y of the CRS (@Szehon, >>>>>>>> maybe Iceberg can ask the engine to provide this information?). See my >>>>>>>> answer 2. >>>>>>>> >>>>>>>> 2. Why choose projjson instead of SRID? >>>>>>>> >>>>>>>> The projjson idea was borrowed from GeoParquet because we'd like to >>>>>>>> enable possible conversion between Geo Iceberg and GeoParquet. >>>>>>>> However, I >>>>>>>> do understand that this is not a good idea for Iceberg since not many >>>>>>>> libs >>>>>>>> can parse projjson. >>>>>>>> >>>>>>>> @Szehon Is there a way that we can support both SRID and PROJJSON >>>>>>>> in Geo Iceberg? >>>>>>>> >>>>>>>> It is also worth noting that, although there are many libs that can >>>>>>>> parse SRID and perform look-up in the EPSG database, the license of the >>>>>>>> EPSG database is NOT compatible with the Apache Software Foundation. >>>>>>>> That >>>>>>>> means: Iceberg still cannot parse / understand SRID. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Jia >>>>>>>> >>>>>>>> On Wed, May 29, 2024 at 11:08 AM Szehon Ho <szehon.apa...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Dmytro >>>>>>>>> >>>>>>>>> Thank you for looking through the proposal and excited to hear >>>>>>>>> from you guys! I am not a 'geo expert' and I will definitely need to >>>>>>>>> pull >>>>>>>>> in Jia Yu for some of these points. >>>>>>>>> >>>>>>>>> Although most calculations are done on the query engine, Iceberg >>>>>>>>> reference implementations (ie, Java, Python) does have to support a >>>>>>>>> few >>>>>>>>> calculations to handle filter push down: >>>>>>>>> >>>>>>>>> 1. push down of the proposed Geospatial transforms ST_COVERS, >>>>>>>>> ST_COVERED_BY, and ST_INTERSECTS >>>>>>>>> 2. evaluation of proposed Geospatial partition transform XZ2. >>>>>>>>> As you may have seen, this was chosen as its the only standard one >>>>>>>>> today >>>>>>>>> that solves the 'boundary object' problem, still preserving 1-to-1 >>>>>>>>> mapping >>>>>>>>> of row => partition value. >>>>>>>>> >>>>>>>>> This is the primary rationale for choosing the values, as these >>>>>>>>> were implemented in the GeoLake and Havasu projects (Iceberg forks >>>>>>>>> that >>>>>>>>> sparked the proposal) based on Geometry type (edge=planar, >>>>>>>>> crs=OGC:CRS84/ >>>>>>>>> SRID=4326). >>>>>>>>> >>>>>>>>> 2. As you mentioned [2] in the proposal there are difficulties >>>>>>>>>> with supporting the full PROJSSON specification of the SRS. From our >>>>>>>>>> experience most of the use-cases do not require the full definition >>>>>>>>>> of the >>>>>>>>>> SRS, in fact that definition is only needed when converting between >>>>>>>>>> coordinate systems. On the other hand, it’s often needed to check >>>>>>>>>> whether >>>>>>>>>> two geometry columns have the same coordinate system, for example >>>>>>>>>> when >>>>>>>>>> joining two columns from different data providers. >>>>>>>>>> >>>>>>>>>> To address this we would like to propose including the option to >>>>>>>>>> specify the SRS with only a SRID in phase 1. The query engine may >>>>>>>>>> choose to >>>>>>>>>> treat it as opaque identified or make a look-up in the EPSG database >>>>>>>>>> of >>>>>>>>>> supported. >>>>>>>>>> >>>>>>>>> >>>>>>>>> The way to specify CRS definition is actually taken from >>>>>>>>> GeoParquet [1], I think we are not bound to follow it if there are >>>>>>>>> better >>>>>>>>> options. I feel we might need to at least list out supported >>>>>>>>> configurations in the spec, though. There is some conversation on >>>>>>>>> the doc >>>>>>>>> here about this [2]. Basically: >>>>>>>>> >>>>>>>>> 1. XZ2 assumes planar edges. This is a feature of the >>>>>>>>> algorithm, based on the original paper. A possible solution to >>>>>>>>> spherical >>>>>>>>> edge is proposed by Michael Entin here: [3], please feel free to >>>>>>>>> evaluate. >>>>>>>>> 2. XZ2 needs to know the coordinate range. According to Jia's >>>>>>>>> comments, this needs parsing of the CRS. Can it be done with SRID >>>>>>>>> alone? >>>>>>>>> >>>>>>>>> >>>>>>>>>> 1. In the first version of the specification Phase1 it is >>>>>>>>>> mentioned as the version focused on the planar geometry model with a >>>>>>>>>> CRS >>>>>>>>>> system fixed on 4326. In this model, Snowflake would not be able to >>>>>>>>>> map our >>>>>>>>>> Geography type since it is based on the spherical Geography model. >>>>>>>>>> Given >>>>>>>>>> that Snowflake supports both edge types, we would like to better >>>>>>>>>> understand >>>>>>>>>> how to map them to the proposed Geometry type and its metadata. >>>>>>>>>> >>>>>>>>>> - >>>>>>>>>> >>>>>>>>>> How is the edge type supposed to be interpreted by the query >>>>>>>>>> engine? Is it necessary for the system to adhere to the edge >>>>>>>>>> model for >>>>>>>>>> geospatial functions, or can it use the model that it supports or >>>>>>>>>> let the >>>>>>>>>> customer choose it? Will it affect the bounding box or other row >>>>>>>>>> group >>>>>>>>>> metadata >>>>>>>>>> - >>>>>>>>>> >>>>>>>>>> Is there any reason why the flexible model has to be >>>>>>>>>> postponed to further iterations? Would it be more extensible to >>>>>>>>>> support >>>>>>>>>> mutable edge type from the Phase 1, but allow systems to ignore >>>>>>>>>> it if they >>>>>>>>>> do not support the spherical computation model >>>>>>>>>> >>>>>>>>>> >>>>>>>>> It may be answered by the previous paragraph in regards to XZ2. >>>>>>>>> >>>>>>>>> 1. If we get XZ2 to work with a more variable CRS without >>>>>>>>> requiring full PROJJSON specification, it seems it is a path to >>>>>>>>> support >>>>>>>>> Snowflake Geometry type? >>>>>>>>> 2. If we get another one-to-one partition function on >>>>>>>>> spherical edges, like the one proposed by Michael, it seems a path >>>>>>>>> to >>>>>>>>> support Snowflake Geography type? >>>>>>>>> >>>>>>>>> Does that sound correct? As for why certain things are marked as >>>>>>>>> Phase 1, they are just chosen so we can all agree on an initial >>>>>>>>> design and >>>>>>>>> iterate faster and not set in stone, maybe the path 1 is possible to >>>>>>>>> do >>>>>>>>> quickly, for example. >>>>>>>>> >>>>>>>>> Also , I am not sure about handling evaluation of ST_COVERS, >>>>>>>>> ST_COVERED_BY, and ST_INTERSECTS (how easy to handle different CRS + >>>>>>>>> spherical edges). I will leave it to Jia. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> Szehon >>>>>>>>> >>>>>>>>> [1]: >>>>>>>>> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata >>>>>>>>> [2]: >>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk >>>>>>>>> <https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk> >>>>>>>>> [3]: >>>>>>>>> https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit >>>>>>>>> <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, May 29, 2024 at 8:30 AM Dmytro Koval >>>>>>>>> <dmytro.ko...@snowflake.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Dear Szehon and Iceberg Community, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This is Dmytro, Peter, Aihua, and Tyler from Snowflake. As part >>>>>>>>>> of our desire to be more active in the Iceberg community, we’ve been >>>>>>>>>> looking over this geospatial proposal. We’re excited geospatial is >>>>>>>>>> getting >>>>>>>>>> traction, as we see a lot of geo usage within Snowflake, and expect >>>>>>>>>> that >>>>>>>>>> usage to carry over to our Iceberg offerings soon. After reviewing >>>>>>>>>> the >>>>>>>>>> proposal, we have some questions we’d like to pose given our >>>>>>>>>> experience >>>>>>>>>> with geospatial support in Snowflake. >>>>>>>>>> >>>>>>>>>> We would like to clarify two aspects of the proposal: handling of >>>>>>>>>> the spherical model and definition of the spatial reference system. >>>>>>>>>> Both of >>>>>>>>>> which have a big impact on the interoperability with Snowflake and >>>>>>>>>> other >>>>>>>>>> query engines and Geo processing systems. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Let us first share some context about geospatial types at >>>>>>>>>> Snowflake; geo experts will certainly be familiar with this context >>>>>>>>>> already, but for the sake of others we want to err on the side of >>>>>>>>>> being >>>>>>>>>> explicit and clear. Snowflake supports two Geospatial types [1]: >>>>>>>>>> - Geography – uses a spherical approximation of the earth for >>>>>>>>>> all the computations. It does not perfectly represent the earth, but >>>>>>>>>> allows >>>>>>>>>> getting accurate results on WGS84 coordinates, used by GPS without >>>>>>>>>> any need >>>>>>>>>> to perform coordinate system reprojections. It is also quite fast for >>>>>>>>>> end-to-end computations. In general, it has less distortions >>>>>>>>>> compared to >>>>>>>>>> the 2d planar model . >>>>>>>>>> - Geometry – uses planar Euclidean geometry model. Geometric >>>>>>>>>> computations are simpler, but require transforming the data between >>>>>>>>>> coordinate systems to minimize the distortion. The Geometry data type >>>>>>>>>> allows setting a spatial reference system for each row using the >>>>>>>>>> SRID. The >>>>>>>>>> binary geospatial functions are only allowed on the geometries with >>>>>>>>>> the >>>>>>>>>> same SRID. The only function that interprets SRID is ST_TRANFORM that >>>>>>>>>> allows conversion between different SRSs. >>>>>>>>>> >>>>>>>>>> Geography >>>>>>>>>> >>>>>>>>>> Geometry >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Given the choice of two types and a set of operations on top of >>>>>>>>>> them, the majority of Snowflake users select the Geography type to >>>>>>>>>> represent their geospatial data. >>>>>>>>>> >>>>>>>>>> From our perspective, Iceberg users would benefit most from being >>>>>>>>>> given the flexibility to store and process data using the model that >>>>>>>>>> better >>>>>>>>>> fits their needs and specific use cases. >>>>>>>>>> >>>>>>>>>> Therefore, we would like to ask some design clarifying questions, >>>>>>>>>> important for interoperability: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 1. In the first version of the specification Phase1 it is >>>>>>>>>> mentioned as the version focused on the planar geometry model with a >>>>>>>>>> CRS >>>>>>>>>> system fixed on 4326. In this model, Snowflake would not be able to >>>>>>>>>> map our >>>>>>>>>> Geography type since it is based on the spherical Geography model. >>>>>>>>>> Given >>>>>>>>>> that Snowflake supports both edge types, we would like to better >>>>>>>>>> understand >>>>>>>>>> how to map them to the proposed Geometry type and its metadata. >>>>>>>>>> >>>>>>>>>> - >>>>>>>>>> >>>>>>>>>> How is the edge type supposed to be interpreted by the query >>>>>>>>>> engine? Is it necessary for the system to adhere to the edge >>>>>>>>>> model for >>>>>>>>>> geospatial functions, or can it use the model that it supports or >>>>>>>>>> let the >>>>>>>>>> customer choose it? Will it affect the bounding box or other row >>>>>>>>>> group >>>>>>>>>> metadata >>>>>>>>>> - >>>>>>>>>> >>>>>>>>>> Is there any reason why the flexible model has to be >>>>>>>>>> postponed to further iterations? Would it be more extensible to >>>>>>>>>> support >>>>>>>>>> mutable edge type from the Phase 1, but allow systems to ignore >>>>>>>>>> it if they >>>>>>>>>> do not support the spherical computation model >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2. As you mentioned [2] in the proposal there are difficulties >>>>>>>>>> with supporting the full PROJSSON specification of the SRS. From our >>>>>>>>>> experience most of the use-cases do not require the full definition >>>>>>>>>> of the >>>>>>>>>> SRS, in fact that definition is only needed when converting between >>>>>>>>>> coordinate systems. On the other hand, it’s often needed to check >>>>>>>>>> whether >>>>>>>>>> two geometry columns have the same coordinate system, for example >>>>>>>>>> when >>>>>>>>>> joining two columns from different data providers. >>>>>>>>>> >>>>>>>>>> To address this we would like to propose including the option to >>>>>>>>>> specify the SRS with only a SRID in phase 1. The query engine may >>>>>>>>>> choose to >>>>>>>>>> treat it as opaque identified or make a look-up in the EPSG database >>>>>>>>>> of >>>>>>>>>> supported. >>>>>>>>>> >>>>>>>>>> Thank you again for driving this effort forward. We look forward >>>>>>>>>> to hearing your thoughts. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry >>>>>>>>>> >>>>>>>>>> [2] >>>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 2024/05/02 00:41:52 Szehon Ho wrote: >>>>>>>>>> > Hi everyone, >>>>>>>>>> > >>>>>>>>>> > We have created a formal proposal for adding Geospatial support >>>>>>>>>> to Iceberg. >>>>>>>>>> > >>>>>>>>>> > Please read the following for details. >>>>>>>>>> > >>>>>>>>>> > - Github Proposal : >>>>>>>>>> https://github.com/apache/iceberg/issues/10260 >>>>>>>>>> > - Proposal Doc: >>>>>>>>>> > >>>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > Note that this proposal is built on existing extensive research >>>>>>>>>> and POC >>>>>>>>>> > implementations (Geolake, Havasu). Special thanks to Jia Yu >>>>>>>>>> and Kristin >>>>>>>>>> > Cowalcijk from Wherobots/Geolake for extensive consultation and >>>>>>>>>> help in >>>>>>>>>> > writing this proposal, as well as support from Yuanyuan Zhang >>>>>>>>>> from Geolake. >>>>>>>>>> > >>>>>>>>>> > We would love to get more feedback for this proposal from the >>>>>>>>>> wider >>>>>>>>>> > community and eventually discuss this in a community sync. >>>>>>>>>> > >>>>>>>>>> > Thanks >>>>>>>>>> > Szehon >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>