Hi Jia Thanks for the update. I'm gonna re-read the whole thread and document to have a better understanding.
Thanks ! Regards JB On Mon, Jun 17, 2024 at 7:44 PM Jia Yu <ji...@apache.org> wrote: > Hi Snowflake folks, > > Please let me know if you have other questions regarding the proposal. If > any, Szehon and I can set up a zoom call with you guys to clarify some > details. We are in the Pacific time zone. If you are in Europe, maybe early > morning Pacific Time works best for you? > > Thanks, > Jia > > On Wed, Jun 5, 2024 at 6:28 PM Gang Wu <ust...@gmail.com> wrote: > >> > The min/max stats are discussed in the doc (Phase 2), depending on the >> non-trivial encoding. >> >> Just want to add that min/max stats filtering could be supported by file >> format natively. Adding geometry type to parquet spec >> is under discussion: https://github.com/apache/parquet-format/pull/240 >> >> Best, >> Gang >> >> On Thu, Jun 6, 2024 at 5:53 AM Szehon Ho <szehon.apa...@gmail.com> wrote: >> >>> Hi Peter >>> >>> Yes the document only concerns the predicate pushdown of geometric >>> column. Predicate pushdown takes two forms, 1) partition filter and 2) >>> min/max stats. The min/max stats are discussed in the doc (Phase 2), >>> depending on the non-trivial encoding. >>> >>> The evaluators are always AND'ed together, so I dont see any issue of >>> partitioning with another key not working on a table with a geo column. >>> >>> On another note, Jia and I thought that we may have a discussion about >>> Snowflake geo types in a call to drill down on some details? What time >>> zone are you folks in/ what time works better ? I think Jia and I are both >>> in Pacific time zone. >>> >>> Thanks >>> Szehon >>> >>> On Wed, Jun 5, 2024 at 1:02 AM Peter Popov <peter.po...@snowflake.com> >>> wrote: >>> >>>> Hi Szehon, hi Jia, >>>> >>>> Thank you for your replies. We now better understand the connection >>>> between the metadata and partitioning in this proposal. Supporting the >>>> Mapping 1 is a great starting point, and we would like to work closer with >>>> you on bringing the support for spherical edges and other coordinate >>>> systems into Iceberg geometry. >>>> >>>> We have some follow-up questions regarding the partitioning (let us >>>> know if it’s better to comment directly in the document): Does this >>>> proposal imply that XZ2 partitioning is always required? In the >>>> current proposal, do you see a possibility of predicate pushdown to >>>> rely on x/y min/max column metadata instead of a partition key? We see >>>> use-cases where a table with a geo column can be partitioned by a different >>>> key(e.g. date) or combination of keys. It would be great to support such >>>> use cases from the very beginning. >>>> >>>> Thanks, >>>> >>>> Peter >>>> >>>> On Thu, May 30, 2024 at 8:07 AM Jia Yu <ji...@apache.org> wrote: >>>> >>>>> Hi Dmtro, >>>>> >>>>> Thanks for your email. To add to Szehon's answer, >>>>> >>>>> 1. How to represent Snowflake Geometry and Geography type in Iceberg, >>>>> given the Geo Iceberg Phase 1 design: >>>>> >>>>> Answer: >>>>> Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg >>>>> Geometry + CRS84 + edges: Planar >>>>> Mapping 2 (impossible): Snowflake Geography -> Iceberg Geometry + >>>>> CRS84 + edges: Spherical >>>>> Mapping 3 (impossible): Snowflake Geometry + SRID:ABCDE-> Iceberg >>>>> Geometry + SRID:ABCDE + edges: Planar >>>>> >>>>> As Szehon mentioned, only Mapping 1 is possible because we need to >>>>> support spatial query push down in Iceberg. This function relies on the >>>>> Iceberg partition transform, which requires a 1:1 mapping between a value >>>>> (point/polygon/linestring) and a partition key. That is: given any >>>>> precision level, a polygon must produce a single ID; and the covering >>>>> indicated by this single ID must fully cover the extent of the polygon. >>>>> Currently, only xz2 can satisfy this requirement. If the theory from >>>>> Michael Entin can be proven to be correct, then we can support Mapping 2 >>>>> in >>>>> Phase 2 of Geo Iceberg. >>>>> >>>>> Regarding Mapping 3, this requires Iceberg to be able to understand >>>>> SRID / PROJJSON such that we will know min max X Y of the CRS (@Szehon, >>>>> maybe Iceberg can ask the engine to provide this information?). See my >>>>> answer 2. >>>>> >>>>> 2. Why choose projjson instead of SRID? >>>>> >>>>> The projjson idea was borrowed from GeoParquet because we'd like to >>>>> enable possible conversion between Geo Iceberg and GeoParquet. However, I >>>>> do understand that this is not a good idea for Iceberg since not many libs >>>>> can parse projjson. >>>>> >>>>> @Szehon Is there a way that we can support both SRID and PROJJSON in >>>>> Geo Iceberg? >>>>> >>>>> It is also worth noting that, although there are many libs that can >>>>> parse SRID and perform look-up in the EPSG database, the license of the >>>>> EPSG database is NOT compatible with the Apache Software Foundation. That >>>>> means: Iceberg still cannot parse / understand SRID. >>>>> >>>>> Thanks, >>>>> Jia >>>>> >>>>> On Wed, May 29, 2024 at 11:08 AM Szehon Ho <szehon.apa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Dmytro >>>>>> >>>>>> Thank you for looking through the proposal and excited to hear >>>>>> from you guys! I am not a 'geo expert' and I will definitely need to >>>>>> pull >>>>>> in Jia Yu for some of these points. >>>>>> >>>>>> Although most calculations are done on the query engine, Iceberg >>>>>> reference implementations (ie, Java, Python) does have to support a few >>>>>> calculations to handle filter push down: >>>>>> >>>>>> 1. push down of the proposed Geospatial transforms ST_COVERS, >>>>>> ST_COVERED_BY, and ST_INTERSECTS >>>>>> 2. evaluation of proposed Geospatial partition transform XZ2. As >>>>>> you may have seen, this was chosen as its the only standard one today >>>>>> that >>>>>> solves the 'boundary object' problem, still preserving 1-to-1 mapping >>>>>> of >>>>>> row => partition value. >>>>>> >>>>>> This is the primary rationale for choosing the values, as these were >>>>>> implemented in the GeoLake and Havasu projects (Iceberg forks that >>>>>> sparked >>>>>> the proposal) based on Geometry type (edge=planar, crs=OGC:CRS84/ >>>>>> SRID=4326). >>>>>> >>>>>> 2. As you mentioned [2] in the proposal there are difficulties with >>>>>>> supporting the full PROJSSON specification of the SRS. From our >>>>>>> experience >>>>>>> most of the use-cases do not require the full definition of the SRS, in >>>>>>> fact that definition is only needed when converting between coordinate >>>>>>> systems. On the other hand, it’s often needed to check whether two >>>>>>> geometry >>>>>>> columns have the same coordinate system, for example when joining two >>>>>>> columns from different data providers. >>>>>>> >>>>>>> To address this we would like to propose including the option to >>>>>>> specify the SRS with only a SRID in phase 1. The query engine may >>>>>>> choose to >>>>>>> treat it as opaque identified or make a look-up in the EPSG database of >>>>>>> supported. >>>>>>> >>>>>> >>>>>> The way to specify CRS definition is actually taken from GeoParquet >>>>>> [1], I think we are not bound to follow it if there are better options. >>>>>> I >>>>>> feel we might need to at least list out supported configurations in the >>>>>> spec, though. There is some conversation on the doc here about this [2]. >>>>>> Basically: >>>>>> >>>>>> 1. XZ2 assumes planar edges. This is a feature of the algorithm, >>>>>> based on the original paper. A possible solution to spherical edge is >>>>>> proposed by Michael Entin here: [3], please feel free to evaluate. >>>>>> 2. XZ2 needs to know the coordinate range. According to Jia's >>>>>> comments, this needs parsing of the CRS. Can it be done with SRID >>>>>> alone? >>>>>> >>>>>> >>>>>>> 1. In the first version of the specification Phase1 it is mentioned >>>>>>> as the version focused on the planar geometry model with a CRS system >>>>>>> fixed >>>>>>> on 4326. In this model, Snowflake would not be able to map our Geography >>>>>>> type since it is based on the spherical Geography model. Given that >>>>>>> Snowflake supports both edge types, we would like to better understand >>>>>>> how >>>>>>> to map them to the proposed Geometry type and its metadata. >>>>>>> >>>>>>> - >>>>>>> >>>>>>> How is the edge type supposed to be interpreted by the query >>>>>>> engine? Is it necessary for the system to adhere to the edge model >>>>>>> for >>>>>>> geospatial functions, or can it use the model that it supports or >>>>>>> let the >>>>>>> customer choose it? Will it affect the bounding box or other row >>>>>>> group >>>>>>> metadata >>>>>>> - >>>>>>> >>>>>>> Is there any reason why the flexible model has to be postponed >>>>>>> to further iterations? Would it be more extensible to support >>>>>>> mutable edge >>>>>>> type from the Phase 1, but allow systems to ignore it if they do not >>>>>>> support the spherical computation model >>>>>>> >>>>>>> >>>>>> It may be answered by the previous paragraph in regards to XZ2. >>>>>> >>>>>> 1. If we get XZ2 to work with a more variable CRS without >>>>>> requiring full PROJJSON specification, it seems it is a path to >>>>>> support >>>>>> Snowflake Geometry type? >>>>>> 2. If we get another one-to-one partition function on spherical >>>>>> edges, like the one proposed by Michael, it seems a path to support >>>>>> Snowflake Geography type? >>>>>> >>>>>> Does that sound correct? As for why certain things are marked as >>>>>> Phase 1, they are just chosen so we can all agree on an initial design >>>>>> and >>>>>> iterate faster and not set in stone, maybe the path 1 is possible to do >>>>>> quickly, for example. >>>>>> >>>>>> Also , I am not sure about handling evaluation of ST_COVERS, >>>>>> ST_COVERED_BY, and ST_INTERSECTS (how easy to handle different CRS + >>>>>> spherical edges). I will leave it to Jia. >>>>>> >>>>>> Thanks! >>>>>> Szehon >>>>>> >>>>>> [1]: >>>>>> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata >>>>>> [2]: >>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk >>>>>> <https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk> >>>>>> [3]: >>>>>> https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit >>>>>> <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit> >>>>>> >>>>>> >>>>>> On Wed, May 29, 2024 at 8:30 AM Dmytro Koval >>>>>> <dmytro.ko...@snowflake.com.invalid> wrote: >>>>>> >>>>>>> Dear Szehon and Iceberg Community, >>>>>>> >>>>>>> >>>>>>> This is Dmytro, Peter, Aihua, and Tyler from Snowflake. As part of >>>>>>> our desire to be more active in the Iceberg community, we’ve been >>>>>>> looking >>>>>>> over this geospatial proposal. We’re excited geospatial is getting >>>>>>> traction, as we see a lot of geo usage within Snowflake, and expect that >>>>>>> usage to carry over to our Iceberg offerings soon. After reviewing the >>>>>>> proposal, we have some questions we’d like to pose given our experience >>>>>>> with geospatial support in Snowflake. >>>>>>> >>>>>>> We would like to clarify two aspects of the proposal: handling of >>>>>>> the spherical model and definition of the spatial reference system. >>>>>>> Both of >>>>>>> which have a big impact on the interoperability with Snowflake and other >>>>>>> query engines and Geo processing systems. >>>>>>> >>>>>>> >>>>>>> Let us first share some context about geospatial types at Snowflake; >>>>>>> geo experts will certainly be familiar with this context already, but >>>>>>> for >>>>>>> the sake of others we want to err on the side of being explicit and >>>>>>> clear. >>>>>>> Snowflake supports two Geospatial types [1]: >>>>>>> - Geography – uses a spherical approximation of the earth for all >>>>>>> the computations. It does not perfectly represent the earth, but allows >>>>>>> getting accurate results on WGS84 coordinates, used by GPS without any >>>>>>> need >>>>>>> to perform coordinate system reprojections. It is also quite fast for >>>>>>> end-to-end computations. In general, it has less distortions compared to >>>>>>> the 2d planar model . >>>>>>> - Geometry – uses planar Euclidean geometry model. Geometric >>>>>>> computations are simpler, but require transforming the data between >>>>>>> coordinate systems to minimize the distortion. The Geometry data type >>>>>>> allows setting a spatial reference system for each row using the SRID. >>>>>>> The >>>>>>> binary geospatial functions are only allowed on the geometries with the >>>>>>> same SRID. The only function that interprets SRID is ST_TRANFORM that >>>>>>> allows conversion between different SRSs. >>>>>>> >>>>>>> Geography >>>>>>> >>>>>>> Geometry >>>>>>> >>>>>>> >>>>>>> >>>>>>> Given the choice of two types and a set of operations on top of >>>>>>> them, the majority of Snowflake users select the Geography type to >>>>>>> represent their geospatial data. >>>>>>> >>>>>>> From our perspective, Iceberg users would benefit most from being >>>>>>> given the flexibility to store and process data using the model that >>>>>>> better >>>>>>> fits their needs and specific use cases. >>>>>>> >>>>>>> Therefore, we would like to ask some design clarifying questions, >>>>>>> important for interoperability: >>>>>>> >>>>>>> >>>>>>> 1. In the first version of the specification Phase1 it is mentioned >>>>>>> as the version focused on the planar geometry model with a CRS system >>>>>>> fixed >>>>>>> on 4326. In this model, Snowflake would not be able to map our Geography >>>>>>> type since it is based on the spherical Geography model. Given that >>>>>>> Snowflake supports both edge types, we would like to better understand >>>>>>> how >>>>>>> to map them to the proposed Geometry type and its metadata. >>>>>>> >>>>>>> - >>>>>>> >>>>>>> How is the edge type supposed to be interpreted by the query >>>>>>> engine? Is it necessary for the system to adhere to the edge model >>>>>>> for >>>>>>> geospatial functions, or can it use the model that it supports or >>>>>>> let the >>>>>>> customer choose it? Will it affect the bounding box or other row >>>>>>> group >>>>>>> metadata >>>>>>> - >>>>>>> >>>>>>> Is there any reason why the flexible model has to be postponed >>>>>>> to further iterations? Would it be more extensible to support >>>>>>> mutable edge >>>>>>> type from the Phase 1, but allow systems to ignore it if they do not >>>>>>> support the spherical computation model >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2. As you mentioned [2] in the proposal there are difficulties with >>>>>>> supporting the full PROJSSON specification of the SRS. From our >>>>>>> experience >>>>>>> most of the use-cases do not require the full definition of the SRS, in >>>>>>> fact that definition is only needed when converting between coordinate >>>>>>> systems. On the other hand, it’s often needed to check whether two >>>>>>> geometry >>>>>>> columns have the same coordinate system, for example when joining two >>>>>>> columns from different data providers. >>>>>>> >>>>>>> To address this we would like to propose including the option to >>>>>>> specify the SRS with only a SRID in phase 1. The query engine may >>>>>>> choose to >>>>>>> treat it as opaque identified or make a look-up in the EPSG database of >>>>>>> supported. >>>>>>> >>>>>>> Thank you again for driving this effort forward. We look forward to >>>>>>> hearing your thoughts. >>>>>>> >>>>>>> [1] >>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry >>>>>>> >>>>>>> [2] >>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf >>>>>>> >>>>>>> >>>>>>> On 2024/05/02 00:41:52 Szehon Ho wrote: >>>>>>> > Hi everyone, >>>>>>> > >>>>>>> > We have created a formal proposal for adding Geospatial support to >>>>>>> Iceberg. >>>>>>> > >>>>>>> > Please read the following for details. >>>>>>> > >>>>>>> > - Github Proposal : >>>>>>> https://github.com/apache/iceberg/issues/10260 >>>>>>> > - Proposal Doc: >>>>>>> > >>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI >>>>>>> > >>>>>>> > >>>>>>> > Note that this proposal is built on existing extensive research >>>>>>> and POC >>>>>>> > implementations (Geolake, Havasu). Special thanks to Jia Yu and >>>>>>> Kristin >>>>>>> > Cowalcijk from Wherobots/Geolake for extensive consultation and >>>>>>> help in >>>>>>> > writing this proposal, as well as support from Yuanyuan Zhang from >>>>>>> Geolake. >>>>>>> > >>>>>>> > We would love to get more feedback for this proposal from the wider >>>>>>> > community and eventually discuss this in a community sync. >>>>>>> > >>>>>>> > Thanks >>>>>>> > Szehon >>>>>>> > >>>>>>> >>>>>>