[ https://issues.apache.org/jira/browse/SPARK-51658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Menelaos Karavelas updated SPARK-51658: --------------------------------------- Labels: SPIP (was: ) > SPIP: Add geospatial types in Spark > ----------------------------------- > > Key: SPARK-51658 > URL: https://issues.apache.org/jira/browse/SPARK-51658 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 4.1.0 > Reporter: Menelaos Karavelas > Priority: Major > Labels: SPIP > > *Q1. What are you trying to do? Articulate your objectives using absolutely > no jargon.* > Add two new data types to Spark SQL and PySpark, GEOMETRY and GEOGRAPHY, for > handling location and shape data, specifically: > # GEOMETRY: For working with shapes on a Cartesian space (a flat surface, > like a paper map) > # GEOGRAPHY: For working with shapes on an ellipsoidal surface (like the > Earth's surface) > These will let Spark users work with standard shape types like: > * POINTS, MULTIPOINTS > * LINESTRINGS, MULTILINESTRINGS > * POLYGONS, MULTIPOLYGONS > * GEOMETRY COLLECTIONS > These are [standard shape > types|https://portal.ogc.org/files/?artifact_id=25355] as defined by the > [Open Geospatial Consortium (OGC)|https://www.ogc.org/]. OGC is the > international governing body for standardizing Geographic Information Systems > (GIS). > New SQL types will be parametrized with an integer value (Spatial Reference > IDentifier - SRID) that defines the underlying coordinate reference system > that geospatial values live in (like GPS coordinates vs. local measurements). > Popular data storage formats > ([Parquet|https://github.com/apache/parquet-format/blob/master/Geospatial.md] > and [Iceberg|https://github.com/apache/iceberg/pull/10981]) are adding > support for GEOMETRY and GEOGRAPHY, and the proposal is to extend Spark such > that it can read / write and operate with new geo types. > > *Q2. What problem is this proposal NOT designed to solve?* > Build a comprehensive system for advanced geospatial analysis in Spark. While > we're adding basic support for storing and reading location/shape data, we > are not: > * Creating complex geospatial processing functions > * Building spatial analysis tools > * Developing geometric or geographic calculations and transformations > This proposal is laying the foundation - building the infrastructure to > handle geospatial data, but not creating a full-featured geospatial > processing system. Such extension can be done later as a separate improvement. > > *Q3. How is it done today, and what are the limits of current practice?* > Current State: > # Users store geospatial data by converting it into basic text (string) or > binary formats because Spark doesn't understand geospatial data directly. > # To work with this data, users need to use external tools to make sense of > and process the data. > Limitations: > * Have to use external libs > * Have to do conversions > * Cannot write / read geospatial values in a native way > * External tools cannot natively understand the data as geospatial > * No file / data skipping > > *Q4. What is new in your approach and why do you think it will be successful?* > The high-level approach is not new, and we have a clear picture of how to > split the work by sub-tasks based on our experience of adding new types such > as ANSI intervals and TIMESTAMP_NTZ. > Most existing data processing systems support working with Geospatial data. > > *Q5. Who cares? If you are successful, what difference will it make?* > New types create the foundation for working with geospatial data on Spark. > Other data analytics systems such as PostgreSQL, Redshift, Snowflake, Big > Query, all have geospatial support. > Spark stays relevant and compatible with the latest Parquet and Iceberg > community developments. > > *Q6. What are the risks?* > The addition of the new data types will allow for strictly new workloads, so > the risks in this direction are minimal. The only overlap with existing > functionality is at the type system level (e.g., casts between the new > geospatial types and the existing Spark SQL types). The risk is low however, > and can be handled through testing. > > *Q7. How long will it take?* > In total it might take around 9 months. The estimation is based on similar > tasks: ANSI intervals (SPARK-27790) and TIMESTAMP_NTZ (SPARK-35662). We can > split the work by function blocks: > # Base functionality - 2 weeks > Add new type geospatial types, literals, type constructor, and external types. > # Persistence - 2.5 months > Ability to create parquet tables of the type GEOSPATIAL, read/write from/to > Parquet and other built-in data types, stats, predicate push down. > # Basic data skipping operator - 1 months > # Implement rudimentary operators for data skipping - (e.g., > ST_BoxIntersects - checks if the bounding boxes of two geospatial values > intersect each other) > # Clients support - 1 month > JDBC, Hive, Thrift server, connect > # PySpark integration - 1 month > DataFrame support, pandas API, python UDFs, Arrow column vectors > # Docs + testing/benchmarking - 1 month > > *Q8. What are the mid-term and final “exams” to check for success?* > Mid-term criteria: New type definitions and read/write geospatial data from > Parquet. > Final criteria: Equivalent functionality with any other scalar data type. > > *Appendix A. Proposed API Changes.* > h4. _Geospatial data types_ > We propose to support two new parametrized data types: GEOMETRY and > GEOGRAPHY. Both take as input a non-negative integer value called SRID > (referring to a spatial reference identifier). See Appendix B for more > details on the type system. > Examples: > > ||Syntax||Comments|| > |GEOMETRY(0)|Materializable, every row has SRID 0| > |GEOMETRY(ANY)|Not materializable, every row can have different SRID| > |GEOGRAPHY(4326) |Materializable, every row has SRID 4326| > |GEOGRAPHY(ANY)|Not materializable, every row can have different SRID| > GEOMETRY(ANY) and GEOGRAPHY(ANY) act as least common types among the > parametrized GEOMETRY and GEOGRAPHY types. > h4. _Table creation_ > Users will be able to create GEOMETRY(<srid>) or GEOGRAPHY(<srid>) columns: > > {code:java} > CREATE TABLE tbl (geom GEOMETRY(0), geog GEOGRAPHY(4326)); {code} > > Users will not be able to create GEOMETRY(ANY) or GEOGRAPHY(ANY) columns. See > details in Appendix B. > Users will be able to insert geospatial values to such columns: > > {code:java} > INSERT INTO tbl > VALUES(X‘0101000000000000000000f03f0000000000000040’::GEOMETRY(0), > X’0101000000000000000000f03f0000000000000040’::GEOGRAPHY(4326)); {code} > > h4. _Casts_ > The following explicit casts as allowed (all other casts are initially > disallowed): > * GEOMETRY/GEOGRAPHY column to BINARY (output is WKB format) > * BINARY to GEOMETRY/GEOGRAPHY column (input is expected to be in WKB format) > * GEOMETRY(ANY) to GEOMETRY(<srid>) > * GEOGRAPHY(ANY) to GEOGRAPHY(<srid>) > The following implicit casts are allowed (also supported as explicit): > * GEOMETRY(<srid>) to GEOMETRY(ANY) > * GEOGRAPHY(<srid>) to GEOGRAPHY(ANY) > h4. _Geospatial expressions_ > The following geospatial expressions are to be exposed to users: > * Scalar expressions > * Boolean ST_BoxesIntersect(BoxStruct1, BoxStruct2) > * BoxStruct ST_Box2D(GeoVal) > * BinaryVal ST_AsBinary(GeoVal) > * GeoVal ST_GeomFromWKB(BinaryVal, IntVal) > * GeoVal ST_GeogFromWKB(BinaryVal, IntVal) > * Aggregate expressions > * BoxStruct ST_Extent(GeoCol) > BoxStruct above refers to a struct of 4 double values. > h4. _Ecosystem_ > Support for SQL, Dataframe, and PySpark APIs including Spark Connect APIs. > > *Appendix B.* > h4. _Type System_ > We propose to introduce two parametrized types: GEOMETRY and GEOGRAPHY. The > parameter would either be a non-negative integer value or the special > specifier ANY. > * The integer value represents a Spatial Reference IDentifier (SRID) which > uniquely defines the coordinate reference system. These integers will be > mapped to coordinate reference system (CRS) definitions defined by > authorities like OGC, EPSG, or ESRI. > * The ANY specifier refers to GEOMETRY or GEOGRAPHY columns where the SRID > value can be different across the column rows. GEOMETRY(ANY) and > GEOGRAPHY(ANY) will not be materialized. This constraint is imposed by the > storage specifications (the Parquet and Iceberg specs require a unique SRID > value per GEOMETRY or GEOGRAPHY column). > * GEOMETRY(ANY) and GEOGRAPHY(ANY) act as the least common type across all > GEOMETRY and GEOGRAPHY parametrized types. They are also needed for the > proposed ST_GeomFromWKB and ST_GeogFromWKB expressions when the second > argument is not foldable; for non-foldable arguments the return type for > these expressions is GEOMETRY(ANY) and GEOGRAPHY(ANY), respectively. > * All known SRID values are positive integers. For the GEOMETRY data type we > will also allow the SRID value of 0. This is the typical way of specifying > geometric values with unknown or unspecified underlying CRS. Mathematically > these values are understood as embedded in a Cartesian space. > h4. _In-memory representation_ > We propose to internally represent geospatial values as a byte array > containing the concatenation of the BINARY value and an INTEGER value. The > BINARY value is the WKB representation of the GEOMETRY or GEOGRAPHY value. > The INTEGER value is the SRID value. WKB stands for Well-Known Binary and it > is one of the standard formats for representing geospatial values (see > [https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary]). > WKB is also the format used by the latest Parquet and Iceberg specs for > representing geospatial values. > h4. _Serialization_ > We propose to only serialize GEOMETRY and GEOGRAPHY columns that have a > numeric SRID value. For such columns the SRID value can be deciphered from > the column metadata. The serialization format for values will be WKB. > h4. _Casting_ > Implicit casting from GEOMETRY(<srid>) to GEOMETRY(ANY) and GEOGRAPHY(<srid>) > to GEOGRAPHY(ANY) is primarily introduced in order to smoothly support > expressions like CASE/WHEN, COALESCE, NVL2, etc., that depend on least common > type logic. > h4. _Dependencies_ > We propose to add a build dependency on the Proj library > ([https://proj.org/en/stable]) for generating a list of supported CRS > definitions. We will need to expand this list with SRID 0 which is necessary > when dealing with geometry inputs with unspecified SRID. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org