[ 
https://issues.apache.org/jira/browse/SPARK-51658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Menelaos Karavelas updated SPARK-51658:
---------------------------------------
    Labels: SPIP  (was: )

> SPIP: Add geospatial types in Spark
> -----------------------------------
>
>                 Key: SPARK-51658
>                 URL: https://issues.apache.org/jira/browse/SPARK-51658
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 4.1.0
>            Reporter: Menelaos Karavelas
>            Priority: Major
>              Labels: SPIP
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> Add two new data types to Spark SQL and PySpark, GEOMETRY and GEOGRAPHY, for 
> handling location and shape data, specifically:
>  # GEOMETRY: For working with shapes on a Cartesian space (a flat surface, 
> like a paper map)
>  # GEOGRAPHY: For working with shapes on an ellipsoidal surface (like the 
> Earth's surface)
> These will let Spark users work with standard shape types like:
>  * POINTS, MULTIPOINTS
>  * LINESTRINGS, MULTILINESTRINGS
>  * POLYGONS, MULTIPOLYGONS
>  * GEOMETRY COLLECTIONS
> These are [standard shape 
> types|https://portal.ogc.org/files/?artifact_id=25355] as defined by the 
> [Open Geospatial Consortium (OGC)|https://www.ogc.org/]. OGC is the 
> international governing body for standardizing Geographic Information Systems 
> (GIS). 
> New SQL types will be parametrized with an integer value (Spatial Reference 
> IDentifier - SRID) that defines the underlying coordinate reference system 
> that geospatial values live in (like GPS coordinates vs. local measurements).
> Popular data storage formats 
> ([Parquet|https://github.com/apache/parquet-format/blob/master/Geospatial.md] 
> and [Iceberg|https://github.com/apache/iceberg/pull/10981]) are adding 
> support for GEOMETRY and GEOGRAPHY, and the proposal is to extend Spark such 
> that it can read / write and operate with new geo types.
>  
> *Q2. What problem is this proposal NOT designed to solve?*
> Build a comprehensive system for advanced geospatial analysis in Spark. While 
> we're adding basic support for storing and reading location/shape data, we 
> are not:
>  * Creating complex geospatial processing functions
>  * Building spatial analysis tools
>  * Developing geometric or geographic calculations and transformations
> This proposal is laying the foundation - building the infrastructure to 
> handle geospatial data, but not creating a full-featured geospatial 
> processing system. Such extension can be done later as a separate improvement.
>  
> *Q3. How is it done today, and what are the limits of current practice?*
> Current State:
>  # Users store geospatial data by converting it into basic text (string) or 
> binary formats because Spark doesn't understand geospatial data directly.
>  # To work with this data, users need to use external tools to make sense of 
> and process the data.
> Limitations:
>  * Have to use external libs
>  * Have to do conversions
>  * Cannot write / read geospatial values in a native way
>  * External tools cannot natively understand the data as geospatial
>  * No file / data skipping
>  
> *Q4. What is new in your approach and why do you think it will be successful?*
> The high-level approach is not new, and we have a clear picture of how to 
> split the work by sub-tasks based on our experience of adding new types such 
> as ANSI intervals and TIMESTAMP_NTZ.
> Most existing data processing systems support working with Geospatial data.
>  
> *Q5. Who cares? If you are successful, what difference will it make?*
> New types create the foundation for working with geospatial data on Spark. 
> Other data analytics systems such as PostgreSQL, Redshift, Snowflake, Big 
> Query, all have geospatial support.
> Spark stays relevant and compatible with the latest Parquet and Iceberg 
> community developments.
>  
> *Q6. What are the risks?*
> The addition of the new data types will allow for strictly new workloads, so 
> the risks in this direction are minimal. The only overlap with existing 
> functionality is at the type system level (e.g., casts between the new 
> geospatial types and the existing Spark SQL types). The risk is low however, 
> and can be handled through testing.
>  
> *Q7. How long will it take?*
> In total it might take around 9 months. The estimation is based on similar 
> tasks: ANSI intervals (SPARK-27790) and TIMESTAMP_NTZ (SPARK-35662). We can 
> split the work by function blocks:
>  # Base functionality - 2 weeks
> Add new type geospatial types, literals, type constructor, and external types.
>  # Persistence - 2.5 months
> Ability to create parquet tables of the type GEOSPATIAL, read/write from/to 
> Parquet and other built-in data types, stats, predicate push down.
>  # Basic data skipping operator - 1 months
>  # Implement rudimentary operators for data skipping - (e.g., 
> ST_BoxIntersects - checks if the bounding boxes of two geospatial values 
> intersect each other)
>  # Clients support - 1 month
> JDBC, Hive, Thrift server, connect
>  # PySpark integration - 1 month
> DataFrame support, pandas API, python UDFs, Arrow column vectors
>  # Docs + testing/benchmarking - 1 month
>  
> *Q8. What are the mid-term and final “exams” to check for success?*
> Mid-term criteria: New type definitions and read/write geospatial data from 
> Parquet.
> Final criteria: Equivalent functionality with any other scalar data type.
>  
> *Appendix A. Proposed API Changes.*
> h4. _Geospatial data types_
> We propose to support two new parametrized data types: GEOMETRY and 
> GEOGRAPHY. Both take as input a non-negative integer value called SRID 
> (referring to a spatial reference identifier). See Appendix B for more 
> details on the type system.
> Examples:
>  
> ||Syntax||Comments||
> |GEOMETRY(0)|Materializable, every row has SRID 0|
> |GEOMETRY(ANY)|Not materializable, every row can have different SRID|
> |GEOGRAPHY(4326) |Materializable, every row has SRID 4326|
> |GEOGRAPHY(ANY)|Not materializable, every row can have different SRID|
> GEOMETRY(ANY) and GEOGRAPHY(ANY) act as least common types among the 
> parametrized GEOMETRY and GEOGRAPHY types.
> h4. _Table creation_
> Users will be able to create GEOMETRY(<srid>) or GEOGRAPHY(<srid>) columns:
>  
> {code:java}
> CREATE TABLE tbl (geom GEOMETRY(0), geog GEOGRAPHY(4326)); {code}
>  
> Users will not be able to create GEOMETRY(ANY) or GEOGRAPHY(ANY) columns. See 
> details in Appendix B.
> Users will be able to insert geospatial values to such columns:
>  
> {code:java}
> INSERT INTO tbl 
> VALUES(X‘0101000000000000000000f03f0000000000000040’::GEOMETRY(0), 
> X’0101000000000000000000f03f0000000000000040’::GEOGRAPHY(4326)); {code}
>  
> h4. _Casts_
> The following explicit casts as allowed (all other casts are initially 
> disallowed):
>  * GEOMETRY/GEOGRAPHY column to BINARY (output is WKB format)
>  * BINARY to GEOMETRY/GEOGRAPHY column (input is expected to be in WKB format)
>  * GEOMETRY(ANY) to GEOMETRY(<srid>)
>  * GEOGRAPHY(ANY) to GEOGRAPHY(<srid>)
> The following implicit casts are allowed (also supported as explicit):
>  * GEOMETRY(<srid>) to GEOMETRY(ANY)
>  * GEOGRAPHY(<srid>) to GEOGRAPHY(ANY)
> h4. _Geospatial expressions_
> The following geospatial expressions are to be exposed to users:
>  * Scalar expressions
>  * Boolean ST_BoxesIntersect(BoxStruct1, BoxStruct2)
>  * BoxStruct ST_Box2D(GeoVal)
>  * BinaryVal ST_AsBinary(GeoVal)
>  * GeoVal ST_GeomFromWKB(BinaryVal, IntVal)
>  * GeoVal ST_GeogFromWKB(BinaryVal, IntVal)
>  * Aggregate expressions
>  * BoxStruct ST_Extent(GeoCol)
> BoxStruct above refers to a struct of 4 double values.
> h4. _Ecosystem_
> Support for SQL, Dataframe, and PySpark APIs including Spark Connect APIs.
>  
> *Appendix B.* 
> h4. _Type System_
> We propose to introduce two parametrized types: GEOMETRY and GEOGRAPHY. The 
> parameter would either be a non-negative integer value or the special 
> specifier ANY.
>  * The integer value represents a Spatial Reference IDentifier (SRID) which 
> uniquely defines the coordinate reference system. These integers will be 
> mapped to coordinate reference system (CRS) definitions defined by 
> authorities like OGC, EPSG, or ESRI.
>  * The ANY specifier refers to GEOMETRY or GEOGRAPHY columns where the SRID 
> value can be different across the column rows. GEOMETRY(ANY) and 
> GEOGRAPHY(ANY) will not be materialized. This constraint is imposed by the 
> storage specifications (the Parquet and Iceberg specs require a unique SRID 
> value per GEOMETRY or GEOGRAPHY column).
>  * GEOMETRY(ANY) and GEOGRAPHY(ANY) act as the least common type across all 
> GEOMETRY and GEOGRAPHY parametrized types. They are also needed for the 
> proposed ST_GeomFromWKB and ST_GeogFromWKB expressions when the second 
> argument is not foldable; for non-foldable arguments the return type for 
> these expressions is GEOMETRY(ANY) and GEOGRAPHY(ANY), respectively.
>  * All known SRID values are positive integers. For the GEOMETRY data type we 
> will also allow the SRID value of 0. This is the typical way of specifying 
> geometric values with unknown or unspecified underlying CRS. Mathematically 
> these values are understood as embedded in a Cartesian space.
> h4. _In-memory representation_
> We propose to internally represent geospatial values as a byte array 
> containing the concatenation of the BINARY value and an INTEGER value. The 
> BINARY value is the WKB representation of the GEOMETRY or GEOGRAPHY value. 
> The INTEGER value is the SRID value. WKB stands for Well-Known Binary and it 
> is one of the standard formats for representing geospatial values (see 
> [https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary]).
>  WKB is also the format used by the latest Parquet and Iceberg specs for 
> representing geospatial values.
> h4. _Serialization_
> We propose to only serialize GEOMETRY and GEOGRAPHY columns that have a 
> numeric SRID value. For such columns the SRID value can be deciphered from 
> the column metadata. The serialization format for values will be WKB.
> h4. _Casting_
> Implicit casting from GEOMETRY(<srid>) to GEOMETRY(ANY) and GEOGRAPHY(<srid>) 
> to GEOGRAPHY(ANY) is primarily introduced in order to smoothly support 
> expressions like CASE/WHEN, COALESCE, NVL2, etc., that depend on least common 
> type logic.
> h4. _Dependencies_
> We propose to add a build dependency on the Proj library 
> (​​[https://proj.org/en/stable]) for generating a list of supported CRS 
> definitions. We will need to expand this list with SRID 0 which is necessary 
> when dealing with geometry inputs with unspecified SRID.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to