Re: [DISCUSS] SPIP: Add geospatial types to Spark

Ángel Álvarez Pascua Sat, 05 Apr 2025 10:48:50 -0700

Hi Jia,

I really appreciate your very instructive answer. I truly believe that
discussing topics with people who know far more than I do is a great way to
learn new and interesting things. Your explanations are quite logical and
make perfect sense to me. Sh**, I'm not that sure about this proposal now
... 😂



*"If you haven’t encountered this kind of ‘dependency hell’ while working
on geospatial projects with Spark, you may have been fortunate to deal with
relatively simple cases."*

Yes, that was the case for us. We loaded OpenStreetMap data from Spain,
calculated some Haversine distances between points in rasters using a
custom-made C++ library, and did a bit more. We also used GeoTrellis for
something, but I can't quite remember what


Thanks again!

El dom, 30 mar 2025 a las 9:25, Jia Yu (<[email protected]>) escribió:

> Hey Angel,
>
> I am glad that you asked these questions. Please see my answers below.
>
>
> *1. Domain types evolve quickly. - It has taken years for Parquet to
> include these new types in its format... We could evolve alongside Parquet.
> Unfortunately, Spark is not known for upgrading its dependencies quickly.*
>
> Exactly — domain-specific types evolve rapidly and may head in directions
> that aren’t fully aligned with formats like Parquet, Avro, and others. In
> such cases, should Spark, as a general-purpose compute engine, really be
> tightly coupled to the specifics of a single storage format?
>
> Personally, I really appreciate Spark’s UserDefinedType mechanism and
> Apache Arrow’s ExtensionType — both offer maximum flexibility while keeping
> the core engine clean and extensible.
>
>
>
> * 2. Geospatial in Java and Python is a dependency hell.- How has Parquet
> solved that problem, then?*
>
> Exactly — this problem is not fully solved by Parquet. While the Parquet
> spec now includes a definition for geospatial types, it’s more of a vision
> than a complete, production-ready solution. Many aspects of the spec are
> not yet implemented in Spark. In fact, the spec represents a compromise
> among multiple vendors (e.g., BigQuery, Snowflake), and many design choices
> are not aligned with Spark’s architecture or ecosystem.
>
> For example:
> • The CRS property in the spec uses a PROJJSON string, which currently
> only has a C++ implementation — there is no Java implementation available.
> • The edge interpolation algorithms (e.g., for great-circle arcs)
> mentioned in the spec also only exist in C++ libraries.
> • Handling of antimeridian-crossing geometries is another complex topic
> that isn’t addressed in Spark today.
>
> The Sedona community is actively working on solutions — either building
> Java equivalents for these features or creating workarounds. These are
> deeply domain-specific efforts and often require non-trivial geospatial
> expertise.
>
> We are currently contributing a Java implementation of the Parquet Geo
> format here: https://github.com/apache/parquet-java/pull/2971
>
> In Python, geospatial manipulation depends on libraries like Shapely and
> GeoPandas, which evolve quickly and frequently introduce breaking changes.
> Sedona has invested significant effort to maintain compatibility and
> stability for Python UDFs across these ecosystems.
>
> If you haven’t encountered this kind of “dependency hell” while working on
> geospatial projects with Spark, you may have been fortunate to deal with
> relatively simple cases — e.g., only working with point data or simple
> polygons.
>
> That usually means:
> 1. All geometries are in a single CRS, typically WGS84 (SRID 4326)
> 2. No antimeridian-crossing geometries
> 3. No need for high-precision distance calculations or spherical geometry
> 4. No need to handle topology or wraparound issues
>
> If that’s the case, then Spark already works fine as-is for your use case
> — so why complicate it?
>
>
> *3. Sedona already supports Geo fully in (Geo)Parquet.- The default format
> in Spark is Parquet, and Parquet now natively supports these types. Are we
> going to force users to add Sedona?*
>
> While opinions may vary, I would encourage users to adopt a solution like
> Apache Sedona that laser focuses on geospatial. Sedona provides
> comprehensive, step-by-step tutorials on how to handle geospatial
> dependencies across major platforms — including Databricks, AWS EMR,
> Microsoft Azure, and Google Cloud. We’re also actively collaborating with
> cloud providers to bundle Sedona natively into their offerings, making it
> even easier for users to get started.
>
>
> That said, I generally share the same perspective — if the Spark community
> believes it would benefit from having basic geospatial support built in,
> the Sedona community would be happy to collaborate on this effort. We’re
> open to contributing the necessary functionality and, if appropriate,
> having Spark depend on Sedona directly to avoid reinvention.
>
> Thanks,
> JIa
>
>
>
> On Sat, Mar 29, 2025 at 11:02 PM Ángel Álvarez Pascua <
> [email protected]> wrote:
>
>>
>> * 1.      Domain types evolve quickly.*
>> It has taken years for Parquet to include these new types in its
>> format... We could evolve alongside Parquet. Unfortunately, Spark is not
>> known for upgrading its dependencies quickly.
>>
>> * 2.      Geospatial in Java and Python is a dependency hell.*
>> How has Parquet solved that problem, then? I don't recall experiencing
>> any "dependency hell" when working on geospatial projects with Spark, to be
>> honest. Besides, Spark already includes Parquet as a dependency, so...
>> where is the problem?
>>
>> *3.      Sedona already supports Geo fully in (Geo)Parquet.*
>> The default format in Spark is Parquet, and Parquet now natively supports
>> these types. Are we going to force users to add Sedona (along with all its
>> third-party dependencies, I assume) to their projects just for reading,
>> writing, and performing basic operations with these types?
>>
>> Anyway, let's vote and see...
>>
>> El sáb, 29 mar 2025 a las 22:41, Reynold Xin (<[email protected]>)
>> escribió:
>>
>>> While I don’t think Spark should become a super specialized geospatial
>>> processing engine, I don’t think it makes sense to focus *only* on reading
>>> and writing from storage. Geospatial is a pretty common and fundamental
>>> capability of analytics systems and virtually every mature and popular
>>> analytics systems, be it open source or proprietary, storage or query, has
>>> some basic geospatial data type and support. Adding geospatial type and
>>> some basic expressions is such a no brainer.
>>>
>>> On Sat, Mar 29, 2025 at 2:27 PM Jia Yu <[email protected]> wrote:
>>>
>>>> Hi Wenchen, Menelaos and Szehon,
>>>>
>>>> Thanks for the clarification — I’m glad to hear the primary motivation
>>>> of this SPIP is focused on reading and writing geospatial data with Parquet
>>>> and Iceberg. That’s an important goal, and I want to highlight that this
>>>> problem is being solved by the Apache Sedona community.
>>>>
>>>> Since the primary motivation here is Parquet-level support, I suggest
>>>> shifting the focus of this discussion toward enabling geo support in Spark
>>>> Parquet DataSource rather than introducing core types.
>>>>
>>>> ** Why Spark Should Avoid Hardcoding Domain-Specific Types like geo
>>>> types **
>>>>
>>>>         1.      Domain types evolve quickly.
>>>>
>>>> In geospatial, we already have geometry, geography, raster, trajectory,
>>>> point clouds — and the list keeps growing. In AI/ML, we’re seeing tensors,
>>>> vectors, and multi-dimensional arrays. Spark’s strength has always been in
>>>> its general-purpose architecture and extensibility. Introducing hardcoded
>>>> support for fast-changing domain-specific types risks long-term maintenance
>>>> issues and eventual incompatibility with emerging standards.
>>>>
>>>>         2.      Geospatial in Java and Python is a dependency hell.
>>>>
>>>> There are multiple competing geometry libraries with incompatible APIs.
>>>> No widely adopted Java library supports geography types. The most
>>>> authoritative CRS dataset (EPSG) is not Apache-compatible. The json format
>>>> for CRS definitions (projjson) is only fully supported in PROJ, a C++
>>>> library without a Java equivalent and no formal OGC standard status. On the
>>>> Python side, this might involve Shapely and GeoPandas dependencies.
>>>>
>>>>         3.      Sedona already supports Geo fully in (Geo)Parquet.
>>>>
>>>> Sedona has supported reading, writing, metadata preservation, and data
>>>> skipping for GeoParquet (predecessor of Parquet Geo) for over two years
>>>> [2][3]. These features are production-tested and widely used.
>>>>
>>>> ** Proposed Path Forward: Geo Support via Spark Extensions **
>>>>
>>>> To enable seamless Parquet integration without burdening Spark core,
>>>> here are two options:
>>>>
>>>> Option 1:
>>>> Sedona offers a dedicated `parquet-geo` DataSource that handles type
>>>> encoding, metadata, and data skipping. No changes to Spark are required.
>>>> This is already underway and will be maintained by the Sedona community to
>>>> keep up with the evolving Geo standards.
>>>>
>>>> Option 2:
>>>> Spark provides hooks to inject:
>>>> - custom logical types / user-defined types (UDTs)
>>>> - custom statistics and filter pushdowns
>>>> Sedona can then extend the built-in `parquet` DataSource to integrate
>>>> geo type metadata, predicate pushdown, and serialization seamlessly.
>>>>
>>>> For Iceberg, we’ve already published a proof-of-concept connector [4]
>>>> showing Sedona, Spark, and Iceberg working together without any Spark core
>>>> changes [5].
>>>>
>>>> ** On the Bigger Picture **
>>>>
>>>> I also agree with your long-term vision. I believe Spark is on the path
>>>> to becoming a foundational compute engine — much like Postgres or Pandas —
>>>> where the core remains focused and stable, while powerful domain-specific
>>>> capabilities emerge from its ecosystem.
>>>>
>>>> To support this future, Spark could prioritize flexible extension hooks
>>>> so that third-party libraries can thrive — just like we’ve seen with
>>>> PostGIS, pgvector, TimescaleDB in the Postgres ecosystem, and GeoPandas in
>>>> the Pandas ecosystem.
>>>>
>>>> Sedona is following this model by building geospatial support around
>>>> Spark — not inside it — and we’d love to continue collaborating in this
>>>> spirit.
>>>>
>>>> Happy to work together on providing Geo support in Parquet!
>>>>
>>>> Best,
>>>> Jia
>>>>
>>>> References
>>>>
>>>> [1] GeoParquet project:
>>>> https://github.com/opengeospatial/geoparquet
>>>>
>>>> [2] Sedona’s GeoParquet DataSource implementation:
>>>>
>>>> https://github.com/apache/sedona/tree/master/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet
>>>>
>>>> [3] Sedona’s GeoParquet documentation:
>>>> https://sedona.apache.org/latest/tutorial/files/geoparquet-sedona-spark/
>>>>
>>>> [4] Sedona-Iceberg connector (PoC):
>>>> https://github.com/wherobots/sedona-iceberg-connector
>>>>
>>>> [5] Spark-Sedona-Iceberg working example:
>>>>
>>>> https://github.com/wherobots/sedona-iceberg-connector/blob/main/src/test/scala/com/wherobots/sedona/TestGeospatial.scala#L53
>>>>
>>>>
>>>> On 2025/03/29 19:27:08 Menelaos Karavelas wrote:
>>>> > To continue along the line of thought of Szehon:
>>>> >
>>>> > I am really excited that the Parquet and Iceberg communities have
>>>> adopted geospatial logical types and of course I am grateful for the work
>>>> put in that direction.
>>>> >
>>>> > As both Wenchen and Szehon pointed out in their own way, the goal is
>>>> to have minimal support in Spark, as a common platform, for these types.
>>>> >
>>>> > To be more specific and explicit: The proposal scope is to add
>>>> support for reading/writing to Parquet, based on the new standard, as well
>>>> as adding the types as built-in types in Spark to complement the storage
>>>> support. The few ST expressions that are in the proposal are what seem to
>>>> be the minimal set of expressions needed to support working with geospatial
>>>> values in the Spark engine in a meaningful way.
>>>> >
>>>> > Best,
>>>> >
>>>> > Menelaos
>>>> >
>>>> >
>>>> > > On Mar 29, 2025, at 12:06 PM, Szehon Ho <[email protected]>
>>>> wrote:
>>>> > >
>>>> > > Thank you Menelaos, will do!
>>>> > >
>>>> > > To give a little background, Jia and Sedona community, also
>>>> GeoParquet community, and others really put much effort contributing to
>>>> defining the Parquet and Iceberg geo types, which couldn't be done without
>>>> their experience and help!
>>>> > >
>>>> > > But I do agree with Wenchen , now that the types are in most common
>>>> data sources in ecosystem , I think Apache Spark as a common platform needs
>>>> to have this type definition for inter-op, otherwise users of vanilla Spark
>>>> cannot work with those data sources with stored geospatial data.  (Imo a
>>>> similar rationale in adding timestamp nano in the other ongoing SPIP.).
>>>> > >
>>>> > > And like Wenchen said, the SPIP’s goal doesnt seem to be to
>>>> fragment the ecosystem by implementing Sedona’s advanced geospatial
>>>> analytic tech in Spark itself, which you may be right belongs in pluggable
>>>> frameworks.  Menelaus may explain more about the SPIP goal.
>>>> > >
>>>> > > I do hope there can be more collaboration across communities (like
>>>> in Iceberg/Parquet collaboration) in getting Sedona community’s experience
>>>> in making sure these type definitions are optimal , and compatible for
>>>> Sedona.
>>>> > >
>>>> > > Thanks!
>>>> > > Szehon
>>>> > >
>>>> > >
>>>> > >> On Mar 29, 2025, at 8:04 AM, Menelaos Karavelas <
>>>> [email protected]> wrote:
>>>> > >>
>>>> > >> 
>>>> > >> Hello Szehon,
>>>> > >>
>>>> > >> I just created a Google doc and also linked it in the JIRA:
>>>> > >>
>>>> > >>
>>>> https://docs.google.com/document/d/1cYSNPGh95OjnpS0k_KDHGM9Ae3j-_0Wnc_eGBZL4D3w/edit?tab=t.0
>>>> > >>
>>>> > >> Please feel free to comment on it.
>>>> > >>
>>>> > >> Best,
>>>> > >>
>>>> > >> Menelaos
>>>> > >>
>>>> > >>
>>>> > >>> On Mar 28, 2025, at 2:19 PM, Szehon Ho <[email protected]>
>>>> wrote:
>>>> > >>>
>>>> > >>> Thanks Menelaos, this is exciting !  Is there a google doc we can
>>>> comment, or just on the JIRA?
>>>> > >>>
>>>> > >>> Thanks
>>>> > >>> Szehon
>>>> > >>>
>>>> > >>> On Fri, Mar 28, 2025 at 1:41 PM Ángel Álvarez Pascua <
>>>> [email protected] <mailto:[email protected]>>
>>>> wrote:
>>>> > >>>> Sorry, I only had a quick look at the proposal, looked for WKT
>>>> and didn't find anything.
>>>> > >>>>
>>>> > >>>> It's been years since I worked on geospatial projects and I'm
>>>> not an expert (at all). Maybe starting with something simple but useful
>>>> like conversion WKT<=>WKB?
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> El vie, 28 mar 2025, 21:27, Menelaos Karavelas <
>>>> [email protected] <mailto:[email protected]>>
>>>> escribió:
>>>> > >>>>> In the SPIP Jira the proposal is to add the expressions
>>>> ST_AsBinary, ST_GeomFromWKB, and ST_GeogFromWKB.
>>>> > >>>>> Is there anything else that you think should be added?
>>>> > >>>>>
>>>> > >>>>> Regarding WKT, what do you think should be added?
>>>> > >>>>>
>>>> > >>>>> - Menelaos
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>>> On Mar 28, 2025, at 1:02 PM, Ángel Álvarez Pascua <
>>>> [email protected] <mailto:[email protected]>>
>>>> wrote:
>>>> > >>>>>>
>>>> > >>>>>> What about adding support for WKT <
>>>> https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry>/WKB
>>>> <
>>>> https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
>>>> >?
>>>> > >>>>>>
>>>> > >>>>>> El vie, 28 mar 2025 a las 20:50, Ángel Álvarez Pascua (<
>>>> [email protected] <mailto:[email protected]>>)
>>>> escribió:
>>>> > >>>>>>> +1 (non-binding)
>>>> > >>>>>>>
>>>> > >>>>>>> El vie, 28 mar 2025, 18:48, Menelaos Karavelas <
>>>> [email protected] <mailto:[email protected]>>
>>>> escribió:
>>>> > >>>>>>>> Dear Spark community,
>>>> > >>>>>>>>
>>>> > >>>>>>>> I would like to propose the addition of new geospatial data
>>>> types (GEOMETRY and GEOGRAPHY) which represent geospatial values as
>>>> recently added as new logical types in the Parquet specification.
>>>> > >>>>>>>>
>>>> > >>>>>>>> The new types should improve Spark’s ability to read the new
>>>> Parquet logical types and perform some minimal meaningful operations on
>>>> them.
>>>> > >>>>>>>>
>>>> > >>>>>>>> SPIP: https://issues.apache.org/jira/browse/SPARK-51658
>>>> > >>>>>>>>
>>>> > >>>>>>>> Looking forward to your comments and feedback.
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>> Best regards,
>>>> > >>>>>>>>
>>>> > >>>>>>>> Menelaos Karavelas
>>>> > >>>>>>>>
>>>> > >>>>>
>>>> > >>
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [email protected]
>>>>
>>>>

Re: [DISCUSS] SPIP: Add geospatial types to Spark

Reply via email to