Re: [DISCUSS] SPIP: Add geospatial types to Spark

Ángel Álvarez Pascua Sat, 29 Mar 2025 23:02:51 -0700

* 1.      Domain types evolve quickly.*
It has taken years for Parquet to include these new types in its format...
We could evolve alongside Parquet. Unfortunately, Spark is not known for
upgrading its dependencies quickly.


* 2.      Geospatial in Java and Python is a dependency hell.*
How has Parquet solved that problem, then? I don't recall experiencing any
"dependency hell" when working on geospatial projects with Spark, to be
honest. Besides, Spark already includes Parquet as a dependency, so...
where is the problem?

*3.      Sedona already supports Geo fully in (Geo)Parquet.*
The default format in Spark is Parquet, and Parquet now natively supports
these types. Are we going to force users to add Sedona (along with all its
third-party dependencies, I assume) to their projects just for reading,
writing, and performing basic operations with these types?

Anyway, let's vote and see...

El sáb, 29 mar 2025 a las 22:41, Reynold Xin (<[email protected]>)
escribió:

> While I don’t think Spark should become a super specialized geospatial
> processing engine, I don’t think it makes sense to focus *only* on reading
> and writing from storage. Geospatial is a pretty common and fundamental
> capability of analytics systems and virtually every mature and popular
> analytics systems, be it open source or proprietary, storage or query, has
> some basic geospatial data type and support. Adding geospatial type and
> some basic expressions is such a no brainer.
>
> On Sat, Mar 29, 2025 at 2:27 PM Jia Yu <[email protected]> wrote:
>
>> Hi Wenchen, Menelaos and Szehon,
>>
>> Thanks for the clarification — I’m glad to hear the primary motivation of
>> this SPIP is focused on reading and writing geospatial data with Parquet
>> and Iceberg. That’s an important goal, and I want to highlight that this
>> problem is being solved by the Apache Sedona community.
>>
>> Since the primary motivation here is Parquet-level support, I suggest
>> shifting the focus of this discussion toward enabling geo support in Spark
>> Parquet DataSource rather than introducing core types.
>>
>> ** Why Spark Should Avoid Hardcoding Domain-Specific Types like geo types
>> **
>>
>>         1.      Domain types evolve quickly.
>>
>> In geospatial, we already have geometry, geography, raster, trajectory,
>> point clouds — and the list keeps growing. In AI/ML, we’re seeing tensors,
>> vectors, and multi-dimensional arrays. Spark’s strength has always been in
>> its general-purpose architecture and extensibility. Introducing hardcoded
>> support for fast-changing domain-specific types risks long-term maintenance
>> issues and eventual incompatibility with emerging standards.
>>
>>         2.      Geospatial in Java and Python is a dependency hell.
>>
>> There are multiple competing geometry libraries with incompatible APIs.
>> No widely adopted Java library supports geography types. The most
>> authoritative CRS dataset (EPSG) is not Apache-compatible. The json format
>> for CRS definitions (projjson) is only fully supported in PROJ, a C++
>> library without a Java equivalent and no formal OGC standard status. On the
>> Python side, this might involve Shapely and GeoPandas dependencies.
>>
>>         3.      Sedona already supports Geo fully in (Geo)Parquet.
>>
>> Sedona has supported reading, writing, metadata preservation, and data
>> skipping for GeoParquet (predecessor of Parquet Geo) for over two years
>> [2][3]. These features are production-tested and widely used.
>>
>> ** Proposed Path Forward: Geo Support via Spark Extensions **
>>
>> To enable seamless Parquet integration without burdening Spark core, here
>> are two options:
>>
>> Option 1:
>> Sedona offers a dedicated `parquet-geo` DataSource that handles type
>> encoding, metadata, and data skipping. No changes to Spark are required.
>> This is already underway and will be maintained by the Sedona community to
>> keep up with the evolving Geo standards.
>>
>> Option 2:
>> Spark provides hooks to inject:
>> - custom logical types / user-defined types (UDTs)
>> - custom statistics and filter pushdowns
>> Sedona can then extend the built-in `parquet` DataSource to integrate geo
>> type metadata, predicate pushdown, and serialization seamlessly.
>>
>> For Iceberg, we’ve already published a proof-of-concept connector [4]
>> showing Sedona, Spark, and Iceberg working together without any Spark core
>> changes [5].
>>
>> ** On the Bigger Picture **
>>
>> I also agree with your long-term vision. I believe Spark is on the path
>> to becoming a foundational compute engine — much like Postgres or Pandas —
>> where the core remains focused and stable, while powerful domain-specific
>> capabilities emerge from its ecosystem.
>>
>> To support this future, Spark could prioritize flexible extension hooks
>> so that third-party libraries can thrive — just like we’ve seen with
>> PostGIS, pgvector, TimescaleDB in the Postgres ecosystem, and GeoPandas in
>> the Pandas ecosystem.
>>
>> Sedona is following this model by building geospatial support around
>> Spark — not inside it — and we’d love to continue collaborating in this
>> spirit.
>>
>> Happy to work together on providing Geo support in Parquet!
>>
>> Best,
>> Jia
>>
>> References
>>
>> [1] GeoParquet project:
>> https://github.com/opengeospatial/geoparquet
>>
>> [2] Sedona’s GeoParquet DataSource implementation:
>>
>> https://github.com/apache/sedona/tree/master/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet
>>
>> [3] Sedona’s GeoParquet documentation:
>> https://sedona.apache.org/latest/tutorial/files/geoparquet-sedona-spark/
>>
>> [4] Sedona-Iceberg connector (PoC):
>> https://github.com/wherobots/sedona-iceberg-connector
>>
>> [5] Spark-Sedona-Iceberg working example:
>>
>> https://github.com/wherobots/sedona-iceberg-connector/blob/main/src/test/scala/com/wherobots/sedona/TestGeospatial.scala#L53
>>
>>
>> On 2025/03/29 19:27:08 Menelaos Karavelas wrote:
>> > To continue along the line of thought of Szehon:
>> >
>> > I am really excited that the Parquet and Iceberg communities have
>> adopted geospatial logical types and of course I am grateful for the work
>> put in that direction.
>> >
>> > As both Wenchen and Szehon pointed out in their own way, the goal is to
>> have minimal support in Spark, as a common platform, for these types.
>> >
>> > To be more specific and explicit: The proposal scope is to add support
>> for reading/writing to Parquet, based on the new standard, as well as
>> adding the types as built-in types in Spark to complement the storage
>> support. The few ST expressions that are in the proposal are what seem to
>> be the minimal set of expressions needed to support working with geospatial
>> values in the Spark engine in a meaningful way.
>> >
>> > Best,
>> >
>> > Menelaos
>> >
>> >
>> > > On Mar 29, 2025, at 12:06 PM, Szehon Ho <[email protected]>
>> wrote:
>> > >
>> > > Thank you Menelaos, will do!
>> > >
>> > > To give a little background, Jia and Sedona community, also
>> GeoParquet community, and others really put much effort contributing to
>> defining the Parquet and Iceberg geo types, which couldn't be done without
>> their experience and help!
>> > >
>> > > But I do agree with Wenchen , now that the types are in most common
>> data sources in ecosystem , I think Apache Spark as a common platform needs
>> to have this type definition for inter-op, otherwise users of vanilla Spark
>> cannot work with those data sources with stored geospatial data.  (Imo a
>> similar rationale in adding timestamp nano in the other ongoing SPIP.).
>> > >
>> > > And like Wenchen said, the SPIP’s goal doesnt seem to be to fragment
>> the ecosystem by implementing Sedona’s advanced geospatial analytic tech in
>> Spark itself, which you may be right belongs in pluggable frameworks.
>> Menelaus may explain more about the SPIP goal.
>> > >
>> > > I do hope there can be more collaboration across communities (like in
>> Iceberg/Parquet collaboration) in getting Sedona community’s experience in
>> making sure these type definitions are optimal , and compatible for Sedona.
>> > >
>> > > Thanks!
>> > > Szehon
>> > >
>> > >
>> > >> On Mar 29, 2025, at 8:04 AM, Menelaos Karavelas <
>> [email protected]> wrote:
>> > >>
>> > >> 
>> > >> Hello Szehon,
>> > >>
>> > >> I just created a Google doc and also linked it in the JIRA:
>> > >>
>> > >>
>> https://docs.google.com/document/d/1cYSNPGh95OjnpS0k_KDHGM9Ae3j-_0Wnc_eGBZL4D3w/edit?tab=t.0
>> > >>
>> > >> Please feel free to comment on it.
>> > >>
>> > >> Best,
>> > >>
>> > >> Menelaos
>> > >>
>> > >>
>> > >>> On Mar 28, 2025, at 2:19 PM, Szehon Ho <[email protected]>
>> wrote:
>> > >>>
>> > >>> Thanks Menelaos, this is exciting !  Is there a google doc we can
>> comment, or just on the JIRA?
>> > >>>
>> > >>> Thanks
>> > >>> Szehon
>> > >>>
>> > >>> On Fri, Mar 28, 2025 at 1:41 PM Ángel Álvarez Pascua <
>> [email protected] <mailto:[email protected]>>
>> wrote:
>> > >>>> Sorry, I only had a quick look at the proposal, looked for WKT and
>> didn't find anything.
>> > >>>>
>> > >>>> It's been years since I worked on geospatial projects and I'm not
>> an expert (at all). Maybe starting with something simple but useful like
>> conversion WKT<=>WKB?
>> > >>>>
>> > >>>>
>> > >>>> El vie, 28 mar 2025, 21:27, Menelaos Karavelas <
>> [email protected] <mailto:[email protected]>>
>> escribió:
>> > >>>>> In the SPIP Jira the proposal is to add the expressions
>> ST_AsBinary, ST_GeomFromWKB, and ST_GeogFromWKB.
>> > >>>>> Is there anything else that you think should be added?
>> > >>>>>
>> > >>>>> Regarding WKT, what do you think should be added?
>> > >>>>>
>> > >>>>> - Menelaos
>> > >>>>>
>> > >>>>>
>> > >>>>>> On Mar 28, 2025, at 1:02 PM, Ángel Álvarez Pascua <
>> [email protected] <mailto:[email protected]>>
>> wrote:
>> > >>>>>>
>> > >>>>>> What about adding support for WKT <
>> https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry>/WKB
>> <
>> https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
>> >?
>> > >>>>>>
>> > >>>>>> El vie, 28 mar 2025 a las 20:50, Ángel Álvarez Pascua (<
>> [email protected] <mailto:[email protected]>>)
>> escribió:
>> > >>>>>>> +1 (non-binding)
>> > >>>>>>>
>> > >>>>>>> El vie, 28 mar 2025, 18:48, Menelaos Karavelas <
>> [email protected] <mailto:[email protected]>>
>> escribió:
>> > >>>>>>>> Dear Spark community,
>> > >>>>>>>>
>> > >>>>>>>> I would like to propose the addition of new geospatial data
>> types (GEOMETRY and GEOGRAPHY) which represent geospatial values as
>> recently added as new logical types in the Parquet specification.
>> > >>>>>>>>
>> > >>>>>>>> The new types should improve Spark’s ability to read the new
>> Parquet logical types and perform some minimal meaningful operations on
>> them.
>> > >>>>>>>>
>> > >>>>>>>> SPIP: https://issues.apache.org/jira/browse/SPARK-51658
>> > >>>>>>>>
>> > >>>>>>>> Looking forward to your comments and feedback.
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Best regards,
>> > >>>>>>>>
>> > >>>>>>>> Menelaos Karavelas
>> > >>>>>>>>
>> > >>>>>
>> > >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: [DISCUSS] SPIP: Add geospatial types to Spark

Reply via email to