Dear Menelaos,

Thanks for bringing this up again. I’ve seen similar proposals come up on the 
mailing list before, and I’d like to offer some thoughts.

For full transparency, I’m Jia Yu, PMC Chair of Apache Sedona 
(https://github.com/apache/sedona), a widely used open-source cluster computing 
framework for processing large-scale geospatial data on Spark, Flink, and other 
engines.

>From what I understand, this proposal aims to add native geospatial types and 
>functionality directly into Spark. However, this seems to replicate much of 
>the work already done by the Sedona project over the past 10 years.

Sedona has a strong and active community with well-established contribution 
guidelines. It is already used extensively with Spark in production—on 
platforms like Databricks, AWS EMR, Microsoft Fabric, and Google Cloud. Users 
simply add the Sedona jar, flip a Spark config, and it just works—similar to 
other mature Spark ecosystem libraries.

The project sees over 2 million downloads per month across PyPI, Maven, etc., 
and has been downloaded more than 45 million times overall. Thousands of 
organizations rely on Sedona in production Spark environments.

Sedona has also actively contributed to upstream ecosystem efforts, such as 
geospatial support in Parquet and Iceberg formats.

Additionally, Sedona’s core technology has been peer-reviewed and published at 
top academic conferences. Its performance has been evaluated and benchmarked by 
many independent research articles: 
https://sedona.apache.org/latest/community/publication/

Given all of this, I’m genuinely unsure what gap a new Spark-native effort is 
aiming to fill. If there’s a specific limitation that Sedona cannot address, 
I’d be eager to understand it. Otherwise, duplicating this functionality risks 
fragmenting the ecosystem and introducing confusion for current users. I would 
strongly advocate for close coordination with the Sedona community to avoid 
disruption and ensure alignment with the broader Spark ecosystem.

Thanks again for raising this—we’re always happy to collaborate and strengthen 
the ecosystem together.


Here is a quick overview of what Apache Sedona already offers on Spark:
        •       Geospatial type support:
        •       Vector: Geometry, partial Geography
        •       Raster
        •       Vector data sources:
        •       GeoParquet (read/write), GeoJSON (read/write), Shapefile, 
GeoPackage, OpenStreetMap PBF
        •       Raster data sources:
        •       STAC catalog, GeoTiff (read/write), NetCDF/HDF
        •       Functions:
        •       209+ vector (ST_*) functions
        •       100+ raster (RS_*) functions
        •       GeoStats SQL: DBSCAN, hotspot analysis, outlier detection
        •       Language support:
        •       Scala, Java, SQL, Python, R
        •       Query acceleration via R-Tree:
        •       Distributed and broadcast spatial joins
        •       KNN joins
        •       Range queries
        •       UDF support:
        •       Scala UDFs (JTS), Python UDFs (Shapely, Rasterio, NumPy), 
Pandas UDFs
        •       Serialization:
        •       Custom serializers for geometry types
        •       Ecosystem integrations:
        •       Jupyter, Zeppelin, Apache Arrow, GeoPandas read/write, 
GeoPandas-like API

Jia Yu


On 2025/03/28 17:46:15 Menelaos Karavelas wrote:
> Dear Spark community,
> 
> I would like to propose the addition of new geospatial data types (GEOMETRY 
> and GEOGRAPHY) which represent geospatial values as recently added as new 
> logical types in the Parquet specification.
> 
> The new types should improve Spark’s ability to read the new Parquet logical 
> types and perform some minimal meaningful operations on them.
> 
> SPIP: https://issues.apache.org/jira/browse/SPARK-51658
> 
> Looking forward to your comments and feedback.
> 
> 
> Best regards,
> 
> Menelaos Karavelas
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to