Dear Menelaos, Thanks for bringing this up again. I’ve seen similar proposals come up on the mailing list before, and I’d like to offer some thoughts.
For full transparency, I’m Jia Yu, PMC Chair of Apache Sedona (https://github.com/apache/sedona), a widely used open-source cluster computing framework for processing large-scale geospatial data on Spark, Flink, and other engines. >From what I understand, this proposal aims to add native geospatial types and >functionality directly into Spark. However, this seems to replicate much of >the work already done by the Sedona project over the past 10 years. Sedona has a strong and active community with well-established contribution guidelines. It is already used extensively with Spark in production—on platforms like Databricks, AWS EMR, Microsoft Fabric, and Google Cloud. Users simply add the Sedona jar, flip a Spark config, and it just works—similar to other mature Spark ecosystem libraries. The project sees over 2 million downloads per month across PyPI, Maven, etc., and has been downloaded more than 45 million times overall. Thousands of organizations rely on Sedona in production Spark environments. Sedona has also actively contributed to upstream ecosystem efforts, such as geospatial support in Parquet and Iceberg formats. Additionally, Sedona’s core technology has been peer-reviewed and published at top academic conferences. Its performance has been evaluated and benchmarked by many independent research articles: https://sedona.apache.org/latest/community/publication/ Given all of this, I’m genuinely unsure what gap a new Spark-native effort is aiming to fill. If there’s a specific limitation that Sedona cannot address, I’d be eager to understand it. Otherwise, duplicating this functionality risks fragmenting the ecosystem and introducing confusion for current users. I would strongly advocate for close coordination with the Sedona community to avoid disruption and ensure alignment with the broader Spark ecosystem. Thanks again for raising this—we’re always happy to collaborate and strengthen the ecosystem together. Here is a quick overview of what Apache Sedona already offers on Spark: • Geospatial type support: • Vector: Geometry, partial Geography • Raster • Vector data sources: • GeoParquet (read/write), GeoJSON (read/write), Shapefile, GeoPackage, OpenStreetMap PBF • Raster data sources: • STAC catalog, GeoTiff (read/write), NetCDF/HDF • Functions: • 209+ vector (ST_*) functions • 100+ raster (RS_*) functions • GeoStats SQL: DBSCAN, hotspot analysis, outlier detection • Language support: • Scala, Java, SQL, Python, R • Query acceleration via R-Tree: • Distributed and broadcast spatial joins • KNN joins • Range queries • UDF support: • Scala UDFs (JTS), Python UDFs (Shapely, Rasterio, NumPy), Pandas UDFs • Serialization: • Custom serializers for geometry types • Ecosystem integrations: • Jupyter, Zeppelin, Apache Arrow, GeoPandas read/write, GeoPandas-like API Jia Yu On 2025/03/28 17:46:15 Menelaos Karavelas wrote: > Dear Spark community, > > I would like to propose the addition of new geospatial data types (GEOMETRY > and GEOGRAPHY) which represent geospatial values as recently added as new > logical types in the Parquet specification. > > The new types should improve Spark’s ability to read the new Parquet logical > types and perform some minimal meaningful operations on them. > > SPIP: https://issues.apache.org/jira/browse/SPARK-51658 > > Looking forward to your comments and feedback. > > > Best regards, > > Menelaos Karavelas > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org