Re: [DISCUSS] SPIP: Add geospatial types to Spark

Menelaos Karavelas Sat, 29 Mar 2025 07:58:33 -0700

Hello Jia,

Wenchen summarized the intent very clearly. The scope of the proposal is 
primarily the type system and storage, not processing. Let’s work together on 
the technical details and make sure the work we propose to do in Spark works 
best with Apache Sedona.


Best,

Menelaos


> On Mar 29, 2025, at 5:23 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
> 
> Hi Jia,
> 
> This is a good question. As the shepherd of this SPIP, I'd like to clarify 
> the motivation here: the focus of this project is more about the storage 
> part, not the processing. Apache Sedona is a great library for geo 
> processing, but without native geo type support in Spark, users can't do the 
> following things:
> - read the geo type columns from Parquet files (or other data sources) 
> directly
> - write geo values into Parquet files (or other data sources) as native geo 
> types.
> - push down geo predicates to the data source when reading
> 
> In the SPIP JIRA, we explicitly mentioned that "This proposal is laying the 
> foundation - building the infrastructure to handle geospatial data, but not 
> creating a full-featured geospatial processing system. Such extension can be 
> done later as a separate improvement." Maybe the right direction is to not do 
> it and leave it to third-party libraries.
> 
> The ultimate goal is to establish Spark as a comprehensive platform that can 
> connect to a rich ecosystem of third-party data sources and processing 
> libraries. For this project, we should definitely work with the Apache Sedona 
> community closely, to figure out the best protocol (what binary/text format 
> to use? how to represent geo values in Java? etc.)
> 
> Thanks,
> Wenchen
> 
> On Sat, Mar 29, 2025 at 5:28 AM Jia Yu <ji...@apache.org 
> <mailto:ji...@apache.org>> wrote:
>> Dear Menelaos,
>> 
>> Thanks for bringing this up again. I’ve seen similar proposals come up on 
>> the mailing list before, and I’d like to offer some thoughts.
>> 
>> For full transparency, I’m Jia Yu, PMC Chair of Apache Sedona 
>> (https://github.com/apache/sedona), a widely used open-source cluster 
>> computing framework for processing large-scale geospatial data on Spark, 
>> Flink, and other engines.
>> 
>> From what I understand, this proposal aims to add native geospatial types 
>> and functionality directly into Spark. However, this seems to replicate much 
>> of the work already done by the Sedona project over the past 10 years.
>> 
>> Sedona has a strong and active community with well-established contribution 
>> guidelines. It is already used extensively with Spark in production—on 
>> platforms like Databricks, AWS EMR, Microsoft Fabric, and Google Cloud. 
>> Users simply add the Sedona jar, flip a Spark config, and it just 
>> works—similar to other mature Spark ecosystem libraries.
>> 
>> The project sees over 2 million downloads per month across PyPI, Maven, 
>> etc., and has been downloaded more than 45 million times overall. Thousands 
>> of organizations rely on Sedona in production Spark environments.
>> 
>> Sedona has also actively contributed to upstream ecosystem efforts, such as 
>> geospatial support in Parquet and Iceberg formats.
>> 
>> Additionally, Sedona’s core technology has been peer-reviewed and published 
>> at top academic conferences. Its performance has been evaluated and 
>> benchmarked by many independent research articles: 
>> https://sedona.apache.org/latest/community/publication/
>> 
>> Given all of this, I’m genuinely unsure what gap a new Spark-native effort 
>> is aiming to fill. If there’s a specific limitation that Sedona cannot 
>> address, I’d be eager to understand it. Otherwise, duplicating this 
>> functionality risks fragmenting the ecosystem and introducing confusion for 
>> current users. I would strongly advocate for close coordination with the 
>> Sedona community to avoid disruption and ensure alignment with the broader 
>> Spark ecosystem.
>> 
>> Thanks again for raising this—we’re always happy to collaborate and 
>> strengthen the ecosystem together.
>> 
>> 
>> Here is a quick overview of what Apache Sedona already offers on Spark:
>>         •       Geospatial type support:
>>         •       Vector: Geometry, partial Geography
>>         •       Raster
>>         •       Vector data sources:
>>         •       GeoParquet (read/write), GeoJSON (read/write), Shapefile, 
>> GeoPackage, OpenStreetMap PBF
>>         •       Raster data sources:
>>         •       STAC catalog, GeoTiff (read/write), NetCDF/HDF
>>         •       Functions:
>>         •       209+ vector (ST_*) functions
>>         •       100+ raster (RS_*) functions
>>         •       GeoStats SQL: DBSCAN, hotspot analysis, outlier detection
>>         •       Language support:
>>         •       Scala, Java, SQL, Python, R
>>         •       Query acceleration via R-Tree:
>>         •       Distributed and broadcast spatial joins
>>         •       KNN joins
>>         •       Range queries
>>         •       UDF support:
>>         •       Scala UDFs (JTS), Python UDFs (Shapely, Rasterio, NumPy), 
>> Pandas UDFs
>>         •       Serialization:
>>         •       Custom serializers for geometry types
>>         •       Ecosystem integrations:
>>         •       Jupyter, Zeppelin, Apache Arrow, GeoPandas read/write, 
>> GeoPandas-like API
>> 
>> Jia Yu
>> 
>> 
>> On 2025/03/28 17:46:15 Menelaos Karavelas wrote:
>> > Dear Spark community,
>> > 
>> > I would like to propose the addition of new geospatial data types 
>> > (GEOMETRY and GEOGRAPHY) which represent geospatial values as recently 
>> > added as new logical types in the Parquet specification.
>> > 
>> > The new types should improve Spark’s ability to read the new Parquet 
>> > logical types and perform some minimal meaningful operations on them.
>> > 
>> > SPIP: https://issues.apache.org/jira/browse/SPARK-51658
>> > 
>> > Looking forward to your comments and feedback.
>> > 
>> > 
>> > Best regards,
>> > 
>> > Menelaos Karavelas
>> > 
>> > 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> <mailto:dev-unsubscr...@spark.apache.org>
>>

Re: [DISCUSS] SPIP: Add geospatial types to Spark

Reply via email to