Hi Jia, I really appreciate your very instructive answer. I truly believe that discussing topics with people who know far more than I do is a great way to learn new and interesting things. Your explanations are quite logical and make perfect sense to me. Sh**, I'm not that sure about this proposal now ... 😂
*"If you haven’t encountered this kind of ‘dependency hell’ while working on geospatial projects with Spark, you may have been fortunate to deal with relatively simple cases."* Yes, that was the case for us. We loaded OpenStreetMap data from Spain, calculated some Haversine distances between points in rasters using a custom-made C++ library, and did a bit more. We also used GeoTrellis for something, but I can't quite remember what Thanks again! El dom, 30 mar 2025 a las 9:25, Jia Yu (<ji...@apache.org>) escribió: > Hey Angel, > > I am glad that you asked these questions. Please see my answers below. > > > *1. Domain types evolve quickly. - It has taken years for Parquet to > include these new types in its format... We could evolve alongside Parquet. > Unfortunately, Spark is not known for upgrading its dependencies quickly.* > > Exactly — domain-specific types evolve rapidly and may head in directions > that aren’t fully aligned with formats like Parquet, Avro, and others. In > such cases, should Spark, as a general-purpose compute engine, really be > tightly coupled to the specifics of a single storage format? > > Personally, I really appreciate Spark’s UserDefinedType mechanism and > Apache Arrow’s ExtensionType — both offer maximum flexibility while keeping > the core engine clean and extensible. > > > > * 2. Geospatial in Java and Python is a dependency hell.- How has Parquet > solved that problem, then?* > > Exactly — this problem is not fully solved by Parquet. While the Parquet > spec now includes a definition for geospatial types, it’s more of a vision > than a complete, production-ready solution. Many aspects of the spec are > not yet implemented in Spark. In fact, the spec represents a compromise > among multiple vendors (e.g., BigQuery, Snowflake), and many design choices > are not aligned with Spark’s architecture or ecosystem. > > For example: > • The CRS property in the spec uses a PROJJSON string, which currently > only has a C++ implementation — there is no Java implementation available. > • The edge interpolation algorithms (e.g., for great-circle arcs) > mentioned in the spec also only exist in C++ libraries. > • Handling of antimeridian-crossing geometries is another complex topic > that isn’t addressed in Spark today. > > The Sedona community is actively working on solutions — either building > Java equivalents for these features or creating workarounds. These are > deeply domain-specific efforts and often require non-trivial geospatial > expertise. > > We are currently contributing a Java implementation of the Parquet Geo > format here: https://github.com/apache/parquet-java/pull/2971 > > In Python, geospatial manipulation depends on libraries like Shapely and > GeoPandas, which evolve quickly and frequently introduce breaking changes. > Sedona has invested significant effort to maintain compatibility and > stability for Python UDFs across these ecosystems. > > If you haven’t encountered this kind of “dependency hell” while working on > geospatial projects with Spark, you may have been fortunate to deal with > relatively simple cases — e.g., only working with point data or simple > polygons. > > That usually means: > 1. All geometries are in a single CRS, typically WGS84 (SRID 4326) > 2. No antimeridian-crossing geometries > 3. No need for high-precision distance calculations or spherical geometry > 4. No need to handle topology or wraparound issues > > If that’s the case, then Spark already works fine as-is for your use case > — so why complicate it? > > > *3. Sedona already supports Geo fully in (Geo)Parquet.- The default format > in Spark is Parquet, and Parquet now natively supports these types. Are we > going to force users to add Sedona?* > > While opinions may vary, I would encourage users to adopt a solution like > Apache Sedona that laser focuses on geospatial. Sedona provides > comprehensive, step-by-step tutorials on how to handle geospatial > dependencies across major platforms — including Databricks, AWS EMR, > Microsoft Azure, and Google Cloud. We’re also actively collaborating with > cloud providers to bundle Sedona natively into their offerings, making it > even easier for users to get started. > > > That said, I generally share the same perspective — if the Spark community > believes it would benefit from having basic geospatial support built in, > the Sedona community would be happy to collaborate on this effort. We’re > open to contributing the necessary functionality and, if appropriate, > having Spark depend on Sedona directly to avoid reinvention. > > Thanks, > JIa > > > > On Sat, Mar 29, 2025 at 11:02 PM Ángel Álvarez Pascua < > angel.alvarez.pas...@gmail.com> wrote: > >> >> * 1. Domain types evolve quickly.* >> It has taken years for Parquet to include these new types in its >> format... We could evolve alongside Parquet. Unfortunately, Spark is not >> known for upgrading its dependencies quickly. >> >> * 2. Geospatial in Java and Python is a dependency hell.* >> How has Parquet solved that problem, then? I don't recall experiencing >> any "dependency hell" when working on geospatial projects with Spark, to be >> honest. Besides, Spark already includes Parquet as a dependency, so... >> where is the problem? >> >> *3. Sedona already supports Geo fully in (Geo)Parquet.* >> The default format in Spark is Parquet, and Parquet now natively supports >> these types. Are we going to force users to add Sedona (along with all its >> third-party dependencies, I assume) to their projects just for reading, >> writing, and performing basic operations with these types? >> >> Anyway, let's vote and see... >> >> El sáb, 29 mar 2025 a las 22:41, Reynold Xin (<r...@databricks.com.invalid>) >> escribió: >> >>> While I don’t think Spark should become a super specialized geospatial >>> processing engine, I don’t think it makes sense to focus *only* on reading >>> and writing from storage. Geospatial is a pretty common and fundamental >>> capability of analytics systems and virtually every mature and popular >>> analytics systems, be it open source or proprietary, storage or query, has >>> some basic geospatial data type and support. Adding geospatial type and >>> some basic expressions is such a no brainer. >>> >>> On Sat, Mar 29, 2025 at 2:27 PM Jia Yu <ji...@apache.org> wrote: >>> >>>> Hi Wenchen, Menelaos and Szehon, >>>> >>>> Thanks for the clarification — I’m glad to hear the primary motivation >>>> of this SPIP is focused on reading and writing geospatial data with Parquet >>>> and Iceberg. That’s an important goal, and I want to highlight that this >>>> problem is being solved by the Apache Sedona community. >>>> >>>> Since the primary motivation here is Parquet-level support, I suggest >>>> shifting the focus of this discussion toward enabling geo support in Spark >>>> Parquet DataSource rather than introducing core types. >>>> >>>> ** Why Spark Should Avoid Hardcoding Domain-Specific Types like geo >>>> types ** >>>> >>>> 1. Domain types evolve quickly. >>>> >>>> In geospatial, we already have geometry, geography, raster, trajectory, >>>> point clouds — and the list keeps growing. In AI/ML, we’re seeing tensors, >>>> vectors, and multi-dimensional arrays. Spark’s strength has always been in >>>> its general-purpose architecture and extensibility. Introducing hardcoded >>>> support for fast-changing domain-specific types risks long-term maintenance >>>> issues and eventual incompatibility with emerging standards. >>>> >>>> 2. Geospatial in Java and Python is a dependency hell. >>>> >>>> There are multiple competing geometry libraries with incompatible APIs. >>>> No widely adopted Java library supports geography types. The most >>>> authoritative CRS dataset (EPSG) is not Apache-compatible. The json format >>>> for CRS definitions (projjson) is only fully supported in PROJ, a C++ >>>> library without a Java equivalent and no formal OGC standard status. On the >>>> Python side, this might involve Shapely and GeoPandas dependencies. >>>> >>>> 3. Sedona already supports Geo fully in (Geo)Parquet. >>>> >>>> Sedona has supported reading, writing, metadata preservation, and data >>>> skipping for GeoParquet (predecessor of Parquet Geo) for over two years >>>> [2][3]. These features are production-tested and widely used. >>>> >>>> ** Proposed Path Forward: Geo Support via Spark Extensions ** >>>> >>>> To enable seamless Parquet integration without burdening Spark core, >>>> here are two options: >>>> >>>> Option 1: >>>> Sedona offers a dedicated `parquet-geo` DataSource that handles type >>>> encoding, metadata, and data skipping. No changes to Spark are required. >>>> This is already underway and will be maintained by the Sedona community to >>>> keep up with the evolving Geo standards. >>>> >>>> Option 2: >>>> Spark provides hooks to inject: >>>> - custom logical types / user-defined types (UDTs) >>>> - custom statistics and filter pushdowns >>>> Sedona can then extend the built-in `parquet` DataSource to integrate >>>> geo type metadata, predicate pushdown, and serialization seamlessly. >>>> >>>> For Iceberg, we’ve already published a proof-of-concept connector [4] >>>> showing Sedona, Spark, and Iceberg working together without any Spark core >>>> changes [5]. >>>> >>>> ** On the Bigger Picture ** >>>> >>>> I also agree with your long-term vision. I believe Spark is on the path >>>> to becoming a foundational compute engine — much like Postgres or Pandas — >>>> where the core remains focused and stable, while powerful domain-specific >>>> capabilities emerge from its ecosystem. >>>> >>>> To support this future, Spark could prioritize flexible extension hooks >>>> so that third-party libraries can thrive — just like we’ve seen with >>>> PostGIS, pgvector, TimescaleDB in the Postgres ecosystem, and GeoPandas in >>>> the Pandas ecosystem. >>>> >>>> Sedona is following this model by building geospatial support around >>>> Spark — not inside it — and we’d love to continue collaborating in this >>>> spirit. >>>> >>>> Happy to work together on providing Geo support in Parquet! >>>> >>>> Best, >>>> Jia >>>> >>>> References >>>> >>>> [1] GeoParquet project: >>>> https://github.com/opengeospatial/geoparquet >>>> >>>> [2] Sedona’s GeoParquet DataSource implementation: >>>> >>>> https://github.com/apache/sedona/tree/master/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet >>>> >>>> [3] Sedona’s GeoParquet documentation: >>>> https://sedona.apache.org/latest/tutorial/files/geoparquet-sedona-spark/ >>>> >>>> [4] Sedona-Iceberg connector (PoC): >>>> https://github.com/wherobots/sedona-iceberg-connector >>>> >>>> [5] Spark-Sedona-Iceberg working example: >>>> >>>> https://github.com/wherobots/sedona-iceberg-connector/blob/main/src/test/scala/com/wherobots/sedona/TestGeospatial.scala#L53 >>>> >>>> >>>> On 2025/03/29 19:27:08 Menelaos Karavelas wrote: >>>> > To continue along the line of thought of Szehon: >>>> > >>>> > I am really excited that the Parquet and Iceberg communities have >>>> adopted geospatial logical types and of course I am grateful for the work >>>> put in that direction. >>>> > >>>> > As both Wenchen and Szehon pointed out in their own way, the goal is >>>> to have minimal support in Spark, as a common platform, for these types. >>>> > >>>> > To be more specific and explicit: The proposal scope is to add >>>> support for reading/writing to Parquet, based on the new standard, as well >>>> as adding the types as built-in types in Spark to complement the storage >>>> support. The few ST expressions that are in the proposal are what seem to >>>> be the minimal set of expressions needed to support working with geospatial >>>> values in the Spark engine in a meaningful way. >>>> > >>>> > Best, >>>> > >>>> > Menelaos >>>> > >>>> > >>>> > > On Mar 29, 2025, at 12:06 PM, Szehon Ho <szehon.apa...@gmail.com> >>>> wrote: >>>> > > >>>> > > Thank you Menelaos, will do! >>>> > > >>>> > > To give a little background, Jia and Sedona community, also >>>> GeoParquet community, and others really put much effort contributing to >>>> defining the Parquet and Iceberg geo types, which couldn't be done without >>>> their experience and help! >>>> > > >>>> > > But I do agree with Wenchen , now that the types are in most common >>>> data sources in ecosystem , I think Apache Spark as a common platform needs >>>> to have this type definition for inter-op, otherwise users of vanilla Spark >>>> cannot work with those data sources with stored geospatial data. (Imo a >>>> similar rationale in adding timestamp nano in the other ongoing SPIP.). >>>> > > >>>> > > And like Wenchen said, the SPIP’s goal doesnt seem to be to >>>> fragment the ecosystem by implementing Sedona’s advanced geospatial >>>> analytic tech in Spark itself, which you may be right belongs in pluggable >>>> frameworks. Menelaus may explain more about the SPIP goal. >>>> > > >>>> > > I do hope there can be more collaboration across communities (like >>>> in Iceberg/Parquet collaboration) in getting Sedona community’s experience >>>> in making sure these type definitions are optimal , and compatible for >>>> Sedona. >>>> > > >>>> > > Thanks! >>>> > > Szehon >>>> > > >>>> > > >>>> > >> On Mar 29, 2025, at 8:04 AM, Menelaos Karavelas < >>>> menelaos.karave...@gmail.com> wrote: >>>> > >> >>>> > >> >>>> > >> Hello Szehon, >>>> > >> >>>> > >> I just created a Google doc and also linked it in the JIRA: >>>> > >> >>>> > >> >>>> https://docs.google.com/document/d/1cYSNPGh95OjnpS0k_KDHGM9Ae3j-_0Wnc_eGBZL4D3w/edit?tab=t.0 >>>> > >> >>>> > >> Please feel free to comment on it. >>>> > >> >>>> > >> Best, >>>> > >> >>>> > >> Menelaos >>>> > >> >>>> > >> >>>> > >>> On Mar 28, 2025, at 2:19 PM, Szehon Ho <szehon.apa...@gmail.com> >>>> wrote: >>>> > >>> >>>> > >>> Thanks Menelaos, this is exciting ! Is there a google doc we can >>>> comment, or just on the JIRA? >>>> > >>> >>>> > >>> Thanks >>>> > >>> Szehon >>>> > >>> >>>> > >>> On Fri, Mar 28, 2025 at 1:41 PM Ángel Álvarez Pascua < >>>> angel.alvarez.pas...@gmail.com <mailto:angel.alvarez.pas...@gmail.com>> >>>> wrote: >>>> > >>>> Sorry, I only had a quick look at the proposal, looked for WKT >>>> and didn't find anything. >>>> > >>>> >>>> > >>>> It's been years since I worked on geospatial projects and I'm >>>> not an expert (at all). Maybe starting with something simple but useful >>>> like conversion WKT<=>WKB? >>>> > >>>> >>>> > >>>> >>>> > >>>> El vie, 28 mar 2025, 21:27, Menelaos Karavelas < >>>> menelaos.karave...@gmail.com <mailto:menelaos.karave...@gmail.com>> >>>> escribió: >>>> > >>>>> In the SPIP Jira the proposal is to add the expressions >>>> ST_AsBinary, ST_GeomFromWKB, and ST_GeogFromWKB. >>>> > >>>>> Is there anything else that you think should be added? >>>> > >>>>> >>>> > >>>>> Regarding WKT, what do you think should be added? >>>> > >>>>> >>>> > >>>>> - Menelaos >>>> > >>>>> >>>> > >>>>> >>>> > >>>>>> On Mar 28, 2025, at 1:02 PM, Ángel Álvarez Pascua < >>>> angel.alvarez.pas...@gmail.com <mailto:angel.alvarez.pas...@gmail.com>> >>>> wrote: >>>> > >>>>>> >>>> > >>>>>> What about adding support for WKT < >>>> https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry>/WKB >>>> < >>>> https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary >>>> >? >>>> > >>>>>> >>>> > >>>>>> El vie, 28 mar 2025 a las 20:50, Ángel Álvarez Pascua (< >>>> angel.alvarez.pas...@gmail.com <mailto:angel.alvarez.pas...@gmail.com>>) >>>> escribió: >>>> > >>>>>>> +1 (non-binding) >>>> > >>>>>>> >>>> > >>>>>>> El vie, 28 mar 2025, 18:48, Menelaos Karavelas < >>>> menelaos.karave...@gmail.com <mailto:menelaos.karave...@gmail.com>> >>>> escribió: >>>> > >>>>>>>> Dear Spark community, >>>> > >>>>>>>> >>>> > >>>>>>>> I would like to propose the addition of new geospatial data >>>> types (GEOMETRY and GEOGRAPHY) which represent geospatial values as >>>> recently added as new logical types in the Parquet specification. >>>> > >>>>>>>> >>>> > >>>>>>>> The new types should improve Spark’s ability to read the new >>>> Parquet logical types and perform some minimal meaningful operations on >>>> them. >>>> > >>>>>>>> >>>> > >>>>>>>> SPIP: https://issues.apache.org/jira/browse/SPARK-51658 >>>> > >>>>>>>> >>>> > >>>>>>>> Looking forward to your comments and feedback. >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> Best regards, >>>> > >>>>>>>> >>>> > >>>>>>>> Menelaos Karavelas >>>> > >>>>>>>> >>>> > >>>>> >>>> > >> >>>> > >>>> > >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>>