* 1. Domain types evolve quickly.* It has taken years for Parquet to include these new types in its format... We could evolve alongside Parquet. Unfortunately, Spark is not known for upgrading its dependencies quickly.
* 2. Geospatial in Java and Python is a dependency hell.* How has Parquet solved that problem, then? I don't recall experiencing any "dependency hell" when working on geospatial projects with Spark, to be honest. Besides, Spark already includes Parquet as a dependency, so... where is the problem? *3. Sedona already supports Geo fully in (Geo)Parquet.* The default format in Spark is Parquet, and Parquet now natively supports these types. Are we going to force users to add Sedona (along with all its third-party dependencies, I assume) to their projects just for reading, writing, and performing basic operations with these types? Anyway, let's vote and see... El sáb, 29 mar 2025 a las 22:41, Reynold Xin (<r...@databricks.com.invalid>) escribió: > While I don’t think Spark should become a super specialized geospatial > processing engine, I don’t think it makes sense to focus *only* on reading > and writing from storage. Geospatial is a pretty common and fundamental > capability of analytics systems and virtually every mature and popular > analytics systems, be it open source or proprietary, storage or query, has > some basic geospatial data type and support. Adding geospatial type and > some basic expressions is such a no brainer. > > On Sat, Mar 29, 2025 at 2:27 PM Jia Yu <ji...@apache.org> wrote: > >> Hi Wenchen, Menelaos and Szehon, >> >> Thanks for the clarification — I’m glad to hear the primary motivation of >> this SPIP is focused on reading and writing geospatial data with Parquet >> and Iceberg. That’s an important goal, and I want to highlight that this >> problem is being solved by the Apache Sedona community. >> >> Since the primary motivation here is Parquet-level support, I suggest >> shifting the focus of this discussion toward enabling geo support in Spark >> Parquet DataSource rather than introducing core types. >> >> ** Why Spark Should Avoid Hardcoding Domain-Specific Types like geo types >> ** >> >> 1. Domain types evolve quickly. >> >> In geospatial, we already have geometry, geography, raster, trajectory, >> point clouds — and the list keeps growing. In AI/ML, we’re seeing tensors, >> vectors, and multi-dimensional arrays. Spark’s strength has always been in >> its general-purpose architecture and extensibility. Introducing hardcoded >> support for fast-changing domain-specific types risks long-term maintenance >> issues and eventual incompatibility with emerging standards. >> >> 2. Geospatial in Java and Python is a dependency hell. >> >> There are multiple competing geometry libraries with incompatible APIs. >> No widely adopted Java library supports geography types. The most >> authoritative CRS dataset (EPSG) is not Apache-compatible. The json format >> for CRS definitions (projjson) is only fully supported in PROJ, a C++ >> library without a Java equivalent and no formal OGC standard status. On the >> Python side, this might involve Shapely and GeoPandas dependencies. >> >> 3. Sedona already supports Geo fully in (Geo)Parquet. >> >> Sedona has supported reading, writing, metadata preservation, and data >> skipping for GeoParquet (predecessor of Parquet Geo) for over two years >> [2][3]. These features are production-tested and widely used. >> >> ** Proposed Path Forward: Geo Support via Spark Extensions ** >> >> To enable seamless Parquet integration without burdening Spark core, here >> are two options: >> >> Option 1: >> Sedona offers a dedicated `parquet-geo` DataSource that handles type >> encoding, metadata, and data skipping. No changes to Spark are required. >> This is already underway and will be maintained by the Sedona community to >> keep up with the evolving Geo standards. >> >> Option 2: >> Spark provides hooks to inject: >> - custom logical types / user-defined types (UDTs) >> - custom statistics and filter pushdowns >> Sedona can then extend the built-in `parquet` DataSource to integrate geo >> type metadata, predicate pushdown, and serialization seamlessly. >> >> For Iceberg, we’ve already published a proof-of-concept connector [4] >> showing Sedona, Spark, and Iceberg working together without any Spark core >> changes [5]. >> >> ** On the Bigger Picture ** >> >> I also agree with your long-term vision. I believe Spark is on the path >> to becoming a foundational compute engine — much like Postgres or Pandas — >> where the core remains focused and stable, while powerful domain-specific >> capabilities emerge from its ecosystem. >> >> To support this future, Spark could prioritize flexible extension hooks >> so that third-party libraries can thrive — just like we’ve seen with >> PostGIS, pgvector, TimescaleDB in the Postgres ecosystem, and GeoPandas in >> the Pandas ecosystem. >> >> Sedona is following this model by building geospatial support around >> Spark — not inside it — and we’d love to continue collaborating in this >> spirit. >> >> Happy to work together on providing Geo support in Parquet! >> >> Best, >> Jia >> >> References >> >> [1] GeoParquet project: >> https://github.com/opengeospatial/geoparquet >> >> [2] Sedona’s GeoParquet DataSource implementation: >> >> https://github.com/apache/sedona/tree/master/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet >> >> [3] Sedona’s GeoParquet documentation: >> https://sedona.apache.org/latest/tutorial/files/geoparquet-sedona-spark/ >> >> [4] Sedona-Iceberg connector (PoC): >> https://github.com/wherobots/sedona-iceberg-connector >> >> [5] Spark-Sedona-Iceberg working example: >> >> https://github.com/wherobots/sedona-iceberg-connector/blob/main/src/test/scala/com/wherobots/sedona/TestGeospatial.scala#L53 >> >> >> On 2025/03/29 19:27:08 Menelaos Karavelas wrote: >> > To continue along the line of thought of Szehon: >> > >> > I am really excited that the Parquet and Iceberg communities have >> adopted geospatial logical types and of course I am grateful for the work >> put in that direction. >> > >> > As both Wenchen and Szehon pointed out in their own way, the goal is to >> have minimal support in Spark, as a common platform, for these types. >> > >> > To be more specific and explicit: The proposal scope is to add support >> for reading/writing to Parquet, based on the new standard, as well as >> adding the types as built-in types in Spark to complement the storage >> support. The few ST expressions that are in the proposal are what seem to >> be the minimal set of expressions needed to support working with geospatial >> values in the Spark engine in a meaningful way. >> > >> > Best, >> > >> > Menelaos >> > >> > >> > > On Mar 29, 2025, at 12:06 PM, Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> > > >> > > Thank you Menelaos, will do! >> > > >> > > To give a little background, Jia and Sedona community, also >> GeoParquet community, and others really put much effort contributing to >> defining the Parquet and Iceberg geo types, which couldn't be done without >> their experience and help! >> > > >> > > But I do agree with Wenchen , now that the types are in most common >> data sources in ecosystem , I think Apache Spark as a common platform needs >> to have this type definition for inter-op, otherwise users of vanilla Spark >> cannot work with those data sources with stored geospatial data. (Imo a >> similar rationale in adding timestamp nano in the other ongoing SPIP.). >> > > >> > > And like Wenchen said, the SPIP’s goal doesnt seem to be to fragment >> the ecosystem by implementing Sedona’s advanced geospatial analytic tech in >> Spark itself, which you may be right belongs in pluggable frameworks. >> Menelaus may explain more about the SPIP goal. >> > > >> > > I do hope there can be more collaboration across communities (like in >> Iceberg/Parquet collaboration) in getting Sedona community’s experience in >> making sure these type definitions are optimal , and compatible for Sedona. >> > > >> > > Thanks! >> > > Szehon >> > > >> > > >> > >> On Mar 29, 2025, at 8:04 AM, Menelaos Karavelas < >> menelaos.karave...@gmail.com> wrote: >> > >> >> > >> >> > >> Hello Szehon, >> > >> >> > >> I just created a Google doc and also linked it in the JIRA: >> > >> >> > >> >> https://docs.google.com/document/d/1cYSNPGh95OjnpS0k_KDHGM9Ae3j-_0Wnc_eGBZL4D3w/edit?tab=t.0 >> > >> >> > >> Please feel free to comment on it. >> > >> >> > >> Best, >> > >> >> > >> Menelaos >> > >> >> > >> >> > >>> On Mar 28, 2025, at 2:19 PM, Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> > >>> >> > >>> Thanks Menelaos, this is exciting ! Is there a google doc we can >> comment, or just on the JIRA? >> > >>> >> > >>> Thanks >> > >>> Szehon >> > >>> >> > >>> On Fri, Mar 28, 2025 at 1:41 PM Ángel Álvarez Pascua < >> angel.alvarez.pas...@gmail.com <mailto:angel.alvarez.pas...@gmail.com>> >> wrote: >> > >>>> Sorry, I only had a quick look at the proposal, looked for WKT and >> didn't find anything. >> > >>>> >> > >>>> It's been years since I worked on geospatial projects and I'm not >> an expert (at all). Maybe starting with something simple but useful like >> conversion WKT<=>WKB? >> > >>>> >> > >>>> >> > >>>> El vie, 28 mar 2025, 21:27, Menelaos Karavelas < >> menelaos.karave...@gmail.com <mailto:menelaos.karave...@gmail.com>> >> escribió: >> > >>>>> In the SPIP Jira the proposal is to add the expressions >> ST_AsBinary, ST_GeomFromWKB, and ST_GeogFromWKB. >> > >>>>> Is there anything else that you think should be added? >> > >>>>> >> > >>>>> Regarding WKT, what do you think should be added? >> > >>>>> >> > >>>>> - Menelaos >> > >>>>> >> > >>>>> >> > >>>>>> On Mar 28, 2025, at 1:02 PM, Ángel Álvarez Pascua < >> angel.alvarez.pas...@gmail.com <mailto:angel.alvarez.pas...@gmail.com>> >> wrote: >> > >>>>>> >> > >>>>>> What about adding support for WKT < >> https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry>/WKB >> < >> https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary >> >? >> > >>>>>> >> > >>>>>> El vie, 28 mar 2025 a las 20:50, Ángel Álvarez Pascua (< >> angel.alvarez.pas...@gmail.com <mailto:angel.alvarez.pas...@gmail.com>>) >> escribió: >> > >>>>>>> +1 (non-binding) >> > >>>>>>> >> > >>>>>>> El vie, 28 mar 2025, 18:48, Menelaos Karavelas < >> menelaos.karave...@gmail.com <mailto:menelaos.karave...@gmail.com>> >> escribió: >> > >>>>>>>> Dear Spark community, >> > >>>>>>>> >> > >>>>>>>> I would like to propose the addition of new geospatial data >> types (GEOMETRY and GEOGRAPHY) which represent geospatial values as >> recently added as new logical types in the Parquet specification. >> > >>>>>>>> >> > >>>>>>>> The new types should improve Spark’s ability to read the new >> Parquet logical types and perform some minimal meaningful operations on >> them. >> > >>>>>>>> >> > >>>>>>>> SPIP: https://issues.apache.org/jira/browse/SPARK-51658 >> > >>>>>>>> >> > >>>>>>>> Looking forward to your comments and feedback. >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Best regards, >> > >>>>>>>> >> > >>>>>>>> Menelaos Karavelas >> > >>>>>>>> >> > >>>>> >> > >> >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>