This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/master by this push:
new e0dbd8a6d6 [GH-2227] Add Geopandas Developer Guide README.md (#2229)
e0dbd8a6d6 is described below
commit e0dbd8a6d64c0cabf114771b41222822dcb324bb
Author: Peter Nguyen <[email protected]>
AuthorDate: Wed Aug 6 11:43:12 2025 -0700
[GH-2227] Add Geopandas Developer Guide README.md (#2229)
* Add geopandas README.md
* Move docs to docs/community/geopandas and add entry to mkdocs.yml
* Update docs/community/geopandas.md
Co-authored-by: Copilot <[email protected]>
* Update docs/community/geopandas.md
Co-authored-by: Copilot <[email protected]>
* Update docs/community/geopandas.md
Co-authored-by: Copilot <[email protected]>
* Apply suggestions from code review
Co-authored-by: Copilot <[email protected]>
* Add example of spark explain output
* Move explain query plans section below rest of section
* Apply suggestions from code review
Co-authored-by: Copilot <[email protected]>
---------
Co-authored-by: Jia Yu <[email protected]>
Co-authored-by: Copilot <[email protected]>
---
docs/community/geopandas.md | 85 +++++++++++++++++++++++++++++++++++++++++++++
mkdocs.yml | 1 +
2 files changed, 86 insertions(+)
diff --git a/docs/community/geopandas.md b/docs/community/geopandas.md
new file mode 100644
index 0000000000..626507f5da
--- /dev/null
+++ b/docs/community/geopandas.md
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Geopandas on Sedona
+
+This guide outlines a few important considerations when contributing changes
to the GeoPandas component of Apache Sedona as a developer. Again, **this guide
is targeted towards contributors**; the official documentation is more tailored
towards users.
+
+**General Approach**: This component is built on top of the PySpark Pandas
API. The `GeoDataFrame` and `GeoSeries` classes both inherit from pyspark
pandas' `ps.DataFrame` and `ps.Series` classes, respectively. When possible, it
is generally preferred to leverage pyspark pandas' implementation of a function
and extending it from there (i.e. find a way to leverage `super()` rather than
copying over parts of logic). The code structure resembles the structure of the
[Geopandas repo](https:/ [...]
+
+**Lazy Evaluation**: Spark uses lazy evaluation. Spark's distributed and lazy
nature occasionally comes in the way of implementing functionality in the same
way the original GeoPandas library does so. For example, GeoPandas has many
checks for invalid crs in many places (e.g `GeoSeries.__init__()`,
`set_crs()`). Sedona's implementation for getting the `crs` currently is
expensive compared to GeoPandas because it requires us to run an eager
`ST_SRID()` query. If we eagerly query for the c [...]
+
+**Maintaining Order**: Because Spark uses distributed data, maintaining the
order of data across operations takes extra time and effort. Maintaining order
for some operations is not very meaningful. In those cases, it's reasonable to
skip the post-sorting to avoid an unnecessary performance hit. Documentation
should document this behavior. For example, `sjoin` currently does not maintain
the traditional pandas dataframe order after performing the join. This follows
the same convention as [...]
+
+**Conventions**: The conventional shorthand for the Sedona Geopandas package
is `sgpd`. Notice it's the same as the geopandas shorthand (`gpd`), except
prefixed with an 's'. The conventional short hands for adjacent packages are
shown below also.
+
+```python
+import pandas as pd
+import geopandas as gpd
+import pyspark.pandas as ps
+import sedona.spark.geopandas as sgpd
+```
+
+**Conversion Methods**: Sedona's Implementation of Geopandas provides useful
methods to convert to and from other dataframes using the following methods.
These apply to both `GeoDataFrame` and `GeoSeries`:
+
+- `to_geopandas()`: Sedona Geo(DataFrame/Series) to Geopandas
+- `to_geoframe()`: Sedona GeoSeries to Sedona GeoDataFrame
+- `to_spark_pandas()`: Sedona Geo(DataFrame/Series) to Pandas on PySpark
+- `to_spark()` (inherited): Sedona GeoDataFrame to Spark DataFrame
+- `to_frame()` (inherited): Sedona GeoSeries to PySpark Pandas DataFrame
+
+**GeoSeries functions**: Most geometry manipulation operations in Geopandas
are considered GeoSeries functions. However, we can call them from a
`GeoDataFrame` object as well to execute on its active geometry column. We
implement the functions in the `GeoSeries` class. However in `base.py`, we add
a `_delegate_to_geometry_column()` call to allow the `GeoDataFrame` to also
execute the function on its active geometry column. We also specify the
docstring for the function here instead of `G [...]
+
+**Explain Query Plans**: Because these dataframe abstractions are built on
Spark, we can retrieve the query plan for an operation for a Dataframe by using
the `.spark.explain()` method.
+
+Example:
+
+```python
+geoseries = GeoSeries([Polygon([(0, 0), (1, 0), (1, 1), (0, 0)])])
+# Currently PySpark pandas Series does not have the spark.explain() method, so
a workaround is to convert it to a dataframe first
+print(geoseries.area.to_frame().spark.explain(extended=True))
+```
+
+```
+== Parsed Logical Plan ==
+Project [__index_level_0__#19L, 0#27 AS None#31]
++- Project [ **org.apache.spark.sql.sedona_sql.expressions.ST_Area** AS
0#27, __index_level_0__#19L, __natural_order__#23L]
+ +- Project [__index_level_0__#19L, 0#20, monotonically_increasing_id() AS
__natural_order__#23L]
+ +- LogicalRDD [__index_level_0__#19L, 0#20], false
+
+== Analyzed Logical Plan ==
+...
+
+== Optimized Logical Plan ==
+...
+
+== Physical Plan ==
+Project [__index_level_0__#19L,
**org.apache.spark.sql.sedona_sql.expressions.ST_Area** AS None#31]
++- *(1) Scan ExistingRDD[__index_level_0__#19L,0#20]
+```
+
+## Suggested Short Readings:
+
+- [PySpark Pandas Best
Practices](https://spark.apache.org/docs/latest/api/python/tutorial/pandas_on_spark/best_practices.html)
- This mentions other useful notes to consider such as why `__iter__()` is not
supported
+- [Geopandas User
Guide](https://geopandas.org/en/stable/docs/user_guide/data_structures.html) -
Specifically it's useful to understand the GeoDataFrame's "active geometry
column"
+
+## Other References
+
+- [Public API Pages for Pandas API on
Spark](https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/index.html)
+- [Geopandas API Page](https://geopandas.org/en/stable/docs/reference.html)
diff --git a/mkdocs.yml b/mkdocs.yml
index 0ff8147b15..02cf474803 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -142,6 +142,7 @@ nav:
- Contributor Guide:
- Rules: community/rule.md
- Develop: community/develop.md
+ - Contribute to Geopandas: community/geopandas.md
- Committer Guide:
- Project Management Committee: community/contributor.md
- Become a release manager: community/release-manager.md