This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/master by this push:
new 8512ecd7c4 [DOCS] Add GeoPandas API on Sedona tutorial (#2341)
8512ecd7c4 is described below
commit 8512ecd7c41bf26cfa4888f5311bb5d9c58321d1
Author: Jia Yu <[email protected]>
AuthorDate: Sun Sep 7 20:20:05 2025 -0700
[DOCS] Add GeoPandas API on Sedona tutorial (#2341)
---
docs/tutorial/geopandas-api.md | 398 +++++++++++++++++++++++++++++++++++++++++
mkdocs.yml | 6 +-
2 files changed, 401 insertions(+), 3 deletions(-)
diff --git a/docs/tutorial/geopandas-api.md b/docs/tutorial/geopandas-api.md
new file mode 100644
index 0000000000..1c99c8711d
--- /dev/null
+++ b/docs/tutorial/geopandas-api.md
@@ -0,0 +1,398 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# GeoPandas API for Apache Sedona
+
+The GeoPandas API for Apache Sedona provides a familiar GeoPandas interface
that scales your geospatial analysis beyond single-node limitations. This API
combines the intuitive GeoPandas DataFrame syntax with the distributed
processing power of Apache Sedona on Apache Spark, enabling you to work with
planetary-scale datasets using the same code patterns you already know.
+
+## Overview
+
+### What is the GeoPandas API for Apache Sedona?
+
+The GeoPandas API for Apache Sedona is a compatibility layer that allows you
to use GeoPandas-style operations on distributed geospatial data. Instead of
being limited to single-node processing, your GeoPandas code can leverage the
full power of Apache Spark clusters for large-scale geospatial analysis.
+
+### Key Benefits
+
+- **Familiar API**: Use the same GeoPandas syntax and methods you're already
familiar with
+- **Distributed Processing**: Scale beyond single-node limitations to handle
large datasets
+- **Lazy Evaluation**: Benefit from Apache Sedona's query optimization and
lazy execution
+- **Performance**: Leverage distributed computing for complex geospatial
operations
+- **Seamless Migration**: Minimal code changes required to migrate existing
GeoPandas workflows
+
+## Setup
+
+The GeoPandas API for Apache Sedona automatically handles SparkSession
management through PySpark's pandas-on-Spark integration. You have two options
for setup:
+
+### Option 1: Automatic SparkSession (Recommended)
+
+The GeoPandas API automatically uses the default SparkSession from PySpark:
+
+```python
+from sedona.spark.geopandas import GeoDataFrame, read_parquet
+
+# No explicit SparkSession setup needed - uses default session
+# The API automatically handles Sedona context initialization
+```
+
+### Option 2: Manual SparkSession Setup
+
+If you need to configure a custom SparkSession or are working in an
environment where you need explicit control:
+
+```python
+from sedona.spark.geopandas import GeoDataFrame, read_parquet
+from sedona.spark import SedonaContext
+
+# Create and configure SparkSession
+config = SedonaContext.builder().getOrCreate()
+sedona = SedonaContext.create(config)
+
+# The GeoPandas API will use this configured session
+```
+
+### Option 3: Using Existing SparkSession
+
+If you already have a SparkSession (e.g., in Databricks, EMR, or other managed
environments):
+
+```python
+from sedona.spark.geopandas import GeoDataFrame, read_parquet
+from sedona.spark import SedonaContext
+
+# Use existing SparkSession (e.g., 'spark' in Databricks)
+sedona = SedonaContext.create(spark) # 'spark' is the existing session
+```
+
+### How SparkSession Management Works
+
+The GeoPandas API leverages PySpark's pandas-on-Spark functionality, which
automatically manages the SparkSession lifecycle:
+
+1. **Default Session**: When you import `sedona.spark.geopandas`, it
automatically uses PySpark's default session via
`pyspark.pandas.utils.default_session()`
+
+2. **Automatic Sedona Registration**: The API automatically registers Sedona's
spatial functions and optimizations with the SparkSession when needed
+
+3. **Transparent Integration**: All GeoPandas operations are translated to
Spark SQL operations under the hood, using the configured SparkSession
+
+4. **No Manual Context Management**: Unlike traditional Sedona usage, you
don't need to explicitly call `SedonaContext.create()` unless you need custom
configuration
+
+This design makes the API more user-friendly by hiding the complexity of
SparkSession management while still providing the full power of distributed
processing.
+
+### S3 Configuration
+
+When working with S3 data, the GeoPandas API uses Spark's built-in S3 support
rather than external libraries like s3fs. Configure anonymous access to public
S3 buckets using Spark configuration:
+
+```python
+from sedona.spark import SedonaContext
+
+# For anonymous access to public S3 buckets
+config = (
+ SedonaContext.builder()
+ .config(
+ "spark.hadoop.fs.s3a.bucket.bucket-name.aws.credentials.provider",
+ "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
+ )
+ .getOrCreate()
+)
+
+sedona = SedonaContext.create(config)
+```
+
+For authenticated S3 access, use appropriate AWS credential providers:
+
+```python
+# For IAM roles (recommended for EC2/EMR)
+config = (
+ SedonaContext.builder()
+ .config(
+ "spark.hadoop.fs.s3a.aws.credentials.provider",
+ "com.amazonaws.auth.InstanceProfileCredentialsProvider",
+ )
+ .getOrCreate()
+)
+
+# For access keys (not recommended for production)
+config = (
+ SedonaContext.builder()
+ .config("spark.hadoop.fs.s3a.access.key", "your-access-key")
+ .config("spark.hadoop.fs.s3a.secret.key", "your-secret-key")
+ .getOrCreate()
+)
+```
+
+## Basic Usage
+
+### Importing the API
+
+Instead of importing GeoPandas directly, import from the Sedona GeoPandas
module:
+
+```python
+# Traditional GeoPandas import
+# import geopandas as gpd
+
+# Sedona GeoPandas API import
+import sedona.spark.geopandas as gpd
+
+# or
+from sedona.spark.geopandas import GeoDataFrame, read_parquet
+```
+
+### Reading Data
+
+The API supports reading from various geospatial formats, including Parquet
files from cloud storage. For S3 access with anonymous credentials, configure
Spark to use anonymous AWS credentials:
+
+```python
+from sedona.spark import SedonaContext
+
+# Configure Spark for anonymous S3 access
+config = (
+ SedonaContext.builder()
+ .config(
+
"spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider",
+ "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
+ )
+ .getOrCreate()
+)
+
+sedona = SedonaContext.create(config)
+
+# Load GeoParquet file directly from S3
+s3_path = "s3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet"
+nyc_buildings = gpd.read_parquet(s3_path)
+
+# Display basic information
+print(f"Dataset shape: {nyc_buildings.shape}")
+print(f"Columns: {nyc_buildings.columns.tolist()}")
+nyc_buildings.head()
+```
+
+### Spatial Filtering
+
+Use spatial indexing and filtering methods. Note that `cx` spatial indexing is
not yet implemented in the current version:
+
+```python
+from shapely.geometry import box
+
+# Define bounding box for Central Park
+central_park_bbox = box(
+ -73.973,
+ 40.764, # bottom-left (longitude, latitude)
+ -73.951,
+ 40.789, # top-right (longitude, latitude)
+)
+
+# Filter buildings within the bounding box using spatial index
+# Note: This requires collecting data to driver for spatial filtering
+# For large datasets, consider using spatial joins instead
+buildings_sample = nyc_buildings.sample(1000) # Sample for demonstration
+central_park_buildings = buildings_sample[
+ buildings_sample.geometry.intersects(central_park_bbox)
+]
+
+# Display results
+print(
+ central_park_buildings[["BUILD_ID", "PROP_ADDR", "height_val",
"geometry"]].head()
+)
+```
+
+**Alternative approach for large datasets using spatial joins:**
+
+```python
+# Create a GeoDataFrame with the bounding box
+bbox_gdf = gpd.GeoDataFrame({"id": [1]}, geometry=[central_park_bbox],
crs="EPSG:4326")
+
+# Use spatial join to filter buildings within the bounding box
+central_park_buildings = nyc_buildings.sjoin(bbox_gdf, predicate="intersects")
+```
+
+## Advanced Operations
+
+### Spatial Joins
+
+Perform spatial joins using the same syntax as GeoPandas:
+
+```python
+# Load two datasets
+left_df = gpd.read_parquet("s3://bucket/left_data.parquet")
+right_df = gpd.read_parquet("s3://bucket/right_data.parquet")
+
+# Spatial join with distance predicate
+result = left_df.sjoin(right_df, predicate="dwithin", distance=50)
+
+# Other spatial predicates
+intersects_result = left_df.sjoin(right_df, predicate="intersects")
+contains_result = left_df.sjoin(right_df, predicate="contains")
+```
+
+### Coordinate Reference System Operations
+
+Transform geometries between different coordinate reference systems:
+
+```python
+# Set initial CRS
+buildings = gpd.read_parquet("buildings.parquet")
+buildings = buildings.set_crs("EPSG:4326")
+
+# Transform to projected CRS for area calculations
+buildings_projected = buildings.to_crs("EPSG:3857")
+
+# Calculate areas
+buildings_projected["area"] = buildings_projected.geometry.area
+```
+
+### Geometric Operations
+
+Apply geometric transformations and analysis:
+
+```python
+# Buffer operations
+buffered = buildings.geometry.buffer(100) # 100 meter buffer
+
+# Geometric properties
+buildings["is_valid"] = buildings.geometry.is_valid
+buildings["is_simple"] = buildings.geometry.is_simple
+buildings["bounds"] = buildings.geometry.bounds
+
+# Distance calculations
+from shapely.geometry import Point
+
+reference_point = Point(-73.9857, 40.7484) # Times Square
+buildings["distance_to_times_square"] =
buildings.geometry.distance(reference_point)
+
+# Area and length calculations (requires projected CRS)
+buildings_projected = buildings.to_crs("EPSG:3857") # Web Mercator
+buildings_projected["area"] = buildings_projected.geometry.area
+buildings_projected["perimeter"] = buildings_projected.geometry.length
+```
+
+## Performance Considerations
+
+### Use Traditional GeoPandas when:
+
+- Working with small datasets (< 1GB)
+- Simple operations on local data
+- Complete functional coverage is required
+- Single-node processing is sufficient
+
+### Use GeoPandas API for Apache Sedona when:
+
+- Working with large datasets (> 1GB)
+- Complex geospatial analyses
+- Distributed processing is needed
+- Data is stored in cloud storage (S3, HDFS, etc.)
+
+## Supported Operations
+
+The GeoPandas API for Apache Sedona has implemented **39 GeoSeries functions**
and **10 GeoDataFrame functions**, covering the most commonly used GeoPandas
operations:
+
+### Data I/O
+
+- `read_parquet()` - Read GeoParquet files
+- `read_file()` - Read various geospatial formats
+- `to_parquet()` - Write to Parquet format
+
+### Spatial Operations
+
+- `sjoin()` - Spatial joins with various predicates
+- `buffer()` - Geometric buffering
+- `distance()` - Distance calculations
+- `intersects()`, `contains()`, `within()` - Spatial predicates
+- `sindex` - Spatial indexing (limited functionality)
+
+### CRS Operations
+
+- `set_crs()` - Set coordinate reference system
+- `to_crs()` - Transform between CRS
+- `crs` - Access CRS information
+
+### Geometric Properties
+
+- `area`, `length`, `bounds` - Geometric measurements
+- `is_valid`, `is_simple`, `is_empty` - Geometric validation
+- `centroid`, `envelope`, `boundary` - Geometric properties
+- `x`, `y`, `z`, `has_z` - Coordinate access
+- `total_bounds`, `estimate_utm_crs` - Bounds and CRS utilities
+
+### Spatial Operations
+
+- `buffer()` - Geometric buffering
+- `distance()` - Distance calculations
+- `intersects()`, `contains()`, `within()` - Spatial predicates
+- `intersection()` - Geometric intersection
+- `make_valid()` - Geometry validation and repair
+- `sindex` - Spatial indexing (limited functionality)
+
+### Data Conversion
+
+- `to_geopandas()` - Convert to traditional GeoPandas
+- `to_wkb()`, `to_wkt()` - Convert to WKB/WKT formats
+- `from_xy()` - Create geometries from coordinates
+- `geom_type` - Get geometry types
+
+## Complete Workflow Example
+
+```python
+import sedona.spark.geopandas as gpd
+from sedona.spark import SedonaContext
+
+# Configure Spark for anonymous S3 access
+config = (
+ SedonaContext.builder()
+ .config(
+
"spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider",
+ "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
+ )
+ .getOrCreate()
+)
+
+sedona = SedonaContext.create(config)
+
+# Load data
+DATA_DIR = "s3://wherobots-examples/data/geopandas_blog/"
+overture_size = "1M"
+postal_codes_path = DATA_DIR + "postal-code/"
+overture_path = DATA_DIR + overture_size + "/" + "overture-buildings/"
+
+postal_codes = gpd.read_parquet(postal_codes_path)
+buildings = gpd.read_parquet(overture_path)
+
+# Spatial analysis
+buildings = buildings.set_crs("EPSG:4326")
+buildings_projected = buildings.to_crs("EPSG:3857")
+
+# Calculate areas and filter
+buildings_projected["area"] = buildings_projected.geometry.area
+large_buildings = buildings_projected[buildings_projected["area"] > 1000]
+
+result = large_buildings.sjoin(postal_codes, predicate="intersects")
+
+# Aggregate by postal code
+summary = (
+ result.groupby("postal_code")
+ .agg({"area": "sum", "BUILD_ID": "count"})
+ .rename(columns={"BUILD_ID": "building_count"})
+)
+
+print(summary.head())
+```
+
+## Resources and Contributing
+
+For detailed and up-to-date API documentation, including complete method
signatures, parameters, and examples, see:
+
+**📚 [GeoPandas API
Documentation](https://sedona.apache.org/latest/api/pydocs/sedona.spark.geopandas.html)**
+
+The GeoPandas API for Apache Sedona is an open-source project. Contributions
are welcome through the [GitHub issue
tracker](https://github.com/apache/sedona/issues/2230) for reporting bugs,
requesting features, or contributing code. For more information on
contributing, see the [Contributor Guide](../community/develop.md).
diff --git a/mkdocs.yml b/mkdocs.yml
index 9b63533d15..24ceb44970 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -55,12 +55,12 @@ nav:
- Download: download.md
- Programming Guides:
- Sedona with Apache Spark:
- - Spatial SQL app: tutorial/sql.md
- - Raster SQL app: tutorial/raster.md
+ - Spatial DataFrame / SQL app: tutorial/sql.md
+ - Raster DataFrame / SQL app: tutorial/raster.md
- Pure SQL environment: tutorial/sql-pure-sql.md
+ - GeoPandas API on Sedona: tutorial/geopandas-api.md
- Spatial RDD app: tutorial/rdd.md
- Sedona R: api/rdocs
- - Sedona Python: api/pydocs
- Work with GeoPandas and Shapely: tutorial/geopandas-shapely.md
- Files:
- CSV: tutorial/files/csv-geometry-sedona-spark.md