(sedona) 01/01: Add GeoPandas tutorial

jiayu Sun, 07 Sep 2025 20:13:25 -0700

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch geopandas
in repository https://gitbox.apache.org/repos/asf/sedona.git


commit ff8f3aceb6e8d6bf21292c8e9fb31d14c3e6f96c
Author: Jia Yu <[email protected]>
AuthorDate: Sun Sep 7 20:12:05 2025 -0700

    Add GeoPandas tutorial
---
 docs/tutorial/geopandas-api.md | 398 +++++++++++++++++++++++++++++++++++++++++
 mkdocs.yml                     |   6 +-
 2 files changed, 401 insertions(+), 3 deletions(-)

diff --git a/docs/tutorial/geopandas-api.md b/docs/tutorial/geopandas-api.md
new file mode 100644
index 0000000000..1c99c8711d
--- /dev/null
+++ b/docs/tutorial/geopandas-api.md
@@ -0,0 +1,398 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# GeoPandas API for Apache Sedona
+
+The GeoPandas API for Apache Sedona provides a familiar GeoPandas interface 
that scales your geospatial analysis beyond single-node limitations. This API 
combines the intuitive GeoPandas DataFrame syntax with the distributed 
processing power of Apache Sedona on Apache Spark, enabling you to work with 
planetary-scale datasets using the same code patterns you already know.
+
+## Overview
+
+### What is the GeoPandas API for Apache Sedona?
+
+The GeoPandas API for Apache Sedona is a compatibility layer that allows you 
to use GeoPandas-style operations on distributed geospatial data. Instead of 
being limited to single-node processing, your GeoPandas code can leverage the 
full power of Apache Spark clusters for large-scale geospatial analysis.
+
+### Key Benefits
+
+- **Familiar API**: Use the same GeoPandas syntax and methods you're already 
familiar with
+- **Distributed Processing**: Scale beyond single-node limitations to handle 
large datasets
+- **Lazy Evaluation**: Benefit from Apache Sedona's query optimization and 
lazy execution
+- **Performance**: Leverage distributed computing for complex geospatial 
operations
+- **Seamless Migration**: Minimal code changes required to migrate existing 
GeoPandas workflows
+
+## Setup
+
+The GeoPandas API for Apache Sedona automatically handles SparkSession 
management through PySpark's pandas-on-Spark integration. You have two options 
for setup:
+
+### Option 1: Automatic SparkSession (Recommended)
+
+The GeoPandas API automatically uses the default SparkSession from PySpark:
+
+```python
+from sedona.spark.geopandas import GeoDataFrame, read_parquet
+
+# No explicit SparkSession setup needed - uses default session
+# The API automatically handles Sedona context initialization
+```
+
+### Option 2: Manual SparkSession Setup
+
+If you need to configure a custom SparkSession or are working in an 
environment where you need explicit control:
+
+```python
+from sedona.spark.geopandas import GeoDataFrame, read_parquet
+from sedona.spark import SedonaContext
+
+# Create and configure SparkSession
+config = SedonaContext.builder().getOrCreate()
+sedona = SedonaContext.create(config)
+
+# The GeoPandas API will use this configured session
+```
+
+### Option 3: Using Existing SparkSession
+
+If you already have a SparkSession (e.g., in Databricks, EMR, or other managed 
environments):
+
+```python
+from sedona.spark.geopandas import GeoDataFrame, read_parquet
+from sedona.spark import SedonaContext
+
+# Use existing SparkSession (e.g., 'spark' in Databricks)
+sedona = SedonaContext.create(spark)  # 'spark' is the existing session
+```
+
+### How SparkSession Management Works
+
+The GeoPandas API leverages PySpark's pandas-on-Spark functionality, which 
automatically manages the SparkSession lifecycle:
+
+1. **Default Session**: When you import `sedona.spark.geopandas`, it 
automatically uses PySpark's default session via 
`pyspark.pandas.utils.default_session()`
+
+2. **Automatic Sedona Registration**: The API automatically registers Sedona's 
spatial functions and optimizations with the SparkSession when needed
+
+3. **Transparent Integration**: All GeoPandas operations are translated to 
Spark SQL operations under the hood, using the configured SparkSession
+
+4. **No Manual Context Management**: Unlike traditional Sedona usage, you 
don't need to explicitly call `SedonaContext.create()` unless you need custom 
configuration
+
+This design makes the API more user-friendly by hiding the complexity of 
SparkSession management while still providing the full power of distributed 
processing.
+
+### S3 Configuration
+
+When working with S3 data, the GeoPandas API uses Spark's built-in S3 support 
rather than external libraries like s3fs. Configure anonymous access to public 
S3 buckets using Spark configuration:
+
+```python
+from sedona.spark import SedonaContext
+
+# For anonymous access to public S3 buckets
+config = (
+    SedonaContext.builder()
+    .config(
+        "spark.hadoop.fs.s3a.bucket.bucket-name.aws.credentials.provider",
+        "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
+    )
+    .getOrCreate()
+)
+
+sedona = SedonaContext.create(config)
+```
+
+For authenticated S3 access, use appropriate AWS credential providers:
+
+```python
+# For IAM roles (recommended for EC2/EMR)
+config = (
+    SedonaContext.builder()
+    .config(
+        "spark.hadoop.fs.s3a.aws.credentials.provider",
+        "com.amazonaws.auth.InstanceProfileCredentialsProvider",
+    )
+    .getOrCreate()
+)
+
+# For access keys (not recommended for production)
+config = (
+    SedonaContext.builder()
+    .config("spark.hadoop.fs.s3a.access.key", "your-access-key")
+    .config("spark.hadoop.fs.s3a.secret.key", "your-secret-key")
+    .getOrCreate()
+)
+```
+
+## Basic Usage
+
+### Importing the API
+
+Instead of importing GeoPandas directly, import from the Sedona GeoPandas 
module:
+
+```python
+# Traditional GeoPandas import
+# import geopandas as gpd
+
+# Sedona GeoPandas API import
+import sedona.spark.geopandas as gpd
+
+# or
+from sedona.spark.geopandas import GeoDataFrame, read_parquet
+```
+
+### Reading Data
+
+The API supports reading from various geospatial formats, including Parquet 
files from cloud storage. For S3 access with anonymous credentials, configure 
Spark to use anonymous AWS credentials:
+
+```python
+from sedona.spark import SedonaContext
+
+# Configure Spark for anonymous S3 access
+config = (
+    SedonaContext.builder()
+    .config(
+        
"spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider",
+        "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
+    )
+    .getOrCreate()
+)
+
+sedona = SedonaContext.create(config)
+
+# Load GeoParquet file directly from S3
+s3_path = "s3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet"
+nyc_buildings = gpd.read_parquet(s3_path)
+
+# Display basic information
+print(f"Dataset shape: {nyc_buildings.shape}")
+print(f"Columns: {nyc_buildings.columns.tolist()}")
+nyc_buildings.head()
+```
+
+### Spatial Filtering
+
+Use spatial indexing and filtering methods. Note that `cx` spatial indexing is 
not yet implemented in the current version:
+
+```python
+from shapely.geometry import box
+
+# Define bounding box for Central Park
+central_park_bbox = box(
+    -73.973,
+    40.764,  # bottom-left (longitude, latitude)
+    -73.951,
+    40.789,  # top-right (longitude, latitude)
+)
+
+# Filter buildings within the bounding box using spatial index
+# Note: This requires collecting data to driver for spatial filtering
+# For large datasets, consider using spatial joins instead
+buildings_sample = nyc_buildings.sample(1000)  # Sample for demonstration
+central_park_buildings = buildings_sample[
+    buildings_sample.geometry.intersects(central_park_bbox)
+]
+
+# Display results
+print(
+    central_park_buildings[["BUILD_ID", "PROP_ADDR", "height_val", 
"geometry"]].head()
+)
+```
+
+**Alternative approach for large datasets using spatial joins:**
+
+```python
+# Create a GeoDataFrame with the bounding box
+bbox_gdf = gpd.GeoDataFrame({"id": [1]}, geometry=[central_park_bbox], 
crs="EPSG:4326")
+
+# Use spatial join to filter buildings within the bounding box
+central_park_buildings = nyc_buildings.sjoin(bbox_gdf, predicate="intersects")
+```
+
+## Advanced Operations
+
+### Spatial Joins
+
+Perform spatial joins using the same syntax as GeoPandas:
+
+```python
+# Load two datasets
+left_df = gpd.read_parquet("s3://bucket/left_data.parquet")
+right_df = gpd.read_parquet("s3://bucket/right_data.parquet")
+
+# Spatial join with distance predicate
+result = left_df.sjoin(right_df, predicate="dwithin", distance=50)
+
+# Other spatial predicates
+intersects_result = left_df.sjoin(right_df, predicate="intersects")
+contains_result = left_df.sjoin(right_df, predicate="contains")
+```
+
+### Coordinate Reference System Operations
+
+Transform geometries between different coordinate reference systems:
+
+```python
+# Set initial CRS
+buildings = gpd.read_parquet("buildings.parquet")
+buildings = buildings.set_crs("EPSG:4326")
+
+# Transform to projected CRS for area calculations
+buildings_projected = buildings.to_crs("EPSG:3857")
+
+# Calculate areas
+buildings_projected["area"] = buildings_projected.geometry.area
+```
+
+### Geometric Operations
+
+Apply geometric transformations and analysis:
+
+```python
+# Buffer operations
+buffered = buildings.geometry.buffer(100)  # 100 meter buffer
+
+# Geometric properties
+buildings["is_valid"] = buildings.geometry.is_valid
+buildings["is_simple"] = buildings.geometry.is_simple
+buildings["bounds"] = buildings.geometry.bounds
+
+# Distance calculations
+from shapely.geometry import Point
+
+reference_point = Point(-73.9857, 40.7484)  # Times Square
+buildings["distance_to_times_square"] = 
buildings.geometry.distance(reference_point)
+
+# Area and length calculations (requires projected CRS)
+buildings_projected = buildings.to_crs("EPSG:3857")  # Web Mercator
+buildings_projected["area"] = buildings_projected.geometry.area
+buildings_projected["perimeter"] = buildings_projected.geometry.length
+```
+
+## Performance Considerations
+
+### Use Traditional GeoPandas when:
+
+- Working with small datasets (< 1GB)
+- Simple operations on local data
+- Complete functional coverage is required
+- Single-node processing is sufficient
+
+### Use GeoPandas API for Apache Sedona when:
+
+- Working with large datasets (> 1GB)
+- Complex geospatial analyses
+- Distributed processing is needed
+- Data is stored in cloud storage (S3, HDFS, etc.)
+
+## Supported Operations
+
+The GeoPandas API for Apache Sedona has implemented **39 GeoSeries functions** 
and **10 GeoDataFrame functions**, covering the most commonly used GeoPandas 
operations:
+
+### Data I/O
+
+- `read_parquet()` - Read GeoParquet files
+- `read_file()` - Read various geospatial formats
+- `to_parquet()` - Write to Parquet format
+
+### Spatial Operations
+
+- `sjoin()` - Spatial joins with various predicates
+- `buffer()` - Geometric buffering
+- `distance()` - Distance calculations
+- `intersects()`, `contains()`, `within()` - Spatial predicates
+- `sindex` - Spatial indexing (limited functionality)
+
+### CRS Operations
+
+- `set_crs()` - Set coordinate reference system
+- `to_crs()` - Transform between CRS
+- `crs` - Access CRS information
+
+### Geometric Properties
+
+- `area`, `length`, `bounds` - Geometric measurements
+- `is_valid`, `is_simple`, `is_empty` - Geometric validation
+- `centroid`, `envelope`, `boundary` - Geometric properties
+- `x`, `y`, `z`, `has_z` - Coordinate access
+- `total_bounds`, `estimate_utm_crs` - Bounds and CRS utilities
+
+### Spatial Operations
+
+- `buffer()` - Geometric buffering
+- `distance()` - Distance calculations
+- `intersects()`, `contains()`, `within()` - Spatial predicates
+- `intersection()` - Geometric intersection
+- `make_valid()` - Geometry validation and repair
+- `sindex` - Spatial indexing (limited functionality)
+
+### Data Conversion
+
+- `to_geopandas()` - Convert to traditional GeoPandas
+- `to_wkb()`, `to_wkt()` - Convert to WKB/WKT formats
+- `from_xy()` - Create geometries from coordinates
+- `geom_type` - Get geometry types
+
+## Complete Workflow Example
+
+```python
+import sedona.spark.geopandas as gpd
+from sedona.spark import SedonaContext
+
+# Configure Spark for anonymous S3 access
+config = (
+    SedonaContext.builder()
+    .config(
+        
"spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider",
+        "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
+    )
+    .getOrCreate()
+)
+
+sedona = SedonaContext.create(config)
+
+# Load data
+DATA_DIR = "s3://wherobots-examples/data/geopandas_blog/"
+overture_size = "1M"
+postal_codes_path = DATA_DIR + "postal-code/"
+overture_path = DATA_DIR + overture_size + "/" + "overture-buildings/"
+
+postal_codes = gpd.read_parquet(postal_codes_path)
+buildings = gpd.read_parquet(overture_path)
+
+# Spatial analysis
+buildings = buildings.set_crs("EPSG:4326")
+buildings_projected = buildings.to_crs("EPSG:3857")
+
+# Calculate areas and filter
+buildings_projected["area"] = buildings_projected.geometry.area
+large_buildings = buildings_projected[buildings_projected["area"] > 1000]
+
+result = large_buildings.sjoin(postal_codes, predicate="intersects")
+
+# Aggregate by postal code
+summary = (
+    result.groupby("postal_code")
+    .agg({"area": "sum", "BUILD_ID": "count"})
+    .rename(columns={"BUILD_ID": "building_count"})
+)
+
+print(summary.head())
+```
+
+## Resources and Contributing
+
+For detailed and up-to-date API documentation, including complete method 
signatures, parameters, and examples, see:
+
+**📚 [GeoPandas API 
Documentation](https://sedona.apache.org/latest/api/pydocs/sedona.spark.geopandas.html)**
+
+The GeoPandas API for Apache Sedona is an open-source project. Contributions 
are welcome through the [GitHub issue 
tracker](https://github.com/apache/sedona/issues/2230) for reporting bugs, 
requesting features, or contributing code. For more information on 
contributing, see the [Contributor Guide](../community/develop.md).
diff --git a/mkdocs.yml b/mkdocs.yml
index 9b63533d15..24ceb44970 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -55,12 +55,12 @@ nav:
   - Download: download.md
   - Programming Guides:
       - Sedona with Apache Spark:
-          - Spatial SQL app: tutorial/sql.md
-          - Raster SQL app: tutorial/raster.md
+          - Spatial DataFrame / SQL app: tutorial/sql.md
+          - Raster DataFrame / SQL app: tutorial/raster.md
           - Pure SQL environment: tutorial/sql-pure-sql.md
+          - GeoPandas API on Sedona: tutorial/geopandas-api.md
           - Spatial RDD app: tutorial/rdd.md
           - Sedona R: api/rdocs
-          - Sedona Python: api/pydocs
           - Work with GeoPandas and Shapely: tutorial/geopandas-shapely.md
           - Files:
               - CSV: tutorial/files/csv-geometry-sedona-spark.md

(sedona) 01/01: Add GeoPandas tutorial

Reply via email to