Re: [PR] [GH-2208] Geopandas: Fix sjoin implementation + proper naming and index behavior [sedona]

via GitHub Sun, 03 Aug 2025 00:17:04 -0700


Copilot commented on code in PR #2209:
URL: https://github.com/apache/sedona/pull/2209#discussion_r2249634624



##########
python/sedona/geopandas/geoseries.py:
##########
@@ -465,30 +465,21 @@ def crs(self) -> Union["CRS", None]:
         if len(self) == 0:
             return None
 
-        if parse_version(pyspark.__version__) >= parse_version("3.5.0"):
-            spark_col = stf.ST_SRID(F.first_value(self.spark.column, 
ignoreNulls=True))
-            # Set this to avoid error complaining that we don't have a groupby 
column
-            is_aggr = True
-        else:
-            spark_col = stf.ST_SRID(self.spark.column)
-            is_aggr = False
+        # F.first is non-deterministic, but it doesn't matter because all 
non-null values should be the same
+        spark_col = stf.ST_SRID(F.first(self.spark.column, ignorenulls=True))

Review Comment:
   The parameter name should be `ignoreNulls` (camel case) instead of 
`ignorenulls` (lowercase) to match Spark SQL function naming conventions.
   ```suggestion
           spark_col = stf.ST_SRID(F.first(self.spark.column, ignoreNulls=True))
   ```



##########
python/sedona/geopandas/tools/sjoin.py:
##########
@@ -195,36 +203,35 @@ def _frame_join(
             final_columns.append(f"{col_name} as {base_name}")
 
     # Select final columns
-    result_df = spatial_join_df.selectExpr(*final_columns)
-
-    # Return appropriate type based on input
-    if isinstance(left_df, GeoSeries) and isinstance(right_df, GeoSeries):
-        # Return GeoSeries for GeoSeries inputs
-        internal = InternalFrame(
-            spark_frame=result_df,
-            index_spark_columns=None,
-            column_labels=[left_df._col_label],
-            data_spark_columns=[scol_for(result_df, "geometry")],
-            data_fields=[left_df._internal.data_fields[0]],
-            column_label_names=left_df._internal.column_label_names,
-        )
-        return _to_geo_series(first_series(ps.DataFrame(internal)))
-    else:
-        # Return GeoDataFrame for GeoDataFrame inputs
-        return GeoDataFrame(result_df)
+    result_df = spatial_join_df.selectExpr(*final_columns).orderBy(
+        SPARK_DEFAULT_INDEX_NAME
+    )

Review Comment:
   Adding an `orderBy` operation on the entire result could be expensive for 
large datasets. Consider if this ordering is necessary or if it can be 
optimized for better performance in distributed environments.
   ```suggestion
       result_df = spatial_join_df.selectExpr(*final_columns)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2208] Geopandas: Fix sjoin implementation + proper naming and index behavior [sedona]

Reply via email to