Re: [PR] [GH-2208] Geopandas: Fix sjoin implementation + proper naming and index behavior [sedona]

via GitHub Mon, 04 Aug 2025 12:14:40 -0700


Copilot commented on code in PR #2209:
URL: https://github.com/apache/sedona/pull/2209#discussion_r2252375654



##########
python/sedona/geopandas/tools/sjoin.py:
##########
@@ -196,35 +206,33 @@ def _frame_join(
 
     # Select final columns
     result_df = spatial_join_df.selectExpr(*final_columns)
+    # Note, we do not .orderBy(SPARK_DEFAULT_INDEX_NAME) to avoid a 
performance hit

Review Comment:
   This comment should be more descriptive about the trade-off. Consider 
explaining that this means the result order may not match geopandas exactly, 
and users should call .sort_index() if order preservation is needed.
   ```suggestion
       # Note: we do not call .orderBy(SPARK_DEFAULT_INDEX_NAME) here to avoid 
a performance hit.
       # As a result, the order of the returned rows may not match the order 
produced by geopandas.
       # If you require the result to preserve the original order, call 
.sort_index() on the output.
   ```



##########
python/sedona/geopandas/geoseries.py:
##########
@@ -465,30 +465,21 @@ def crs(self) -> Union["CRS", None]:
         if len(self) == 0:
             return None
 
-        if parse_version(pyspark.__version__) >= parse_version("3.5.0"):
-            spark_col = stf.ST_SRID(F.first_value(self.spark.column, 
ignoreNulls=True))
-            # Set this to avoid error complaining that we don't have a groupby 
column
-            is_aggr = True
-        else:
-            spark_col = stf.ST_SRID(self.spark.column)
-            is_aggr = False
+        # F.first is non-deterministic, but it doesn't matter because all 
non-null values should be the same
+        spark_col = stf.ST_SRID(F.first(self.spark.column, ignorenulls=True))

Review Comment:
   The parameter name should be 'ignoreNulls' (camelCase) not 'ignorenulls' 
(lowercase). This follows Spark's naming convention for function parameters.
   ```suggestion
           spark_col = stf.ST_SRID(F.first(self.spark.column, ignoreNulls=True))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2208] Geopandas: Fix sjoin implementation + proper naming and index behavior [sedona]

Reply via email to