Copilot commented on code in PR #2209:
URL: https://github.com/apache/sedona/pull/2209#discussion_r2249634624
##########
python/sedona/geopandas/geoseries.py:
##########
@@ -465,30 +465,21 @@ def crs(self) -> Union["CRS", None]:
if len(self) == 0:
return None
- if parse_version(pyspark.__version__) >= parse_version("3.5.0"):
- spark_col = stf.ST_SRID(F.first_value(self.spark.column,
ignoreNulls=True))
- # Set this to avoid error complaining that we don't have a groupby
column
- is_aggr = True
- else:
- spark_col = stf.ST_SRID(self.spark.column)
- is_aggr = False
+ # F.first is non-deterministic, but it doesn't matter because all
non-null values should be the same
+ spark_col = stf.ST_SRID(F.first(self.spark.column, ignorenulls=True))
Review Comment:
The parameter name should be `ignoreNulls` (camel case) instead of
`ignorenulls` (lowercase) to match Spark SQL function naming conventions.
```suggestion
spark_col = stf.ST_SRID(F.first(self.spark.column, ignoreNulls=True))
```
##########
python/sedona/geopandas/tools/sjoin.py:
##########
@@ -195,36 +203,35 @@ def _frame_join(
final_columns.append(f"{col_name} as {base_name}")
# Select final columns
- result_df = spatial_join_df.selectExpr(*final_columns)
-
- # Return appropriate type based on input
- if isinstance(left_df, GeoSeries) and isinstance(right_df, GeoSeries):
- # Return GeoSeries for GeoSeries inputs
- internal = InternalFrame(
- spark_frame=result_df,
- index_spark_columns=None,
- column_labels=[left_df._col_label],
- data_spark_columns=[scol_for(result_df, "geometry")],
- data_fields=[left_df._internal.data_fields[0]],
- column_label_names=left_df._internal.column_label_names,
- )
- return _to_geo_series(first_series(ps.DataFrame(internal)))
- else:
- # Return GeoDataFrame for GeoDataFrame inputs
- return GeoDataFrame(result_df)
+ result_df = spatial_join_df.selectExpr(*final_columns).orderBy(
+ SPARK_DEFAULT_INDEX_NAME
+ )
Review Comment:
Adding an `orderBy` operation on the entire result could be expensive for
large datasets. Consider if this ordering is necessary or if it can be
optimized for better performance in distributed environments.
```suggestion
result_df = spatial_join_df.selectExpr(*final_columns)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]