Re: [PR] [WIP][SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in `split` when ANSI mode is on [spark]

via GitHub Tue, 27 May 2025 15:22:09 -0700


xinrong-meng commented on code in PR #51006:
URL: https://github.com/apache/spark/pull/51006#discussion_r2110394870



##########
python/pyspark/pandas/strings.py:
##########
@@ -2031,7 +2031,13 @@ def pudf(s: pd.Series) -> pd.Series:
         if expand:
             psdf = psser.to_frame()
             scol = psdf._internal.data_spark_columns[0]
-            spark_columns = [scol[i].alias(str(i)) for i in range(n + 1)]
+
+            if ps.get_option("compute.ansi_mode_support"):
+                spark_columns = [
+                    F.try_element_at(scol, F.lit(i + 1)).alias(str(i)) for i 
in range(n + 1)

Review Comment:
   Thanks for suggestion!
   There might not be a significant perf difference between creating F.lit 
inside the loop or beforehand, it's just wrapping a Python literal into a Spark 
expression, which aren’t executed immediately(just nodes in the DAG), and will 
be deduplicated by Catalyst. With that being said I’d like to keep the original 
for simplicity, but feel free to share if you have other opinions!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in `split` when ANSI mode is on [spark]

Reply via email to