Re: [PR] [SPARK-51282][ML][PYTHON][CONNECT] Optimize OneVsRestModel transform by eliminating the JVM-Python data exchange [spark]

via GitHub Fri, 21 Feb 2025 05:41:24 -0800


zhengruifeng commented on code in PR #50041:
URL: https://github.com/apache/spark/pull/50041#discussion_r1965493411



##########
python/pyspark/sql/internal.py:
##########
@@ -130,3 +130,42 @@ def make_interval(unit: str, e: Union[Column, int, float]) 
-> Column:
             "SECOND": "secs",
         }
         return F.make_interval(**{unit_mapping[unit]: F.lit(e)})
+
+    @staticmethod
+    def get_vector(vec: Column, idx: Column) -> Column:
+        unwrapped = F.unwrap_udt(vec)
+        is_dense = unwrapped.getField("type") == F.lit(1)
+        values = unwrapped.getField("values")
+        size = F.when(is_dense, 
F.array_size(values)).otherwise(unwrapped.getField("size"))
+        sparse_idx = 
InternalFunction.array_binary_search(unwrapped.getField("indices"), idx)
+        value = (
+            F.when(is_dense, F.get(values, idx))
+            .when(sparse_idx >= 0, F.get(values, sparse_idx))
+            .otherwise(F.lit(0.0))
+        )
+
+        return F.when((0 <= idx) & (idx < size), value).otherwise(
+            F.raise_error(F.printf(F.lit("Vector index must be in [0, %s), but 
got %s"), size, idx))
+        )
+
+    @staticmethod
+    def array_argmax(arr: Column) -> Column:

Review Comment:
   there is a slight difference on the NaN handling.
   
   ```
   In [3]: 
spark.range(1).select(sf.array_max(sf.lit([1.0,2.0,float("nan")]))).show()
   +-------------------------------+
   |array_max(array(1.0, 2.0, NaN))|
   +-------------------------------+
   |                            NaN|
   +-------------------------------+
   ```
   
   
https://github.com/apache/spark/blob/89fb67f7e88044bcf364d8e70cd171647d7671fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2363-L2364
   
   `array_max` treat NaN as the largest value, while in this python UDF, NaN is 
ignored.
   
   And using a lambda function only need one pass on this array,
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51282][ML][PYTHON][CONNECT] Optimize OneVsRestModel transform by eliminating the JVM-Python data exchange [spark]

Reply via email to