Re: [PR] [SPARK-55228][SPARK-55230][SQL][CONNECT] Implement Dataset.zipWithIndex in Scala API [spark]

via GitHub Wed, 04 Feb 2026 15:57:24 -0800


HyukjinKwon commented on code in PR #54014:
URL: https://github.com/apache/spark/pull/54014#discussion_r2766463361



##########
sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -2010,6 +2010,37 @@ abstract class Dataset[T] extends Serializable {
    */
   def exceptAll(other: Dataset[T]): Dataset[T]
 
+  /**
+   * Returns a new [[Dataset]] by appending a column containing consecutive 
0-based Long indices,
+   * similar to `RDD.zipWithIndex()`.
+   *
+   * The index column is appended as the last column of the resulting 
[[DataFrame]].
+   *
+   * @group untypedrel
+   * @since 4.2.0
+   */
+  def zipWithIndex(): DataFrame = zipWithIndex("index")
+
+  /**
+   * Returns a new [[Dataset]] by appending a column containing consecutive 
0-based Long indices,
+   * similar to `RDD.zipWithIndex()`.
+   *
+   * The index column is appended as the last column of the resulting 
[[DataFrame]].
+   *
+   * @note
+   *   If a column with `indexColName` already exists in the schema, the 
resulting [[DataFrame]]

Review Comment:
   ```suggestion
      *   If a column with `indexColName` already exists in the schema, the 
resulting `DataFrame`
   ```



##########
sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -2010,6 +2010,37 @@ abstract class Dataset[T] extends Serializable {
    */
   def exceptAll(other: Dataset[T]): Dataset[T]
 
+  /**
+   * Returns a new [[Dataset]] by appending a column containing consecutive 
0-based Long indices,
+   * similar to `RDD.zipWithIndex()`.
+   *
+   * The index column is appended as the last column of the resulting 
[[DataFrame]].
+   *
+   * @group untypedrel
+   * @since 4.2.0
+   */
+  def zipWithIndex(): DataFrame = zipWithIndex("index")
+
+  /**
+   * Returns a new [[Dataset]] by appending a column containing consecutive 
0-based Long indices,
+   * similar to `RDD.zipWithIndex()`.
+   *
+   * The index column is appended as the last column of the resulting 
[[DataFrame]].

Review Comment:
   ```suggestion
      * The index column is appended as the last column of the resulting 
`DataFrame`.
   ```



##########
sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -2010,6 +2010,37 @@ abstract class Dataset[T] extends Serializable {
    */
   def exceptAll(other: Dataset[T]): Dataset[T]
 
+  /**
+   * Returns a new [[Dataset]] by appending a column containing consecutive 
0-based Long indices,
+   * similar to `RDD.zipWithIndex()`.
+   *
+   * The index column is appended as the last column of the resulting 
[[DataFrame]].

Review Comment:
   ```suggestion
      * The index column is appended as the last column of the resulting 
`DataFrame`.
   ```



##########
sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -2010,6 +2010,37 @@ abstract class Dataset[T] extends Serializable {
    */
   def exceptAll(other: Dataset[T]): Dataset[T]
 
+  /**
+   * Returns a new [[Dataset]] by appending a column containing consecutive 
0-based Long indices,
+   * similar to `RDD.zipWithIndex()`.
+   *
+   * The index column is appended as the last column of the resulting 
[[DataFrame]].
+   *
+   * @group untypedrel
+   * @since 4.2.0
+   */
+  def zipWithIndex(): DataFrame = zipWithIndex("index")
+
+  /**
+   * Returns a new [[Dataset]] by appending a column containing consecutive 
0-based Long indices,
+   * similar to `RDD.zipWithIndex()`.
+   *
+   * The index column is appended as the last column of the resulting 
[[DataFrame]].
+   *
+   * @note
+   *   If a column with `indexColName` already exists in the schema, the 
resulting [[DataFrame]]
+   *   will have duplicate column names. Selecting the duplicate column by 
name will throw
+   *   `AMBIGUOUS_REFERENCE`, and writing the [[DataFrame]] will throw 
`COLUMN_ALREADY_EXISTS`.

Review Comment:
   ```suggestion
      *   `AMBIGUOUS_REFERENCE`, and writing the `DataFrame` will throw 
`COLUMN_ALREADY_EXISTS`.
   ```



##########
sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -2010,6 +2010,37 @@ abstract class Dataset[T] extends Serializable {
    */
   def exceptAll(other: Dataset[T]): Dataset[T]
 
+  /**
+   * Returns a new [[Dataset]] by appending a column containing consecutive 
0-based Long indices,
+   * similar to `RDD.zipWithIndex()`.
+   *
+   * The index column is appended as the last column of the resulting 
[[DataFrame]].
+   *
+   * @group untypedrel
+   * @since 4.2.0
+   */
+  def zipWithIndex(): DataFrame = zipWithIndex("index")
+
+  /**
+   * Returns a new [[Dataset]] by appending a column containing consecutive 
0-based Long indices,

Review Comment:
   ```suggestion
      * Returns a new `Dataset` by appending a column containing consecutive 
0-based Long indices,
   ```



##########
sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -2010,6 +2010,37 @@ abstract class Dataset[T] extends Serializable {
    */
   def exceptAll(other: Dataset[T]): Dataset[T]
 
+  /**
+   * Returns a new [[Dataset]] by appending a column containing consecutive 
0-based Long indices,

Review Comment:
   ```suggestion
      * Returns a new `Dataset` by appending a column containing consecutive 
0-based Long indices,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55228][SPARK-55230][SQL][CONNECT] Implement Dataset.zipWithIndex in Scala API [spark]

Reply via email to