Re: Add row IDs column to data frame

akbar501 Thu, 12 Jan 2017 00:01:21 -0800

The following are 2 different approaches to adding an id/index to RDDs and 1
approach to adding an index to a DataFrame.


Add an index column to an RDD


```scala
// RDD
val dataRDD = sc.textFile("./README.md")
// Add index then set index as key in map() transformation
// Results in RDD[(Long, String)]
val indexedRDD = dataRDD.zipWithIndex().map(pair => (pair._2, pair._1))
```

Add a unique id column to an RDD


```scala
// RDD
val dataRDD = sc.textFile("./README.md")
// Add unique id then set id as key in map() transformation
// Results in RDD[(Long, String)]
val indexedRDD = dataRDD.zipWithUniqueId().map(pair => (pair._2, pair._1))
indexedRDD.collect
```

Add an index column to a DataFrame


Note: You could use a similar approach with a Dataset.

```scala
import spark.implicits._
import org.apache.spark.sql.functions.monotonicallyIncreasingId 

val dataDF = spark.read.textFile("./README.md")
val indexedDF = dataDF.withColumn("id", monotonically_increasing_id)
indexedDF.select($"id", $"value").show
```



-----
Delixus.com - Spark Consulting
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Append-column-to-Data-Frame-or-RDD-tp22385p28300.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Add row IDs column to data frame

Reply via email to