Re: Add row IDs column to data frame

ayan guha Thu, 12 Jan 2017 03:22:53 -0800

Just in case you are more comfortable with SQL,

row_number over ()


should also generate an unique id.

On Thu, Jan 12, 2017 at 7:00 PM, akbar501 <akbar...@gmail.com> wrote:

> The following are 2 different approaches to adding an id/index to RDDs and
> 1
> approach to adding an index to a DataFrame.
>
> Add an index column to an RDD
>
>
> ```scala
> // RDD
> val dataRDD = sc.textFile("./README.md")
> // Add index then set index as key in map() transformation
> // Results in RDD[(Long, String)]
> val indexedRDD = dataRDD.zipWithIndex().map(pair => (pair._2, pair._1))
> ```
>
> Add a unique id column to an RDD
>
>
> ```scala
> // RDD
> val dataRDD = sc.textFile("./README.md")
> // Add unique id then set id as key in map() transformation
> // Results in RDD[(Long, String)]
> val indexedRDD = dataRDD.zipWithUniqueId().map(pair => (pair._2, pair._1))
> indexedRDD.collect
> ```
>
> Add an index column to a DataFrame
>
>
> Note: You could use a similar approach with a Dataset.
>
> ```scala
> import spark.implicits._
> import org.apache.spark.sql.functions.monotonicallyIncreasingId
>
> val dataDF = spark.read.textFile("./README.md")
> val indexedDF = dataDF.withColumn("id", monotonically_increasing_id)
> indexedDF.select($"id", $"value").show
> ```
>
>
>
> -----
> Delixus.com - Spark Consulting
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Append-column-to-Data-Frame-or-RDD-
> tp22385p28300.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha

Re: Add row IDs column to data frame

Reply via email to