Re: Nested RDD operation

Jean Georges Perrin Fri, 15 Sep 2017 05:32:15 -0700

Hey Daniel, not sure this will help, but... I had a similar need where i wanted 
the content of a dataframe to become a "cell" or a row in the parent dataframe. 
I grouped by the child dataframe, then collect it as a list in the parent 
dataframe after a join operation. As I said, not sure it matches your use case, 
but HIH...
jg


> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy 
> <danieljamesda...@gmail.com> wrote:
> 
> Hi guys,
> 
> I'm having trouble implementing this scenario:
> 
> I have a column with a typical entry being : ['apple', 'orange', 'apple', 
> 'pear', 'pear']
> 
> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
> 
> I'm attempting to do this but because of the nested operation on another RDD 
> I get the NPE.
> 
> Here's my code so far, thanks:
> 
> val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF("email", 
> "event_name")
> 
>       // attempting
>       import sqlContext.implicits._
>       val event_list = dfWithSchema.select("event_name").distinct
>       val event_listDF = event_list.toDF()
>       val eventIndexer = new StringIndexer()
>         .setInputCol("event_name")
>         .setOutputCol("eventIndex")
>         .fit(event_listDF)
> 
>       val eventIndexed = eventIndexer.transform(event_listDF)
> 
>       val converter = new IndexToString()
>         .setInputCol("eventIndex")
>         .setOutputCol("originalCategory")
> 
>       val convertedEvents = converter.transform(eventIndexed)
>       val rddX = 
> dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim
>  replaceAll ("[\\[\\]\"]", "")).toList)
>       //val oneRow = 
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
> 
>       val severalRows = rddX.map(row => {
>         // Split array into n tools
>         println("ROW: " + row(0).toString)
>         println(row(0).getClass)
>         println("PRINT: " + 
> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0))).toDF("event_name")).select("eventIndex").first().getDouble(0))
>         
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF("event_name")).select("eventIndex").first().getDouble(0),
>  Seq(row).toString)
>       })
>       // attempting

Re: Nested RDD operation

Reply via email to