Re: Saving and Loading Dataframes

Raj Kumar Fri, 26 Feb 2016 10:02:15 -0800

Thanks for the response Yanbo. Here is the source (it uses the 
sample_libsvm_data.txt file used in the
mlliv examples).


-Raj
————— IOTest.scala -------------

import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.DataFrame

object IOTest {
  val InputFile = "/tmp/sample_libsvm_data.txt"
  val OutputDir ="/tmp/out"

  val sconf = new SparkConf().setAppName("test").setMaster("local[*]")
  val sqlc  = new SQLContext( new SparkContext( sconf ))
  val df = sqlc.read.format("libsvm").load( InputFile  )
  df.show; df.printSchema

  df.write.format("json").mode("overwrite").save( OutputDir )
  val data = sqlc.read.format("json").load( OutputDir )
  data.show; data.printSchema

  def main( args: Array[String]):Unit = {}
}


-----------------------

On Feb 26, 2016, at 12:47 AM, Yanbo Liang 
<[email protected]<mailto:[email protected]>> wrote:

Hi Raj,

Could you share your code which can help others to diagnose this issue? Which 
version did you use?
I can not reproduce this problem in my environment.

Thanks
Yanbo

2016-02-26 10:49 GMT+08:00 raj.kumar 
<[email protected]<mailto:[email protected]>>:
Hi,

I am using mllib. I use the ml vectorization tools to create the vectorized
input dataframe for
the ml/mllib machine-learning models with schema:

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

To avoid repeated vectorization, I am trying to save and load this dataframe
using
   df.write.format("json").mode("overwrite").save( url )
    val data = Spark.sqlc.read.format("json").load( url )

However when I load the dataframe, the newly loaded dataframe has the
following schema:
root
 |-- features: struct (nullable = true)
 |    |-- indices: array (nullable = true)
 |    |    |-- element: long (containsNull = true)
 |    |-- size: long (nullable = true)
 |    |-- type: long (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |-- label: double (nullable = true)

which the machine-learning models do not recognize.

Is there a way I can save and load this dataframe without the schema
changing.
I assume it has to do with the fact that Vector is not a basic type.

thanks
-Raj





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Saving-and-Loading-Dataframes-tp26339.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com<http://nabble.com>.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
[email protected]<mailto:[email protected]>
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>

Re: Saving and Loading Dataframes

Reply via email to