Please see: https://issues.apache.org/jira/browse/SPARK-19477
Thanks. -Don On Wed, Feb 8, 2017 at 6:51 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > i checked it, it seems is a bug. do you create a jira now plesae? > > ---Original--- > *From:* "Don Drake"<dondr...@gmail.com> > *Date:* 2017/2/7 01:26:59 > *To:* "user"<user@spark.apache.org>; > *Subject:* Re: Spark 2 - Creating datasets from dataframes with extra > columns > > This seems like a bug to me, the schemas should match. > > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > > scala> fEncoder.schema == ds.schema > res2: Boolean = false > > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), > StructField(f2,StringType,true), StructField(f3,StringType,true), > StructField(c4,StringType,true)) > > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), > StructField(f2,StringType,true), StructField(f3,StringType,true)) > > I'll open a JIRA. > > -Don > > On Thu, Feb 2, 2017 at 2:46 PM, Don Drake <dondr...@gmail.com> wrote: > >> In 1.6, when you created a Dataset from a Dataframe that had extra >> columns, the columns not in the case class were dropped from the Dataset. >> >> For example in 1.6, the column c4 is gone: >> >> scala> case class F(f1: String, f2: String, f3:String) >> >> defined class F >> >> >> scala> import sqlContext.implicits._ >> >> import sqlContext.implicits._ >> >> >> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", >> "j","z")).toDF("f1", "f2", "f3", "c4") >> >> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, >> c4: string] >> >> >> scala> val ds = df.as[F] >> >> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: >> string] >> >> >> scala> ds.show >> >> +---+---+---+ >> >> | f1| f2| f3| >> >> +---+---+---+ >> >> | a| b| c| >> >> | d| e| f| >> >> | h| i| j| >> >> >> This seems to have changed in Spark 2.0 and also 2.1: >> >> Spark 2.1.0: >> >> scala> case class F(f1: String, f2: String, f3:String) >> defined class F >> >> scala> import spark.implicits._ >> import spark.implicits._ >> >> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", >> "j","z")).toDF("f1", "f2", "f3", "c4") >> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more >> fields] >> >> scala> val ds = df.as[F] >> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more >> fields] >> >> scala> ds.show >> +---+---+---+---+ >> | f1| f2| f3| c4| >> +---+---+---+---+ >> | a| b| c| x| >> | d| e| f| y| >> | h| i| j| z| >> +---+---+---+---+ >> >> Is there a way to get a Dataset that conforms to the case class in Spark >> 2.1.0? Basically, I'm attempting to use the case class to define an output >> schema, and these extra columns are getting in the way. >> >> Thanks. >> >> -Don >> >> -- >> Donald Drake >> Drake Consulting >> http://www.drakeconsulting.com/ >> https://twitter.com/dondrake <http://www.MailLaunder.com/> >> 800-733-2143 >> > > > > -- > Donald Drake > Drake Consulting > http://www.drakeconsulting.com/ > https://twitter.com/dondrake <http://www.MailLaunder.com/> > 800-733-2143 <(800)%20733-2143> > -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ https://twitter.com/dondrake <http://www.MailLaunder.com/> 800-733-2143