Re: Spark 2 - Creating datasets from dataframes with extra columns

Don Drake Wed, 08 Feb 2017 17:23:06 -0800

Please see: https://issues.apache.org/jira/browse/SPARK-19477


Thanks.

-Don

On Wed, Feb 8, 2017 at 6:51 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> i checked it, it seems is a bug. do you create a jira  now plesae?
>
> ---Original---
> *From:* "Don Drake"<dondr...@gmail.com>
> *Date:* 2017/2/7 01:26:59
> *To:* "user"<user@spark.apache.org>;
> *Subject:* Re: Spark 2 - Creating datasets from dataframes with extra
> columns
>
> This seems like a bug to me, the schemas should match.
>
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
>
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]:
> string, f3[0]: string]
>
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
>
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true),
> StructField(f2,StringType,true), StructField(f3,StringType,true),
> StructField(c4,StringType,true))
>
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true),
> StructField(f2,StringType,true), StructField(f3,StringType,true))
>
> I'll open a JIRA.
>
> -Don
>
> On Thu, Feb 2, 2017 at 2:46 PM, Don Drake <dondr...@gmail.com> wrote:
>
>> In 1.6, when you created a Dataset from a Dataframe that had extra
>> columns, the columns not in the case class were dropped from the Dataset.
>>
>> For example in 1.6, the column c4 is gone:
>>
>> scala> case class F(f1: String, f2: String, f3:String)
>>
>> defined class F
>>
>>
>> scala> import sqlContext.implicits._
>>
>> import sqlContext.implicits._
>>
>>
>> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i",
>> "j","z")).toDF("f1", "f2", "f3", "c4")
>>
>> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string,
>> c4: string]
>>
>>
>> scala> val ds = df.as[F]
>>
>> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3:
>> string]
>>
>>
>> scala> ds.show
>>
>> +---+---+---+
>>
>> | f1| f2| f3|
>>
>> +---+---+---+
>>
>> |  a|  b|  c|
>>
>> |  d|  e|  f|
>>
>> |  h|  i|  j|
>>
>>
>> This seems to have changed in Spark 2.0 and also 2.1:
>>
>> Spark 2.1.0:
>>
>> scala> case class F(f1: String, f2: String, f3:String)
>> defined class F
>>
>> scala> import spark.implicits._
>> import spark.implicits._
>>
>> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i",
>> "j","z")).toDF("f1", "f2", "f3", "c4")
>> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more
>> fields]
>>
>> scala> val ds = df.as[F]
>> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more
>> fields]
>>
>> scala> ds.show
>> +---+---+---+---+
>> | f1| f2| f3| c4|
>> +---+---+---+---+
>> |  a|  b|  c|  x|
>> |  d|  e|  f|  y|
>> |  h|  i|  j|  z|
>> +---+---+---+---+
>>
>> Is there a way to get a Dataset that conforms to the case class in Spark
>> 2.1.0?  Basically, I'm attempting to use the case class to define an output
>> schema, and these extra columns are getting in the way.
>>
>> Thanks.
>>
>> -Don
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> https://twitter.com/dondrake <http://www.MailLaunder.com/>
>> 800-733-2143
>>
>
>
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake <http://www.MailLaunder.com/>
> 800-733-2143 <(800)%20733-2143>
>



-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143

Re: Spark 2 - Creating datasets from dataframes with extra columns

Reply via email to