This seems like a bug to me, the schemas should match.

scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders

scala> val fEncoder = Encoders.product[F]
fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]:
string, f3[0]: string]

scala> fEncoder.schema == ds.schema
res2: Boolean = false

scala> ds.schema
res3: org.apache.spark.sql.types.StructType =
StructType(StructField(f1,StringType,true),
StructField(f2,StringType,true), StructField(f3,StringType,true),
StructField(c4,StringType,true))

scala> fEncoder.schema
res4: org.apache.spark.sql.types.StructType =
StructType(StructField(f1,StringType,true),
StructField(f2,StringType,true), StructField(f3,StringType,true))

I'll open a JIRA.

-Don

On Thu, Feb 2, 2017 at 2:46 PM, Don Drake <dondr...@gmail.com> wrote:

> In 1.6, when you created a Dataset from a Dataframe that had extra
> columns, the columns not in the case class were dropped from the Dataset.
>
> For example in 1.6, the column c4 is gone:
>
> scala> case class F(f1: String, f2: String, f3:String)
>
> defined class F
>
>
> scala> import sqlContext.implicits._
>
> import sqlContext.implicits._
>
>
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i",
> "j","z")).toDF("f1", "f2", "f3", "c4")
>
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string,
> c4: string]
>
>
> scala> val ds = df.as[F]
>
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
>
>
> scala> ds.show
>
> +---+---+---+
>
> | f1| f2| f3|
>
> +---+---+---+
>
> |  a|  b|  c|
>
> |  d|  e|  f|
>
> |  h|  i|  j|
>
>
> This seems to have changed in Spark 2.0 and also 2.1:
>
> Spark 2.1.0:
>
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
>
> scala> import spark.implicits._
> import spark.implicits._
>
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i",
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more
> fields]
>
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more
> fields]
>
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
>
> Is there a way to get a Dataset that conforms to the case class in Spark
> 2.1.0?  Basically, I'm attempting to use the case class to define an output
> schema, and these extra columns are getting in the way.
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake <http://www.MailLaunder.com/>
> 800-733-2143 <(800)%20733-2143>
>



-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143

Reply via email to