Hi,

AFAIK `Dataset#printSchema` just prints an output schema of the logical
plan that the Dataset has.

The logical plans in your example are as follows;

---

scala> x.as[Array[Byte]].explain(true)

== Analyzed Logical Plan ==

x: string

Project [value#1 AS x#3]

+- LocalRelation [value#1]

scala> x.as[Array[Byte]].map(x => x).explain(true)

== Analyzed Logical Plan ==

value: binary

SerializeFromObject [input[0, binary, true] AS value#43]

+- MapElements <function1>, class [B, [StructField(value,BinaryType,true)],
obj#42: binary

   +- DeserializeToObject cast(x#3 as binary), obj#41: binary

      +- Project [value#1 AS x#3]

         +- LocalRelation [value#1]

---

So, it seems they print different schemas.


// maropu



On Wed, Jan 25, 2017 at 1:28 AM, Koert Kuipers <ko...@tresata.com> wrote:

> scala> val x = Seq("a", "b").toDF("x")
> x: org.apache.spark.sql.DataFrame = [x: string]
>
> scala> x.as[Array[Byte]].printSchema
> root
>  |-- x: string (nullable = true)
>
> scala> x.as[Array[Byte]].map(x => x).printSchema
> root
>  |-- value: binary (nullable = true)
>
> why does the first schema show string instead of binary?
>



-- 
---
Takeshi Yamamuro

Reply via email to