Hi, AFAIK `Dataset#printSchema` just prints an output schema of the logical plan that the Dataset has.
The logical plans in your example are as follows; --- scala> x.as[Array[Byte]].explain(true) == Analyzed Logical Plan == x: string Project [value#1 AS x#3] +- LocalRelation [value#1] scala> x.as[Array[Byte]].map(x => x).explain(true) == Analyzed Logical Plan == value: binary SerializeFromObject [input[0, binary, true] AS value#43] +- MapElements <function1>, class [B, [StructField(value,BinaryType,true)], obj#42: binary +- DeserializeToObject cast(x#3 as binary), obj#41: binary +- Project [value#1 AS x#3] +- LocalRelation [value#1] --- So, it seems they print different schemas. // maropu On Wed, Jan 25, 2017 at 1:28 AM, Koert Kuipers <ko...@tresata.com> wrote: > scala> val x = Seq("a", "b").toDF("x") > x: org.apache.spark.sql.DataFrame = [x: string] > > scala> x.as[Array[Byte]].printSchema > root > |-- x: string (nullable = true) > > scala> x.as[Array[Byte]].map(x => x).printSchema > root > |-- value: binary (nullable = true) > > why does the first schema show string instead of binary? > -- --- Takeshi Yamamuro