Its not throwing away any information from the point of view of the SQL
optimizer. The schema preserves all the type information that the catalyst
uses. The type information T in Dataset[T] is only used at the API level to
ensure compilation-time type checks of the user program.
On Thu, Jun 16, 20
I'm clear on what a type alias is. My question is more that moving
from e.g. Dataset[T] to Dataset[Row] involves throwing away
information. Reading through code that uses the Dataframe alias, it's
a little hard for me to know when that's intentional or not.
On Thu, Jun 16, 2016 at 2:50 PM, Tath
There are different ways to view this. If its confusing to think that
Source API returning DataFrames, its equivalent to thinking that you are
returning a Dataset[Row], and DataFrame is just a shorthand.
And DataFrame/Datasetp[Row] is to Dataset[String] is what java
Array[Object] is to Array[String
Is this really an internal / external distinction?
For a concrete example, Source.getBatch seems to be a public
interface, but returns DataFrame.
On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
wrote:
> DataFrame is a type alias of Dataset[Row], so externally it seems like
> Dataset is the main t
DataFrame is a type alias of Dataset[Row], so externally it seems like
Dataset is the main type and DataFrame is a derivative type.
However, internally, since everything is processed as Rows, everything uses
DataFrames, Type classes used in a Dataset is internally converted to rows
for processing.
Sorry, meant DataFrame vs Dataset
On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger wrote:
> Is there a principled reason why sql.streaming.* and
> sql.execution.streaming.* are making extensive use of DataFrame
> instead of Datasource?
>
> Or is that just a holdover from code written before the m