Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
Its not throwing away any information from the point of view of the SQL optimizer. The schema preserves all the type information that the catalyst uses. The type information T in Dataset[T] is only used at the API level to ensure compilation-time type checks of the user program. On Thu, Jun 16, 20

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
I'm clear on what a type alias is. My question is more that moving from e.g. Dataset[T] to Dataset[Row] involves throwing away information. Reading through code that uses the Dataframe alias, it's a little hard for me to know when that's intentional or not. On Thu, Jun 16, 2016 at 2:50 PM, Tath

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
There are different ways to view this. If its confusing to think that Source API returning DataFrames, its equivalent to thinking that you are returning a Dataset[Row], and DataFrame is just a shorthand. And DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object] is to Array[String

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
Is this really an internal / external distinction? For a concrete example, Source.getBatch seems to be a public interface, but returns DataFrame. On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das wrote: > DataFrame is a type alias of Dataset[Row], so externally it seems like > Dataset is the main t

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
DataFrame is a type alias of Dataset[Row], so externally it seems like Dataset is the main type and DataFrame is a derivative type. However, internally, since everything is processed as Rows, everything uses DataFrames, Type classes used in a Dataset is internally converted to rows for processing.

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
Sorry, meant DataFrame vs Dataset On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger wrote: > Is there a principled reason why sql.streaming.* and > sql.execution.streaming.* are making extensive use of DataFrame > instead of Datasource? > > Or is that just a holdover from code written before the m