Hi Spark-users,
I came across a few sources which mentioned DataFrame can be more
efficient than Dataset. I can understand this is true because Dataset
allows functional transformation which Catalyst cannot look into and hence
cannot optimize well. But can DataFrame be more efficient than Data
On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram
wrote:
> Hi,
>
> We use monotonically_increasing_id() as well, but just cache the table
> first like Ankur suggested. With that method, we get the same keys in all
> derived tables.
>
Ah, okay, awesome. Let me give that a go.
>
> Thanks,
> Subhas
As far as I am aware in newer Spark versions a DataFrame is the same as
Dataset[Row].
In fact, performance depends on so many factors, so I am not sure such a
comparison makes sense.
> On 8. Apr 2017, at 20:15, Shiyuan wrote:
>
> Hi Spark-users,
> I came across a few sources which mention
it is because you use Dataset[X] but the actual computations are still done
in Dataset[Row] (so DataFrame). well... the actual computations are done in
RDD[InternalRow] with spark's internal types to represent String, Map, Seq,
structs, etc.
so for example if you do:
scala> val x: Dataset[(String,
let me try that again. i left some crap at the bottom of my previous email
as i was editing it. sorry about that. here it goes:
it is because you use Dataset[X] but the actual computations are still done
in Dataset[Row] (so DataFrame). well... the actual computations are done in
RDD[InternalRow] w
how would you use only relational transformations on dataset?
On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan wrote:
> Hi Spark-users,
> I came across a few sources which mentioned DataFrame can be more
> efficient than Dataset. I can understand this is true because Dataset
> allows functional tran
Thanks Jules. It was helpful.
On Fri, Apr 7, 2017 at 8:32 PM, Jules Damji wrote:
> This blog that shows how to write a custom sink: https://databricks.com/
> blog/2017/04/04/real-time-end-to-end-integration-with-
> apache-kafka-in-apache-sparks-structured-streaming.html
>
> Cheers
> Jules
>
> Se
Links that was helpful to me during learning about the spark source code:
- Articles with "spark" tag in this blog:
http://hydronitrogen.com/tag/spark.html
- Jacek's "mastering apache spark" git book:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Hope those can help.
On Sat, Apr 8,