RE: Difference between dataset and dataframe

2019-02-18 Thread Lunagariya, Dhaval
It does for dataframe also. Please try example. df1 = spark.range(2, 1000, 2) df2 = spark.range(2, 1000, 4) step1 = df1.repartition(5) step12 = df2.repartition(6) step2 = step1.selectExpr("id * 5 as id") step3 = step2.join(step12, ["id"]) step4 = step3.selectExpr("sum(id)") step4.collect()

Recall: Difference between dataset and dataframe

2019-02-18 Thread Lunagariya, Dhaval
Lunagariya, Dhaval [CCC-OT] would like to recall the message, "Difference between dataset and dataframe". - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: Difference between dataset and dataframe

2019-02-18 Thread Lunagariya, Dhaval
It does for dataframe also. Please try example. df1 = spark.range(2, 1000, 2) df2 = spark.range(2, 1000, 4) step1 = df1.repartition(5) step12 = df2.repartition(6) step2 = step1.selectExpr("id * 5 as id") step3 = step2.join(step12, ["id"]) step4 = step3.selectExpr("sum(id)") step4.collect()

Re: Difference between dataset and dataframe

2019-02-18 Thread Akhilanand
Thanks for the reply. But can you please tell why dataframes are performant than datasets? Any specifics would be helpful. Also, could you comment on the tungsten code gen part of my question? > On Feb 18, 2019, at 10:47 PM, Koert Kuipers wrote: > > in the api DataFrame is just Dataset[Row].

Re: Difference between dataset and dataframe

2019-02-18 Thread Koert Kuipers
in the api DataFrame is just Dataset[Row]. so this makes you think Dataset is the generic api. interestingly enough under the hood everything is really Dataset[Row], so DataFrame is really the "native" language for spark sql, not Dataset. i find DataFrame to be significantly more performant. in ge

Difference between dataset and dataframe

2019-02-18 Thread Akhilanand
Hello, I have been recently exploring about dataset and dataframes. I would really appreciate if someone could answer these questions: 1) Is there any difference in terms performance when we use datasets over dataframes? Is it significant to choose 1 over other. I do realise there would be

Streaming Tab in Kafka Structured Streaming

2019-02-18 Thread KhajaAsmath Mohammed
Hi, I am new to the structured streaming but used dstreams a lot. Difference I saw in the spark UI is the streaming tab for dstreams. Is there a way to know how many records and batches were executed in structred streaming and also any option on how to see streaming tab? Thanks, Asmath

Avoiding MUltiple GroupBy

2019-02-18 Thread Kumar sp
Can we avoid multiple group by , l have a million records and its a performance concern. Below is my query , even with Windows functions also i guess it is a performance hit, can you please advice if there is a better alternative. I need to get max no of equipments for that house for list of dates

Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Wenchen Fan
great job! On Mon, Feb 18, 2019 at 4:24 PM Hyukjin Kwon wrote: > Yay! Good job Takeshi! > > On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro >> We are happy to announce the availability of Spark 2.3.3! >> >> Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3 >> maintenance branch o

Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Hyukjin Kwon
Yay! Good job Takeshi! On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro We are happy to announce the availability of Spark 2.3.3! > > Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3 > maintenance branch of Spark. We strongly recommend all 2.3.x users to > upgrade to this stable re