This is a guess but I would bet that most of the time when into the loading of 
the data. The second time there are many places this could be cached (either  
by spark or even by the OS if you are reading from file).

-----Original Message-----
From: brccosta [mailto:brunocosta....@gmail.com] 
Sent: Friday, December 09, 2016 1:24 PM
To: user@spark.apache.org
Subject: About transformations

Dear guys,

We're performing some tests to evaluate the behavior of transformations and 
actions in Spark with Spark SQL. In our tests, first we conceive a simple 
dataflow with 2 transformations and 1 action:

LOAD (result: df_1) > SELECT ALL FROM df_1 (result: df_2) > COUNT(df_2)

The execution time for this first dataflow was 10 seconds. Next, we added 
another action to our dataflow:

LOAD (result: df_1) > SELECT ALL FROM df_1 (result: df_2) > COUNT(df_2) >
COUNT(df_2)

Analyzing the second version of the dataflow, since all transformation are lazy 
and re-executed for each action (according to the documentation), when 
executing the second count, it should require the execution of the two previous 
transformations (LOAD and SELECT ALL). Thus, we expected that when executing 
this second version of our dataflow, the time would be around 20 seconds. 
However, the execution time was 11 seconds. Apparently, the results of the 
transformations required by the first count were cached by Spark for the second 
count.

Please, do you guys know what is happening? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/About-transformations-tp28188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to