Hi,
 
I’m new to Spark and trying to understand the inner workings of Spark in the
below mentioned scenarios. I’m using PySpark and Spark 2.1.1
 
Spark.read.json():
 
I am running executing this line
“spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with three
worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json
files and the total size is about 4.4 GB. The line created three spark jobs
first job with 10000 tasks, second job with 19949 tasks and third job with
10000 tasks. Each of the jobs have one stage in it. Please refer to the
attached images job0, job1 and job2.jpg.   job0.jpg
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job0.jpg>    
job1.jpg
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job1.jpg>    
job2.jpg
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job2.jpg>   
I was expecting it to create 1 job with 19949 tasks.  I’d like to understand
why there are three jobs instead of just one and why reading json files
calls for map operation.
 
Caching and Count():
 
Once spark reads 19949 json files into a dataframe (let’s call it files_df),
I am calling these two operations files_df.createOrReplaceTempView(“files)
and files_df.cache(). I am expecting files_df.cache() will cache the entire
dataframe in memory so any subsequent operation will be faster. My next
statement is files_df.count(). This operation took an entire 8.8 minutes and
it looks like it read the files again from s3 and calculated the count. 
Please refer to attached count.jpg file for reference.   count.jpg
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/count.jpg>  
Why is this happening? If I call files_df.count() for the second time, it
comes back fast within few seconds. Can someone explain this?
 
In general, I am looking for a good source to learn about Spark Internals
and try to understand what’s happening beneath the hood.
 
Thanks in advance!
 
Ram



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Caching-tp28708.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to