Hi, I’m new to Spark and trying to understand the inner workings of Spark in the below mentioned scenarios. I’m using PySpark and Spark 2.1.1 Spark.read.json(): I am running executing this line “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with three worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json files and the total size is about 4.4 GB. The line created three spark jobs first job with 10000 tasks, second job with 19949 tasks and third job with 10000 tasks. Each of the jobs have one stage in it. Please refer to the attached images job0, job1 and job2.jpg. job0.jpg <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job0.jpg> job1.jpg <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job1.jpg> job2.jpg <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job2.jpg> I was expecting it to create 1 job with 19949 tasks. I’d like to understand why there are three jobs instead of just one and why reading json files calls for map operation. Caching and Count(): Once spark reads 19949 json files into a dataframe (let’s call it files_df), I am calling these two operations files_df.createOrReplaceTempView(“files) and files_df.cache(). I am expecting files_df.cache() will cache the entire dataframe in memory so any subsequent operation will be faster. My next statement is files_df.count(). This operation took an entire 8.8 minutes and it looks like it read the files again from s3 and calculated the count. Please refer to attached count.jpg file for reference. count.jpg <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/count.jpg> Why is this happening? If I call files_df.count() for the second time, it comes back fast within few seconds. Can someone explain this? In general, I am looking for a good source to learn about Spark Internals and try to understand what’s happening beneath the hood. Thanks in advance! Ram
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Caching-tp28708.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org