Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Ram Navan
Thank You Stephen and Nicholas. I specified the schema to spark.read.json() and the time to execute this instruction got reduced to 4 minutes from original 8 minutes! I also see only two jobs (instead of three when calling with no schema) created. Please refer to attachment job0 and job2 from the

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Nicholas Hakobian
If you do not specify a schema, then the json() function will attempt to determine the schema, which requires a full scan of the file. Any subsequent actions will again have to read in the data. See the documentation at: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.D

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Steffen Schmitz
Hi Ram, spark.read.json() should be evaluated on the first the call of .count(). It should then be read into memory once and the rows are counted. After this operation it will be in memory and access will be faster. If you add println statements in between of your function calls you should see

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Ram Navan
Hi Steffen, Thanks for your response. Isn't spark.read.json() an action function? It reads the files from the source directory, infers the schema and creates a dataframe right? dataframe.cache() prints out this schema as well. I am not sure why dataframe.count() will try to do the same thing agai

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Steffen Schmitz
Hi Ram, Regarding your caching question: The data frame is evaluated lazy. That means it isn’t cached directly on invoking of .cache(), but on calling the first action on it (in your case count). Then it is loaded into memory and the rows are counted, not on the call of .cache(). On the second