subject:"Re\: Questions regarding Jobs, Stages and Caching"

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Ram Navan

Thank You Stephen and Nicholas. I specified the schema to spark.read.json() and the time to execute this instruction got reduced to 4 minutes from original 8 minutes! I also see only two jobs (instead of three when calling with no schema) created. Please refer to attachment job0 and job2 from the

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Nicholas Hakobian

If you do not specify a schema, then the json() function will attempt to determine the schema, which requires a full scan of the file. Any subsequent actions will again have to read in the data. See the documentation at: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.D

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Steffen Schmitz

Hi Ram, spark.read.json() should be evaluated on the first the call of .count(). It should then be read into memory once and the rows are counted. After this operation it will be in memory and access will be faster. If you add println statements in between of your function calls you should see

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Ram Navan

Hi Steffen, Thanks for your response. Isn't spark.read.json() an action function? It reads the files from the source directory, infers the schema and creates a dataframe right? dataframe.cache() prints out this schema as well. I am not sure why dataframe.count() will try to do the same thing agai

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Steffen Schmitz

Hi Ram, Regarding your caching question: The data frame is evaluated lazy. That means it isn’t cached directly on invoking of .cache(), but on calling the first action on it (in your case count). Then it is loaded into memory and the rows are counted, not on the call of .cache(). On the second

Re: Questions regarding Jobs, Stages and Caching

Re: Questions regarding Jobs, Stages and Caching

Re: Questions regarding Jobs, Stages and Caching

Re: Questions regarding Jobs, Stages and Caching

Re: Questions regarding Jobs, Stages and Caching

5 matches

Site Navigation

Mail list logo

Footer information