Thank You Stephen and Nicholas.
I specified the schema to spark.read.json() and the time to execute this
instruction got reduced to 4 minutes from original 8 minutes! I also see
only two jobs (instead of three when calling with no schema) created.
Please refer to attachment job0 and job2 from the
If you do not specify a schema, then the json() function will attempt to
determine the schema, which requires a full scan of the file. Any
subsequent actions will again have to read in the data. See the
documentation at:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.D
Hi Ram,
spark.read.json() should be evaluated on the first the call of .count(). It
should then be read into memory once and the rows are counted. After this
operation it will be in memory and access will be faster.
If you add println statements in between of your function calls you should see
Hi Steffen,
Thanks for your response.
Isn't spark.read.json() an action function? It reads the files from the
source directory, infers the schema and creates a dataframe right?
dataframe.cache() prints out this schema as well. I am not sure why
dataframe.count() will try to do the same thing agai
Hi Ram,
Regarding your caching question:
The data frame is evaluated lazy. That means it isn’t cached directly on
invoking of .cache(), but on calling the first action on it (in your case
count).
Then it is loaded into memory and the rows are counted, not on the call of
.cache().
On the second