Thank You Stephen and Nicholas.
I specified the schema to spark.read.json() and the time to execute this
instruction got reduced to 4 minutes from original 8 minutes! I also see
only two jobs (instead of three when calling with no schema) created.
Please refer to attachment job0 and job2 from the
tion will be faster. My next
>> > statement is files_df.count(). This operation took an entire 8.8
>> minutes and
>> > it looks like it read the files again from s3 and calculated the count.
>> > Please refer to attached count.jpg file for reference. count.jpg
>
d what’s happening beneath the hood.
>
> Thanks in advance!
>
> Ram
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Caching-tp28708.html
> Sent from the Apache Spark User List mailin
ference. count.jpg
> > <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28708/count.jpg>
> > Why is this happening? If I call files_df.count() for the second time, it
> > comes back fast within few seconds. Can someone explain this?
> >
> > In gener
for a good source to learn about Spark Internals
> and try to understand what’s happening beneath the hood.
>
> Thanks in advance!
>
> Ram
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-C
meone explain this?
In general, I am looking for a good source to learn about Spark Internals
and try to understand what’s happening beneath the hood.
Thanks in advance!
Ram
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-