Actually, when I did a simple test on parquet
(spark.read.parquet(“somefile”).cache().count()) the UI showed me that the
entire file is cached. Is this just a fluke?
In any case I believe the question is still valid, how to make sure a dataframe
is cached.
Consider for example a case where we r
I think your example relates to scheduling, e.g. it makes sense to use oozie or
similar to fetch the data at specific point in times.
I am also not a big fan of caching everything. In a Multi-user cluster with a
lot of Applications you waste a lot of resources making everybody less
efficient.
I am not saying you should cache everything, just that it is a valid use case.
From: Jörn Franke [via Apache Spark Developers List]
[mailto:ml-node+s1001551n21026...@n3.nabble.com]
Sent: Sunday, February 19, 2017 12:13 PM
To: Mendelson, Assaf
Subject: Re: Will .count() always trigger an evaluati