RE: Will .count() always trigger an evaluation of each row?

2017-02-19 Thread assaf.mendelson
Actually, when I did a simple test on parquet (spark.read.parquet(“somefile”).cache().count()) the UI showed me that the entire file is cached. Is this just a fluke? In any case I believe the question is still valid, how to make sure a dataframe is cached. Consider for example a case where we r

Re: Will .count() always trigger an evaluation of each row?

2017-02-19 Thread Jörn Franke
I think your example relates to scheduling, e.g. it makes sense to use oozie or similar to fetch the data at specific point in times. I am also not a big fan of caching everything. In a Multi-user cluster with a lot of Applications you waste a lot of resources making everybody less efficient.

RE: Will .count() always trigger an evaluation of each row?

2017-02-19 Thread assaf.mendelson
I am not saying you should cache everything, just that it is a valid use case. From: Jörn Franke [via Apache Spark Developers List] [mailto:ml-node+s1001551n21026...@n3.nabble.com] Sent: Sunday, February 19, 2017 12:13 PM To: Mendelson, Assaf Subject: Re: Will .count() always trigger an evaluati