The cache() method on the DataFrame API caught me out. Having learnt that DataFrames are built on RDDs and that RDDs are immutable, when I saw the statement df.cache() in our codebase I thought ‘This must be a bug, the result is not assigned, the statement will have no affect.’
However, I’ve since learnt that the cache method actually mutates the DataFrame object*. The statement was valid after all. I understand that the underlying user data is immutable, but doesn’t mutating the DataFrame object make the API a little inconsistent and harder to reason about? Regards Chris * (as does persist and rdd.setName methods. I expect there are others)