The cache() method on the DataFrame API caught me out.

Having learnt that DataFrames are built on RDDs and that RDDs are
immutable, when I saw the statement df.cache() in our codebase I thought
‘This must be a bug, the result is not assigned, the statement will have no
affect.’

However, I’ve since learnt that the cache method actually mutates the
DataFrame object*. The statement was valid after all.

I understand that the underlying user data is immutable, but doesn’t
mutating the DataFrame object make the API a little inconsistent and harder
to reason about?

Regards

Chris


* (as does persist and rdd.setName methods. I expect there are others)

Reply via email to