I'm still wrapping my head around that fact that the data backing an RDD is
immutable since an RDD may need to be reconstructed from its lineage at any
point. In the context of clustering there are many iterations where an RDD may
need to change (for instance cluster assignments, etc) based on a broadcast
variable of a list of centroids which are objects that in turn contain a list
of features. So immutability is all well and good for the purposes of being
able to replay a lineage. But now I'm wondering, during each iterations in
which this RDD goes through many transformations it will be transforming based
on that broadcast variable of centroids that are mutable. How would it replay
the lineage in this instance? Does a dependency on mutable variables mess up
the whole lineage thing?
Any help appreciated. Just trying to wrap my head around using Spark correctly.
I will say it does seem like there is a common miss conception that Spark RDDs
are in-memory arrays - but perhaps this is for a reason. Perhaps in some cases
an option for mutability and failure exception is exactly what is needed for a
one off algorithm that doesn't necessarily need resiliency. Just a thought.