I agree - it is very easy for users to shoot themselves in the foot if we
don't put in the safeguards, or mislead them by giving them the impression
that operations are cheap. DataFrame in Spark isn't like a single node
in-memory data structure.
Note that the repr string work is very different. Th
In the case of len, I think we should examine how python does iterators and
generators. https://docs.python.org/3/library/collections.abc.html
Iterators have __iter__ and __next__ but are not Sized so they don’t have
__len__. If you ask for the len() of a generator (like len(x for x in
range(10) i
Ok so let's say you made a spark dataframe, you call length -- what do you
expect to happen?
Personallt I expect Spark to evaluate the dataframe, this is what happens
with collections and even iterables.
The interplay with cache is a bit strange, but presumably if you've marked
your Dataframe for
> (2) If the method forces evaluation this matches most obvious way that
would implemented then we should add it with a note in the docstring
I am not sure about this because force evaluation could be something that
has side effect. For example, df.count() can realize a cache and if we
implement _
That all sounds reasonable but I think in the case of 4 and maybe also 3 I
would rather see it implemented to raise an error message that explains
what’s going on and suggests the explicit operation that would do the most
equivalent thing. And perhaps raise a warning (using the warnings module)
for