Re: Helper methods for PySpark discussion

2018-10-28 Thread Reynold Xin
I agree - it is very easy for users to shoot themselves in the foot if we don't put in the safeguards, or mislead them by giving them the impression that operations are cheap. DataFrame in Spark isn't like a single node in-memory data structure. Note that the repr string work is very different. Th

Re: Helper methods for PySpark discussion

2018-10-27 Thread Leif Walsh
In the case of len, I think we should examine how python does iterators and generators. https://docs.python.org/3/library/collections.abc.html Iterators have __iter__ and __next__ but are not Sized so they don’t have __len__. If you ask for the len() of a generator (like len(x for x in range(10) i

Re: Helper methods for PySpark discussion

2018-10-26 Thread Holden Karau
Ok so let's say you made a spark dataframe, you call length -- what do you expect to happen? Personallt I expect Spark to evaluate the dataframe, this is what happens with collections and even iterables. The interplay with cache is a bit strange, but presumably if you've marked your Dataframe for

Re: Helper methods for PySpark discussion

2018-10-26 Thread Li Jin
> (2) If the method forces evaluation this matches most obvious way that would implemented then we should add it with a note in the docstring I am not sure about this because force evaluation could be something that has side effect. For example, df.count() can realize a cache and if we implement _

Re: Helper methods for PySpark discussion

2018-10-26 Thread Leif Walsh
That all sounds reasonable but I think in the case of 4 and maybe also 3 I would rather see it implemented to raise an error message that explains what’s going on and suggests the explicit operation that would do the most equivalent thing. And perhaps raise a warning (using the warnings module) for