Re: Will .count() always trigger an evaluation of each row?

2017-02-18 Thread Sean Owen
I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every elem

RE: Design document - MLlib's statistical package for DataFrames

2017-02-18 Thread Pritish Nawlakhe
Hi Would anyone know how to unsubscribe to this list? Thank you!! Regards Pritish Nirvana International Inc. Big Data, Hadoop, Oracle EBS and IT Solutions VA - SWaM, MD - MBE Certified Company prit...@nirvana-international.com http://www.nirvana-international.com Twitter: @nirvanainternat

Re: Design document - MLlib's statistical package for DataFrames

2017-02-18 Thread Holden Karau
It's at the bottom of every message (although some mail clients hide it for some reason), send an email to dev-unsubscr...@spark.apache.org On Sat, Feb 18, 2017 at 11:07 AM Pritish Nawlakhe < prit...@nirvana-international.com> wrote: > Hi > > Would anyone know how to unsubscribe to this list? > >

Re: Will .count() always trigger an evaluation of each row?

2017-02-18 Thread Matei Zaharia
Count is different on DataFrames and Datasets from RDDs. On RDDs, it always evaluates everything, but on DataFrame/Dataset, it turns into the equivalent of "select count(*) from ..." in SQL, which can be done without scanning the data for some data formats (e.g. Parquet). On the other hand thoug