Re: Will .count() always trigger an evaluation of each row?

2017-02-18 Thread Matei Zaharia
Count is different on DataFrames and Datasets from RDDs. On RDDs, it always evaluates everything, but on DataFrame/Dataset, it turns into the equivalent of "select count(*) from ..." in SQL, which can be done without scanning the data for some data formats (e.g. Parquet). On the other hand thoug

Re: Design document - MLlib's statistical package for DataFrames

2017-02-18 Thread Holden Karau
It's at the bottom of every message (although some mail clients hide it for some reason), send an email to dev-unsubscr...@spark.apache.org On Sat, Feb 18, 2017 at 11:07 AM Pritish Nawlakhe < prit...@nirvana-international.com> wrote: > Hi > > Would anyone know how to unsubscribe to this list? > >

RE: Design document - MLlib's statistical package for DataFrames

2017-02-18 Thread Pritish Nawlakhe
Hi Would anyone know how to unsubscribe to this list? Thank you!! Regards Pritish Nirvana International Inc. Big Data, Hadoop, Oracle EBS and IT Solutions VA - SWaM, MD - MBE Certified Company prit...@nirvana-international.com http://www.nirvana-international.com Twitter: @nirvanainternat

Re: Will .count() always trigger an evaluation of each row?

2017-02-18 Thread Sean Owen
I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every elem