As Keith said, it depends on what you want to do with your data. >From a pipelining perspective the general flow (YMMV) is:
Load dataset(s) -> Transform and / or Join --> Aggregate --> Write dataset Each step in the pipeline does something distinct with the data. The end step is usually loading the final data into something that can display / query it (IE a DBMS of some sort) That's where you'd start doing your queries etc. There's generally no 'good' IMO reason to be collecting your data on the driver except for testing / validation / exploratory work. I hope that helps. Gary Lucas On 4 April 2017 at 12:12, Keith Chapman <keithgchap...@gmail.com> wrote: > As Paul said it really depends on what you want to do with your data, > perhaps writing it to a file would be a better option, but again it depends > on what you want to do with the data you collect. > > Regards, > Keith. > > http://keith-chapman.com > > On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern < > eike.segg...@sevenval.com> wrote: > >> Hi, >> >> depending on what you're trying to achieve `RDD.toLocalIterator()` might >> help you. >> >> Best >> >> Eike >> >> >> 2017-03-29 21:00 GMT+02:00 szep.laszlo.it <szep.laszlo...@gmail.com>: >> >>> Hi, >>> >>> after I created a dataset >>> >>> Dataset<Row> df = sqlContext.sql("query"); >>> >>> I need to have a result values and I call a method: collectAsList() >>> >>> List<Row> list = df.collectAsList(); >>> >>> But it's very slow, if I work with large datasets (20-30 million >>> records). I >>> know, that the result isn't presented in driver app, that's why it takes >>> long time, because collectAsList() collect all data from worker nodes. >>> >>> But then what is the right way to get result values? Is there an other >>> solution to iterate over a result dataset rows, or get values? Can anyone >>> post a small & working example? >>> >>> Thanks & Regards, >>> Laszlo Szep >>> >> >