I see, thanks a lot for the clarifications.
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
You can use `Dataset.limit`, which return a new `Dataset` instead of an
Array. Then you can transform it and still get the top k optimization from
Spark.
On Wed, Jan 31, 2018 at 3:39 PM, Yacine Mazari wrote:
> Thanks for the quick reply and explanation @rxin.
>
> So if one does not want to colle
Thanks for the quick reply and explanation @rxin.
So if one does not want to collect()/take() but want the top k as a dataset
to do further transformations there is no optimized API, that's why I am
suggesting adding this "top()" as a public method.
If that sounds like a good idea, I will open a
For the DataFrame/Dataset API, the optimizer rewrites orderBy followed by a
take into a priority queue based top implementation actually.
On Tue, Jan 30, 2018 at 11:10 PM, Yacine Mazari wrote:
> Hi All,
>
> Would it make sense to add a "top()" method to the Dataset API?
> This method would retu
Hi All,
Would it make sense to add a "top()" method to the Dataset API?
This method would return a Dataset containing the top k elements, the caller
may then do further processing on the Dataset or call collect(). This is in
contrast with RDD's top() which returns a collected array.
In terms of i