For the DataFrame/Dataset API, the optimizer rewrites orderBy followed by a
take into a priority queue based top implementation actually.


On Tue, Jan 30, 2018 at 11:10 PM, Yacine Mazari <y.maz...@gmail.com> wrote:

> Hi All,
>
> Would it make sense to add a "top()" method to the Dataset API?
> This method would return a Dataset containing the top k elements, the
> caller
> may then do further processing on the Dataset or call collect(). This is in
> contrast with RDD's top() which returns a collected array.
>
> In terms of implementation, this would use a bounded priority queue, which
> will avoid sorting all elements and run in O(n log k).
>
> I know something similar can be achieved by "orderBy().take()", but I am
> not
> sure if this is optimized.
> If that's not the case, and it performs sorting of all elements (therefore
> running in n log n), it might be handy to add this method.
>
> What do you think?
>
> Regards,
> Yacine.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to