@antoine thanks for the heads up.
This was the example. I want to merge this table into a final table
+-----------+--+-----------+
|  Table 0  |  |  Table 1  |
+---+---+---+  +---+---+---+
|   | A | B |  |   | A | B |
+---+---+---+  +---+---+---+
| 0 | 3 | x |  | 0 | 1 | x |
+---+---+---+  +---+---+---+
| 1 | 3 | x |  | 1 | 2 | x |
+---+---+---+  +---+---+---+
| 2 | 4 | x |  | 2 | 2 | x |
+---+---+---+  +---+---+---+
| 3 | 5 | x |  | 3 | 5 | x |
+---+---+---+--+---+---+---+

I agree. For the 2 tables case, we don't need a min-heap. But I think the
complexity of the min-heap approach would be O(N log K) for K tables.

I looked at the chunked array index sort impl. If I want to create a final
table, I'd still have to use `sort_index` followed by `take`, wouldn't it?

On Tue, Aug 31, 2021 at 1:22 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi Niranda,
>
> Le 31/08/2021 à 18:09, Niranda Perera a écrit :
> > Hi all,
> >
> > Is there an efficient way to merge a set of sorted tables/ record batches
> > (like in the merge step of a merge sort)?
> > Ex: Simplest case would be to merge these 2 tables, based on A
>
> The formatting of your example looks broken, so I don't really know what
> the two tables are exactly :-)
>
> > Any thoughts on these approaches? Is there a better way to do this?
>
> Merging two sorted sequences is a O(N) operation, it doesn't need a
> min-heap (which would make your algorithm O(N log N)).
>
> You can take a look at the sample implementations in
> https://en.cppreference.com/w/cpp/algorithm/merge
>
> If you have more than 2 tables, you can merge them by pairs,
> recursively, like is already done for the C++ sorting kernel on chunked
> arrays:
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L796
>
> Regards
>
> Antoine.
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Reply via email to