@antoine thanks for the heads up. This was the example. I want to merge this table into a final table +-----------+--+-----------+ | Table 0 | | Table 1 | +---+---+---+ +---+---+---+ | | A | B | | | A | B | +---+---+---+ +---+---+---+ | 0 | 3 | x | | 0 | 1 | x | +---+---+---+ +---+---+---+ | 1 | 3 | x | | 1 | 2 | x | +---+---+---+ +---+---+---+ | 2 | 4 | x | | 2 | 2 | x | +---+---+---+ +---+---+---+ | 3 | 5 | x | | 3 | 5 | x | +---+---+---+--+---+---+---+
I agree. For the 2 tables case, we don't need a min-heap. But I think the complexity of the min-heap approach would be O(N log K) for K tables. I looked at the chunked array index sort impl. If I want to create a final table, I'd still have to use `sort_index` followed by `take`, wouldn't it? On Tue, Aug 31, 2021 at 1:22 PM Antoine Pitrou <anto...@python.org> wrote: > > Hi Niranda, > > Le 31/08/2021 à 18:09, Niranda Perera a écrit : > > Hi all, > > > > Is there an efficient way to merge a set of sorted tables/ record batches > > (like in the merge step of a merge sort)? > > Ex: Simplest case would be to merge these 2 tables, based on A > > The formatting of your example looks broken, so I don't really know what > the two tables are exactly :-) > > > Any thoughts on these approaches? Is there a better way to do this? > > Merging two sorted sequences is a O(N) operation, it doesn't need a > min-heap (which would make your algorithm O(N log N)). > > You can take a look at the sample implementations in > https://en.cppreference.com/w/cpp/algorithm/merge > > If you have more than 2 tables, you can merge them by pairs, > recursively, like is already done for the C++ sorting kernel on chunked > arrays: > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L796 > > Regards > > Antoine. > -- Niranda Perera https://niranda.dev/ @n1r44 <https://twitter.com/N1R44>