Re: Merging sorted tables/ record batches

Micah Kornfield Tue, 31 Aug 2021 18:58:14 -0700

According to Wikipedia there is a min-heap approach that is O(N log k) not
sure if this matches with Niranda's proposal [1].  On the surface the
analysis make sense to me but I could be missing something.


[1] https://en.m.wikipedia.org/wiki/K-way_merge_algorithm

On Tuesday, August 31, 2021, Antoine Pitrou <anto...@python.org> wrote:

> On Tue, 31 Aug 2021 17:47:02 -0400
> Niranda Perera <niranda.per...@gmail.com> wrote:
> > @antoine thanks for the heads up.
> > This was the example. I want to merge this table into a final table
> > +-----------+--+-----------+
> > |  Table 0  |  |  Table 1  |
> > +---+---+---+  +---+---+---+
> > |   | A | B |  |   | A | B |
> > +---+---+---+  +---+---+---+
> > | 0 | 3 | x |  | 0 | 1 | x |
> > +---+---+---+  +---+---+---+
> > | 1 | 3 | x |  | 1 | 2 | x |
> > +---+---+---+  +---+---+---+
> > | 2 | 4 | x |  | 2 | 2 | x |
> > +---+---+---+  +---+---+---+
> > | 3 | 5 | x |  | 3 | 5 | x |
> > +---+---+---+--+---+---+---+
> >
> > I agree. For the 2 tables case, we don't need a min-heap. But I think the
> > complexity of the min-heap approach would be O(N log K) for K tables.
>
> The min-heap approach you're proposing looks like a regular heap sort,
> and so would be O(N log N) (it doesn't take advantage of the existing
> order). Besides, heap sort is not stable.
>
> The pairwise merge approach should O(N log K), however.
>
> (of course, if N is small-ish this may not be a problem for you)
>
> > I looked at the chunked array index sort impl. If I want to create a
> final
> > table, I'd still have to use `sort_index` followed by `take`, wouldn't
> it?
>
> If you use `sort_indices`, you'll have the entire sort.  If you want to
> take advantage of each chunk being ordered, you should extract or
> rewrite the merge logic.
>
> (and, yes, if you use `sort_indices`, you'll need to call `take`
> afterwards; but that's much more efficient that sorting the table
> directly, which requires shuffling entire rows when you need to swap
> values)
>
> Regards
>
> Antoine.
>
>
>

Re: Merging sorted tables/ record batches

Reply via email to