For what it is worth, my experience with some SQL databases has been the
opposite -- ordering can and does differ from statement to statement if the
clause has a GROUP BY but no ORDER BY).

As Andy mentioned, the core reason is performance -- as can require
additional computation to ensure the results come out in a specific,
repeatable order, so SQL allows users to choose if they want to pay the
potential cost for that feature or not (by including ORDER BY)

Some databases that use (or might use) merge based grouping (as opposed to
hash based grouping, as DataFusion currently has) will often produce data
in out in a consistent order due to an artifact of that implementation.

Andrew

On Sat, Feb 20, 2021 at 10:30 AM Marc Prud'hommeaux <mprud...@apache.org>
wrote:

>
> I understand that GROUP BY ought not imply any particular ordering; it's
> just that working with other SQL databases, I've come to expect that
> ordering will be consistent between multiple runs of the same statement, at
> least within the context of a single transaction on a single connection.
>
> I do note that the random ordering is still present when I reduce the CSV
> to 2 rows, so I doubt it is #2, exclusively at least. But the ability to
> process and return the data in parallel is very compelling, so I don't
> think it matters whether the phenomenon is a result of #1 or something else.
>
> Thanks for indulging my curiosity!
>
>     –Marc
>
> On 2021/02/20 14:49:59, Andy Grove <andygrov...@gmail.com> wrote:
> > The SQL standard in general makes no guarantee of the order of resulting
> > data unless there is an explicit ORDER BY clause.
> >
> > I would guess that there are two factors in play here:
> >
> > 1. The use of hash-based data structures, as you mention
> > 2. If you have partitioned data then it is processed on multiple threads
> > and that can affect ordering as well
> >
> > Andy.
> >
> > On Sat, Feb 20, 2021 at 7:31 AM Marc Prud'hommeaux <mprud...@apache.org>
> > wrote:
> >
> > > When I group by a column in DataFusion SQL, the order of the results is
> > > different every time. For example, "select country from data group by
> > > country" against
> > >
> https://github.com/Teradata/kylo/blob/master/samples/sample-data/csv/userdata3.csv
> > > might return "Moldova" first one time, and then "Sweden" first the next
> > > time I execute it.
> > >
> > > It appears that this is known and acknowledged behavior (it is
> mentioned
> > > at https://issues.apache.org/jira/browse/ARROW-5680), but is there
> good
> > > reason for it (e.g., performance; simplicity; random hash seeding)? I
> > > understand why it makes sense to not unnecessarily impose a particular
> > > ordering, but is there any reason the results are not consistent
> between
> > > two identical SQL statements executed against the same
> > > datafusion::execution::context::ExecutionContext?
> > >
> > >
> >
>

Reply via email to