alamb commented on issue #10572:
URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2170404296
Hi @twitu -- I am very sorry for the delay in responding -- I have been
traveling for sever
> You'll see that the queries with ORDER BY have a Sort expression in the
plan. It's not clear to me why despite specifying the sort order in the
configuration the plan still has a sort. I hope the optimizations you've
mentioned will take this into account.
One thing that might be going on is that the NULLS FIRST doesn't seem to
match
In your plan the sort is putting nulls last
```
Sort: data.ts_init ASC NULLS LAST
```
but in your code you specify NULLS first
```rust
file_sort_order: vec![vec![Expr::Sort(Sort {
expr: Box::new(col("ts_init")),
asc: true,
nulls_first: true,
})]],
```
> I don't think this is equivalent to adding a LIMIT clause because for the
purpose of the query I'm reading the whole file. It is only that the consumer
decides to stop after reading one row group.
DataFusion is a streaming engine, so if you open a parquet file and read one
batch and stop then the entire file will not be opened read (the batches are
basically created on demand)
There are certain "pipeline breaking" operators that do require reading the
entire input, such as `Sort` and `GroupHashAggregate` which is why I think you
are seeing the entire file read when your query has a sprt
> If you need an additional contributor in any of the above mentioned
issues, I'm happy to help 😄
We are always looking for contributors -- anything you can do to help others
would be most appreciated. For example, perhaps you can add an example to
`datafusion-examples`
https://github.com/apache/datafusion/tree/main/datafusion-examples showing how
to use a pre-sorted input file to avoid sorting during query (assuming that you
can actually get that working)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]