adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3006801288
Great I'll address
https://github.com/apache/datafusion/pull/16461#discussion_r2159713329 and then
I think this will be ready to merge!
--
This is an automated message from the A
kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3006767151
@adriangb
> @kosiew any objections to merging this?
Nope.
I am excited to see the solution of the puzzle.
--
This is an automated message from the Apache Git Servic
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3006603280
@kosiew any objections to merging this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2998806438
https://github.com/apache/datafusion/pull/16530 will be able to easily be
incorporated into this work, completely eliminating what are currently
expensive casts
--
This is an aut
adriangb commented on code in PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#discussion_r2162856077
##
datafusion/physical-expr/src/schema_rewriter.rs:
##
@@ -0,0 +1,318 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor li
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3004630417
Thank you very much for the feedback @kosiew ππ»! I don't mean to disregard
it, you make great points, but I think they are surmountable! Let's move
forward with this and keep iterat
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3004580959
> Every invocation in the evaluator will loop over rows to build child
arrays, then pack them into a StructArray
As far as I know a PhysicalExpr can operate at the array level
kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3003420880
hi @adriangb
> could you take a look at
https://github.com/apache/datafusion/commit/32725dd621ec5e96caf1970433f3549dca977a80?
πππ
The new tests in
`PhysicalExpr
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3000255690
@kosiew I'm not sure I agree with the conclusions there. Why can't we use
expressions to do the schema adapting during the scan? It's very possible as
@alamb pointed out in
https:/
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3000698443
@kosiew could you take a look at 32725dd?
> Complex handling for deeply nested types.
I do think this is a concern, I'm not sure how hard it would be to actually
implem
kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3000612368
@adriangb ,
Sorry, it was not my intention to presume the conclusions.
I do look forward to a solution that handles schema adaptation in one pass.
--
This is an automat
kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2999138805
Adding notes for future reference:
---
# Summary: Adapting Filter Expressions to File Schema During Parquet Scan
---
## Background & Goal
- Apache Dat
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2998670275
I opened https://github.com/apache/datafusion/issues/16528 to track further
ideas / steps
--
This is an automated message from the Apache Git Service.
To respond to the message, p
alamb commented on code in PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#discussion_r2162492604
##
datafusion/physical-expr/src/schema_rewriter.rs:
##
@@ -0,0 +1,318 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor licen
adriangb commented on code in PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#discussion_r2162548231
##
datafusion/physical-expr/src/schema_rewriter.rs:
##
@@ -0,0 +1,318 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor li
alamb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2997873713
I would personally recommend proceeding in parallel with the two approaches,
ensuring there are good end to end tests (.slt) -- and then if we find that the
projection pushdown / rewri
alamb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2997870791
> But weβd still need an array-level counterpart to actually materialize
those null nested fields in the RecordBatch when we call map_batch.
FWIW I think this is one mechanism to
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2996225226
Thanks for the thoughtful reply. An important point that I forgot to
mention: we're actively working on projection / selection pushdown which would
involve pushing down expressions
kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994947013
@adriangb
Thanks for the ping on this.
> Would it be possible to implement the nested struct imputation work you're
doing with this approach?
Do you mean reusing the
kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994965537
Putting more words to how I understand pushdown and data adaptation:
1. Pushdown β βWhich rows or pages should I read?β
- Input: your original predicate (e.g. col("foo.b") > 5
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994216351
@kosiew I'm curious what you think about this. Would it be possible to
implement the nested struct imputation work you're doing with this approach?
--
This is an automated message
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994242282
Just for fun I opened https://github.com/pydantic/datafusion/pull/31 to see
how hard it would be to incorporate
https://github.com/apache/datafusion/issues/15780#issuecomment-282471
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994214075
My hopes with this work is that we can:
- Make it easier to do further optimizations like
https://github.com/apache/datafusion/issues/15780#issuecomment-2824716928
- Replace th
adriangb commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2993667691
@alamb I've created the builder, moved the implementation and added some
unit tests
--
This is an automated message from the Apache Git Service.
To respond to the message, please
adriangb commented on code in PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#discussion_r2159712965
##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -524,6 +539,84 @@ fn should_enable_page_index(
.unwrap_or(false)
}
+use datafusion_physical_
adriangb commented on code in PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#discussion_r2159713329
##
datafusion/datasource-parquet/src/row_filter.rs:
##
@@ -520,111 +489,15 @@ mod test {
let expr = col("int64_list").is_not_null();
let expr =
alamb commented on code in PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#discussion_r2159679800
##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -524,6 +539,84 @@ fn should_enable_page_index(
.unwrap_or(false)
}
+use datafusion_physical_exp
adriangb commented on code in PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#discussion_r2157528156
##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -879,4 +972,107 @@ mod test {
assert_eq!(num_batches, 0);
assert_eq!(num_rows, 0);
}
adriangb commented on code in PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#discussion_r2157528156
##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -879,4 +972,107 @@ mod test {
assert_eq!(num_batches, 0);
assert_eq!(num_rows, 0);
}
adriangb commented on code in PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#discussion_r2157496405
##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -524,6 +532,62 @@ fn should_enable_page_index(
.unwrap_or(false)
}
+use datafusion_physical_
adriangb opened a new pull request, #16461:
URL: https://github.com/apache/datafusion/pull/16461
The idea here is to move us one step closer to
https://github.com/apache/datafusion/issues/15780 although there is still work
to do (e.g.
https://github.com/apache/datafusion/issues/15780#issue
31 matches
Mail list logo