Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-25 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3006801288 Great I'll address https://github.com/apache/datafusion/pull/16461#discussion_r2159713329 and then I think this will be ready to merge! -- This is an automated message from the A

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-25 Thread via GitHub
kosiew commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3006767151 @adriangb > @kosiew any objections to merging this? Nope. I am excited to see the solution of the puzzle. -- This is an automated message from the Apache Git Servic

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-25 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3006603280 @kosiew any objections to merging this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-25 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2998806438 https://github.com/apache/datafusion/pull/16530 will be able to easily be incorporated into this work, completely eliminating what are currently expensive casts -- This is an aut

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-25 Thread via GitHub
adriangb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2162856077 ## datafusion/physical-expr/src/schema_rewriter.rs: ## @@ -0,0 +1,318 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor li

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-25 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3004630417 Thank you very much for the feedback @kosiew πŸ™πŸ»! I don't mean to disregard it, you make great points, but I think they are surmountable! Let's move forward with this and keep iterat

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-25 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3004580959 > Every invocation in the evaluator will loop over rows to build child arrays, then pack them into a StructArray As far as I know a PhysicalExpr can operate at the array level

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-24 Thread via GitHub
kosiew commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3003420880 hi @adriangb > could you take a look at https://github.com/apache/datafusion/commit/32725dd621ec5e96caf1970433f3549dca977a80? πŸ‘πŸ‘πŸ‘ The new tests in `PhysicalExpr

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-24 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3000255690 @kosiew I'm not sure I agree with the conclusions there. Why can't we use expressions to do the schema adapting during the scan? It's very possible as @alamb pointed out in https:/

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-24 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3000698443 @kosiew could you take a look at 32725dd? > Complex handling for deeply nested types. I do think this is a concern, I'm not sure how hard it would be to actually implem

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-24 Thread via GitHub
kosiew commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3000612368 @adriangb , Sorry, it was not my intention to presume the conclusions. I do look forward to a solution that handles schema adaptation in one pass. -- This is an automat

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-24 Thread via GitHub
kosiew commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2999138805 Adding notes for future reference: --- # Summary: Adapting Filter Expressions to File Schema During Parquet Scan --- ## Background & Goal - Apache Dat

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-23 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2998670275 I opened https://github.com/apache/datafusion/issues/16528 to track further ideas / steps -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-23 Thread via GitHub
alamb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2162492604 ## datafusion/physical-expr/src/schema_rewriter.rs: ## @@ -0,0 +1,318 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor licen

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-23 Thread via GitHub
adriangb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2162548231 ## datafusion/physical-expr/src/schema_rewriter.rs: ## @@ -0,0 +1,318 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor li

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-23 Thread via GitHub
alamb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2997873713 I would personally recommend proceeding in parallel with the two approaches, ensuring there are good end to end tests (.slt) -- and then if we find that the projection pushdown / rewri

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-23 Thread via GitHub
alamb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2997870791 > But we’d still need an array-level counterpart to actually materialize those null nested fields in the RecordBatch when we call map_batch. FWIW I think this is one mechanism to

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-23 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2996225226 Thanks for the thoughtful reply. An important point that I forgot to mention: we're actively working on projection / selection pushdown which would involve pushing down expressions

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-23 Thread via GitHub
kosiew commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994947013 @adriangb Thanks for the ping on this. > Would it be possible to implement the nested struct imputation work you're doing with this approach? Do you mean reusing the

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-23 Thread via GitHub
kosiew commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994965537 Putting more words to how I understand pushdown and data adaptation: 1. Pushdown β€” β€œWhich rows or pages should I read?” - Input: your original predicate (e.g. col("foo.b") > 5

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994216351 @kosiew I'm curious what you think about this. Would it be possible to implement the nested struct imputation work you're doing with this approach? -- This is an automated message

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994242282 Just for fun I opened https://github.com/pydantic/datafusion/pull/31 to see how hard it would be to incorporate https://github.com/apache/datafusion/issues/15780#issuecomment-282471

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994214075 My hopes with this work is that we can: - Make it easier to do further optimizations like https://github.com/apache/datafusion/issues/15780#issuecomment-2824716928 - Replace th

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-21 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2993667691 @alamb I've created the builder, moved the implementation and added some unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-20 Thread via GitHub
adriangb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2159712965 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -524,6 +539,84 @@ fn should_enable_page_index( .unwrap_or(false) } +use datafusion_physical_

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-20 Thread via GitHub
adriangb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2159713329 ## datafusion/datasource-parquet/src/row_filter.rs: ## @@ -520,111 +489,15 @@ mod test { let expr = col("int64_list").is_not_null(); let expr =

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-20 Thread via GitHub
alamb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2159679800 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -524,6 +539,84 @@ fn should_enable_page_index( .unwrap_or(false) } +use datafusion_physical_exp

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2157528156 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -879,4 +972,107 @@ mod test { assert_eq!(num_batches, 0); assert_eq!(num_rows, 0); }

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2157528156 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -879,4 +972,107 @@ mod test { assert_eq!(num_batches, 0); assert_eq!(num_rows, 0); }

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2157496405 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -524,6 +532,62 @@ fn should_enable_page_index( .unwrap_or(false) } +use datafusion_physical_

[PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-19 Thread via GitHub
adriangb opened a new pull request, #16461: URL: https://github.com/apache/datafusion/pull/16461 The idea here is to move us one step closer to https://github.com/apache/datafusion/issues/15780 although there is still work to do (e.g. https://github.com/apache/datafusion/issues/15780#issue