Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-15 Thread via GitHub
kosiew commented on code in PR #16771: URL: https://github.com/apache/datafusion/pull/16771#discussion_r2209435726 ## datafusion/expr/src/utils.rs: ## @@ -992,7 +996,7 @@ pub fn iter_conjunction_owned(expr: Expr) -> impl Iterator { stack.push(*right);

Re: [PR] feat: Upgrade to the official DataFusion 49.0.0 release [datafusion-comet]

2025-07-15 Thread via GitHub
dharanad commented on code in PR #1997: URL: https://github.com/apache/datafusion-comet/pull/1997#discussion_r2209357954 ## native/Cargo.toml: ## @@ -38,8 +38,8 @@ arrow = { version = "55.1.0", features = ["prettyprint", "ffi", "chrono-tz"] } async-trait = { version = "0.1" }

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-15 Thread via GitHub
zhuqi-lucas commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3076797095 Strange, i can't reproduce this in my local: ```rust with_param_values_many_columns1.00129.8ยฑ0.74ยตs? ?/sec1.06137.2ยฑ0.79ยตs ```

Re: [PR] Automatically split large single RecordBatches in `MemorySource` into smaller batches [datafusion]

2025-07-15 Thread via GitHub
kosiew commented on PR #16734: URL: https://github.com/apache/datafusion/pull/16734#issuecomment-3076777548 Thanks @2010YOUY01 , @alamb for your review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Automatically split large single RecordBatches in `MemorySource` into smaller batches [datafusion]

2025-07-15 Thread via GitHub
kosiew commented on code in PR #16734: URL: https://github.com/apache/datafusion/pull/16734#discussion_r2209233129 ## datafusion/sqllogictest/test_files/window.slt: ## @@ -5181,6 +5184,10 @@ order by c1; 3 1 1 3 10 2 + +statement ok +set datafusion.execution.batch_size = 1;

Re: [PR] Automatically split large single RecordBatches in `MemorySource` into smaller batches [datafusion]

2025-07-15 Thread via GitHub
kosiew commented on code in PR #16734: URL: https://github.com/apache/datafusion/pull/16734#discussion_r2209232180 ## datafusion/sqllogictest/test_files/window.slt: ## @@ -4341,6 +4341,9 @@ LIMIT 5; 24 31 14 94 +statement ok +set datafusion.execution.batch_size = 100; + Rev

Re: [PR] Automatically split large single RecordBatches in `MemorySource` into smaller batches [datafusion]

2025-07-15 Thread via GitHub
kosiew commented on code in PR #16734: URL: https://github.com/apache/datafusion/pull/16734#discussion_r2209232180 ## datafusion/sqllogictest/test_files/window.slt: ## @@ -4341,6 +4341,9 @@ LIMIT 5; 24 31 14 94 +statement ok +set datafusion.execution.batch_size = 100; + Rev

Re: [PR] Automatically split large single RecordBatches in `MemorySource` into smaller batches [datafusion]

2025-07-15 Thread via GitHub
kosiew commented on code in PR #16734: URL: https://github.com/apache/datafusion/pull/16734#discussion_r2209227792 ## datafusion/sqllogictest/test_files/window.slt: ## @@ -4341,6 +4341,9 @@ LIMIT 5; 24 31 14 94 +statement ok Review Comment: @alamb, I added another

Re: [PR] Automatically split large single RecordBatches in `MemorySource` into smaller batches [datafusion]

2025-07-15 Thread via GitHub
kosiew commented on code in PR #16734: URL: https://github.com/apache/datafusion/pull/16734#discussion_r2209227792 ## datafusion/sqllogictest/test_files/window.slt: ## @@ -4341,6 +4341,9 @@ LIMIT 5; 24 31 14 94 +statement ok Review Comment: @alamb, I added another

[PR] CI: Fix slow join test [datafusion]

2025-07-15 Thread via GitHub
2010YOUY01 opened a new pull request, #16796: URL: https://github.com/apache/datafusion/pull/16796 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/16792 ## Rationale for this change #16443 added one batch size option to

Re: [PR] Snowflake: Improve accuracy of lookahead in implicit LIMIT alias [datafusion-sqlparser-rs]

2025-07-15 Thread via GitHub
yoavcloud commented on code in PR #1941: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1941#discussion_r2209218297 ## src/dialect/snowflake.rs: ## @@ -365,6 +364,18 @@ impl Dialect for SnowflakeDialect { false } +// `LIM

Re: [PR] Automatically split large single RecordBatches in `MemorySource` into smaller batches [datafusion]

2025-07-15 Thread via GitHub
kosiew commented on code in PR #16734: URL: https://github.com/apache/datafusion/pull/16734#discussion_r2209201224 ## datafusion/physical-plan/src/stream.rs: ## @@ -522,6 +524,139 @@ impl Stream for ObservedStream { } } +pin_project! { +/// Stream wrapper that splits

Re: [PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-15 Thread via GitHub
andygrove commented on code in PR #2031: URL: https://github.com/apache/datafusion-comet/pull/2031#discussion_r2209180081 ## native/hdfs/src/object_store/hdfs.rs: ## @@ -88,19 +88,18 @@ impl HadoopFileSystem { fn read_range(range: &Range, file: &HdfsFile) -> Result {

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-15 Thread via GitHub
zhuqi-lucas commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3076583715 > ๐Ÿค–: Benchmark completed > > Details Thank you @alamb for benchmark, interesting, it seems the performance decrease from 112 size to 80, i will investigate it, if i

Re: [PR] [Update author info] Blog: Embedding User-Defined Indexes in Apache Parquet Files #79 [datafusion-site]

2025-07-15 Thread via GitHub
zhuqi-lucas commented on PR #89: URL: https://github.com/apache/datafusion-site/pull/89#issuecomment-3076581217 > > > Hi @alamb One small typo in https://bsky.app/profile/andrewlamb.bsky.social I read: the name should be `Qi Zhu` as https://github.com/zhuqi-lucas > > > > > >

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-15 Thread via GitHub
2010YOUY01 commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3076513755 > select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value; I'm happy to include this benchmark in the bench suite this week, un

Re: [I] [substrait] [sqllogictest] Unsupported cast type: FixedSizeList [datafusion]

2025-07-15 Thread via GitHub
niebayes commented on issue #16278: URL: https://github.com/apache/datafusion/issues/16278#issuecomment-3076494962 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

[I] Postgres VACUUM not supported [datafusion-sqlparser-rs]

2025-07-15 Thread via GitHub
achristmascarl opened a new issue, #1948: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1948 VACUUM does not seem to be parsed. Docs: https://www.postgresql.org/docs/current/sql-vacuum.html -- This is an automated message from the Apache Git Service. To respond to the mess

Re: [PR] Add xxhash algorithms in SQL and expression api [datafusion]

2025-07-15 Thread via GitHub
github-actions[bot] closed pull request #14367: Add xxhash algorithms in SQL and expression api URL: https://github.com/apache/datafusion/pull/14367 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] [wip] update list & struct coercion to support incrementality [datafusion]

2025-07-15 Thread via GitHub
github-actions[bot] closed pull request #15259: [wip] update list & struct coercion to support incrementality URL: https://github.com/apache/datafusion/pull/15259 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [PR] Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-07-15 Thread via GitHub
github-actions[bot] closed pull request #15532: Make Clickbench Q29 5x faster for datafusion URL: https://github.com/apache/datafusion/pull/15532 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Improve push down limit (logical optimizer rule) [datafusion]

2025-07-15 Thread via GitHub
github-actions[bot] closed pull request #15744: Improve push down limit (logical optimizer rule) URL: https://github.com/apache/datafusion/pull/15744 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] POC: Parse to Merge Logical Plan [datafusion]

2025-07-15 Thread via GitHub
github-actions[bot] closed pull request #15862: POC: Parse to Merge Logical Plan URL: https://github.com/apache/datafusion/pull/15862 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [PR] Deprecate `ExprSchema` functions [datafusion]

2025-07-15 Thread via GitHub
github-actions[bot] commented on PR #15847: URL: https://github.com/apache/datafusion/pull/15847#issuecomment-3076487214 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] fix: Allow ORDER BY aggregates not present in SELECT list [datafusion]

2025-07-15 Thread via GitHub
github-actions[bot] commented on PR #15876: URL: https://github.com/apache/datafusion/pull/15876#issuecomment-3076487145 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-15 Thread via GitHub
codecov-commenter commented on PR #2031: URL: https://github.com/apache/datafusion-comet/pull/2031#issuecomment-3076413162 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2031?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] feat: Upgrade to the official DataFusion 49.0.0 release [datafusion-comet]

2025-07-15 Thread via GitHub
rishvin commented on code in PR #1997: URL: https://github.com/apache/datafusion-comet/pull/1997#discussion_r2208977125 ## native/Cargo.toml: ## @@ -38,8 +38,8 @@ arrow = { version = "55.1.0", features = ["prettyprint", "ffi", "chrono-tz"] } async-trait = { version = "0.1" }

Re: [PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-15 Thread via GitHub
parthchandra commented on PR #2031: URL: https://github.com/apache/datafusion-comet/pull/2031#issuecomment-3076378047 @Kontinuation, fyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

[PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-15 Thread via GitHub
parthchandra opened a new pull request, #2031: URL: https://github.com/apache/datafusion-comet/pull/2031 The `get` and `read_range` methods in hdfs object store implementation do not always read the data requested because the underlying hdfs call may return fewer bytes than requested. ht

Re: [I] Expected: decimal(7,2), Found: DOUBLE when running TPC-DS benchmarks on Spark 3.5 [datafusion-comet]

2025-07-15 Thread via GitHub
parthchandra commented on issue #2029: URL: https://github.com/apache/datafusion-comet/issues/2029#issuecomment-3076327201 This error is likely because the schema for the column `sr_return_amt` is a `decimal(7,2)` while in the parquet file that field has a physical type of `double`. (see

[I] count_all() aggregations cannot be aliased [datafusion]

2025-07-15 Thread via GitHub
BlakeOrth opened a new issue, #16795: URL: https://github.com/apache/datafusion/issues/16795 ### Describe the bug Attempting to apply an `alias` to a `count_all()` aggregation results in an Internal Error with the following message: ``` called `Result::unwrap()` on an `Err`

Re: [I] Expected: decimal(7,2), Found: DOUBLE when running TPC-DS benchmarks on Spark 3.5 [datafusion-comet]

2025-07-15 Thread via GitHub
ShreyeshArangath commented on issue #2029: URL: https://github.com/apache/datafusion-comet/issues/2029#issuecomment-3075918084 Update: Looking the executor logs it looks like its likely because its not using HDFS as the data source. The file reader is using S3 related configs?

Re: [PR] fix: [iceberg] Add LogicalTypeAnnotation in ParquetColumnSpec [datafusion-comet]

2025-07-15 Thread via GitHub
kazuyukitanimura commented on PR #2000: URL: https://github.com/apache/datafusion-comet/pull/2000#issuecomment-3075840104 Thanks @huaxingao @andygrove @parthchandra @hsiang-c -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

Re: [PR] fix: [iceberg] Add LogicalTypeAnnotation in ParquetColumnSpec [datafusion-comet]

2025-07-15 Thread via GitHub
kazuyukitanimura merged PR #2000: URL: https://github.com/apache/datafusion-comet/pull/2000 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3075698800 ๐Ÿค–: Benchmark completed Details ``` group main reduce_expr_size -

[PR] fix tests in page_pruning when filter pushdown is enabled by default [datafusion]

2025-07-15 Thread via GitHub
XiangpengHao opened a new pull request, #16794: URL: https://github.com/apache/datafusion/pull/16794 ## Which issue does this PR close? This should unblock our way to enable filter pushdown by default. ## Rationale for this change When working on #16711 we kee

Re: [PR] Implement equals for stateful functions [datafusion]

2025-07-15 Thread via GitHub
alamb commented on code in PR #16781: URL: https://github.com/apache/datafusion/pull/16781#discussion_r2208644548 ## datafusion/doc/src/lib.rs: ## @@ -158,7 +158,7 @@ impl Documentation { } } -#[derive(Debug, Clone, PartialEq)] +#[derive(Debug, Clone, PartialEq, Hash)]

Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-07-15 Thread via GitHub
hsiang-c commented on PR #1987: URL: https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3075575929 Most of the exceptions in Iceberg Spark SQL Tests can be reproduced by 1. Follow the official guide to build Comet and Iceberg, configure Spark shell and populate the Ice

Re: [PR] cache generation of dictionary keys and null arrays for ScalarValue [datafusion]

2025-07-15 Thread via GitHub
adriangb commented on PR #16789: URL: https://github.com/apache/datafusion/pull/16789#issuecomment-3075549678 > With this code, can we now also remove ZeroBufferGenerators ? Perhaps as a follow on PR? Not yet. That will still be used for projection (selecting a partition column) unti

Re: [PR] feat: randn expression support [datafusion-comet]

2025-07-15 Thread via GitHub
mbutrovich commented on code in PR #2010: URL: https://github.com/apache/datafusion-comet/pull/2010#discussion_r2208611091 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -2765,6 +2765,26 @@ class CometExpressionSuite extends CometTestBase with Adapti

Re: [PR] Add reproducing test cases for stackoverflows [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16787: URL: https://github.com/apache/datafusion/pull/16787#issuecomment-3075487540 cc @gabotechs and @LiaCastaneda -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] fix: support nullable columns in pre-sorted data sources [datafusion]

2025-07-15 Thread via GitHub
alamb commented on code in PR #16783: URL: https://github.com/apache/datafusion/pull/16783#discussion_r2208603160 ## datafusion/datasource/src/statistics.rs: ## @@ -230,14 +230,7 @@ impl MinMaxStatistics { .zip(sort_columns.iter().copied()) .map

Re: [I] Release DataFusion `49.0.0` (July 2025) [datafusion]

2025-07-15 Thread via GitHub
adriangb commented on issue #16235: URL: https://github.com/apache/datafusion/issues/16235#issuecomment-3075486861 I've opened https://github.com/apache/datafusion/pull/16791 to address backwards compatibility -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-15 Thread via GitHub
adriangb commented on PR #16791: URL: https://github.com/apache/datafusion/pull/16791#issuecomment-3075481633 > I'd like to add a unit test that confirms the custom schema adapter factory will be used if specified. done -- This is an automated message from the Apache Git Service. T

Re: [PR] Fix: Preserve sorting for the COPY TO plan [datafusion]

2025-07-15 Thread via GitHub
alamb commented on code in PR #16785: URL: https://github.com/apache/datafusion/pull/16785#discussion_r2208580080 ## datafusion/core/tests/dataframe/mod.rs: ## @@ -6193,3 +6194,59 @@ async fn test_copy_schema() -> Result<()> { assert_logical_expr_schema_eq_physical_expr_sch

Re: [PR] cache generation of dictionary keys and null arrays for ScalarValue [datafusion]

2025-07-15 Thread via GitHub
alamb commented on code in PR #16789: URL: https://github.com/apache/datafusion/pull/16789#discussion_r2208548427 ## datafusion/common/src/scalar/mod.rs: ## @@ -854,6 +854,140 @@ pub fn get_dict_value( Ok((dict_array.values(), dict_array.key(index))) } +/// Cache for dic

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3075424716 ๐Ÿค– `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~

Re: [PR] Add support for Float16 type in substrait [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16793: URL: https://github.com/apache/datafusion/pull/16793#issuecomment-3075381830 Thank you @jatin510 @gabotechs or @LiaCastaneda would you have time to review this PR? -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] chore(deps): bump prost-build from 0.13.5 to 0.14.1 in the proto group [datafusion]

2025-07-15 Thread via GitHub
dependabot[bot] commented on PR #16546: URL: https://github.com/apache/datafusion/pull/16546#issuecomment-3075345489 This pull request was built based on a group rule. Closing it will not ignore any of these versions in future pull requests. To ignore these dependencies, configure [ig

Re: [PR] chore(deps): bump prost-build from 0.13.5 to 0.14.1 in the proto group [datafusion]

2025-07-15 Thread via GitHub
alamb closed pull request #16546: chore(deps): bump prost-build from 0.13.5 to 0.14.1 in the proto group URL: https://github.com/apache/datafusion/pull/16546 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] Support min/max aggregates for FixedSizeBinary type [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16765: URL: https://github.com/apache/datafusion/pull/16765#issuecomment-3075342025 Thanks again @theirix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [I] FixedSizeBinary support in min/max accumulators [datafusion]

2025-07-15 Thread via GitHub
alamb closed issue #16513: FixedSizeBinary support in min/max accumulators URL: https://github.com/apache/datafusion/issues/16513 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Support min/max aggregates for FixedSizeBinary type [datafusion]

2025-07-15 Thread via GitHub
alamb merged PR #16765: URL: https://github.com/apache/datafusion/pull/16765 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Bug: `make_date(year, month, day)` reports error if one of the fileds is NULL [datafusion]

2025-07-15 Thread via GitHub
alamb closed issue #16746: Bug: `make_date(year, month, day)` reports error if one of the fileds is NULL URL: https://github.com/apache/datafusion/issues/16746 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] fix: return NULL if any of the param to make_date is NULL [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16759: URL: https://github.com/apache/datafusion/pull/16759#issuecomment-3075340476 Thanks again @feniljain and @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] fix: return NULL if any of the param to make_date is NULL [datafusion]

2025-07-15 Thread via GitHub
alamb merged PR #16759: URL: https://github.com/apache/datafusion/pull/16759 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3075251060 ๐Ÿค–: Benchmark completed Details ``` Comparing HEAD and alamb_test_pushdown Benchmark clickbench_pushdown.json

Re: [PR] add filter to handle backtrace [datafusion]

2025-07-15 Thread via GitHub
blaginin commented on PR #16752: URL: https://github.com/apache/datafusion/pull/16752#issuecomment-3075203014 Thanks @geetanshjuneja ๐Ÿ™ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] add filter to handle backtrace [datafusion]

2025-07-15 Thread via GitHub
blaginin merged PR #16752: URL: https://github.com/apache/datafusion/pull/16752 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [I] Handle panic stacktrace in `datafusion-cli` tests [datafusion]

2025-07-15 Thread via GitHub
blaginin closed issue #16146: Handle panic stacktrace in `datafusion-cli` tests URL: https://github.com/apache/datafusion/issues/16146 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3075168061 ๐Ÿค– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3075166705 ๐Ÿค– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

[PR] Add support for Float16 type in substrait [datafusion]

2025-07-15 Thread via GitHub
jatin510 opened a new pull request, #16793: URL: https://github.com/apache/datafusion/pull/16793 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/16298. ## Rationale for this change ## What changes are included in thi

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3074932979 ๐Ÿค–: Benchmark completed Details ``` Comparing HEAD and alamb_update_arrow_56.0.0 Benchmark clickbench_pushdown.json

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3074793041 ๐Ÿค– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

[I] joins::nested_loop_join::tests::join_maintains_right_order tests take over 60 seconds [datafusion]

2025-07-15 Thread via GitHub
alamb opened a new issue, #16792: URL: https://github.com/apache/datafusion/issues/16792 ### Describe the bug Some CI tests are taking over 60 seconds to complete, which results in slower CI runs slower local development ### To Reproduce Using cargo nextest https

Re: [PR] fix: skip predicates on struct unnest in FilterPushdown [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16790: URL: https://github.com/apache/datafusion/pull/16790#issuecomment-3074709053 @adriangb is there any chance you have time to review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [I] Add a datafusion benchmark for `filter_pushdown` [datafusion]

2025-07-15 Thread via GitHub
alamb closed issue #16729: Add a datafusion benchmark for `filter_pushdown` URL: https://github.com/apache/datafusion/issues/16729 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [PR] Add `clickbench_pushdown` benchmark [datafusion]

2025-07-15 Thread via GitHub
alamb merged PR #16731: URL: https://github.com/apache/datafusion/pull/16731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-15 Thread via GitHub
adriangb commented on PR #16791: URL: https://github.com/apache/datafusion/pull/16791#issuecomment-3074657503 I'd like to add a unit test that confirms the custom schema adapter factory will be used if specified. -- This is an automated message from the Apache Git Service. To respond to t

[PR] Chore: Improve array contains test coverage [datafusion-comet]

2025-07-15 Thread via GitHub
kazantsev-maksim opened a new pull request, #2030: URL: https://github.com/apache/datafusion-comet/pull/2030 ## Which issue does this PR close? Part of: https://github.com/apache/datafusion-comet/issues/1902 ## Rationale for this change Part of: https://github.com/apache/

[PR] add fallback and docs [datafusion]

2025-07-15 Thread via GitHub
adriangb opened a new pull request, #16791: URL: https://github.com/apache/datafusion/pull/16791 https://github.com/apache/datafusion/issues/16235#issuecomment-3067125182 This essentially partially reverts https://github.com/apache/datafusion/pull/16461 by keeping backward compatibil

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3074440773 ๐Ÿค–: Benchmark completed Details ``` group main reduce_expr_size -

[PR] fix: skip predicates on struct unnest in FilterPushdown [datafusion]

2025-07-15 Thread via GitHub
akoshchiy opened a new pull request, #16790: URL: https://github.com/apache/datafusion/pull/16790 ## Which issue does this PR close? - Closes #16695. ## Rationale for this change ## What changes are included in this PR? ## Are these changes

[I] Expected: decimal(7,2), Found: DOUBLE when running TPC-DS benchmarks on Spark 3.5 [datafusion-comet]

2025-07-15 Thread via GitHub
ShreyeshArangath opened a new issue, #2029: URL: https://github.com/apache/datafusion-comet/issues/2029 ### Describe the bug Built the most recent version of Comet for some internal benchmarking, but running into this following issue ``` org.apache.spark.SparkException: Parq

[PR] [WIP] chore: refactor Comparison out of QueryPlanSerde [datafusion-comet]

2025-07-15 Thread via GitHub
CuteChuanChuan opened a new pull request, #2028: URL: https://github.com/apache/datafusion-comet/pull/2028 This is a work-in-progress implementation for issue #2019. Following the pattern established in #2027 (math expressions) and #2026 (array expressions), this PR refactors comp

Re: [PR] chore(deps): bump prost-build from 0.13.5 to 0.14.1 in the proto group [datafusion]

2025-07-15 Thread via GitHub
crepererum commented on PR #16546: URL: https://github.com/apache/datafusion/pull/16546#issuecomment-3073623302 lets wait for upstream (i.e. arrow) to upgrade, so we don't pull two versions of prost -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-15 Thread via GitHub
parthchandra commented on PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3074316229 > The [`get`](https://github.com/apache/datafusion-comet/blob/0.9.0/native/hdfs/src/object_store/hdfs.rs#L138-L163) and [`read_range`](https://github.com/apache/datafusion

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3074267706 ๐Ÿค– `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~

Re: [PR] [Update author info] Blog: Embedding User-Defined Indexes in Apache Parquet Files #79 [datafusion-site]

2025-07-15 Thread via GitHub
alamb commented on PR #89: URL: https://github.com/apache/datafusion-site/pull/89#issuecomment-3074254910 > > Hi @alamb One small typo in https://bsky.app/profile/andrewlamb.bsky.social I read: the name should be `Qi Zhu` as https://github.com/zhuqi-lucas > > Thank you @JigaoLuo

Re: [I] Optimize the join operators [datafusion]

2025-07-15 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3074225566 > DataFusion is underperforming the Polars streaming engine on some localhost join queries (1e8 rows of data on a Macbook M3 with 16GB of RAM): > > https://private-use

Re: [PR] Per file filter evaluation [datafusion]

2025-07-15 Thread via GitHub
alamb commented on PR #15057: URL: https://github.com/apache/datafusion/pull/15057#issuecomment-3074155476 Thank you @adriangb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Automatically split large single RecordBatchesin `MemorySource` into smaller batches [datafusion]

2025-07-15 Thread via GitHub
alamb commented on code in PR #16734: URL: https://github.com/apache/datafusion/pull/16734#discussion_r2207881274 ## datafusion/physical-plan/src/stream.rs: ## @@ -522,6 +524,139 @@ impl Stream for ObservedStream { } } +pin_project! { +/// Stream wrapper that splits

[PR] Postgres: ALTER TABLE SET ( storage_parameters ) [datafusion-sqlparser-rs]

2025-07-15 Thread via GitHub
achristmascarl opened a new pull request, #1947: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1947 Closes #1946 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

[PR] cache generation of dictionary keys and null arrays for ScalarValue [datafusion]

2025-07-15 Thread via GitHub
adriangb opened a new pull request, #16789: URL: https://github.com/apache/datafusion/pull/16789 This was inspired by what is already being done for partition values: https://github.com/apache/datafusion/blob/62dbebdeaa78782aa7fe357ce629684a7ec143de/datafusion/datasource/src/file_scan_config

Re: [PR] Per file filter evaluation [datafusion]

2025-07-15 Thread via GitHub
alamb merged PR #15057: URL: https://github.com/apache/datafusion/pull/15057 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Per file filter evaluation [datafusion]

2025-07-15 Thread via GitHub
adriangb commented on PR #15057: URL: https://github.com/apache/datafusion/pull/15057#issuecomment-3073986907 I plan on merging this once CI passes. It has been approved / reviewed and we need the customization of the rewriters for https://github.com/apache/datafusion/issues/16235#issuecomm

Re: [I] Release DataFusion `49.0.0` (July 2025) [datafusion]

2025-07-15 Thread via GitHub
adriangb commented on issue #16235: URL: https://github.com/apache/datafusion/issues/16235#issuecomment-3073988803 xpost: https://github.com/apache/datafusion/pull/15057#issuecomment-3073986907 -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [I] Support `VARIANT` type for unstructured data [datafusion]

2025-07-15 Thread via GitHub
paleolimbot commented on issue #16116: URL: https://github.com/apache/datafusion/issues/16116#issuecomment-3073876353 Just listing a few specific places where I've had to integrate extension types outside of existing DataFusion mechanisms: - The `Signature` (i.e., how do you use the S

Re: [I] Optimize the join operators [datafusion]

2025-07-15 Thread via GitHub
Dandandan commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3073881761 Thanks @nuno-faria that's a great insight (for TPC-H / very nested joins we probably should implement a smarter join order algorithm). For h2o joins however, it seems it

[I] Postgres ALTER TABLE SET ( storage_parameter [= value] [, ... ] ) fails to parse [datafusion-sqlparser-rs]

2025-07-15 Thread via GitHub
achristmascarl opened a new issue, #1946: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1946 The following example is valid syntax (https://www.postgresql.org/docs/current/sql-altertable.html) but fails to parse: ```sql ALTER TABLE your_table SET ( autovacuum_v

Re: [PR] Per file filter evaluation [datafusion]

2025-07-15 Thread via GitHub
adriangb commented on code in PR #15057: URL: https://github.com/apache/datafusion/pull/15057#discussion_r2207679888 ## datafusion-examples/examples/default_column_values.rs: ## @@ -0,0 +1,366 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contribu

[PR] MySQL: EXPLAIN ANALYZE format type [datafusion-sqlparser-rs]

2025-07-15 Thread via GitHub
yoavcloud opened a new pull request, #1945: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1945 Add support for the MySQL`FORMAT=` syntax for `EXPLAIN ANALYZE` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

Re: [PR] Per file filter evaluation [datafusion]

2025-07-15 Thread via GitHub
adriangb commented on code in PR #15057: URL: https://github.com/apache/datafusion/pull/15057#discussion_r2207512528 ## datafusion-examples/examples/default_column_values.rs: ## @@ -0,0 +1,366 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contribu

Re: [PR] Per file filter evaluation [datafusion]

2025-07-15 Thread via GitHub
adriangb commented on code in PR #15057: URL: https://github.com/apache/datafusion/pull/15057#discussion_r2207520996 ## datafusion-examples/examples/default_column_values.rs: ## @@ -0,0 +1,366 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contribu

Re: [PR] Per file filter evaluation [datafusion]

2025-07-15 Thread via GitHub
adriangb commented on code in PR #15057: URL: https://github.com/apache/datafusion/pull/15057#discussion_r2207512528 ## datafusion-examples/examples/default_column_values.rs: ## @@ -0,0 +1,366 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contribu

Re: [PR] Fix for Postgres regex and like binary operators [datafusion-sqlparser-rs]

2025-07-15 Thread via GitHub
solontsev commented on code in PR #1928: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1928#discussion_r2207480227 ## tests/sqlparser_postgres.rs: ## @@ -2207,19 +2223,31 @@ fn parse_pg_like_match_ops() { ]; for (str_op, op) in pg_like_match_ops { -

Re: [PR] Remove fixed version from MSRV check [datafusion]

2025-07-15 Thread via GitHub
crepererum merged PR #16786: URL: https://github.com/apache/datafusion/pull/16786 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [PR] fix: support nullable columns in pre-sorted data sources [datafusion]

2025-07-15 Thread via GitHub
crepererum commented on code in PR #16783: URL: https://github.com/apache/datafusion/pull/16783#discussion_r2207466414 ## datafusion/sqllogictest/test_files/parquet.slt: ## @@ -130,8 +130,7 @@ STORED AS PARQUET; 3 -# Check output plan again, expect no "output_ordering"

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-15 Thread via GitHub
zhuqi-lucas commented on code in PR #16771: URL: https://github.com/apache/datafusion/pull/16771#discussion_r2207426064 ## datafusion/sqllogictest/test_files/joins.slt: ## @@ -4009,12 +4009,12 @@ logical_plan 09)Unnest: lists[__unnest_placeholder(generate_series(In

  1   2   >