Re: [PR] feat: introduce hadoop mini cluster to test native scan on hdfs [datafusion-comet]

2025-04-04 Thread via GitHub
wForget commented on code in PR #1556: URL: https://github.com/apache/datafusion-comet/pull/1556#discussion_r2006771877 ## spark/src/test/scala/org/apache/spark/sql/benchmark/CometReadBenchmark.scala: ## @@ -63,6 +65,7 @@ object CometReadBenchmark extends CometBenchmarkBase {

[PR] Migrate datafusion/sql tests to insta, part4 [datafusion]

2025-04-04 Thread via GitHub
qstommyshu opened a new pull request, #15548: URL: https://github.com/apache/datafusion/pull/15548 ## Which issue does this PR close? - Related #15484, #15397, #15497, #15499, #15533 - this is a part of #15484 breaking down. - Checkout things to note of the whole migrati

Re: [I] Unsupported Arrow Vector for export: class org.apache.arrow.vector.complex.ListVector [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove commented on issue #1289: URL: https://github.com/apache/datafusion-comet/issues/1289#issuecomment-2779554887 I now have a repro for this issue in https://github.com/apache/datafusion-comet/pull/1610 -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2771480246 > FYI I will likely try and review this PR again carefully first thing tomorrow morning I got everything mostly nice... but it seems there are still a couple minor bugs. Will

Re: [PR] docs: various improvements to tuning guide [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove commented on code in PR #1525: URL: https://github.com/apache/datafusion-comet/pull/1525#discussion_r2003562001 ## common/src/main/scala/org/apache/comet/CometConf.scala: ## @@ -274,11 +272,9 @@ object CometConf extends ShimCometConf { .createWithDefault(true)

Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-04-04 Thread via GitHub
ctsk commented on PR #15476: URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2762368030 I've amended the PR so that `Executionplan::execute` fails if one tries to execute such a problematic plan. -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-04-04 Thread via GitHub
ctsk commented on PR #15418: URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2764253958 Closed in favor of https://github.com/apache/datafusion/pull/15476 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15561: URL: https://github.com/apache/datafusion/pull/15561#issuecomment-2779567108 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_

Re: [PR] Migrate datafusion/sql tests to insta, part6 [datafusion]

2025-04-04 Thread via GitHub
qstommyshu commented on PR #15578: URL: https://github.com/apache/datafusion/pull/15578#issuecomment-2779568810 This is the last PR for #15397 The migration is mostly done. There are just a few really big tests that I think they would be too big to break down and use inline assertion

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on PR #15561: URL: https://github.com/apache/datafusion/pull/15561#issuecomment-2779541725 A caching implementation also adds overhead, if we can't measure it how do we know the caching version is not slower, etc. -- This is an automated message from the Apache Git Serv

Re: [I] TPCH DataGen Not working [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove closed issue #1157: TPCH DataGen Not working URL: https://github.com/apache/datafusion-comet/issues/1157 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscrib

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on PR #15561: URL: https://github.com/apache/datafusion/pull/15561#issuecomment-2779540313 > This would be great -- maybe you can file a ticket to track it I mean my point is: if we think the impact on performance is minimal / not measurable in a real world scenario

Re: [PR] [ignore] see which tests do not explicitly enable Comet [datafusion-comet]

2025-04-04 Thread via GitHub
codecov-commenter commented on PR #1559: URL: https://github.com/apache/datafusion-comet/pull/1559#issuecomment-2738147280 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1559?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2779542553 > I'm curious why the PR triggers the `security audit` CI - It is not related to this PR: https://github.com/apache/datafusion/issues/15571 -- This is an automated message

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2779539336 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_

Re: [PR] Add SQL logic tests for compound field access in JOIN conditions [datafusion]

2025-04-04 Thread via GitHub
alexwilcoxson-rel commented on PR #15556: URL: https://github.com/apache/datafusion/pull/15556#issuecomment-2775874398 The SLT added here should suffice: https://github.com/apache/datafusion/pull/15153/files#diff-065f63a22dadb8c281756ba1a6354ec6661224df0aae1398912e57259d1c6cee -- This is

Re: [PR] perf: replace `merge` `uninitiated_partitions` `VecDeque` with custom fixed size queue [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15562: URL: https://github.com/apache/datafusion/pull/15562#issuecomment-2776774911 FYI @jayzhan211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-04-04 Thread via GitHub
alamb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021697091 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -1197,35 +1197,55 @@ impl ExecutionPlan for SortExec { ) -> Result { trace!("Start SortExec::execu

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on PR #15561: URL: https://github.com/apache/datafusion/pull/15561#issuecomment-2779564211 Agreed then let's just proceed as is. Thanks for your help with this; sorry I was a bit MIA this morning! -- This is an automated message from the Apache Git Service. To respond t

Re: [PR] fix: Making shuffle files generated in native shuffle mode reclaimable [datafusion-comet]

2025-04-04 Thread via GitHub
kazuyukitanimura commented on code in PR #1568: URL: https://github.com/apache/datafusion-comet/pull/1568#discussion_r2025420990 ## spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleExchangeExec.scala: ## @@ -232,20 +222,23 @@ object CometShuffleExcha

Re: [PR] Change default `EXPLAIN` format in `datafusion-cli` to `tree` format [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15427: URL: https://github.com/apache/datafusion/pull/15427#issuecomment-2766364269 🚀 📖 THanks again @blaginin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[PR] feat: make scheduler session context stateless [datafusion-ballista]

2025-04-04 Thread via GitHub
milenkovicm opened a new pull request, #1226: URL: https://github.com/apache/datafusion-ballista/pull/1226 ... as session context was not cleaned up. # Which issue does this PR close? Closes #1220. # Rationale for this change - Scheduler was keeping HashMap of ses

Re: [PR] Only unnest source for `EmptyRelation` [datafusion]

2025-04-04 Thread via GitHub
blaginin commented on code in PR #15159: URL: https://github.com/apache/datafusion/pull/15159#discussion_r2007638250 ## datafusion/sql/tests/cases/plan_to_sql.rs: ## Review Comment: Sorry -- before I added one more test, but then I have to remove it due to (https://github.

Re: [PR] feat: make scheduler session context stateless [datafusion-ballista]

2025-04-04 Thread via GitHub
Copilot commented on code in PR #1226: URL: https://github.com/apache/datafusion-ballista/pull/1226#discussion_r2029326895 ## ballista/scheduler/src/scheduler_server/grpc.rs: ## @@ -276,10 +276,11 @@ impl SchedulerGrpc let session_config = session_config.

[PR] chore(deps): bump blake3 from 1.7.0 to 1.8.0 [datafusion]

2025-04-04 Thread via GitHub
dependabot[bot] opened a new pull request, #15502: URL: https://github.com/apache/datafusion/pull/15502 Bumps [blake3](https://github.com/BLAKE3-team/BLAKE3) from 1.7.0 to 1.8.0. Release notes Sourced from https://github.com/BLAKE3-team/BLAKE3/releases";>blake3's releases. 1

Re: [PR] fix: check if handle has been initialized before closing [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove merged PR #1554: URL: https://github.com/apache/datafusion-comet/pull/1554 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [I] Add most functions to the Expr class so that they're chainable. [datafusion-python]

2025-04-04 Thread via GitHub
timsaucer commented on issue #1064: URL: https://github.com/apache/datafusion-python/issues/1064#issuecomment-2741380684 The reluctance is only because some of them I think doesn't improve ergonomics for our users. The 1 and 2 functions are *absolutely* helpful. But I can be persuaded othe

Re: [I] Use spill manager in row hasher [datafusion]

2025-04-04 Thread via GitHub
alamb closed issue #15401: Use spill manager in row hasher URL: https://github.com/apache/datafusion/issues/15401 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [PR] feat: add test to check for `ctx.read_json()` [datafusion-ballista]

2025-04-04 Thread via GitHub
milenkovicm commented on PR #1212: URL: https://github.com/apache/datafusion-ballista/pull/1212#issuecomment-2742580843 I believe whole job should be cancelled -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Blog post on Parquet pruning in datafusion [datafusion-site]

2025-04-04 Thread via GitHub
alamb merged PR #60: URL: https://github.com/apache/datafusion-site/pull/60 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusio

Re: [D] More thorough contribution guideline [datafusion]

2025-04-04 Thread via GitHub
GitHub user ozankabak added a comment to the discussion: More thorough contribution guideline Thank you @logan-keede for this. There is indeed a lot of refactoring going on and I think we can do much better w.r.t. how we approach refactoring. A few thoughts: - We don't do a good job at managi

[PR] Add support for MSSQL IF/ELSE statements. [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
romanb opened a new pull request, #1791: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1791 This PR is a follow-up to https://github.com/apache/datafusion-sqlparser-rs/pull/1741. In particular, it addresses the parsing of IF/ELSE statements for MSSQL. These are syntactically

[PR] Fix: Snowflake ALTER SESSION cannot be followed by other statements. [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
romanb opened a new pull request, #1786: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1786 Fixes https://github.com/apache/datafusion-sqlparser-rs/issues/1775. Currently `parse_session_options` for Snowflake does not check for semicolons, which makes it impossible to pa

Re: [PR] Draft: Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-04 Thread via GitHub
Dandandan commented on code in PR #15532: URL: https://github.com/apache/datafusion/pull/15532#discussion_r2024211429 ## datafusion/sqllogictest/test_files/explain.slt: ## @@ -244,6 +244,159 @@ physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/ ph

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2778655813 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64

Re: [PR] Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-04-04 Thread via GitHub
xudong963 commented on PR #15503: URL: https://github.com/apache/datafusion/pull/15503#issuecomment-2775970366 @alamb @berkaysynnada My thought about unifying the two methods: ```rust /// Specifies what statistics to compute pub enum StatisticsType { /// Only compute global st

Re: [I] Running Spark Shell with Comet throws Exception [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove commented on issue #872: URL: https://github.com/apache/datafusion-comet/issues/872#issuecomment-2775874807 This issue has not been updated for a long time so I will close. @radhikabajaj123 Please feel free to reopen if you still have the issue. -- This is an automated message

Re: [PR] chore: Create simple fuzz test as part of test suite [datafusion-comet]

2025-04-04 Thread via GitHub
comphead commented on code in PR #1610: URL: https://github.com/apache/datafusion-comet/pull/1610#discussion_r2029557330 ## common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala: ## @@ -278,7 +277,7 @@ object Utils { case v @ (_: BitVector | _: TinyIntVector |

Re: [PR] feat: add MAP type support for first level [datafusion-comet]

2025-04-04 Thread via GitHub
comphead commented on PR #1603: URL: https://github.com/apache/datafusion-comet/pull/1603#issuecomment-2779913933 Some of failing tests with MapVector could be fixed like https://github.com/apache/datafusion-comet/pull/1610#discussion_r2029452928 -- This is an automated message from the

[PR] Add disk usage limit configuration to datafusion-cli [datafusion]

2025-04-04 Thread via GitHub
jsai28 opened a new pull request, #15586: URL: https://github.com/apache/datafusion/pull/15586 ## Which issue does this PR close? Closes #15553. ## Rationale for this change Allows users to specify a disk limit for spill queries. ## What changes are included in this PR?

Re: [PR] minor: Fix clippy warnings [datafusion-comet]

2025-04-04 Thread via GitHub
codecov-commenter commented on PR #1606: URL: https://github.com/apache/datafusion-comet/pull/1606#issuecomment-2775693271 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1606?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [I] Will Comet support closed-source forks of Apache Spark (e.g. CSP versions)? [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove closed issue #414: Will Comet support closed-source forks of Apache Spark (e.g. CSP versions)? URL: https://github.com/apache/datafusion-comet/issues/414 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-04-04 Thread via GitHub
alamb commented on code in PR #15313: URL: https://github.com/apache/datafusion/pull/15313#discussion_r2005880020 ## datafusion/physical-plan/Cargo.toml: ## @@ -58,6 +58,7 @@ futures = { workspace = true } half = { workspace = true } hashbrown = { workspace = true } indexmap

Re: [PR] docs: change OSX/OS X to macOS [datafusion-comet]

2025-04-04 Thread via GitHub
codecov-commenter commented on PR #1584: URL: https://github.com/apache/datafusion-comet/pull/1584#issuecomment-2767636712 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1584?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [I] [DISCUSS] Switch to `tree` explain by default [datafusion]

2025-04-04 Thread via GitHub
alamb closed issue #15343: [DISCUSS] Switch to `tree` explain by default URL: https://github.com/apache/datafusion/issues/15343 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] STRING_AGG missing functionality [datafusion]

2025-04-04 Thread via GitHub
gabotechs commented on code in PR #14412: URL: https://github.com/apache/datafusion/pull/14412#discussion_r2026461993 ## datafusion/functions-aggregate/src/string_agg.rs: ## @@ -129,52 +172,326 @@ impl AggregateUDFImpl for StringAgg { #[derive(Debug)] pub(crate) struct Strin

[I] `count` fails for FFI Table Providers [datafusion]

2025-04-04 Thread via GitHub
timsaucer opened a new issue, #15569: URL: https://github.com/apache/datafusion/issues/15569 ### Describe the bug When using FFI Table Providers, we generate an error because the input schemas do not match for cases like `count` where the input schema is irrelevant. See the minim

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-04 Thread via GitHub
zhuqi-lucas commented on code in PR #15561: URL: https://github.com/apache/datafusion/pull/15561#discussion_r2029658117 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -109,47 +108,84 @@ impl FileOpener for ParquetOpener { .schema_adapter_factory

Re: [PR] Docs : Added Sql examples for window Functions : `nth_val` , etc [datafusion]

2025-04-04 Thread via GitHub
Adez017 commented on PR #1: URL: https://github.com/apache/datafusion/pull/1#issuecomment-2780153961 does it now going to merge ? @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [I] Trivial WHERE filter not eliminated when combined with CTE [datafusion]

2025-04-04 Thread via GitHub
ding-young commented on issue #15387: URL: https://github.com/apache/datafusion/issues/15387#issuecomment-2780253832 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Add support for MSSQL IF/ELSE statements. [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
iffyio commented on code in PR #1791: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1791#discussion_r2029718008 ## src/ast/spans.rs: ## @@ -739,19 +740,12 @@ impl Spanned for CreateIndex { impl Spanned for CaseStatement { fn span(&self) -> Span { le

Re: [PR] MSSQL: Add support for functionality `MERGE` output clause [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
dilovancelik commented on PR #1790: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1790#issuecomment-2780303913 Hey I did a rebase, and I think something went wrong, now I just did a classic merge, and et looks like the extra commits are gone. Sorry. -- This is an automated

Re: [PR] chore: Fix some inconsistencies in memory pool configuration [datafusion-comet]

2025-04-04 Thread via GitHub
viirya commented on code in PR #1561: URL: https://github.com/apache/datafusion-comet/pull/1561#discussion_r2006929718 ## spark/src/main/scala/org/apache/comet/CometExecIterator.scala: ## @@ -63,9 +64,28 @@ class CometExecIterator( }.toArray private val plan = { val c

Re: [PR] docs: various improvements to tuning guide [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove commented on code in PR #1525: URL: https://github.com/apache/datafusion-comet/pull/1525#discussion_r2003780906 ## spark/src/main/scala/org/apache/spark/Plugins.scala: ## @@ -63,13 +63,10 @@ class CometDriverPlugin extends DriverPlugin with Logging with ShimCometDrive

Re: [PR] chore: Create simple fuzz test as part of test suite [datafusion-comet]

2025-04-04 Thread via GitHub
codecov-commenter commented on PR #1610: URL: https://github.com/apache/datafusion-comet/pull/1610#issuecomment-2779892103 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1610?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-04-04 Thread via GitHub
2010YOUY01 commented on PR #15501: URL: https://github.com/apache/datafusion/pull/15501#issuecomment-2774917623 > In my mind the only thing remaining for this PR is to reduce the time down from 30 seconds somehow (maybe split it into multiple smaller tests that can run in parallel, for exam

[PR] fix: corrected the logic of eliminating CometSparkToColumnarExec [datafusion-comet]

2025-04-04 Thread via GitHub
wForget opened a new pull request, #1597: URL: https://github.com/apache/datafusion-comet/pull/1597 ## Which issue does this PR close? Closes #1314 and #1588. ## Rationale for this change `EliminateRedundantTransitions` eliminates the required `ColumnarToRowExec`

[PR] Allow single quotes in EXTRACT() for Redshift. [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
romanb opened a new pull request, #1795: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1795 Just like Postgres, Redshift supports enclosing the `datepart` parameter in single quotes. See also the examples in https://docs.aws.amazon.com/redshift/latest/dg/r_EXTRACT_function.htm

Re: [PR] Add support for MSSQL IF/ELSE statements. [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
iffyio commented on code in PR #1791: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1791#discussion_r2025392312 ## src/ast/mod.rs: ## @@ -2145,116 +2149,189 @@ impl fmt::Display for CaseStatement { } if let Some(else_block) = else_block { -

Re: [PR] Use `any` instead of `for_each` [datafusion]

2025-04-04 Thread via GitHub
xudong963 commented on code in PR #15289: URL: https://github.com/apache/datafusion/pull/15289#discussion_r2007912619 ## datafusion/datasource-parquet/src/file_format.rs: ## @@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc( total_byte_size += row_group_meta

Re: [I] Collecting parquet without any transformations throws an exception [datafusion-comet]

2025-04-04 Thread via GitHub
comphead closed issue #1588: Collecting parquet without any transformations throws an exception URL: https://github.com/apache/datafusion-comet/issues/1588 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Improve collection during repr and repr_html [datafusion-python]

2025-04-04 Thread via GitHub
konjac commented on code in PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#discussion_r2007612621 ## src/dataframe.rs: ## @@ -771,3 +871,82 @@ fn record_batch_into_schema( RecordBatch::try_new(schema, data_arrays) } + +/// This is a helper function

Re: [PR] chore: Attach Diagnostic to "incompatible type in unary expression" error [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15209: URL: https://github.com/apache/datafusion/pull/15209#issuecomment-2734188614 Thanks again everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [PR] Add all missing table options to be handled in any order [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
tomershaniii commented on code in PR #1747: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1747#discussion_r2009156970 ## src/ast/dml.rs: ## @@ -138,6 +143,30 @@ pub struct CreateTable { pub engine: Option, pub comment: Option, pub auto_increment_off

Re: [PR] Use `any` instead of `for_each` [datafusion]

2025-04-04 Thread via GitHub
xudong963 commented on code in PR #15289: URL: https://github.com/apache/datafusion/pull/15289#discussion_r2007912619 ## datafusion/datasource-parquet/src/file_format.rs: ## @@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc( total_byte_size += row_group_meta

Re: [I] Cache Parquet Metadata [datafusion]

2025-04-04 Thread via GitHub
matthewmturner commented on issue #15582: URL: https://github.com/apache/datafusion/issues/15582#issuecomment-2780145360 I am working on this for `dft` right now actually and I plan on integrating it into the observability feature that I have been working on (where different observability m

[I] Enable `split_file_groups_by_statistics` by default [datafusion]

2025-04-04 Thread via GitHub
alamb opened a new issue, #10336: URL: https://github.com/apache/datafusion/issues/10336 ### Is your feature request related to a problem or challenge? - Part of https://github.com/apache/datafusion/issues/10313 In https://github.com/apache/datafusion/pull/9593, @suremarc added

Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2025-04-04 Thread via GitHub
xudong963 commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2780179649 I'll open a follow-up PR to make it default -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

[PR] Chore: Call arrow's methods `row_count` and `skipped_row_count` [datafusion]

2025-04-04 Thread via GitHub
jayzhan211 opened a new pull request, #15587: URL: https://github.com/apache/datafusion/pull/15587 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes test

Re: [PR] Blog post on Parquet pruning in datafusion [datafusion-site]

2025-04-04 Thread via GitHub
kevinjqliu commented on code in PR #60: URL: https://github.com/apache/datafusion-site/pull/60#discussion_r2006136201 ## content/blog/2025-03-20-parquet-pruning.md: ## @@ -0,0 +1,118 @@ +--- +layout: post +title: Parquet Pruning in DataFusion: Read Only What Matters +date: 2025-

Re: [I] Spark executor fail to start occasionally with SIGILL [datafusion-comet]

2025-04-04 Thread via GitHub
mbutrovich commented on issue #1598: URL: https://github.com/apache/datafusion-comet/issues/1598#issuecomment-2776011568 A release build will by default set target-cpu=native: https://github.com/apache/datafusion-comet/blob/c5e78b6b59778f0429f0fc8157c6a959bfd9d4c3/Makefile#L101 which

Re: [PR] Update changelog and version number [datafusion-python]

2025-04-04 Thread via GitHub
timsaucer commented on PR #1089: URL: https://github.com/apache/datafusion-python/pull/1089#issuecomment-2764539321 Included changes were approved via vote on dev mailing list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [I] Will Comet support closed-source forks of Apache Spark (e.g. CSP versions)? [datafusion-comet]

2025-04-04 Thread via GitHub
andygrove commented on issue #414: URL: https://github.com/apache/datafusion-comet/issues/414#issuecomment-2775835470 The documentation does now state that we only support open-source Apache Spark, so I will close this issue -- This is an automated message from the Apache Git Service. To

Re: [PR] Add disk usage limit configuration to datafusion-cli [datafusion]

2025-04-04 Thread via GitHub
2010YOUY01 commented on code in PR #15586: URL: https://github.com/apache/datafusion/pull/15586#discussion_r2029691057 ## docs/source/user-guide/cli/usage.md: ## @@ -57,6 +57,9 @@ OPTIONS: --mem-pool-type Specify the memory pool type 'greedy' or 'fair', de

Re: [PR] Add disk usage limit configuration to datafusion-cli [datafusion]

2025-04-04 Thread via GitHub
jsai28 commented on code in PR #15586: URL: https://github.com/apache/datafusion/pull/15586#discussion_r2029700750 ## datafusion-cli/src/main.rs: ## @@ -125,6 +127,14 @@ struct Args { #[clap(long, help = "Enables console syntax highlighting")] color: bool, + +#[c

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on code in PR #15561: URL: https://github.com/apache/datafusion/pull/15561#discussion_r2029708307 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -109,47 +108,84 @@ impl FileOpener for ParquetOpener { .schema_adapter_factory .cr

Re: [PR] Add support for 'IN ' [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
iffyio commented on code in PR #1793: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1793#discussion_r2029711949 ## src/parser/mod.rs: ## @@ -3742,24 +3742,23 @@ impl<'a> Parser<'a> { }); } self.expect_token(&Token::LParen)?; -

[PR] Blog post about user defined window functions [datafusion-site]

2025-04-04 Thread via GitHub
Adez017 opened a new pull request, #66: URL: https://github.com/apache/datafusion-site/pull/66 ## Is your feature request related to a problem or challenge? Solving the Issue [#6781](https://github.com/apache/datafusion/issues/6781) from data fusion repo ## Describe th

Re: [PR] Blog post about user defined window functions [datafusion-site]

2025-04-04 Thread via GitHub
Dandandan commented on code in PR #66: URL: https://github.com/apache/datafusion-site/pull/66#discussion_r2028680588 ## content/blog/2025-04-04-datafusion-userdefined-window-functions.md: ## @@ -0,0 +1,154 @@ +--- +layout: post +title: User defined Window Functions in DataFusion

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2778478143 Comparing main_base and alamb_test_upgrade_54 Note: Skipping /home/alamb/arrow-datafusion/benchmarks/results/main_base/*.json as /home/alamb/arrow-datafusion/benchmarks/results/alam

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2778396004 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing alamb/test_upgrad

[PR] Minor: rm session downcast [datafusion]

2025-04-04 Thread via GitHub
jayzhan211 opened a new pull request, #15575: URL: https://github.com/apache/datafusion/pull/15575 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes test

Re: [PR] Docs : Added Sql examples for window Functions : `nth_val` , etc [datafusion]

2025-04-04 Thread via GitHub
Adez017 commented on PR #1: URL: https://github.com/apache/datafusion/pull/1#issuecomment-2777865305 i think it should work now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] Fix clippy lint on rust 1.86 [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
alamb merged PR #1796: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1796 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr.

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2778408978 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing alamb/test_upgrad

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2778416199 `./gh_compare_branch.sh` Benchmark Script Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing alamb/test_upgr

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2778390928 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing alamb/test_upgrad

Re: [PR] (WIP) Upgrading to arrow 55 [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2778419409 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64

Re: [PR] perf: Add TopK benchmarks as variation over the `sort_tpch` benchmarks [datafusion]

2025-04-04 Thread via GitHub
geoffreyclaude commented on PR #15560: URL: https://github.com/apache/datafusion/pull/15560#issuecomment-2778019358 > Thanks @geoffreyclaude -- would it be possible to add some documentation about this benchmark in the readme? > > https://github.com/apache/datafusion/tree/main/benchma

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-04 Thread via GitHub
geoffreyclaude commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2772276984 @alamb: > This may be some overlap with this work from @adriangb (though I realize you are talking about a different optimization) The two are complimentary. @ad

Re: [PR] chore: update clickbench [datafusion]

2025-04-04 Thread via GitHub
Dandandan commented on PR #15574: URL: https://github.com/apache/datafusion/pull/15574#issuecomment-2777865667 > should we make sure clickbench.slt in sync with queries.sql in clickbench directory ? currently they are a bit different. Yes that makes sense, could you do that as well? >

Re: [PR] perf: Introduce sort prefix computation for early TopK exit optimization on partially sorted input [datafusion]

2025-04-04 Thread via GitHub
geoffreyclaude commented on code in PR #15563: URL: https://github.com/apache/datafusion/pull/15563#discussion_r2028543416 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -90,15 +90,38 @@ pub struct TopK { scratch_rows: Rows, /// stores the top k values and their so

Re: [I] Apache DataFusion Google Summer of Code (GSoC) 2025 Application Guidelines [datafusion]

2025-04-04 Thread via GitHub
ding-young commented on issue #14577: URL: https://github.com/apache/datafusion/issues/14577#issuecomment-2778240355 > > Hello, [@ozankabak](https://github.com/ozankabak). May I ask for new discord invite link? I would like to take a look on ongoing discussion on project ideas, but the link

Re: [I] A complete solution for stable and safe sort with spill [datafusion]

2025-04-04 Thread via GitHub
alamb commented on issue #14692: URL: https://github.com/apache/datafusion/issues/14692#issuecomment-2778254506 Thanks @qstommyshu ! I think @2010YOUY01 has been working on this one most recently, so I would recommend checking in with them before coding too mich -- This is an automated

Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-04-04 Thread via GitHub
alamb merged PR #15501: URL: https://github.com/apache/datafusion/pull/15501 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Add SQL examples to window functions: `nth_value`, etc [datafusion]

2025-04-04 Thread via GitHub
alamb commented on issue #13399: URL: https://github.com/apache/datafusion/issues/13399#issuecomment-2778263737 > Hi [@alamb](https://github.com/alamb) , i am also interested in this ? can I do that ? Absolutely -- see https://datafusion.apache.org/contributor-guide/index.html#open-c

[PR] Support additional DuckDB integer types such as HUGEINT, UHUGEINT, etc [datafusion-sqlparser-rs]

2025-04-04 Thread via GitHub
alexander-beedie opened a new pull request, #1797: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1797 * Adds `DataType` support for the following integer types, with tests: `HUGEINT` `UHUGEINT` `UTINYINT` `USMALLINT` `UBIGINT` (ref: https

Re: [PR] STRING_AGG missing functionality [datafusion]

2025-04-04 Thread via GitHub
gabotechs commented on code in PR #14412: URL: https://github.com/apache/datafusion/pull/14412#discussion_r2028725710 ## datafusion/functions-aggregate/src/string_agg.rs: ## @@ -129,52 +172,326 @@ impl AggregateUDFImpl for StringAgg { #[derive(Debug)] pub(crate) struct Strin

Re: [I] A complete solution for stable and safe sort with spill [datafusion]

2025-04-04 Thread via GitHub
qstommyshu commented on issue #14692: URL: https://github.com/apache/datafusion/issues/14692#issuecomment-2778461103 > Thanks @qstommyshu ! I think @2010YOUY01 has been working on this one most recently, so I would recommend checking in with them before coding too mich Sounds good, no

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15561: URL: https://github.com/apache/datafusion/pull/15561#issuecomment-2778671317 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_

  1   2   3   4   5   >