Re: [PR] [logical-types] add NativeType and LogicalType [datafusion]

2024-10-13 Thread via GitHub
findepi commented on code in PR #12853: URL: https://github.com/apache/datafusion/pull/12853#discussion_r1798843685 ## datafusion/common/src/types/logical.rs: ## @@ -0,0 +1,58 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agree

Re: [PR] [logical-types] add NativeType and LogicalType [datafusion]

2024-10-13 Thread via GitHub
findepi commented on code in PR #12853: URL: https://github.com/apache/datafusion/pull/12853#discussion_r1798841086 ## datafusion/common/src/types/logical.rs: ## @@ -0,0 +1,41 @@ +use core::fmt; +use std::{cmp::Ordering, hash::Hash, sync::Arc}; + +use super::NativeType; + +/// A

Re: [PR] Add support for parsing MsSql alias with equals [datafusion-sqlparser-rs]

2024-10-13 Thread via GitHub
yoavcloud commented on PR #1467: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1467#issuecomment-2410086902 @iffyio I tried inlining the function, it's not as short as I wish it would, due to pattern matching the boxed fields. But maybe I'm missing something... Would love you

Re: [PR] Remove unused `math_expressions.rs` [datafusion]

2024-10-13 Thread via GitHub
jonahgao commented on code in PR #12917: URL: https://github.com/apache/datafusion/pull/12917#discussion_r1798789739 ## datafusion/physical-expr/src/math_expressions.rs: ## @@ -1,126 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor l

Re: [PR] Remove unused `math_expressions.rs` [datafusion]

2024-10-13 Thread via GitHub
jonahgao commented on code in PR #12917: URL: https://github.com/apache/datafusion/pull/12917#discussion_r1798786852 ## datafusion/physical-expr/src/math_expressions.rs: ## @@ -1,126 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor l

Re: [PR] Optimize `isnan` (2-5x faster) [datafusion]

2024-10-13 Thread via GitHub
jonahgao commented on code in PR #12889: URL: https://github.com/apache/datafusion/pull/12889#discussion_r1798785780 ## datafusion/physical-expr/src/math_expressions.rs: ## @@ -17,60 +17,27 @@ //! Math expressions -use std::any::type_name; use std::sync::Arc; -use arrow:

[PR] Remove unused `math_expressions.rs` [datafusion]

2024-10-13 Thread via GitHub
jonahgao opened a new pull request, #12917: URL: https://github.com/apache/datafusion/pull/12917 ## Which issue does this PR close? N/A ## Rationale for this change Found when reviewing #12889 It should be a leftover from moving math functions. ## What

Re: [PR] Dynamic filter pushdown to probe side [datafusion]

2024-10-13 Thread via GitHub
Lordworms commented on PR #12781: URL: https://github.com/apache/datafusion/pull/12781#issuecomment-2409938373 When I was implementing this , the hard part is how to collect global build side information in partition-mode, currently my idea is to using a lock to ensure all the build-side op

Re: [PR] Introduce `binary_as_string` parquet option [datafusion]

2024-10-13 Thread via GitHub
goldmedal commented on PR #12816: URL: https://github.com/apache/datafusion/pull/12816#issuecomment-2409843953 > Here is the performance of this PR. Some queries are slower, some are faster. > > I believe once we turn on string view everything will be faster. > Thanks @alam

Re: [I] `octet_length()` function not working for StringView columns (SQLancer) [datafusion]

2024-10-13 Thread via GitHub
jonahgao closed issue #12149: `octet_length()` function not working for StringView columns (SQLancer) URL: https://github.com/apache/datafusion/issues/12149 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] octet_length + string view == ❤️ [datafusion]

2024-10-13 Thread via GitHub
jonahgao merged PR #12900: URL: https://github.com/apache/datafusion/pull/12900 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [I] hooks for `temporary` tables in the catalog [datafusion]

2024-10-13 Thread via GitHub
jonahgao closed issue #12463: hooks for `temporary` tables in the catalog URL: https://github.com/apache/datafusion/issues/12463 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Minor: add flags for temporary ddl [datafusion]

2024-10-13 Thread via GitHub
jonahgao merged PR #12561: URL: https://github.com/apache/datafusion/pull/12561 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] Support Null aware anti join by HashJoin [datafusion]

2024-10-13 Thread via GitHub
github-actions[bot] commented on PR #10584: URL: https://github.com/apache/datafusion/pull/10584#issuecomment-2409696079 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] fix: bugs when having and group by are all false [datafusion]

2024-10-13 Thread via GitHub
github-actions[bot] commented on PR #11897: URL: https://github.com/apache/datafusion/pull/11897#issuecomment-2409695828 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] Better multi-column aggregation support with StringView [datafusion]

2024-10-13 Thread via GitHub
github-actions[bot] closed pull request #11794: Better multi-column aggregation support with StringView URL: https://github.com/apache/datafusion/pull/11794 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]

2024-10-13 Thread via GitHub
github-actions[bot] commented on PR #11758: URL: https://github.com/apache/datafusion/pull/11758#issuecomment-2409695918 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] Update Datafusion Ray architecture docs [datafusion-ray]

2024-10-13 Thread via GitHub
andygrove merged PR #27: URL: https://github.com/apache/datafusion-ray/pull/27 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafu

Re: [PR] Fixing docker image publishing [datafusion-ray]

2024-10-13 Thread via GitHub
andygrove merged PR #31: URL: https://github.com/apache/datafusion-ray/pull/31 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafu

[PR] fix: Add Int32 type override for Dialects [datafusion]

2024-10-13 Thread via GitHub
peasee opened a new pull request, #12916: URL: https://github.com/apache/datafusion/pull/12916 ## Which issue does this PR close? Closes #12915. ## Rationale for this change * To fix a bug with casting integers in MySQL ## What changes are included

[I] CAST(.. AS INTEGER) is not supported in MySQL [datafusion]

2024-10-13 Thread via GitHub
peasee opened a new issue, #12915: URL: https://github.com/apache/datafusion/issues/12915 ### Describe the bug `CAST(.. AS INTEGER)` is not supported in MySQL, as `INTEGER` is not a valid cast type. ### To Reproduce Create an execution plan using the MySQL Dialect that i

Re: [PR] Update Datafusion Ray architecture docs [datafusion-ray]

2024-10-13 Thread via GitHub
austin362667 commented on code in PR #27: URL: https://github.com/apache/datafusion-ray/pull/27#discussion_r1798588033 ## docs/README.md: ## @@ -17,12 +17,12 @@ under the License. --> -# RaySQL Design Documentation +# Datafusion Ray Design Documentation Review Comment:

Re: [PR] Update Datafusion Ray architecture docs [datafusion-ray]

2024-10-13 Thread via GitHub
austin362667 commented on code in PR #27: URL: https://github.com/apache/datafusion-ray/pull/27#discussion_r1798588305 ## docs/README.md: ## @@ -260,13 +257,14 @@ child plans, building up a DAG of futures. ## Distributed Shuffle -The output of each query stage needs to be p

[PR] Fixing docker image publishing [datafusion-ray]

2024-10-13 Thread via GitHub
edmondop opened a new pull request, #31: URL: https://github.com/apache/datafusion-ray/pull/31 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,

[PR] Remove redundant unsafe in test [datafusion]

2024-10-13 Thread via GitHub
findepi opened a new pull request, #12914: URL: https://github.com/apache/datafusion/pull/12914 Unsafe `StructArray::new_unchecked` is a more performant alternative to `StructArray::new`. In test code there is no benefit from using the unsafe code path. -- This is an automated messag

Re: [PR] Fix #1469 (SET ROLE regression) [datafusion-sqlparser-rs]

2024-10-13 Thread via GitHub
lovasoa commented on code in PR #1474: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1474#discussion_r1798561445 ## src/parser/mod.rs: ## @@ -9416,27 +9416,42 @@ impl<'a> Parser<'a> { } } +fn parse_set_role(&mut self, modifier: Option) -> Opti

Re: [PR] Fix #1469 (SET ROLE regression) [datafusion-sqlparser-rs]

2024-10-13 Thread via GitHub
lovasoa commented on PR #1474: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1474#issuecomment-2409117537 I changed the code to use maybe_parse, which is indeed much cleaner and resolves both your points. Thanks @iffyio ! -- This is an automated message from the Apache Git

Re: [I] Improve k8s documentation to cover upgrading to Ray 2.37.0 [datafusion-ray]

2024-10-13 Thread via GitHub
andygrove commented on issue #30: URL: https://github.com/apache/datafusion-ray/issues/30#issuecomment-2409070032 Thanks. I hadn't understood that from reading the docs, but it makes sense now. -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] Implement special `GroupColumn` support for byte view [datafusion]

2024-10-13 Thread via GitHub
Rachelint commented on PR #12809: URL: https://github.com/apache/datafusion/pull/12809#issuecomment-2409067977 @alamb 👍 Thanks for reminding about the test coverage. After checking the codes again more carefully, I found some testcases indeed don't coverage code paths as I expected.

Re: [PR] Building docker images during CI/CD and publish on tag [datafusion-ray]

2024-10-13 Thread via GitHub
andygrove commented on code in PR #29: URL: https://github.com/apache/datafusion-ray/pull/29#discussion_r1798471260 ## .github/workflows/k8s.yml: ## @@ -0,0 +1,32 @@ +name: Kubernetes Review Comment: I'll go ahead and merge so we can test. I added a bullet to https://github

Re: [PR] Building docker images during CI/CD and publish on tag [datafusion-ray]

2024-10-13 Thread via GitHub
andygrove merged PR #29: URL: https://github.com/apache/datafusion-ray/pull/29 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafu

[PR] Fix zero data type in `expr % 1` simplification [datafusion]

2024-10-13 Thread via GitHub
eejbyfeldt opened a new pull request, #12913: URL: https://github.com/apache/datafusion/pull/12913 ## Which issue does this PR close? Follow up to #12814 ## Rationale for this change This address a bug that previously always replace % 1 expression with a 0 of type i3

Re: [PR] feat: add `with_columns` [datafusion-python]

2024-10-13 Thread via GitHub
timsaucer commented on code in PR #909: URL: https://github.com/apache/datafusion-python/pull/909#discussion_r1798465579 ## python/datafusion/dataframe.py: ## @@ -163,7 +163,20 @@ def with_column(self, name: str, expr: Expr) -> DataFrame: def with_columns( self, *e

Re: [PR] refactor: rename to `with_column_renamed` to `rename` [datafusion-python]

2024-10-13 Thread via GitHub
timsaucer commented on code in PR #908: URL: https://github.com/apache/datafusion-python/pull/908#discussion_r1798465440 ## python/datafusion/dataframe.py: ## @@ -175,7 +178,23 @@ def with_column_renamed(self, old_name: str, new_name: str) -> DataFrame: Returns:

Re: [I] Improve k8s documentation to cover upgrading to Ray 2.37.0 [datafusion-ray]

2024-10-13 Thread via GitHub
vakarisbk commented on issue #30: URL: https://github.com/apache/datafusion-ray/issues/30#issuecomment-2409037630 Cluster version is determined by the container image that is used to launch the cluster (step 3 in the docs). The default container image is `rayproject/ray:2.9.0`. The d

Re: [I] Add `fill_null` on dataframe/expr level [datafusion]

2024-10-13 Thread via GitHub
ion-elgreco commented on issue #12906: URL: https://github.com/apache/datafusion/issues/12906#issuecomment-2409035766 @Omega359 indeed, for now I've opened a PR for a fill_null in datafusion-python that calls nvl: https://github.com/apache/datafusion-python/pull/919 -- This is an automat

[PR] feat: add fill_null/nan expressions [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco opened a new pull request, #919: URL: https://github.com/apache/datafusion-python/pull/919 # Which issue does this PR close? - related https://github.com/apache/datafusion-python/issues/875 # Rationale for this change Fill_na/null is a quite common method for doing t

Re: [PR] Fix 2 bugs related to push down partition filters [datafusion]

2024-10-13 Thread via GitHub
eejbyfeldt commented on code in PR #12902: URL: https://github.com/apache/datafusion/pull/12902#discussion_r1798432714 ## datafusion/sqllogictest/test_files/errors.slt: ## @@ -133,3 +133,7 @@ create table foo as values (1), ('foo'); query error No function matches select 1 g

Re: [PR] Add spilling related metrics for aggregation [datafusion]

2024-10-13 Thread via GitHub
2010YOUY01 commented on code in PR #12888: URL: https://github.com/apache/datafusion/pull/12888#discussion_r1798425953 ## datafusion/physical-plan/src/aggregates/row_hash.rs: ## @@ -102,6 +102,19 @@ struct SpillState { /// true when streaming merge is in progress is_

Re: [PR] feat: Implement bloom_filter_agg [datafusion-comet]

2024-10-13 Thread via GitHub
mbutrovich commented on PR #987: URL: https://github.com/apache/datafusion-comet/pull/987#issuecomment-2409027007 Merged in updated main, thanks for the quick fix! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [PR] feat: add `with_columns` [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco commented on code in PR #909: URL: https://github.com/apache/datafusion-python/pull/909#discussion_r1798413292 ## python/datafusion/dataframe.py: ## @@ -163,7 +163,20 @@ def with_column(self, name: str, expr: Expr) -> DataFrame: def with_columns( self,

[I] Make all read methods available on DataFusion module [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco opened a new issue, #918: URL: https://github.com/apache/datafusion-python/issues/918 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Instead of having users explicitly create a sessioncontext, we could just create one

Re: [PR] Building docker images during CI/CD and publish on tag [datafusion-ray]

2024-10-13 Thread via GitHub
andygrove commented on code in PR #29: URL: https://github.com/apache/datafusion-ray/pull/29#discussion_r1798408562 ## .github/workflows/k8s.yml: ## @@ -0,0 +1,32 @@ +name: Kubernetes Review Comment: I would also be okay with explicitly excluding GitHub workflows from RAT c

Re: [PR] feat: add `with_columns` [datafusion-python]

2024-10-13 Thread via GitHub
timsaucer commented on code in PR #909: URL: https://github.com/apache/datafusion-python/pull/909#discussion_r1798407548 ## python/datafusion/dataframe.py: ## @@ -163,7 +163,20 @@ def with_column(self, name: str, expr: Expr) -> DataFrame: def with_columns( self, *e

[PR] Move StringArrayType, StringViewArrayBuilder, etc outside of string module [datafusion]

2024-10-13 Thread via GitHub
Omega359 opened a new pull request, #12912: URL: https://github.com/apache/datafusion/pull/12912 ## Which issue does this PR close? Closes #12898 ## Rationale for this change This refactor is to eliminate the requirement for other function modules to depend on th

[PR] refactor: `from_arrow` use protocol typehints [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco opened a new pull request, #917: URL: https://github.com/apache/datafusion-python/pull/917 @timsaucer we could probably drop the rust code for from_pandas and from_polars and keep those in Python as aliases but just call from_arrow or even drop them. The only thing is we will ha

[I] Improve k8s documentation to cover upgrading to Ray 2.37.0 [datafusion-ray]

2024-10-13 Thread via GitHub
andygrove opened a new issue, #30: URL: https://github.com/apache/datafusion-ray/issues/30 I followed the k8s docs and it set up a cluster using Ray 2.9.0. DataFusion Ray requires Ray 2.37.0, so I think that we should add documentation explaining how to upgrade. -- This is an automated m

Re: [PR] Building docker images during CI/CD and publish on tag [datafusion-ray]

2024-10-13 Thread via GitHub
andygrove commented on code in PR #29: URL: https://github.com/apache/datafusion-ray/pull/29#discussion_r1798403887 ## .github/workflows/k8s.yml: ## @@ -0,0 +1,32 @@ +name: Kubernetes Review Comment: Could you add the ASF license header? It looks like we don't have the RAT

Re: [PR] Update Datafusion Ray architecture docs [datafusion-ray]

2024-10-13 Thread via GitHub
andygrove commented on code in PR #27: URL: https://github.com/apache/datafusion-ray/pull/27#discussion_r1798402790 ## docs/README.md: ## @@ -260,13 +257,14 @@ child plans, building up a DAG of futures. ## Distributed Shuffle -The output of each query stage needs to be pers

Re: [PR] Update Datafusion Ray architecture docs [datafusion-ray]

2024-10-13 Thread via GitHub
andygrove commented on code in PR #27: URL: https://github.com/apache/datafusion-ray/pull/27#discussion_r1798401864 ## docs/README.md: ## @@ -17,12 +17,12 @@ under the License. --> -# RaySQL Design Documentation +# Datafusion Ray Design Documentation Review Comment: ni

Re: [PR] feat: add `cast` to DataFrame [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco commented on PR #916: URL: https://github.com/apache/datafusion-python/pull/916#issuecomment-2409012940 > Very nice addition, but it looks like this branch is on top of your other one since it uses `with_columns` so we will need to merge that one before this. Yess, we nee

Re: [PR] Migrate documentation for all math functions from scalar_functions.md to code [datafusion]

2024-10-13 Thread via GitHub
juroberttyb commented on PR #12908: URL: https://github.com/apache/datafusion/pull/12908#issuecomment-2409012291 PR updated with `get_pow_doc()` removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Migrate documentation for all math functions from scalar_functions.md to code [datafusion]

2024-10-13 Thread via GitHub
juroberttyb commented on code in PR #12908: URL: https://github.com/apache/datafusion/pull/12908#discussion_r1798392093 ## datafusion/functions/src/math/monotonicity.rs: ## @@ -218,14 +545,76 @@ pub fn sqrt_order(input: &[ExprProperties]) -> Result { } } +static DOCUMEN

[I] Add `tail` method on DataFrame [datafusion]

2024-10-13 Thread via GitHub
ion-elgreco opened a new issue, #12911: URL: https://github.com/apache/datafusion/issues/12911 Is there a better way we could do this? Maybe add something upstream if necessary? As I'm thinking of it, I don't know that this operation is necessarily well defined. Just li

Re: [PR] feat: add `head`, `tail` methods [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco commented on code in PR #915: URL: https://github.com/apache/datafusion-python/pull/915#discussion_r1798390360 ## python/datafusion/dataframe.py: ## @@ -223,6 +223,30 @@ def limit(self, count: int, offset: int = 0) -> DataFrame: """ return DataFrame

Re: [PR] feat: add `head`, `tail` methods [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco commented on code in PR #915: URL: https://github.com/apache/datafusion-python/pull/915#discussion_r1798390091 ## python/datafusion/dataframe.py: ## @@ -223,6 +223,30 @@ def limit(self, count: int, offset: int = 0) -> DataFrame: """ return DataFrame

Re: [PR] feat: expose `join_on` [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco commented on code in PR #914: URL: https://github.com/apache/datafusion-python/pull/914#discussion_r1798388988 ## python/tests/test_dataframe.py: ## @@ -259,6 +259,43 @@ def test_join(): assert table.to_pydict() == expected +def test_join_on(): +ctx = Se

Re: [PR] feat: support inner iejoin [datafusion]

2024-10-13 Thread via GitHub
my-vegetable-has-exploded commented on PR #12754: URL: https://github.com/apache/datafusion/pull/12754#issuecomment-2409003636 ![btree-perf](https://github.com/user-attachments/assets/b5728eec-eab7-43bf-ab50-9984b9b2d39a) It seems the main cost is sorting. -- This is an automated m

[PR] Optimize performance of `math::cot` (~2x faster) [datafusion]

2024-10-13 Thread via GitHub
tlm365 opened a new pull request, #12910: URL: https://github.com/apache/datafusion/pull/12910 ## Rationale for this change Using the `unary` functions allows faster processing by avoiding branching on nulls. ## What changes are included in this PR? - Apply `unary`

Re: [PR] Optimize performance of math::trunc (~2.5x faster) [datafusion]

2024-10-13 Thread via GitHub
tlm365 commented on code in PR #12909: URL: https://github.com/apache/datafusion/pull/12909#discussion_r1798363818 ## datafusion/functions/src/math/trunc.rs: ## @@ -111,44 +111,66 @@ fn trunc(args: &[ArrayRef]) -> Result { ); } -//if only one arg then invoke

Re: [I] Move StringArrayType, StringViewArrayBuilder, StringArrayBuilder & LargeStringArrayBuilder from functions/string/common.rs to a more common crate/module [datafusion]

2024-10-13 Thread via GitHub
Omega359 commented on issue #12898: URL: https://github.com/apache/datafusion/issues/12898#issuecomment-2408988141 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] Implement special `GroupColumn` support for byte view [datafusion]

2024-10-13 Thread via GitHub
Rachelint commented on PR #12809: URL: https://github.com/apache/datafusion/pull/12809#issuecomment-2408987937 > Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested. > > cc @jayzhan211 your GroupColumn pattern is really work

Re: [PR] Optimize performance of math::trunc (~2.5x faster) [datafusion]

2024-10-13 Thread via GitHub
simonvandel commented on code in PR #12909: URL: https://github.com/apache/datafusion/pull/12909#discussion_r1798350152 ## datafusion/functions/src/math/trunc.rs: ## @@ -111,44 +111,66 @@ fn trunc(args: &[ArrayRef]) -> Result { ); } -//if only one arg then in

Re: [I] Add `pivot` & `unpivot` (melt) to DataFrame [datafusion]

2024-10-13 Thread via GitHub
Omega359 commented on issue #12907: URL: https://github.com/apache/datafusion/issues/12907#issuecomment-2408986299 referenced in https://github.com/apache/datafusion-python/issues/875 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

Re: [I] Add `fill_null` on dataframe/expr level [datafusion]

2024-10-13 Thread via GitHub
Omega359 commented on issue #12906: URL: https://github.com/apache/datafusion/issues/12906#issuecomment-2408983472 This seems like it may almost be an alias for [nvl](https://datafusion.apache.org/user-guide/sql/scalar_functions_new.html#nvl) or [nvl2](https://datafusion.apache.org/user-gu

Re: [PR] feat: add `with_columns` [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco commented on code in PR #909: URL: https://github.com/apache/datafusion-python/pull/909#discussion_r1798340584 ## python/datafusion/dataframe.py: ## @@ -160,6 +160,40 @@ def with_column(self, name: str, expr: Expr) -> DataFrame: """ return DataFrame

Re: [PR] feat: add `with_columns` [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco commented on code in PR #909: URL: https://github.com/apache/datafusion-python/pull/909#discussion_r1798340584 ## python/datafusion/dataframe.py: ## @@ -160,6 +160,40 @@ def with_column(self, name: str, expr: Expr) -> DataFrame: """ return DataFrame

Re: [PR] Migrate documentation for all math functions from scalar_functions.md to code [datafusion]

2024-10-13 Thread via GitHub
Omega359 commented on PR #12908: URL: https://github.com/apache/datafusion/pull/12908#issuecomment-2408982151 Beyond the `pow` removal this PR LGTM 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] refactor: rename to `with_column_renamed` to `rename` [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco commented on code in PR #908: URL: https://github.com/apache/datafusion-python/pull/908#discussion_r1798339065 ## python/datafusion/dataframe.py: ## @@ -175,7 +178,23 @@ def with_column_renamed(self, old_name: str, new_name: str) -> DataFrame: Returns:

Re: [PR] Migrate documentation for all math functions from scalar_functions.md to code [datafusion]

2024-10-13 Thread via GitHub
Omega359 commented on code in PR #12908: URL: https://github.com/apache/datafusion/pull/12908#discussion_r1798337489 ## datafusion/functions/src/math/monotonicity.rs: ## @@ -218,14 +545,76 @@ pub fn sqrt_order(input: &[ExprProperties]) -> Result { } } +static DOCUMENTAT

Re: [PR] feat: add `cast` to DataFrame [datafusion-python]

2024-10-13 Thread via GitHub
timsaucer commented on code in PR #916: URL: https://github.com/apache/datafusion-python/pull/916#discussion_r1798329026 ## python/datafusion/dataframe.py: ## @@ -211,6 +245,19 @@ def sort(self, *exprs: Expr | SortExpr) -> DataFrame: exprs_raw = [sort_or_default(expr) f

[PR] Optimize performance of math::trunc (~2.5x faster) [datafusion]

2024-10-13 Thread via GitHub
tlm365 opened a new pull request, #12909: URL: https://github.com/apache/datafusion/pull/12909 ## Rationale for this change Same idea as https://github.com/apache/datafusion/pull/12881. Using the `unary`/`binary` functions allow faster processing (most likely auto-vectorized code) by avo

Re: [PR] Implement special `GroupColumn` support for byte view [datafusion]

2024-10-13 Thread via GitHub
Rachelint commented on code in PR #12809: URL: https://github.com/apache/datafusion/pull/12809#discussion_r1798329004 ## datafusion/physical-plan/src/aggregates/group_values/group_column.rs: ## @@ -376,6 +385,399 @@ where } } +/// An implementation of [`GroupColumn`] for

Re: [PR] feat: add `head`, `tail` methods [datafusion-python]

2024-10-13 Thread via GitHub
timsaucer commented on code in PR #915: URL: https://github.com/apache/datafusion-python/pull/915#discussion_r1798327582 ## python/datafusion/dataframe.py: ## @@ -223,6 +223,30 @@ def limit(self, count: int, offset: int = 0) -> DataFrame: """ return DataFrame(s

Re: [PR] Implement special `GroupColumn` support for byte view [datafusion]

2024-10-13 Thread via GitHub
Rachelint commented on code in PR #12809: URL: https://github.com/apache/datafusion/pull/12809#discussion_r1798325968 ## datafusion/physical-plan/src/aggregates/group_values/group_column.rs: ## @@ -376,6 +385,399 @@ where } } +/// An implementation of [`GroupColumn`] for

Re: [I] Migrate documentation for all math functions from scalar_functions.md to code [datafusion]

2024-10-13 Thread via GitHub
juroberttyb commented on issue #12867: URL: https://github.com/apache/datafusion/issues/12867#issuecomment-2408973361 Hi @alamb, thank you for providing this opportunity for new comers to learn more about the repo! I have opened a PR, could you or someone help me review it when you ha

Re: [PR] feat: expose `join_on` [datafusion-python]

2024-10-13 Thread via GitHub
timsaucer commented on code in PR #914: URL: https://github.com/apache/datafusion-python/pull/914#discussion_r1798323879 ## python/tests/test_dataframe.py: ## @@ -259,6 +259,43 @@ def test_join(): assert table.to_pydict() == expected +def test_join_on(): +ctx = Sess

Re: [I] [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench [datafusion]

2024-10-13 Thread via GitHub
alamb commented on issue #12821: URL: https://github.com/apache/datafusion/issues/12821#issuecomment-2408972279 Thank you @tustvold -- that content is so good I made a PR to propose putting it in the readme of arrow-rs: https://github.com/apache/arrow-rs/pull/6554 -- This is an automate

[PR] migrate: math doc from static to code gen [datafusion]

2024-10-13 Thread via GitHub
juroberttyb opened a new pull request, #12908: URL: https://github.com/apache/datafusion/pull/12908 ## Which issue does this PR close? Closes #12867. ## Rationale for this change ## What changes are included in this PR? ## Are these changes

Re: [PR] refactor: dataframe `join` params [datafusion-python]

2024-10-13 Thread via GitHub
timsaucer commented on code in PR #912: URL: https://github.com/apache/datafusion-python/pull/912#discussion_r1798320329 ## python/datafusion/dataframe.py: ## @@ -284,14 +324,41 @@ def join( Args: right: Other DataFrame to join with. -join_key

Re: [PR] feat: add `with_columns` [datafusion-python]

2024-10-13 Thread via GitHub
timsaucer commented on code in PR #909: URL: https://github.com/apache/datafusion-python/pull/909#discussion_r1798315618 ## python/datafusion/dataframe.py: ## @@ -160,6 +160,40 @@ def with_column(self, name: str, expr: Expr) -> DataFrame: """ return DataFrame(s

Re: [PR] Implement special `GroupColumn` support for byte view [datafusion]

2024-10-13 Thread via GitHub
alamb commented on code in PR #12809: URL: https://github.com/apache/datafusion/pull/12809#discussion_r1798291818 ## datafusion/physical-plan/src/aggregates/group_values/group_column.rs: ## @@ -579,4 +986,208 @@ mod tests { assert!(!builder.equal_to(4, &input_array, 4))

Re: [I] [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench [datafusion]

2024-10-13 Thread via GitHub
tustvold commented on issue #12821: URL: https://github.com/apache/datafusion/issues/12821#issuecomment-2408966481 I found that LLVM is relatively good at vectorizing vertical operations provided: * There are no conditionals within the loop body * You've been careful to avoid inlin

Re: [PR] refactor: rename to `with_column_renamed` to `rename` [datafusion-python]

2024-10-13 Thread via GitHub
timsaucer commented on code in PR #908: URL: https://github.com/apache/datafusion-python/pull/908#discussion_r1798309123 ## python/datafusion/dataframe.py: ## @@ -175,7 +178,23 @@ def with_column_renamed(self, old_name: str, new_name: str) -> DataFrame: Returns:

Re: [I] [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench [datafusion]

2024-10-13 Thread via GitHub
alamb commented on issue #12821: URL: https://github.com/apache/datafusion/issues/12821#issuecomment-2408961671 > Can we use nightly rust that enable std::simd for vectorization? Although in arrow-rs, the simd code is rewritten with auto-vectorization, but when I check the generated asm, I

Re: [PR] Make PruningPredicate's rewrite public [datafusion]

2024-10-13 Thread via GitHub
alamb commented on PR #12850: URL: https://github.com/apache/datafusion/pull/12850#issuecomment-2408959974 Thanks again @adriangb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [PR] Make PruningPredicate's rewrite public [datafusion]

2024-10-13 Thread via GitHub
alamb merged PR #12850: URL: https://github.com/apache/datafusion/pull/12850 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Enable `datafusion.execution.parquet.schema_force_string_view` by default [datafusion]

2024-10-13 Thread via GitHub
alamb commented on issue #11682: URL: https://github.com/apache/datafusion/issues/11682#issuecomment-2408959341 Update: we have enough of the pieces implemented thanks to @Rachelint and @goldmedal and @jayzhan211 so I have hacked it together in a branch and am now running the performance

Re: [PR] Introduce `binary_as_string` parquet option [datafusion]

2024-10-13 Thread via GitHub
alamb commented on PR #12816: URL: https://github.com/apache/datafusion/pull/12816#issuecomment-2408955172 I reabased / squashed all the code in this branch so it would be easier to pull in to test in https://github.com/apache/datafusion/pull/12092 -- This is an automated message from the

Re: [PR] Implement special min/max accumulator for Strings and Binary (10% faster for Clickbench Q28) [datafusion]

2024-10-13 Thread via GitHub
alamb commented on PR #12792: URL: https://github.com/apache/datafusion/pull/12792#issuecomment-2408953006 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [I] Implement fast min/max accumulator for binary / strings (now it uses the slower path) [datafusion]

2024-10-13 Thread via GitHub
alamb closed issue #6906: Implement fast min/max accumulator for binary / strings (now it uses the slower path) URL: https://github.com/apache/datafusion/issues/6906 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] Implement special min/max accumulator for Strings and Binary (10% faster for Clickbench Q28) [datafusion]

2024-10-13 Thread via GitHub
alamb merged PR #12792: URL: https://github.com/apache/datafusion/pull/12792 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

[PR] feat: add `cast` to DataFrame [datafusion-python]

2024-10-13 Thread via GitHub
ion-elgreco opened a new pull request, #916: URL: https://github.com/apache/datafusion-python/pull/916 # Which issue does this PR close? - related https://github.com/apache/datafusion-python/issues/875 # Rationale for this change A top level cast is very practical and a common p

Re: [PR] Enable reading `StringViewArray` by default from Parquet [datafusion]

2024-10-13 Thread via GitHub
alamb commented on PR #12092: URL: https://github.com/apache/datafusion/pull/12092#issuecomment-2408950500 My plan is to pull the changes from the following PRs into this PR and rerun the overall perf test - [ ] https://github.com/apache/datafusion/pull/12792 - [ ] https://github.com

Re: [PR] Implement special `GroupColumn` support for byte view [datafusion]

2024-10-13 Thread via GitHub
alamb commented on code in PR #12809: URL: https://github.com/apache/datafusion/pull/12809#discussion_r1798283664 ## datafusion/physical-plan/src/aggregates/group_values/group_column.rs: ## @@ -376,6 +385,399 @@ where } } +/// An implementation of [`GroupColumn`] for bin

Re: [PR] Implement special `GroupColumn` support for byte view [datafusion]

2024-10-13 Thread via GitHub
alamb commented on code in PR #12809: URL: https://github.com/apache/datafusion/pull/12809#discussion_r1798281370 ## datafusion/physical-plan/src/aggregates/group_values/group_column.rs: ## @@ -376,6 +385,399 @@ where } } +/// An implementation of [`GroupColumn`] for bin

[I] Add `pivot` & `unpivot` (melt) to DataFrame [datafusion]

2024-10-13 Thread via GitHub
ion-elgreco opened a new issue, #12907: URL: https://github.com/apache/datafusion/issues/12907 ### Is your feature request related to a problem or challenge? Pivoting and unpivoting is a common use case for data scientists, this is currently missing in the DF api. ### Describe

Re: [PR] Introduce `binary_as_string` parquet option [datafusion]

2024-10-13 Thread via GitHub
alamb commented on PR #12816: URL: https://github.com/apache/datafusion/pull/12816#issuecomment-2408940698 Here is the performance of this PR. Some queries are slower, some are faster. I believe once we turn on string view everything will be faster. ``` ---

Re: [PR] Implement special min/max accumulator for Strings and Binary (10% faster for Clickbench Q28) [datafusion]

2024-10-13 Thread via GitHub
alamb commented on code in PR #12792: URL: https://github.com/apache/datafusion/pull/12792#discussion_r1798266004 ## datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/nulls.rs: ## @@ -91,3 +100,105 @@ pub fn filtered_null_mask( let opt_filter = opt_filt

[I] Add `fill_null` on dataframe/expr level [datafusion]

2024-10-13 Thread via GitHub
ion-elgreco opened a new issue, #12906: URL: https://github.com/apache/datafusion/issues/12906 ### Is your feature request related to a problem or challenge? I would like to be able to fill_nulls per col/expr and on the dataframe level, akin to polars/pyspark/pandas ### Describ

  1   2   >