deanm0000 commented on PR #1074: URL: https://github.com/apache/datafusion-python/pull/1074#issuecomment-2741334821
I wrote this script to help produce this PR. I wanted to start with just the functions which have a single Expr input that return an Expr. The script writes all those defs to a file and then I copied that over to the source and let the linter fix the bad formatting that my code created. ```python from inspect import signature, Parameter from datafusion import functions as f, Expr from types import FunctionType from pathlib import Path funcs_not_exprs = set(dir(f)) - set(dir(Expr)) funcs = [] for fun in funcs_not_exprs: if isinstance(getattr(f, fun), FunctionType): funcs.append(fun) expr_in_out = { "one_in_out": [], "multi_expr": [], "other_ins": [], "other_return": [], "other": [], } for fun in funcs: sig = signature(getattr(f, fun)) params = sig.parameters return_annotation = sig.return_annotation if return_annotation != "Expr": expr_in_out["other_return"].append(fun) continue all_expr = True no_star = True for name, param in params.items(): if param.annotation != "Expr": all_expr = False break if param.kind in (Parameter.VAR_POSITIONAL, Parameter.KEYWORD_ONLY): no_star=False if len(params) == 1 and all_expr and no_star: expr_in_out["one_in_out"].append(fun) elif len(params) > 1 and all_expr: expr_in_out["multi_expr"].append(fun) elif len(params) > 1 and not all_expr: expr_in_out["other_ins"].append(fun) else: expr_in_out["other"].append(fun) expr_defs = Path("./expr_defs.py") with expr_defs.open("w") as ff: for fun in expr_in_out["one_in_out"]: ff.write(f" def {fun}(self) -> Expr:\n") docstring = getattr(f, fun).__doc__ if docstring is not None: ff.write(' """') docstring = docstring.strip() ff.write(docstring) ff.write('\n """\n') ff.write(f" return F.{fun}(self)\n") ``` Before I do tests for all of them, I wanted to put this in the world for feedback. One additional idea that wasn't in the original PR would be to create namespaces to group the category of function so instead of `col('a").tan()` it'd be `col("a").trig.tan()`, `col("b").list.length()`, `col("c").str.reverse()`, `col("d").dt.to_timestamp()`. That keeps there from being too many available functions to choose from and it puts similar functions together. That way if someone is working with datetimes then with a datetime namespace, all the functions they look through are for datetimes. Similarly, there wouldn't be datetime functions clogging up the root of Expr. Same for trig, strings, lists, arrays, and anything else that deserves a namespace. That, of course, requires some more manual effort to categorize the functions. As another forward thought, when it comes to functions that take extra Expr inputs like `levenshtein`, I would also put in a convenience check where the function would be ```python def levenshtein(self, string2: Expr|str) -> Expr: if isinstance(string2, str): string2=col(string2) return F.levenshtein(self, string2) ``` That would be consistent with polars wrt to literals, so if someone wanted the levenshtein against a literal they'd have to use `lit("other_string")` rather than just using the "other_string" directly. For functions that take a number as the second then it'd use that directly, such as: ```python def pow(self, exponent: Expr | int | float) -> Expr: if isinstance(exponent, (int, float)): exponent=lit(exponent) return f.pow(self, exponent) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org