deanm0000 commented on PR #1074:
URL: 
https://github.com/apache/datafusion-python/pull/1074#issuecomment-2741334821

   I wrote this script to help produce this PR. I wanted to start with just the 
functions which have a single Expr input that return an Expr. The script writes 
all those defs to a file and then I copied that over to the source and let the 
linter fix the bad formatting that my code created. 
   
   ```python
   from inspect import signature, Parameter
   from datafusion import functions as f, Expr
   from types import FunctionType
   from pathlib import Path
   
   funcs_not_exprs = set(dir(f)) - set(dir(Expr))
   funcs = []
   for fun in funcs_not_exprs:
       if isinstance(getattr(f, fun), FunctionType):
           funcs.append(fun)
   
   expr_in_out = {
       "one_in_out": [],
       "multi_expr": [],
       "other_ins": [],
       "other_return": [],
       "other": [],
   }
   for fun in funcs:
       sig = signature(getattr(f, fun))
       params = sig.parameters
       return_annotation = sig.return_annotation
       if return_annotation != "Expr":
           expr_in_out["other_return"].append(fun)
           continue
       all_expr = True
       no_star = True
       for name, param in params.items():
           if param.annotation != "Expr":
               all_expr = False
               break
           if param.kind in (Parameter.VAR_POSITIONAL, Parameter.KEYWORD_ONLY):
               no_star=False
       if len(params) == 1 and all_expr and no_star:
           expr_in_out["one_in_out"].append(fun)
       elif len(params) > 1 and all_expr:
           expr_in_out["multi_expr"].append(fun)
       elif len(params) > 1 and not all_expr:
           expr_in_out["other_ins"].append(fun)
       else:
           expr_in_out["other"].append(fun)
   
   expr_defs = Path("./expr_defs.py")
   with expr_defs.open("w") as ff:
       for fun in expr_in_out["one_in_out"]:
           ff.write(f"    def {fun}(self) -> Expr:\n")
   
           docstring = getattr(f, fun).__doc__
           if docstring is not None:
               ff.write('        """')
               docstring = docstring.strip()
               ff.write(docstring)
               ff.write('\n        """\n')
           ff.write(f"        return F.{fun}(self)\n")
   
   ```
   
   Before I do tests for all of them, I wanted to put this in the world for 
feedback.
   
   One additional idea that wasn't in the original PR would be to create 
namespaces to group the category of function so instead of `col('a").tan()` 
it'd be `col("a").trig.tan()`, `col("b").list.length()`, 
`col("c").str.reverse()`, `col("d").dt.to_timestamp()`. That keeps there from 
being too many available functions to choose from and it puts similar functions 
together. That way if someone is working with datetimes then with a datetime 
namespace, all the functions they look through are for datetimes. Similarly, 
there wouldn't be datetime functions clogging up the root of Expr. Same for 
trig, strings, lists, arrays, and anything else that deserves a namespace.
   
   That, of course, requires some more manual effort to categorize the 
functions. 
   
   
   As another forward thought, when it comes to functions that take extra Expr 
inputs like `levenshtein`, I would also put in a convenience check where the 
function would be
   
   ```python
   def levenshtein(self, string2: Expr|str) -> Expr:
       if isinstance(string2, str):
           string2=col(string2)
       return F.levenshtein(self, string2)
   ```
   
   That would be consistent with polars wrt to literals, so if someone wanted 
the levenshtein against a literal they'd have to use `lit("other_string")` 
rather than just using the "other_string" directly.
   
   For functions that take a number as the second then it'd use that directly, 
such as:
   
   ```python
   def pow(self, exponent: Expr | int | float) -> Expr:
       if isinstance(exponent, (int, float)):
           exponent=lit(exponent)
       return f.pow(self, exponent)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to