timsaucer commented on PR #17289:
URL: https://github.com/apache/datafusion/pull/17289#issuecomment-3218278949

   After experimenting a little more I can see two paths forward for supporting 
cases like this dataframe:
   
   ```
   +--------------+--------------+
   | a            | b            |
   +--------------+--------------+
   | 0.1111111111 | [1, 2, 3]    |
   | 0.2222222222 |              |
   |              | [4, 5, 6, 7] |
   | 0.4444444444 | []           |
   +--------------+--------------+
   ```
   
   Suppose I wanted to do a `round` call where I am passing column `a` as the 
value to round and column `b` as the number of decimal places I want to round 
to. Ultimately I want this to give an output like
   
   ```
   +--------------+--------------+--------------------+
   | a            | b            | round(a, b[])      |
   +--------------+--------------+--------------------+
   | 0.1111111111 | [1, 2, 3]    | [0.1, 0.11, 0.111] |
   | 0.2222222222 |              |                    |
   |              | [4, 5, 6, 7] |                    |
   | 0.4444444444 | []           | []                 |
   +--------------+--------------+--------------------+
   ```
   
   A difficulty here is that we need to map the entries of `a` multiple times 
to the `b`. It appears the best way to do this is to use run end encoding. Then 
we could keep the `ArrayRef` for the column `a` and create a small primitive 
array of indices `[3, 4, 8, 9]` that should give us an array that will have the 
same length as the `values` array of column `b`.
   
   I have tested this locally but I run into the problem that the existing 
scalar functions do not handle run end encoded arrays. All of these functions 
would need to be implemented, as well as any UDFs that customers create.
   
   An alternative way we could do this would be to create a new array for `b` 
and simply duplicate the data as many times as necessary. This feels like it 
could lead to excessive memory consumption as we are duplicating values just to 
feed them into a function and throw them away afterwards. Yet it has the 
advantage that it would immediately support *all* scalar functions we have with 
no additional work.
   
   I'm a bit torn on this. I also have alternative reasons to wish to have 
additional REE support throughout DataFusion. So pushing the first approach 
would lead to long term benefit but have a much longer tail of implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to