[I] [DISCUSSION] Making it easier (lessons from GlareDB) [datafusion]

via GitHub Fri, 22 Nov 2024 06:46:32 -0800


alamb opened a new issue, #13525:
URL: https://github.com/apache/datafusion/issues/13525


   ### Is your feature request related to a problem or challenge?
   
   I recently watched the [Biting the Bullet: Rebuilding GlareDB from the 
Ground Up (Sean 
Smith)](https://www.youtube.com/watch?v=Sor3KZpmbHg&list=PLSE8ODhjZXjZc2AdXq_Lc1JS62R48UX2L&index=9)
 video from the CMU database series by @scsmithr. I found it very informative 
(thanks for the talk)
   
   My high level summary of the talk is "we are replacing DataFusion with our 
own engine, rewritten from the the ground up". It sounds like GlareDB they are 
still using DataFusion as the new engine is not yet at feature parity
   
   From my perspective, it seems that GlareDB's decision is a result of their 
judgement on the relative benefits vs estimated costs between
   1. Developing a new engine by themselves (and the associated testing, 
maintenance, etc)
   2. Working to integrate DataFusion / work with the community to help make it 
better
   
   Of course, not every organization / project will have the same tradeoffs, 
but several of the challenges that @scsmithr described in the talk I think are 
worth discussing to make them better.
   
   Here is my summary of those challenges:
   
   # Effort required / Regressions during upgrades 
   The general sentiment was that it took a long time to upgrade to DataFusion 
versions. This is something I have heard from others (@paveltiunov  for example 
from Cube). We also experience this at InfluxData
   
   Also, he mentioned they had hit issues where queries that used to work did 
not after upgrade.
   
   **Possible Improvements**
   - [ ] More testing via additional test coverage (it would be great to see 
some explicit investment in this area)
   - [ ] https://github.com/apache/datafusion/issues/13470
   - [ ] Discussion on being more deliberate with API changes / slowing down 
breaking
   - [ ] Discuss / document ways to make it easier for users (e.g. 
@jacksonrnewhouse mentioned keeping to the higher levels made upgrades a lot 
easier)
   
   # Dependency management
   
   Glare DB uses Lance and [delta-rs](https://github.com/delta-io/delta-rs), 
which both use DataFusion, and currently this requires the version of 
DataFusion used by GlareDB to match the(he specifically cited 
https://github.com/delta-io/delta-rs/pull/2886 which *just* finally merged)
   
   For what it is worth, @matthewmturner  and I hit the same thing in dft (see 
https://github.com/datafusion-contrib/datafusion-dft/pull/150)
   
   **Possible Improvements**
   * I think the FFI work that @timsaucer  is working on / some examples of how 
it can be used to integrate delta and iceberg and lance built with different 
DataFusion versions would be super helpful.
   
   # WASM support
   I don't think he said explicitly that DataFusion couldn't do WASM, but it 
seemed to be implied
   
   Thanks to the work from @jonmmease @waynexia and others, DataFusion 
absolutely be compiled to WASM (check out this cool example from @XiangpengHao: 
https://parquet-viewer.haoxp.xyz/) but maybe it needs to be better documented / 
explained in a blog
   
   - [ ] Blog about how to compile DataFusion for WASM??
   
   # Implicit assumptions in LogicalPlans
   Another thing that was mentioned was the challenge of writing custom 
optimizer rules was challenging because there were implicit assumptions (e.g. 
that column names were unique)
   
   **Possible Improvements**
   - [ ] Better document what those assumptions are
   - [ ] For example, https://github.com/apache/datafusion/issues/12736
   
   # SQL Features
   Repeated column names came up as an example feature that was missing in 
DataFusion.
   
   ```sql
   select * from (values (1), (2)) v(a, a)
   ```
   
   
   **Possible Improvements**
   - [ ] I think @findepi is also thinking about this as part of  
https://github.com/apache/datafusion/issues/12723 
   
   # Distributed Planning and Execution
   
   ## Planning:
   The talk mentioned that the DataFusion planner was linear in the way it 
resolved references and the GlareDB system needed to resolve references from a 
remote catalog. Their solution seems to have been to fork the DataFusion SQL 
planner and make it `async`
   
   Another approach that is taken by the 
[`SessionContext::sql`](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.sql)
 Is:
   1. Does  an initial pass through the parse tree to find all references (non 
async)
   2. Then fetch all references (can be `async`)
   3. Then does the planning (non async) with all the relevant references
   
   I don't think this is particularly well documented
   
   **Possible Improvements**
   - [ ] Add an example of implementing a remote / async catalog
   
   
   ## Execution:
   The talk mentioned several times that GlareDB runs distributed queries and 
found it challenging to use a different number of threads on different 
executors (it sounds like maybe they split the `ExecutionPlan`, which already 
has a target number of partitions baked in). It sounds like their solution was 
to write a new scheduler  / execution engine that didn't have the parallelism 
baked in
   
    Another potential way to achieve a different threads per execution node  is 
to do the distribution at the `LogicalPlan` level and then run the physical 
planner on each sub part of the LogicalPlan. 
   
   **Possible Improvements**
   - [ ] Maybe we could document that better / show an example of how to split 
/ run a distributed plan
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [DISCUSSION] Making it easier (lessons from GlareDB) [datafusion]

Reply via email to