alamb opened a new issue, #13525: URL: https://github.com/apache/datafusion/issues/13525
### Is your feature request related to a problem or challenge? I recently watched the [Biting the Bullet: Rebuilding GlareDB from the Ground Up (Sean Smith)](https://www.youtube.com/watch?v=Sor3KZpmbHg&list=PLSE8ODhjZXjZc2AdXq_Lc1JS62R48UX2L&index=9) video from the CMU database series by @scsmithr. I found it very informative (thanks for the talk) My high level summary of the talk is "we are replacing DataFusion with our own engine, rewritten from the the ground up". It sounds like GlareDB they are still using DataFusion as the new engine is not yet at feature parity From my perspective, it seems that GlareDB's decision is a result of their judgement on the relative benefits vs estimated costs between 1. Developing a new engine by themselves (and the associated testing, maintenance, etc) 2. Working to integrate DataFusion / work with the community to help make it better Of course, not every organization / project will have the same tradeoffs, but several of the challenges that @scsmithr described in the talk I think are worth discussing to make them better. Here is my summary of those challenges: # Effort required / Regressions during upgrades The general sentiment was that it took a long time to upgrade to DataFusion versions. This is something I have heard from others (@paveltiunov for example from Cube). We also experience this at InfluxData Also, he mentioned they had hit issues where queries that used to work did not after upgrade. **Possible Improvements** - [ ] More testing via additional test coverage (it would be great to see some explicit investment in this area) - [ ] https://github.com/apache/datafusion/issues/13470 - [ ] Discussion on being more deliberate with API changes / slowing down breaking - [ ] Discuss / document ways to make it easier for users (e.g. @jacksonrnewhouse mentioned keeping to the higher levels made upgrades a lot easier) # Dependency management Glare DB uses Lance and [delta-rs](https://github.com/delta-io/delta-rs), which both use DataFusion, and currently this requires the version of DataFusion used by GlareDB to match the(he specifically cited https://github.com/delta-io/delta-rs/pull/2886 which *just* finally merged) For what it is worth, @matthewmturner and I hit the same thing in dft (see https://github.com/datafusion-contrib/datafusion-dft/pull/150) **Possible Improvements** * I think the FFI work that @timsaucer is working on / some examples of how it can be used to integrate delta and iceberg and lance built with different DataFusion versions would be super helpful. # WASM support I don't think he said explicitly that DataFusion couldn't do WASM, but it seemed to be implied Thanks to the work from @jonmmease @waynexia and others, DataFusion absolutely be compiled to WASM (check out this cool example from @XiangpengHao: https://parquet-viewer.haoxp.xyz/) but maybe it needs to be better documented / explained in a blog - [ ] Blog about how to compile DataFusion for WASM?? # Implicit assumptions in LogicalPlans Another thing that was mentioned was the challenge of writing custom optimizer rules was challenging because there were implicit assumptions (e.g. that column names were unique) **Possible Improvements** - [ ] Better document what those assumptions are - [ ] For example, https://github.com/apache/datafusion/issues/12736 # SQL Features Repeated column names came up as an example feature that was missing in DataFusion. ```sql select * from (values (1), (2)) v(a, a) ``` **Possible Improvements** - [ ] I think @findepi is also thinking about this as part of https://github.com/apache/datafusion/issues/12723 # Distributed Planning and Execution ## Planning: The talk mentioned that the DataFusion planner was linear in the way it resolved references and the GlareDB system needed to resolve references from a remote catalog. Their solution seems to have been to fork the DataFusion SQL planner and make it `async` Another approach that is taken by the [`SessionContext::sql`](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.sql) Is: 1. Does an initial pass through the parse tree to find all references (non async) 2. Then fetch all references (can be `async`) 3. Then does the planning (non async) with all the relevant references I don't think this is particularly well documented **Possible Improvements** - [ ] Add an example of implementing a remote / async catalog ## Execution: The talk mentioned several times that GlareDB runs distributed queries and found it challenging to use a different number of threads on different executors (it sounds like maybe they split the `ExecutionPlan`, which already has a target number of partitions baked in). It sounds like their solution was to write a new scheduler / execution engine that didn't have the parallelism baked in Another potential way to achieve a different threads per execution node is to do the distribution at the `LogicalPlan` level and then run the physical planner on each sub part of the LogicalPlan. **Possible Improvements** - [ ] Maybe we could document that better / show an example of how to split / run a distributed plan ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
