scsmithr commented on issue #13525:
URL: https://github.com/apache/datafusion/issues/13525#issuecomment-2494375757

   # Choosing DataFusion
   
   We chose to use DataFusion for the base of our product ~June 2022 and have 
kept up with it until earlier this year.
   
   When deciding, we looked at:
   - Community: DataFusion seemed to have a quite a few contributors and that 
didn't look like it was slowing down. And seeing the investment from Influx was 
a confidence builder.
   - Performance: At the time, DataFusion was already pretty fast, and again, 
it just felt like it'd be getting better over time.
   - Extensibility: The big thing for us, it was really cool to see that almost 
every part could be extended and customized.
   
   The main alternative we looked at was DuckDB. But opted to go for it since 
it was written in C++ (not a bad thing, just I was much more familiar with Rust 
and was focused on building a Rust team) and it seemed a bit less extensible 
than DataFusion for our needs.
   
   And as a mentioned in the talk, choosing DataFusion got us to a POC very 
quickly and let us build out the rest of the features we wanted.
   
   # Core challenges
   
   ## Dependencies
   
   Huge challenge for us. As a startup, we wanted to try to bring in libraries 
instead of trying to write our own. And since we were billing ourselves as a 
database to query across data sources, bringing in libraries like 
[delta-rs](https://github.com/delta-io/delta-rs) and 
[lance](https://github.com/lancedb/lance) with already written TableProviders 
for those formats seemed like a no-brainer.
   
   However trying to keep all these in sync with the right version of 
DataFusion (and Arrow) was incredibly challenging and time consuming. Sometimes 
there were pretty major blockers for either of those to upgrade. Some folks on 
our team helped out with a few upgrades (e.g. 
https://github.com/delta-io/delta-rs/pull/2249), but some upgrades ended up 
taking a bit a time away from building the features we needed.
   
   Did we have to upgrade? No, but we chose DataFusion not for where it is 
today, but where it'll be in the future. And delaying the upgrades meant we 
would just be stacking up the incompatibilities and missing out on 
fixes/features that were going in.
   
   This issue only started to crop up when we decided to depend on delta-rs & 
lance, before then we didn't really have dependency issues.
   
   ## Upgrades
   
   This goes hand-in-hand with our dependency challenges. But I want to be more 
specific around upgrading _DataFusion_.
   
   > More testing via additional test coverage (it would be great to see some 
explicit investment in this area)
   
   I accept that bugs will be introduced. It's software, it happens. The real 
challenge was actually upgrading to get the bug fixes because of the deps...
   
   > Discussion on being more deliberate with API changes / slowing down 
breaking
   
   I actually don't really care about API breakages if there's a clear path to 
getting the equivalent functionality working.
   
   If something's bad, change it/break it and make it better. As an end user, 
this is fine. "Better" is subjective and depends on the use case.
   
   I'd sum this up as upgrades were fine, our challenges were with upgrades + 
upgrading dependencies.
   
   ## Component composability
   
   We originally chose DataFusion for the composability, and I went in with the 
idea that I'd be able to have all components _within_ DataFusion be composable. 
I tried to shoe horn that ideal with our early versions of GlareDB with mixed 
success. 
   
   For example, I can't just have an `Optimizer` struct configured with some 
parameters and rules, I need to go through a `SessionContext` to have the 
optimizer work right. For example, we had a bug where `select now()` would 
return the wrong time because the optimizer was was using a field provided by 
the context for replacing the `now()` call with constant. See: 
https://github.com/GlareDB/glaredb/issues/1427
   
   There were a few other issues where attempting to use the pieces separately 
ended up biting us because of these implicit dependencies. I really wanted to 
avoid the `SessionContext` so that we could have fine-grained control over 
planning and execution, but we were forced into using the `SessionContext` 
because the optimizer wouldn't work right for certain queries, or and 
`ExecutionPlan` would read a configuration var from the `TaskContext` that 
could only be set through the `SessionContext` and have it be properly threaded 
down.
   
   Essentially it felt like we had to depend on the `SessionContext` in order 
to use anything with confidence of it working as expected. Either because 
things were designed that way, or because we had limited visibility (`pub 
(crate)`) to the actual fields/types.
   
   This _has_ gotten better over time (e.g. the `OptimizerConfig` would have 
solved our above `select now()` problem), but we've also had changes go the 
other way the increased dependence on the `SessionContext`. One of our earliest 
issues was the change to `FileScanConfig` removing the ability to just provide 
an `ObjectStore` (https://github.com/apache/datafusion/pull/2668). Since we're 
building a multi-tenant database, we want very complete control over how 
`ObjectStore`s are constructed and how they're used since they'll typically be 
unique to the session. The above change forced how we managed object stores if 
we wanted to use the file scan features from DataFusion. This has also 
introduced some weirdness when more "dynamic" object stores are needed, e.g. 
delta-rs needs to generate unique urls for object stores they create just to 
store them on the context: 
https://github.com/delta-io/delta-rs/blob/main/crates/core/src/logstore/mod.rs#L257-L265.
   
   I'd summarize this as: We wanted to use all of DataFusion's features, but in 
a way where we could control _how_ those features we're used. I _want_ full 
control over configuring the optimizer, or the listing table, or ... But in 
many cases, that configuration can only happen with the use of `SessionContext`.
   
   A suggestion might be to open up explicitly configuring components in 
DataFusion without the use of a `SessionContext`, and just have the 
`SessionContext` provide the easy to use interface for everything.
   
   # Features (misc)
   
   ## WASM
   
   > I don't think he said explicitly that DataFusion couldn't do WASM, but it 
seemed to be implied
   
   I didn't mean to imply it, just that it's not something actively 
tested/developed for in DataFusion. I can't recall the exact issue, but late 
last year (2023) I was working on getting stuff running in the browser and was 
facing a panic since _something_ was blocking.
   
   There's a lot of considerations needed when bringing in dependencies when 
trying to compile for WASM, and most of our issues around that were just we 
didn't have a clean split with what works with WASM vs what doesn't. That's not 
a DataFusion issue.
   
   ## Column naming/aliasing
   
   > Repeated column names came up as an example feature that was missing in 
DataFusion.
   >
   >     select * from (values (1), (2)) v(a, a)
   
   This came up quite a bit for us earlier since we would have folks try out 
joins across tables with very similar schemas (e.g. finding the intersection of 
users between service "A" and service "B") and sometimes those queries would 
fail due to conflicting column names, even when the user wasn't explicitly 
selecting them.
   
   This has gotten better in DataFusion, but I don't think DataFusion's 
handling of column names (and generating column names) is good. There's a lot 
of code to ensure string representations of expressions are the same across 
planning steps, and it's very prone to breaking. I did try to add in more 
correlated subquery support but stopped because frankly I found the logical 
planning with column names to be very unpleasant.
   
   ## Async planning
   
   > Another approach that is taken by the 
[SessionContext::sql](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.sql)
 Is:
   > 
   > 1. Does an initial pass through the parse tree to find all references (non 
async)
   > 2. Then fetch all references (can be async)
   > 3. Then does the planning (non async) with all the relevant references
   > 
   > I don't think this is particularly well documented
   
   Yes, this is what we needed. At the time, we just forked the planner since 
it was easier to make changes that way (since we were also adding table 
functions for reading from remote sources).
   
   ## Execution
   
   > The talk mentioned several times that GlareDB runs distributed queries and 
found it challenging to use a different number of threads on different 
executors (it sounds like maybe they split the ExecutionPlan, which already has 
a target number of partitions baked in). It sounds like their solution was to 
write a new scheduler / execution engine that didn't have the parallelism baked 
in
   
   Yes, we wanted to independently execute parts of the queries across 
heterogeneous machines. 
   
   We looked Ballista quite a bit, but ultimately when with a more stream-based 
approach to try to fit in better with how DataFusion executes queries. There's 
a high-level comment 
[here](https://github.com/GlareDB/glaredb/blob/legacy/crates/sqlexec/src/remote/planner.rs#L411-L503)
 about what we were trying to accomplish (split execution across a local and 
remote machine) but the code itself ended up being quite complex.
   
   I don't think it makes sense for DataFusion to try to cater to this use 
case, but this ultimately ended up being a reason why we moved away and is very 
central to the design of the new system.
   
   ---
   
   Happy to go deeper in any of this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to