scsmithr commented on issue #13525: URL: https://github.com/apache/datafusion/issues/13525#issuecomment-2494375757
# Choosing DataFusion We chose to use DataFusion for the base of our product ~June 2022 and have kept up with it until earlier this year. When deciding, we looked at: - Community: DataFusion seemed to have a quite a few contributors and that didn't look like it was slowing down. And seeing the investment from Influx was a confidence builder. - Performance: At the time, DataFusion was already pretty fast, and again, it just felt like it'd be getting better over time. - Extensibility: The big thing for us, it was really cool to see that almost every part could be extended and customized. The main alternative we looked at was DuckDB. But opted to go for it since it was written in C++ (not a bad thing, just I was much more familiar with Rust and was focused on building a Rust team) and it seemed a bit less extensible than DataFusion for our needs. And as a mentioned in the talk, choosing DataFusion got us to a POC very quickly and let us build out the rest of the features we wanted. # Core challenges ## Dependencies Huge challenge for us. As a startup, we wanted to try to bring in libraries instead of trying to write our own. And since we were billing ourselves as a database to query across data sources, bringing in libraries like [delta-rs](https://github.com/delta-io/delta-rs) and [lance](https://github.com/lancedb/lance) with already written TableProviders for those formats seemed like a no-brainer. However trying to keep all these in sync with the right version of DataFusion (and Arrow) was incredibly challenging and time consuming. Sometimes there were pretty major blockers for either of those to upgrade. Some folks on our team helped out with a few upgrades (e.g. https://github.com/delta-io/delta-rs/pull/2249), but some upgrades ended up taking a bit a time away from building the features we needed. Did we have to upgrade? No, but we chose DataFusion not for where it is today, but where it'll be in the future. And delaying the upgrades meant we would just be stacking up the incompatibilities and missing out on fixes/features that were going in. This issue only started to crop up when we decided to depend on delta-rs & lance, before then we didn't really have dependency issues. ## Upgrades This goes hand-in-hand with our dependency challenges. But I want to be more specific around upgrading _DataFusion_. > More testing via additional test coverage (it would be great to see some explicit investment in this area) I accept that bugs will be introduced. It's software, it happens. The real challenge was actually upgrading to get the bug fixes because of the deps... > Discussion on being more deliberate with API changes / slowing down breaking I actually don't really care about API breakages if there's a clear path to getting the equivalent functionality working. If something's bad, change it/break it and make it better. As an end user, this is fine. "Better" is subjective and depends on the use case. I'd sum this up as upgrades were fine, our challenges were with upgrades + upgrading dependencies. ## Component composability We originally chose DataFusion for the composability, and I went in with the idea that I'd be able to have all components _within_ DataFusion be composable. I tried to shoe horn that ideal with our early versions of GlareDB with mixed success. For example, I can't just have an `Optimizer` struct configured with some parameters and rules, I need to go through a `SessionContext` to have the optimizer work right. For example, we had a bug where `select now()` would return the wrong time because the optimizer was was using a field provided by the context for replacing the `now()` call with constant. See: https://github.com/GlareDB/glaredb/issues/1427 There were a few other issues where attempting to use the pieces separately ended up biting us because of these implicit dependencies. I really wanted to avoid the `SessionContext` so that we could have fine-grained control over planning and execution, but we were forced into using the `SessionContext` because the optimizer wouldn't work right for certain queries, or and `ExecutionPlan` would read a configuration var from the `TaskContext` that could only be set through the `SessionContext` and have it be properly threaded down. Essentially it felt like we had to depend on the `SessionContext` in order to use anything with confidence of it working as expected. Either because things were designed that way, or because we had limited visibility (`pub (crate)`) to the actual fields/types. This _has_ gotten better over time (e.g. the `OptimizerConfig` would have solved our above `select now()` problem), but we've also had changes go the other way the increased dependence on the `SessionContext`. One of our earliest issues was the change to `FileScanConfig` removing the ability to just provide an `ObjectStore` (https://github.com/apache/datafusion/pull/2668). Since we're building a multi-tenant database, we want very complete control over how `ObjectStore`s are constructed and how they're used since they'll typically be unique to the session. The above change forced how we managed object stores if we wanted to use the file scan features from DataFusion. This has also introduced some weirdness when more "dynamic" object stores are needed, e.g. delta-rs needs to generate unique urls for object stores they create just to store them on the context: https://github.com/delta-io/delta-rs/blob/main/crates/core/src/logstore/mod.rs#L257-L265. I'd summarize this as: We wanted to use all of DataFusion's features, but in a way where we could control _how_ those features we're used. I _want_ full control over configuring the optimizer, or the listing table, or ... But in many cases, that configuration can only happen with the use of `SessionContext`. A suggestion might be to open up explicitly configuring components in DataFusion without the use of a `SessionContext`, and just have the `SessionContext` provide the easy to use interface for everything. # Features (misc) ## WASM > I don't think he said explicitly that DataFusion couldn't do WASM, but it seemed to be implied I didn't mean to imply it, just that it's not something actively tested/developed for in DataFusion. I can't recall the exact issue, but late last year (2023) I was working on getting stuff running in the browser and was facing a panic since _something_ was blocking. There's a lot of considerations needed when bringing in dependencies when trying to compile for WASM, and most of our issues around that were just we didn't have a clean split with what works with WASM vs what doesn't. That's not a DataFusion issue. ## Column naming/aliasing > Repeated column names came up as an example feature that was missing in DataFusion. > > select * from (values (1), (2)) v(a, a) This came up quite a bit for us earlier since we would have folks try out joins across tables with very similar schemas (e.g. finding the intersection of users between service "A" and service "B") and sometimes those queries would fail due to conflicting column names, even when the user wasn't explicitly selecting them. This has gotten better in DataFusion, but I don't think DataFusion's handling of column names (and generating column names) is good. There's a lot of code to ensure string representations of expressions are the same across planning steps, and it's very prone to breaking. I did try to add in more correlated subquery support but stopped because frankly I found the logical planning with column names to be very unpleasant. ## Async planning > Another approach that is taken by the [SessionContext::sql](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.sql) Is: > > 1. Does an initial pass through the parse tree to find all references (non async) > 2. Then fetch all references (can be async) > 3. Then does the planning (non async) with all the relevant references > > I don't think this is particularly well documented Yes, this is what we needed. At the time, we just forked the planner since it was easier to make changes that way (since we were also adding table functions for reading from remote sources). ## Execution > The talk mentioned several times that GlareDB runs distributed queries and found it challenging to use a different number of threads on different executors (it sounds like maybe they split the ExecutionPlan, which already has a target number of partitions baked in). It sounds like their solution was to write a new scheduler / execution engine that didn't have the parallelism baked in Yes, we wanted to independently execute parts of the queries across heterogeneous machines. We looked Ballista quite a bit, but ultimately when with a more stream-based approach to try to fit in better with how DataFusion executes queries. There's a high-level comment [here](https://github.com/GlareDB/glaredb/blob/legacy/crates/sqlexec/src/remote/planner.rs#L411-L503) about what we were trying to accomplish (split execution across a local and remote machine) but the code itself ended up being quite complex. I don't think it makes sense for DataFusion to try to cater to this use case, but this ultimately ended up being a reason why we moved away and is very central to the design of the new system. --- Happy to go deeper in any of this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
