Re: [I] [DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) [datafusion]

via GitHub Fri, 22 Nov 2024 09:29:07 -0800


scsmithr commented on issue #13525:
URL: https://github.com/apache/datafusion/issues/13525#issuecomment-2494375757

# Choosing DataFusion

We chose to use DataFusion for the base of our product ~June 2022 and have
kept up with it until earlier this year.

When deciding, we looked at:
- Community: DataFusion seemed to have a quite a few contributors and that
didn't look like it was slowing down. And seeing the investment from Influx was
a confidence builder.
- Performance: At the time, DataFusion was already pretty fast, and again,
it just felt like it'd be getting better over time.
- Extensibility: The big thing for us, it was really cool to see that almost
every part could be extended and customized.

The main alternative we looked at was DuckDB. But opted to go for it since
it was written in C++ (not a bad thing, just I was much more familiar with Rust
and was focused on building a Rust team) and it seemed a bit less extensible
than DataFusion for our needs.

And as a mentioned in the talk, choosing DataFusion got us to a POC very
quickly and let us build out the rest of the features we wanted.

# Core challenges

## Dependencies

Huge challenge for us. As a startup, we wanted to try to bring in libraries
instead of trying to write our own. And since we were billing ourselves as a
database to query across data sources, bringing in libraries like
[delta-rs](https://github.com/delta-io/delta-rs) and
[lance](https://github.com/lancedb/lance) with already written TableProviders
for those formats seemed like a no-brainer.

However trying to keep all these in sync with the right version of
DataFusion (and Arrow) was incredibly challenging and time consuming. Sometimes
there were pretty major blockers for either of those to upgrade. Some folks on
our team helped out with a few upgrades (e.g.
https://github.com/delta-io/delta-rs/pull/2249), but some upgrades ended up
taking a bit a time away from building the features we needed.

Did we have to upgrade? No, but we chose DataFusion not for where it is
today, but where it'll be in the future. And delaying the upgrades meant we
would just be stacking up the incompatibilities and missing out on
fixes/features that were going in.

This issue only started to crop up when we decided to depend on delta-rs &
lance, before then we didn't really have dependency issues.

## Upgrades

This goes hand-in-hand with our dependency challenges. But I want to be more
specific around upgrading _DataFusion_.

> More testing via additional test coverage (it would be great to see some
explicit investment in this area)

I accept that bugs will be introduced. It's software, it happens. The real
challenge was actually upgrading to get the bug fixes because of the deps...

> Discussion on being more deliberate with API changes / slowing down
breaking

I actually don't really care about API breakages if there's a clear path to
getting the equivalent functionality working.

If something's bad, change it/break it and make it better. As an end user,
this is fine. "Better" is subjective and depends on the use case.

I'd sum this up as upgrades were fine, our challenges were with upgrades +
upgrading dependencies.

## Component composability

We originally chose DataFusion for the composability, and I went in with the
idea that I'd be able to have all components _within_ DataFusion be composable.
I tried to shoe horn that ideal with our early versions of GlareDB with mixed
success.

For example, I can't just have an `Optimizer` struct configured with some
parameters and rules, I need to go through a `SessionContext` to have the
optimizer work right. For example, we had a bug where `select now()` would
return the wrong time because the optimizer was was using a field provided by
the context for replacing the `now()` call with constant. See:
https://github.com/GlareDB/glaredb/issues/1427

There were a few other issues where attempting to use the pieces separately
ended up biting us because of these implicit dependencies. I really wanted to
avoid the `SessionContext` so that we could have fine-grained control over
planning and execution, but we were forced into using the `SessionContext`
because the optimizer wouldn't work right for certain queries, or and
`ExecutionPlan` would read a configuration var from the `TaskContext` that
could only be set through the `SessionContext` and have it be properly threaded
down.

Essentially it felt like we had to depend on the `SessionContext` in order
to use anything with confidence of it working as expected. Either because
things were designed that way, or because we had limited visibility (`pub
(crate)`) to the actual fields/types.

This _has_ gotten better over time (e.g. the `OptimizerConfig` would have
solved our above `select now()` problem), but we've also had changes go the
other way the increased dependence on the `SessionContext`. One of our earliest
issues was the change to `FileScanConfig` removing the ability to just provide
an `ObjectStore` (https://github.com/apache/datafusion/pull/2668). Since we're
building a multi-tenant database, we want very complete control over how
`ObjectStore`s are constructed and how they're used since they'll typically be
unique to the session. The above change forced how we managed object stores if
we wanted to use the file scan features from DataFusion. This has also
introduced some weirdness when more "dynamic" object stores are needed, e.g.
delta-rs needs to generate unique urls for object stores they create just to
store them on the context:
https://github.com/delta-io/delta-rs/blob/main/crates/core/src/logstore/mod.rs#L257-L265.

I'd summarize this as: We wanted to use all of DataFusion's features, but in
a way where we could control _how_ those features we're used. I _want_ full
control over configuring the optimizer, or the listing table, or ... But in
many cases, that configuration can only happen with the use of `SessionContext`.

A suggestion might be to open up explicitly configuring components in
DataFusion without the use of a `SessionContext`, and just have the
`SessionContext` provide the easy to use interface for everything.

# Features (misc)

## WASM

> I don't think he said explicitly that DataFusion couldn't do WASM, but it
seemed to be implied

I didn't mean to imply it, just that it's not something actively
tested/developed for in DataFusion. I can't recall the exact issue, but late
last year (2023) I was working on getting stuff running in the browser and was
facing a panic since _something_ was blocking.

There's a lot of considerations needed when bringing in dependencies when
trying to compile for WASM, and most of our issues around that were just we
didn't have a clean split with what works with WASM vs what doesn't. That's not
a DataFusion issue.

## Column naming/aliasing

> Repeated column names came up as an example feature that was missing in
DataFusion.
>
> select * from (values (1), (2)) v(a, a)

This came up quite a bit for us earlier since we would have folks try out
joins across tables with very similar schemas (e.g. finding the intersection of
users between service "A" and service "B") and sometimes those queries would
fail due to conflicting column names, even when the user wasn't explicitly
selecting them.

This has gotten better in DataFusion, but I don't think DataFusion's
handling of column names (and generating column names) is good. There's a lot
of code to ensure string representations of expressions are the same across
planning steps, and it's very prone to breaking. I did try to add in more
correlated subquery support but stopped because frankly I found the logical
planning with column names to be very unpleasant.

## Async planning

> Another approach that is taken by the
[SessionContext::sql](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.sql)
Is:
>
> 1. Does an initial pass through the parse tree to find all references (non
async)
> 2. Then fetch all references (can be async)
> 3. Then does the planning (non async) with all the relevant references
>
> I don't think this is particularly well documented

Yes, this is what we needed. At the time, we just forked the planner since
it was easier to make changes that way (since we were also adding table
functions for reading from remote sources).

## Execution

> The talk mentioned several times that GlareDB runs distributed queries and
found it challenging to use a different number of threads on different
executors (it sounds like maybe they split the ExecutionPlan, which already has
a target number of partitions baked in). It sounds like their solution was to
write a new scheduler / execution engine that didn't have the parallelism baked
in

Yes, we wanted to independently execute parts of the queries across
heterogeneous machines.

We looked Ballista quite a bit, but ultimately when with a more stream-based
approach to try to fit in better with how DataFusion executes queries. There's
a high-level comment
[here](https://github.com/GlareDB/glaredb/blob/legacy/crates/sqlexec/src/remote/planner.rs#L411-L503)
about what we were trying to accomplish (split execution across a local and
remote machine) but the code itself ended up being quite complex.

I don't think it makes sense for DataFusion to try to cater to this use
case, but this ultimately ended up being a reason why we moved away and is very
central to the design of the new system.

---

Happy to go deeper in any of this.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) [datafusion]

Reply via email to