geoffreyclaude commented on PR #14547: URL: https://github.com/apache/datafusion/pull/14547#issuecomment-2694747715
> Thank you for working on this @geoffreyclaude -- I am sorry for the delay in responding. > > Primarily my concern about this PR is that it adds new more dependencies (albiet optional ones) to DataFusion. We already spend a non trivial amount of maintenance time keeping up with dependencies / fixing related issues, etc . > > I also worry that not everyone uses the tracing crate. > > New features also increase the compile time, such as > > * [Build time regressionĀ #14256](https://github.com/apache/datafusion/issues/14256) > > > the "TracingExec", but it's a pretty wonky workaround. It also fails when the source nodes of the plan spawn new tasks, as in the ParquetSink, which prevent the >underlying object_store from linking its spans to the main trace. > > FWIW the way we have handled this is to pass a tracing span along in PartitionedFile::extensions, but that is non ideal > > https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.PartitionedFile.html#structfield.extensions > > ## Questions / Suggestions > Is there some way to avoid adding a direct dependency on `tracing` but still keep the API flexibility you need? > > For example, perhaps we could create a different `JoinSet` that instead of hard coding `tracing` could provide the callbacks for spawning tasks š¤ And thank you @alamb for the detailed response! I'll try to address your points one by one. > I also worry that not everyone uses the tracing crate. the `tracing` crate has pretty much become the standard crate for instrumentation, same as how `tokio` is the standard crate for async runtimes. Almost everyone who's interested in instrumenting their DataFusion execution will be using the `tracing` crate. Moreover, if we push this further and deliver a full execution plan instrumentation solution that relies on `tracing` (as a separate contrib crate,) it will probably become the de-facto DataFusion standard. > New features also increase the compile time This is why I added a feature guard, as tracing definitely adds some compilation overhead. With the feature disabled, there should be no impact at all. > pass a tracing span along in PartitionedFile::extensions This works if you want to link the object store calls to the root span, but I believe it won't allow you to link them to the "real" parent span, the `DataSinkExec` that wraps the `ParquetSink`. It also doesn't allow you to instrument native and custom execution nodes past a `RepartitionExec` that spawns tasks on separate threads. > Is there some way to avoid adding a direct dependency on `tracing` but still keep the API flexibility you need? I really like that idea, as it doesn't need to pull in additional dependencies, and let's users chose their instrumentation solution. Unfortunately, if we want to dynamically "inject" the new tracing behavior at runtime, it's particularly tricky, in particular because: 1. it should be a static injection, so each new `JoinSet` created with `JoinSet<T>::new()` picks it up 2. it needs to work for all the generic types of `JoinSet`, from `JoinSet<()>` to `JoinSet<std::result::Result<usize, DataFusionError>>`. (So we probably need complex macros, or a hard-coded list of all the possible types.) 3. It needs to be done once and once only, globally, without relying on `unsafe`. It's of course possible, but the additional complexity just doesn't seem worth it. ... Or maybe there's a super obvious way to do it which I completely missed! To me, I don't see a simpler way to open up full instrumentation of execution plans than what's done in this PR. It's super unfortunate that the `tracing` crate has a thread-local only storage for its context, but it doesn't seem likely to move away from this implementation decision. Since the minimal change is pretty small, we can keep our fork of the change internally. But it also seemed like a good reason to push it upstream and contribute the full execution plan instrumentation code on top of it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org