geoffreyclaude commented on PR #14547:
URL: https://github.com/apache/datafusion/pull/14547#issuecomment-2694747715

   > Thank you for working on this @geoffreyclaude -- I am sorry for the delay 
in responding.
   > 
   > Primarily my concern about this PR is that it adds new more dependencies 
(albiet optional ones) to DataFusion. We already spend a non trivial amount of 
maintenance time keeping up with dependencies / fixing related issues, etc .
   > 
   > I also worry that not everyone uses the tracing crate.
   > 
   > New features also increase the compile time, such as
   > 
   > * [Build time regressionĀ 
#14256](https://github.com/apache/datafusion/issues/14256)
   > 
   > > the "TracingExec", but it's a pretty wonky workaround. It also fails 
when the source nodes of the plan spawn new tasks, as in the ParquetSink, which 
prevent the >underlying object_store from linking its spans to the main trace.
   > 
   > FWIW the way we have handled this is to pass a tracing span along in 
PartitionedFile::extensions, but that is non ideal
   > 
   > 
https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.PartitionedFile.html#structfield.extensions
   > 
   > ## Questions / Suggestions
   > Is there some way to avoid adding a direct dependency on `tracing` but 
still keep the API flexibility you need?
   > 
   > For example, perhaps we could create a different `JoinSet` that instead of 
hard coding `tracing` could provide the callbacks for spawning tasks šŸ¤”
   
   And thank you @alamb for the detailed response! I'll try to address your 
points one by one.
   
   > I also worry that not everyone uses the tracing crate.
   the `tracing` crate has pretty much become the standard crate for 
instrumentation, same as how `tokio` is the standard crate for async runtimes. 
Almost everyone who's interested in instrumenting their DataFusion execution 
will be using the `tracing` crate. Moreover, if we push this further and 
deliver a full execution plan instrumentation solution that relies on `tracing` 
(as a separate contrib crate,) it will probably become the de-facto DataFusion 
standard.
   
   > New features also increase the compile time
   This is why I added a feature guard, as tracing definitely adds some 
compilation overhead. With the feature disabled, there should be no impact at 
all.
   
   > pass a tracing span along in PartitionedFile::extensions
   This works if you want to link the object store calls to the root span, but 
I believe it won't allow you to link them to the "real" parent span, the 
`DataSinkExec` that wraps the `ParquetSink`. It also doesn't allow you to 
instrument native and custom execution nodes past a `RepartitionExec` that 
spawns tasks on separate threads.
   
   > Is there some way to avoid adding a direct dependency on `tracing` but 
still keep the API flexibility you need?
   I really like that idea, as it doesn't need to pull in additional 
dependencies, and let's users chose their instrumentation solution.
   Unfortunately, if we want to dynamically "inject" the new tracing behavior 
at runtime, it's particularly tricky, in particular because:
   1. it should be a static injection, so each new `JoinSet` created with 
`JoinSet<T>::new()` picks it up
   2. it needs to work for all the generic types of `JoinSet`, from 
`JoinSet<()>` to `JoinSet<std::result::Result<usize, DataFusionError>>`. (So we 
probably need complex macros, or a hard-coded list of all the possible types.)
   3. It needs to be done once and once only, globally, without relying on 
`unsafe`.
   It's of course possible, but the additional complexity just doesn't seem 
worth it. ... Or maybe there's a super obvious way to do it which I completely 
missed!
   
   To me, I don't see a simpler way to open up full instrumentation of 
execution plans than what's done in this PR. It's super unfortunate that the 
`tracing` crate has a thread-local only storage for its context, but it doesn't 
seem likely to move away from this implementation decision. 
   
   Since the minimal change is pretty small, we can keep our fork of the change 
internally. But it also seemed like a good reason to push it upstream and 
contribute the full execution plan instrumentation code on top of it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to