ShreyeshArangath commented on PR #1381:
URL:
https://github.com/apache/datafusion-python/pull/1381#issuecomment-3931838295
Thanks for the suggestion! Initially when I designed the change, I did
consider moving `collect()` / `execute_*()` onto plan object. The reason I
didn’t go that route was more about how observability fits into real usage
patterns (from the cases that I have seen).
Today, I think the users naturally treat a dataframe as the primary handle
for a query:
```py
df = ctx.sql("SELECT * FROM t WHERE column1 > 1")
batches = df.collect()
```
Requiring metrics to go through ExecutionPlan would effectively change the
model to look something like so
```py
df = ctx.sql("SELECT * FROM t WHERE column1 > 1")
plan = df.execution_plan()
batches = plan.collect()
metrics = plan.collect_metrics()
```
I thought that this would require users to restructure pipelines and thread
a plan object through call chains purely to have access to metrics. The LoE
required to get people to use it seemed high to me.
My goal was to make minimal changes to how users can add support for metrics
without changing how they run queries
```py
df = ctx.sql("SELECT * FROM t WHERE column1 > 1")
batches = df.collect()
plan = df.execution_plan()
metrics = plan.collect_metrics()
```
I’m happy to switch to the plan-based approach if we prefer stronger
alignment with the upstream API, but I leaned toward this design to make
observability easier to adopt without disrupting current usage patterns — lmk
what you think
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]