1) In my opinion this is to complex for the average user. In this case I'm
assuming you have some sort of optimizer that would apply and do it
automatically for the user? If its just in the research stage of things can
you just modify Spark to do experiments?
2) I think the main thing is havin
Thanks for the clarification Tom!
A bit more backgrounds for what we want to do: we have proposed a fine-grained
(stage-level) resource optimization approach in VLDB22
https://www.vldb.org/pvldb/vol15/p3098-lyu.pdf and would like to try it over
Spark. Our approach can recommend the resource con
see the original SPIP for as to why we only support RDD:
https://issues.apache.org/jira/browse/SPARK-27495
The main problem is exactly what you are referring to. The RDD level is not
exposed to the user when using SQL or Dataframe API. This is on purpose and
user shouldn't have to know anythin
Thanks for the reply!
To clarify, for issue 2, it could still break apart a query into multiple jobs
without AQE — I have turned off the AQE in my posted example.
For 1, an end user just needs to turn on/off a knob to use the stage-level
scheduling for Spark SQL — I am considering adding a comp
I think issue 2 is caused by adaptive query execution. This will break
apart queries into multiple jobs, each subsequent job will generate a RDD
that is based on previous ones.
As for 1. I am not sure how much you want to expose to an end user here.
SQL is declarative, and it does not specify how
Hi,
I plan to deploy the stage-level scheduling for Spark SQL to apply some
fine-grained optimizations over the DAG of stages. However, I am blocked by the
following issues:
1. The current stage-level scheduling supports RDD APIs only. So is there a way
to reuse the stage-level scheduling for