villebro opened a new issue, #29515:
URL: https://github.com/apache/superset/issues/29515
## [SIP] Proposal for Simplified Global Async Queries
### Motivation
With [\[SIP-39\] Global Async Query Support](#9190) (GAQ for short) still
being behind an experimental feature flag, and not actively maintained, I've
been thinking about ways we could simplify the architecture, and finally make
this generally available in a forthcoming Superset release. I feel the
following issues have all done their part to contribute to not gain wide
community traction:
- **No deduplication support**: In my experience, the lack of deduplication
of heavy queries is one of the main bottlenecks of Superset, and tends to cause
major issues when many people try to access the same charts/dashboards. There
was an effort to add [deduplication support to GAQ](#14112), but the proposal
went stale.
- **Design complexity**: There was heated discussion during the SIP process
about websockets potentially being too heavy handed for this particular task,
with channels, reconnection functionality etc. In retrospect, I feel most of
the raised concerns comments turned out to be true - while the implementation
may be more elegant than a simple polling solution, very few committers or
community members ended up fully understanding the minutiae of the
implementation, causing it to fall out of active maintenance.
- **New components**: Using the websocket feature required adding a new
component to the Superset deployment, along with tightly coupling with the
Redis Streams protocol. A simpler polling solution would likely have received
more adoption from the community, leading to the feature stabilizing more
quickly.
Having said all this, the feature is still as relevant today as it was when
the original SIP was opened, and I think stabilizing this feature is very
important is because Superset's current synchronous query execution model
causes lots of issues:
- If many people open the same chart/dashboard at the same time, they will
all start a query to the underlying database, due to no locking of queries
- if a user refreshes a dashboard multiple times, they can quickly congest
the downstream database with heavy queries, both eating up webserver threads
and database resources.
- There's no way to cancel queries that get orphaned by closed browsers.
- In some cases, the web worker threads/processes get blocked waiting for
long running queries to complete executing, making it impossible to effectively
scale web worker replica sets based on CPU consumption. By moving queries to
async workers it should become possible to get by with much slimmer webworker
replica sets. Furthermore, async workers could be scaled up/down based on the
queue depth.
### Proposed Change
To simplify the architecture and reuse existing functionality, I propose the
following:
- The websocket architecture is removed, as it adds a lot of complexity to
the architecture - in the future only polling would be supported.
- The concept of a "query context cache key" is removed in favor of only a
single cache key, i.e. the one we already use for chart data.
- A new model is introduced for async queries. If data for a particular
cache key isn't found, an entry is added to the new model, which tracks the
query progress. The model will get a menu in the UI, from which users will be
able to cancel their own queries (Admins will see all queries). Ultimately,
this entry gets deleted once the data request is completed.
- When requesting chart data, if the data exists in the cache, the data is
returned normally.
When chart data isn't available in the cache, only the cache_key is
returned, along with additional details: when the most recent chart data
request has been submitted, status (pending, executing), last heartbeat from
the async worker etc.
The async execution flow is changed to be similar to SQL Lab async
execution, with the following changes:
- when the async worker starts executing the query, the cache key is locked
using the KeyValueDistributedLock context manager. This means that only a
single worker executes any one cache key query at a time.
- To support automatic cancellation of queries, we add a new optional field
`poll_ttl` to the query context, which makes it possible to automatically
cancel queries that are not being actively polled. Every time the cache key is
polled, the latest poll time is updated on the metadata object. While
executing, the worker periodically checks the metadata object, and if the
`poll_ttl` is defined, and if the last poll time exceeds the TTL, the query is
cancelled. This ensures that if a person closes a dashboard with lots of long
running queries, the queries are automatically cancelled if nobody is actively
waiting for the results. By default, frontend requests have poll_ttl set to
whichever value is set in the config (`DEFAULT_CHART_DATA_POLL_TTL`). Cache
warmup requests would likely not have a `poll_ttl` set, so as to avoid
unnecessary polling.
- To limit hammering the polling endpoint, we introduce a customizable
backoff function in `superset_config.py`, which makes it possible to define how
polling backoff should be implemented. The default behavior would be some sort
of exponential backoff, where freshly started queries are polled more actively,
and queries that have been pending/running for a long time are polled less
frequently. When the frontend requests chart data, the backend provides the
recommended wait time in the response based on the backoff function. Note, that
backoff will be based on time passed since query submission time; this means,
that if I open a dashboard with a chart that has a query that's been running
for 10 minutes, the browser will repoll much slower than it would if the query
would have been dispatched to the async workers right away
Some random thoughts:
- Currently multi-query query contexts get executed serially. With this new
approach the queries can be executed in parallel, as each query is dispatched
separately.
- I feel synchronous execution is very problematic in the context of
Superset due to the problems described in the intro of this post. Originally I
thought about proposing making async queries the *only* supported query
mechanism in Superset. However, as @betodealmeida [pointed
out](https://github.com/apache/superset/issues/9190#issuecomment-2206778896),
certain databases that are expected to return data at sub second latencies are
better suited to a synchronous flow, as dispatching Celery tasks can add a few
seconds of extra overhead to the process. For this reason, we should probably
keep async execution an optional feature.
### New or Changed Public Interfaces
- New config flag for providing a backoff function for chart data polling.
This will used to tell the frontend when it should poll for completed chart
data the next time.
- New config flag for setting the default chart data poll TTL.
- New FAB model + DAO for async queries, along with a new menu for managing
currently queued/executing queries.
- New optional fields added to chart data request:
- `poll_ttl`: if set, the query will be cancelled unless a client has
asked for data within the TTL bounds. This will ensure that dashboards that are
closed don't leave orphaned chart data requests.
- `execution_mode`: the client can ask the query to be executed sync,
async or using the default mode. This is specifically added for programmatic
integrations, where implementing the polling mechanism may sometimes add
unnecessary complexity.
- New fields added to the chart data response:
- `poll_delay`: how many seconds should the client wait before checking
if the query has completed. This value will be calculated by the backend based
on the backoff function.
- `status` and `start_dttm`: is the query queued or started, and when did
the query start executing. This information can be used by the chart component
to give the user information of the state of the query, similar to how we
currently display how stale the cached data is. So in the future, we may
display the following: "Query is queued", or "Query executing for 2 minutes".
### New dependencies
In the base proposal, I suggest not adding any new dependencies, and simply
supporting polling. However, we may consider using [Server-sent
events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events) as
noted by @betodealmeida .
### Migration Plan and Compatibility
1. Remove feature flag + related code, Websocket worker code
2. Remove references to websocket deployment/service from Helm chart
3. Add new feature flag `SIMPLIFIED_GLOBAL_ASYNC_QUERIES`
### Rejected Alternatives
- For now I suggest not implementing `Server-sent events` to keep the
implementation simple.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]