As outlined in AURORA-458, using the new jobs page with a large (but
reasonable) number of active and complete tasks can take a long time[1] to
render. Performance profiling done as part of AURORA-471 shows that the
main factor in response time is rendering and returning the size of the
uncompressed payload to the client.

To that end, I think we have two approaches:

 1) Add pagination to the getTasksStatus call.
 2) Make the getTasksStatus response more lightweight.


Pagination
---------------

Pagination would be the simplest approach, and would scale to arbitrarily
large numbers of tasks moving forward. The main issue with this is that we
need all active tasks to build the configuration summary at the top of the
job page.

As a workaround we could add a new API call - getTaskConfigSummary - which
returns something like:


struct ConfigGroup {
  1: TaskConfig config
  2: set<i32> instanceIds
}

struct ConfigSummary {
  1: JobKey jobKey
  2: set<ConfigGroup> groups
}


To support pagination without breaking the existing API, we could add
offset and limit fields to the TaskQuery struct.


Make getTasksStatus more lightweight
------------------------------------

getTasksStatus currently returns a list of ScheduledTask instances. The
biggest (in terms of payload size) child object of a ScheduledTask is the
TaskConfig struct, which itself contains an ExecutorConfig.

I took a sample response from one of our internal production instances and
it turns out that around 65% of the total response size was for
ExecutorConfig objects, and specifically the "cmdline" property of these.
We currently do not use this information anywhere in the UI nor do we
inspect it when grouping taskConfigs, and it would be a relatively easy win
to just drop these from the response.

We'd still need this information for the config grouping, so we could add
the response suggested for getTaskConfigSummary as another property and
allow the client to reconcile these objects if it needs to:


struct TaskStatusResponse {
  1: list<LightweightTask> tasks
  2: set<ConfigGroup> configSummary
}


This would significantly reduce the uncompressed payload size while still
containing the same data.

However, there is still a potentially significant part of a payload size
remaining: task events (and these *are* currently used in the UI). We could
solve this by dropping task events from the LightweightTask struct too, and
fetching them lazily in batches.

i.e. an API call like:


getTaskEvents(1: JobKey key, 2: set<i32> instanceIds)


Could return:


struct TaskEventResult {
  1: i32 instanceId
  2: list<TaskEvent> taskEvents
}

struct TaskEventResponse {
  1: JobKey key
  2: list<TaskEventResult> results
}


Events could then only be fetched and rendered as the user clicks through
the pages of tasks.


Proposal
-------------

I think pagination makes more sense here. It adds moderate overhead to the
complexity of the UI (this is purely due to our use of smart-table which is
not so server-side pagination friendly) but the client logic would actually
be simpler with the new getTaskConfigSummary api call.

I do think there is value in considering whether the ScheduledTask struct
needs to contain all of the information it does - but this could be done as
part of a separate or complimentary performance improvement ticket.




[1] - At Twitter we observed 2000 active + 100 finished tasks having a
payload size of 10MB which took 8~10 seconds to complete.

Reply via email to