Interestingly, when we first expanded getTasksStatus, I didn't like the
idea, because I thought it would have exactly this problem! It's a *lot* of
information to get in a single burst.

Have you checked what effect it'll have on the command-line client? In
general, the command-line has the context do a single API call, gathers the
results, and returns to a command implementation. It'll definitely
complicate things to add pagination.  How much of an effect will it be?

   -Mark



On Tue, May 27, 2014 at 5:32 PM, David McLaughlin <da...@dmclaughlin.com>wrote:

> As outlined in AURORA-458, using the new jobs page with a large (but
> reasonable) number of active and complete tasks can take a long time[1] to
> render. Performance profiling done as part of AURORA-471 shows that the
> main factor in response time is rendering and returning the size of the
> uncompressed payload to the client.
>
> To that end, I think we have two approaches:
>
>  1) Add pagination to the getTasksStatus call.
>  2) Make the getTasksStatus response more lightweight.
>
>
> Pagination
> ---------------
>
> Pagination would be the simplest approach, and would scale to arbitrarily
> large numbers of tasks moving forward. The main issue with this is that we
> need all active tasks to build the configuration summary at the top of the
> job page.
>
> As a workaround we could add a new API call - getTaskConfigSummary - which
> returns something like:
>
>
> struct ConfigGroup {
>   1: TaskConfig config
>   2: set<i32> instanceIds
> }
>
> struct ConfigSummary {
>   1: JobKey jobKey
>   2: set<ConfigGroup> groups
> }
>
>
> To support pagination without breaking the existing API, we could add
> offset and limit fields to the TaskQuery struct.
>
>
> Make getTasksStatus more lightweight
> ------------------------------------
>
> getTasksStatus currently returns a list of ScheduledTask instances. The
> biggest (in terms of payload size) child object of a ScheduledTask is the
> TaskConfig struct, which itself contains an ExecutorConfig.
>
> I took a sample response from one of our internal production instances and
> it turns out that around 65% of the total response size was for
> ExecutorConfig objects, and specifically the "cmdline" property of these.
> We currently do not use this information anywhere in the UI nor do we
> inspect it when grouping taskConfigs, and it would be a relatively easy win
> to just drop these from the response.
>
> We'd still need this information for the config grouping, so we could add
> the response suggested for getTaskConfigSummary as another property and
> allow the client to reconcile these objects if it needs to:
>
>
> struct TaskStatusResponse {
>   1: list<LightweightTask> tasks
>   2: set<ConfigGroup> configSummary
> }
>
>
> This would significantly reduce the uncompressed payload size while still
> containing the same data.
>
> However, there is still a potentially significant part of a payload size
> remaining: task events (and these *are* currently used in the UI). We could
> solve this by dropping task events from the LightweightTask struct too, and
> fetching them lazily in batches.
>
> i.e. an API call like:
>
>
> getTaskEvents(1: JobKey key, 2: set<i32> instanceIds)
>
>
> Could return:
>
>
> struct TaskEventResult {
>   1: i32 instanceId
>   2: list<TaskEvent> taskEvents
> }
>
> struct TaskEventResponse {
>   1: JobKey key
>   2: list<TaskEventResult> results
> }
>
>
> Events could then only be fetched and rendered as the user clicks through
> the pages of tasks.
>
>
> Proposal
> -------------
>
> I think pagination makes more sense here. It adds moderate overhead to the
> complexity of the UI (this is purely due to our use of smart-table which is
> not so server-side pagination friendly) but the client logic would actually
> be simpler with the new getTaskConfigSummary api call.
>
> I do think there is value in considering whether the ScheduledTask struct
> needs to contain all of the information it does - but this could be done as
> part of a separate or complimentary performance improvement ticket.
>
>
>
>
> [1] - At Twitter we observed 2000 active + 100 finished tasks having a
> payload size of 10MB which took 8~10 seconds to complete.
>

Reply via email to