Re: Proposal: API changes to getTasksStatus

Suman Karumuri Fri, 30 May 2014 11:42:11 -0700

The new API can be a flagged version of the getTasksStatus API also.


On Fri, May 30, 2014 at 11:27 AM, Bill Farner <wfar...@apache.org> wrote:

> >
> > I feel that we should drop the extra information from the
> > getTasksStatus first (if the client is unaffected by the change)
>
>
> The client would indeed be affected by that (config diffing when performing
> job updates).  However, there's nothing stopping us from defining a new RPC
> and message to define exactly the fields we want to send for the UI.
>
> -=Bill
>
>
> On Fri, May 30, 2014 at 11:22 AM, Suman Karumuri <ma...@apache.org> wrote:
>
> > Looks like my earlier mail has bounced. Sending it again.
> >
> > Can you please add some explanation to AURORA-471 explaining where the
> time
> > is being spent with some data and some micro benchmarks. As is it is
> > unclear why thrift.flush should be taking so long. Is the default
> > implementation of TServlet.doPost() doing something that can be
> > optimized?If thrift serialization is really quick, where are we spending
> > the time? Or is thrift.js spending too much time parsing 10 megs of JSON?
> >
> > Pagination is a good idea. The only downside is that users would loose
> the
> > ability to sort and filter on the UI (this feature is to be implemented).
> > While pagination improves the time it takes to load a page, it doesn't
> > necessarily improve the time it takes for a user to achieve his goal.
> >
> > For this reason, I feel that we should drop the extra information from
> the
> > getTasksStatus first (if the client is unaffected by the change) to
> reduce
> > the page load time and see how much it improves the performance.  To
> > improve perceived performance, we can reduce the need to reload the job
> > page (by loading tasks in the back ground and client side caching using
> > $cache), since the only time the user will notice the delay is when
> loading
> > the page which is ok if it's few seconds after throwing out the executor
> > config.
> >
> > I think we should delay pagination as long as possible since operating on
> > the all the tasks in the UI with sorting, filtering and visualization
> adds
> > a lot of powerful capabilities. We can also add filtering and searching
> > capabilities in the client with a paginated API as well. But I feel
> > delaying pagination would result in a quicker win instead since the API
> can
> > be changed in a backwards compatible manner.
> >
> >
> > On Wed, May 28, 2014 at 4:42 AM, Mark Chu-Carroll <
> mchucarr...@apache.org>
> > wrote:
> >
> > > Great. +1 from me.
> > >
> > >
> > > On Tue, May 27, 2014 at 7:39 PM, David McLaughlin <
> da...@dmclaughlin.com
> > > >wrote:
> > >
> > > > Pagination would be a no-op to the client because it would be opt-in
> > > only,
> > > > so it would continue to fetch all the tasks in one request.
> > > >
> > > > But you raise a good point in that presumably the client is also
> going
> > to
> > > > be blocked for several seconds while executing getTasksStatus for
> large
> > > > jobs. Making the response more lightweight could be a big win there,
> > but
> > > I
> > > > would need a better understanding of how the client is using those
> > > > responses first.
> > > >
> > > >
> > > > On Tue, May 27, 2014 at 3:34 PM, Mark Chu-Carroll <
> > > mchucarr...@apache.org
> > > > >wrote:
> > > >
> > > > > Interestingly, when we first expanded getTasksStatus, I didn't like
> > the
> > > > > idea, because I thought it would have exactly this problem! It's a
> > > *lot*
> > > > of
> > > > > information to get in a single burst.
> > > > >
> > > > > Have you checked what effect it'll have on the command-line client?
> > In
> > > > > general, the command-line has the context do a single API call,
> > gathers
> > > > the
> > > > > results, and returns to a command implementation. It'll definitely
> > > > > complicate things to add pagination.  How much of an effect will it
> > be?
> > > > >
> > > > >    -Mark
> > > > >
> > > > >
> > > > >
> > > > > On Tue, May 27, 2014 at 5:32 PM, David McLaughlin <
> > > da...@dmclaughlin.com
> > > > > >wrote:
> > > > >
> > > > > > As outlined in AURORA-458, using the new jobs page with a large
> > (but
> > > > > > reasonable) number of active and complete tasks can take a long
> > > time[1]
> > > > > to
> > > > > > render. Performance profiling done as part of AURORA-471 shows
> that
> > > the
> > > > > > main factor in response time is rendering and returning the size
> of
> > > the
> > > > > > uncompressed payload to the client.
> > > > > >
> > > > > > To that end, I think we have two approaches:
> > > > > >
> > > > > >  1) Add pagination to the getTasksStatus call.
> > > > > >  2) Make the getTasksStatus response more lightweight.
> > > > > >
> > > > > >
> > > > > > Pagination
> > > > > > ---------------
> > > > > >
> > > > > > Pagination would be the simplest approach, and would scale to
> > > > arbitrarily
> > > > > > large numbers of tasks moving forward. The main issue with this
> is
> > > that
> > > > > we
> > > > > > need all active tasks to build the configuration summary at the
> top
> > > of
> > > > > the
> > > > > > job page.
> > > > > >
> > > > > > As a workaround we could add a new API call -
> getTaskConfigSummary
> > -
> > > > > which
> > > > > > returns something like:
> > > > > >
> > > > > >
> > > > > > struct ConfigGroup {
> > > > > >   1: TaskConfig config
> > > > > >   2: set<i32> instanceIds
> > > > > > }
> > > > > >
> > > > > > struct ConfigSummary {
> > > > > >   1: JobKey jobKey
> > > > > >   2: set<ConfigGroup> groups
> > > > > > }
> > > > > >
> > > > > >
> > > > > > To support pagination without breaking the existing API, we could
> > add
> > > > > > offset and limit fields to the TaskQuery struct.
> > > > > >
> > > > > >
> > > > > > Make getTasksStatus more lightweight
> > > > > > ------------------------------------
> > > > > >
> > > > > > getTasksStatus currently returns a list of ScheduledTask
> instances.
> > > The
> > > > > > biggest (in terms of payload size) child object of a
> ScheduledTask
> > is
> > > > the
> > > > > > TaskConfig struct, which itself contains an ExecutorConfig.
> > > > > >
> > > > > > I took a sample response from one of our internal production
> > > instances
> > > > > and
> > > > > > it turns out that around 65% of the total response size was for
> > > > > > ExecutorConfig objects, and specifically the "cmdline" property
> of
> > > > these.
> > > > > > We currently do not use this information anywhere in the UI nor
> do
> > we
> > > > > > inspect it when grouping taskConfigs, and it would be a
> relatively
> > > easy
> > > > > win
> > > > > > to just drop these from the response.
> > > > > >
> > > > > > We'd still need this information for the config grouping, so we
> > could
> > > > add
> > > > > > the response suggested for getTaskConfigSummary as another
> property
> > > and
> > > > > > allow the client to reconcile these objects if it needs to:
> > > > > >
> > > > > >
> > > > > > struct TaskStatusResponse {
> > > > > >   1: list<LightweightTask> tasks
> > > > > >   2: set<ConfigGroup> configSummary
> > > > > > }
> > > > > >
> > > > > >
> > > > > > This would significantly reduce the uncompressed payload size
> while
> > > > still
> > > > > > containing the same data.
> > > > > >
> > > > > > However, there is still a potentially significant part of a
> payload
> > > > size
> > > > > > remaining: task events (and these *are* currently used in the
> UI).
> > We
> > > > > could
> > > > > > solve this by dropping task events from the LightweightTask
> struct
> > > too,
> > > > > and
> > > > > > fetching them lazily in batches.
> > > > > >
> > > > > > i.e. an API call like:
> > > > > >
> > > > > >
> > > > > > getTaskEvents(1: JobKey key, 2: set<i32> instanceIds)
> > > > > >
> > > > > >
> > > > > > Could return:
> > > > > >
> > > > > >
> > > > > > struct TaskEventResult {
> > > > > >   1: i32 instanceId
> > > > > >   2: list<TaskEvent> taskEvents
> > > > > > }
> > > > > >
> > > > > > struct TaskEventResponse {
> > > > > >   1: JobKey key
> > > > > >   2: list<TaskEventResult> results
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Events could then only be fetched and rendered as the user clicks
> > > > through
> > > > > > the pages of tasks.
> > > > > >
> > > > > >
> > > > > > Proposal
> > > > > > -------------
> > > > > >
> > > > > > I think pagination makes more sense here. It adds moderate
> overhead
> > > to
> > > > > the
> > > > > > complexity of the UI (this is purely due to our use of
> smart-table
> > > > which
> > > > > is
> > > > > > not so server-side pagination friendly) but the client logic
> would
> > > > > actually
> > > > > > be simpler with the new getTaskConfigSummary api call.
> > > > > >
> > > > > > I do think there is value in considering whether the
> ScheduledTask
> > > > struct
> > > > > > needs to contain all of the information it does - but this could
> be
> > > > done
> > > > > as
> > > > > > part of a separate or complimentary performance improvement
> ticket.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > [1] - At Twitter we observed 2000 active + 100 finished tasks
> > having
> > > a
> > > > > > payload size of 10MB which took 8~10 seconds to complete.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Proposal: API changes to getTasksStatus

Reply via email to