The new API can be a flagged version of the getTasksStatus API also.
On Fri, May 30, 2014 at 11:27 AM, Bill Farner <wfar...@apache.org> wrote: > > > > I feel that we should drop the extra information from the > > getTasksStatus first (if the client is unaffected by the change) > > > The client would indeed be affected by that (config diffing when performing > job updates). However, there's nothing stopping us from defining a new RPC > and message to define exactly the fields we want to send for the UI. > > -=Bill > > > On Fri, May 30, 2014 at 11:22 AM, Suman Karumuri <ma...@apache.org> wrote: > > > Looks like my earlier mail has bounced. Sending it again. > > > > Can you please add some explanation to AURORA-471 explaining where the > time > > is being spent with some data and some micro benchmarks. As is it is > > unclear why thrift.flush should be taking so long. Is the default > > implementation of TServlet.doPost() doing something that can be > > optimized?If thrift serialization is really quick, where are we spending > > the time? Or is thrift.js spending too much time parsing 10 megs of JSON? > > > > Pagination is a good idea. The only downside is that users would loose > the > > ability to sort and filter on the UI (this feature is to be implemented). > > While pagination improves the time it takes to load a page, it doesn't > > necessarily improve the time it takes for a user to achieve his goal. > > > > For this reason, I feel that we should drop the extra information from > the > > getTasksStatus first (if the client is unaffected by the change) to > reduce > > the page load time and see how much it improves the performance. To > > improve perceived performance, we can reduce the need to reload the job > > page (by loading tasks in the back ground and client side caching using > > $cache), since the only time the user will notice the delay is when > loading > > the page which is ok if it's few seconds after throwing out the executor > > config. > > > > I think we should delay pagination as long as possible since operating on > > the all the tasks in the UI with sorting, filtering and visualization > adds > > a lot of powerful capabilities. We can also add filtering and searching > > capabilities in the client with a paginated API as well. But I feel > > delaying pagination would result in a quicker win instead since the API > can > > be changed in a backwards compatible manner. > > > > > > On Wed, May 28, 2014 at 4:42 AM, Mark Chu-Carroll < > mchucarr...@apache.org> > > wrote: > > > > > Great. +1 from me. > > > > > > > > > On Tue, May 27, 2014 at 7:39 PM, David McLaughlin < > da...@dmclaughlin.com > > > >wrote: > > > > > > > Pagination would be a no-op to the client because it would be opt-in > > > only, > > > > so it would continue to fetch all the tasks in one request. > > > > > > > > But you raise a good point in that presumably the client is also > going > > to > > > > be blocked for several seconds while executing getTasksStatus for > large > > > > jobs. Making the response more lightweight could be a big win there, > > but > > > I > > > > would need a better understanding of how the client is using those > > > > responses first. > > > > > > > > > > > > On Tue, May 27, 2014 at 3:34 PM, Mark Chu-Carroll < > > > mchucarr...@apache.org > > > > >wrote: > > > > > > > > > Interestingly, when we first expanded getTasksStatus, I didn't like > > the > > > > > idea, because I thought it would have exactly this problem! It's a > > > *lot* > > > > of > > > > > information to get in a single burst. > > > > > > > > > > Have you checked what effect it'll have on the command-line client? > > In > > > > > general, the command-line has the context do a single API call, > > gathers > > > > the > > > > > results, and returns to a command implementation. It'll definitely > > > > > complicate things to add pagination. How much of an effect will it > > be? > > > > > > > > > > -Mark > > > > > > > > > > > > > > > > > > > > On Tue, May 27, 2014 at 5:32 PM, David McLaughlin < > > > da...@dmclaughlin.com > > > > > >wrote: > > > > > > > > > > > As outlined in AURORA-458, using the new jobs page with a large > > (but > > > > > > reasonable) number of active and complete tasks can take a long > > > time[1] > > > > > to > > > > > > render. Performance profiling done as part of AURORA-471 shows > that > > > the > > > > > > main factor in response time is rendering and returning the size > of > > > the > > > > > > uncompressed payload to the client. > > > > > > > > > > > > To that end, I think we have two approaches: > > > > > > > > > > > > 1) Add pagination to the getTasksStatus call. > > > > > > 2) Make the getTasksStatus response more lightweight. > > > > > > > > > > > > > > > > > > Pagination > > > > > > --------------- > > > > > > > > > > > > Pagination would be the simplest approach, and would scale to > > > > arbitrarily > > > > > > large numbers of tasks moving forward. The main issue with this > is > > > that > > > > > we > > > > > > need all active tasks to build the configuration summary at the > top > > > of > > > > > the > > > > > > job page. > > > > > > > > > > > > As a workaround we could add a new API call - > getTaskConfigSummary > > - > > > > > which > > > > > > returns something like: > > > > > > > > > > > > > > > > > > struct ConfigGroup { > > > > > > 1: TaskConfig config > > > > > > 2: set<i32> instanceIds > > > > > > } > > > > > > > > > > > > struct ConfigSummary { > > > > > > 1: JobKey jobKey > > > > > > 2: set<ConfigGroup> groups > > > > > > } > > > > > > > > > > > > > > > > > > To support pagination without breaking the existing API, we could > > add > > > > > > offset and limit fields to the TaskQuery struct. > > > > > > > > > > > > > > > > > > Make getTasksStatus more lightweight > > > > > > ------------------------------------ > > > > > > > > > > > > getTasksStatus currently returns a list of ScheduledTask > instances. > > > The > > > > > > biggest (in terms of payload size) child object of a > ScheduledTask > > is > > > > the > > > > > > TaskConfig struct, which itself contains an ExecutorConfig. > > > > > > > > > > > > I took a sample response from one of our internal production > > > instances > > > > > and > > > > > > it turns out that around 65% of the total response size was for > > > > > > ExecutorConfig objects, and specifically the "cmdline" property > of > > > > these. > > > > > > We currently do not use this information anywhere in the UI nor > do > > we > > > > > > inspect it when grouping taskConfigs, and it would be a > relatively > > > easy > > > > > win > > > > > > to just drop these from the response. > > > > > > > > > > > > We'd still need this information for the config grouping, so we > > could > > > > add > > > > > > the response suggested for getTaskConfigSummary as another > property > > > and > > > > > > allow the client to reconcile these objects if it needs to: > > > > > > > > > > > > > > > > > > struct TaskStatusResponse { > > > > > > 1: list<LightweightTask> tasks > > > > > > 2: set<ConfigGroup> configSummary > > > > > > } > > > > > > > > > > > > > > > > > > This would significantly reduce the uncompressed payload size > while > > > > still > > > > > > containing the same data. > > > > > > > > > > > > However, there is still a potentially significant part of a > payload > > > > size > > > > > > remaining: task events (and these *are* currently used in the > UI). > > We > > > > > could > > > > > > solve this by dropping task events from the LightweightTask > struct > > > too, > > > > > and > > > > > > fetching them lazily in batches. > > > > > > > > > > > > i.e. an API call like: > > > > > > > > > > > > > > > > > > getTaskEvents(1: JobKey key, 2: set<i32> instanceIds) > > > > > > > > > > > > > > > > > > Could return: > > > > > > > > > > > > > > > > > > struct TaskEventResult { > > > > > > 1: i32 instanceId > > > > > > 2: list<TaskEvent> taskEvents > > > > > > } > > > > > > > > > > > > struct TaskEventResponse { > > > > > > 1: JobKey key > > > > > > 2: list<TaskEventResult> results > > > > > > } > > > > > > > > > > > > > > > > > > Events could then only be fetched and rendered as the user clicks > > > > through > > > > > > the pages of tasks. > > > > > > > > > > > > > > > > > > Proposal > > > > > > ------------- > > > > > > > > > > > > I think pagination makes more sense here. It adds moderate > overhead > > > to > > > > > the > > > > > > complexity of the UI (this is purely due to our use of > smart-table > > > > which > > > > > is > > > > > > not so server-side pagination friendly) but the client logic > would > > > > > actually > > > > > > be simpler with the new getTaskConfigSummary api call. > > > > > > > > > > > > I do think there is value in considering whether the > ScheduledTask > > > > struct > > > > > > needs to contain all of the information it does - but this could > be > > > > done > > > > > as > > > > > > part of a separate or complimentary performance improvement > ticket. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] - At Twitter we observed 2000 active + 100 finished tasks > > having > > > a > > > > > > payload size of 10MB which took 8~10 seconds to complete. > > > > > > > > > > > > > > > > > > > > >