Re: Speeding up Aurora client job creation

Hussein Elgridly Mon, 16 Mar 2015 20:38:54 -0700

As a general rule we're trying to stick to Python 3.4. I don't imagine
implementing something a THTTPClient of my own will be too difficult,
especially given that I have the Aurora client's TRequestsTransport [1] for
reference.


[1]
https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/common/transport.py

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 16 March 2015 at 22:58, Bill Farner <wfar...@apache.org> wrote:

> Exploring the possibilities - can you use python 2.7?  If so, you could
> leverage some of the private libraries within the client and lower the
> surface area of what you need to build.  It won't be a stable programmatic
> API, but you might get moving faster.  I assume this is what Stephan is
> suggesting.
>
> -=Bill
>
> On Mon, Mar 16, 2015 at 7:52 PM, Hussein Elgridly <
> huss...@broadinstitute.org> wrote:
>
> > I'm not quite sure I understand your question, so I'll be painfully
> > explicit instead.
> >
> > I don't want to use the existing Aurora client because it's slow
> (Pystachio
> > + repeated HTTP connection overheads, as detailed earlier in this
> thread).
> > Instead, I want to use the Thrift interface to talk to the Aurora
> scheduler
> > directly - I can skip Pystachio entirely and keep the HTTP connection
> > open).
> >
> > I cannot use the official Thrift bindings for Python as they do not yet
> > support Python 3 [1]. There is a third-party, pure Python implementation
> of
> > Thrift that does support Python 3 called thriftpy [2]. However, thriftpy
> > does not include a THTTPClient transport, which is what the Aurora
> > scheduler uses. I will therefore have to write my own THTTPClient
> transport
> > (and probably contribute it back to thriftpy).
> >
> > [1] https://issues.apache.org/jira/browse/THRIFT-1857
> > [2] https://github.com/eleme/thriftpy
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> >
> > On 16 March 2015 at 19:11, Erb, Stephan <stephan....@blue-yonder.com>
> > wrote:
> >
> > > Just to make sure I get this correctly: You say, you cannot use the
> > > existing python client because it is python 2.7 only so you want to
> > write a
> > > new one in python 3?
> > >
> > > Regards,
> > > Stephan
> > > ________________________________________
> > > From: Hussein Elgridly <huss...@broadinstitute.org>
> > > Sent: Monday, March 16, 2015 11:44 PM
> > > To: dev@aurora.incubator.apache.org
> > > Subject: Re: Speeding up Aurora client job creation
> > >
> > > So this has now bubbled back to the top of my TODO list and I'm
> actively
> > > working on it. I am entirely new to Thrift so please forgive the newbie
> > > questions...
> > >
> > > I would like to talk to the Aurora scheduler directly from my (Python)
> > > application using Thrift. Since I'm on Python 3.4 I've had to use
> > thriftpy:
> > > https://github.com/eleme/thriftpy
> > >
> > > As far as I can tell, the following should work (by default, thriftpy
> > uses
> > > a TBufferedTransport around a TSocket):
> > >
> > > ---
> > > import thriftpy
> > > import thriftpy.rpc
> > >
> > > aurora_api = thriftpy.load("api.thrift")
> > >
> > > client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
> > > host="localhost", port=8081,
> > > proto_factory=thriftpy.protocol.TJSONProtocolFactory() )
> > >
> > > print(client.getJobSummary())
> > > ---
> > >
> > > Obviously I wouldn't be writing this email if it did work :) It hangs.
> > >
> > > I jumped into pdb and found it was sending the following payload:
> > >
> > > b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0,
> > "ttype":
> > > 1, "version": 1}, "payload": {}}'
> > >
> > > to a socket that looked like this:
> > >
> > > <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0,
> > > laddr=('<localhost's_private_ip>', 49167),
> > raddr=('localhost's_private_ip',
> > > 8081)>
> > >
> > > ...but was waiting forever to receive any data. Adding a timeout just
> > > triggered the timeout.
> > >
> > > I'm stumped. Any clues?
> > >
> > >
> > > Hussein Elgridly
> > > Senior Software Engineer, DSDE
> > > The Broad Institute of MIT and Harvard
> > >
> > >
> > > On 12 February 2015 at 04:15, Erb, Stephan <
> stephan....@blue-yonder.com>
> > > wrote:
> > >
> > > > Hi Hussein,
> > > >
> > > > we also had slight performance problems when talking to Aurora. We
> > ended
> > > > up using the existing python client directly in our code (see
> > > > apache.aurora.client.api.__init__.py). This allowed us to reuse the
> api
> > > > object and its scheduler connection, dropping a connection latency of
> > > about
> > > > 0.3-0.4 seconds per request.
> > > >
> > > > Best Regards,
> > > > Stephan
> > > > ________________________________________
> > > > From: Bill Farner <wfar...@apache.org>
> > > > Sent: Wednesday, February 11, 2015 9:29 PM
> > > > To: dev@aurora.incubator.apache.org
> > > > Subject: Re: Speeding up Aurora client job creation
> > > >
> > > > To reduce that time you will indeed want to talk directly to the
> > > > scheduler.  This will definitely require you to roll up your sleeves
> a
> > > bit
> > > > and set up a thrift client to our api (based on api.thrift [1]),
> since
> > > you
> > > > will need to specify your tasks in a format that the thermos executor
> > can
> > > > understand.  Turns out this is JSON data, so it should not be *too*
> > > > prohibitive.
> > > >
> > > > However, there is another technical limitation you will hit for the
> > > > submission rate you are after.  The scheduler is backed by a durable
> > > store
> > > > whose write latency is at minimum the amount of time required to
> fsync.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
> > > >
> > > > -=Bill
> > > >
> > > > On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
> > > > huss...@broadinstitute.org> wrote:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > I'm looking at a use cases that involves submitting potentially
> > > hundreds
> > > > of
> > > > > jobs a second to our Mesos cluster. My tests show that the aurora
> > > client
> > > > is
> > > > > taking 1-2 seconds for each job submission, and that I can run
> about
> > > four
> > > > > client processes in parallel before they peg the CPU at 100%. I
> need
> > > more
> > > > > throughput than this!
> > > > >
> > > > > Squashing jobs down to the Process or Task level doesn't really
> make
> > > > sense
> > > > > for our use case. I'm aware that with some shenanigans I can batch
> > jobs
> > > > > together using job instances, but that's a lot of work on my
> current
> > > > > timeframe (and of questionable utility given that the jobs
> certainly
> > > > won't
> > > > > have identical resource requirements).
> > > > >
> > > > > What I really need is (at least) an order of magnitude speedup in
> > terms
> > > > of
> > > > > being able to submit jobs to the Aurora scheduler (via the client
> or
> > > > > otherwise).
> > > > >
> > > > > Conceptually it doesn't seem like adding a job to a queue should
> be a
> > > > thing
> > > > > that takes a couple of seconds, so I'm baffled as to why it's
> taking
> > so
> > > > > long. As an experiment, I wrapped the call to client.execute() in
> > > > > client.py:proxy_main in cProfile and called aurora job create with
> a
> > > very
> > > > > simple test job.
> > > > >
> > > > > Results of the profile are in the Gist below:
> > > > >
> > > > > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
> > > > >
> > > > > Our of a 0.977s profile time, the two things that stick out to me
> > are:
> > > > >
> > > > > 1. 0.526s spent in Pystachio for a job that doesn't use any
> templates
> > > > > 2. 0.564s spent in create_job, presumably talking to the scheduler
> > (and
> > > > > setting up the machinery for doing so)
> > > > >
> > > > > I imagine I can sidestep #1 with a check for "{{" in the job file
> and
> > > > > bypass Pystachio entirely. Can I also skip the Aurora client
> entirely
> > > and
> > > > > talk directly to the scheduler? If so what does that entail, and
> are
> > > > there
> > > > > any risks associated?
> > > > >
> > > > > Thanks,
> > > > > -Hussein
> > > > >
> > > > > Hussein Elgridly
> > > > > Senior Software Engineer, DSDE
> > > > > The Broad Institute of MIT and Harvard
> > > > >
> > > >
> > >
> >
>

Re: Speeding up Aurora client job creation

Reply via email to