Re: Speeding up Aurora client job creation

Hussein Elgridly Fri, 20 Mar 2015 10:17:42 -0700

Another update: Thrift has a pull request open for Python 3 support [1],
but it was out of date and needed rebasing onto master. I did this off in
my own fork [2] and managed to build a Py3-generating version of Thrift.
This allowed me to generate Python 3 Thrift bindings for Aurora, which I'm
including in my project along with a tarball of the Python 3 Thrift
libraries. Success!


[1] https://github.com/apache/thrift/pull/213
[2] https://github.com/broadinstitute/thrift/tree/eevee/python3

The changes make Thrift fail on Python 2, so I imagine it'll be a while
before they make it into official Thrift. But it works for me, so I'm happy
:)


Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 17 March 2015 at 15:18, Hussein Elgridly <huss...@broadinstitute.org>
wrote:

> For anyone following along at home, I managed to make my own THTTPClient
> for thriftpy just fine. Unfortunately, thriftpy's TJSONProtocol seems to be
> *a* JSON protocol, not *the* JSON protocol:
>
> thrift: [1,"getJobSummary",1,0,{}]
> thriftpy: {"metadata": {"ttype": 1, "name": "getJobSummary", "version": 1,
> "seqid": 0}, "payload": {}}
>
> Which is frustrating to say the least. I am now debating whether to:
>
> 1. Stub out the subset of the API that I actually need (currently only
> createJob and getTasksWithoutConfigs);
> 2. Roll my own protocol, based on Thrift's code [1]; or
> 3. Backport my project to Python 2.7 and use official Thrift.
>
> [1]
> https://github.com/apache/thrift/blob/93fea15b51494a79992a5323c803325537134bd8/lib/py/src/protocol/TJSONProtocol.py
>
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 16 March 2015 at 23:37, Hussein Elgridly <huss...@broadinstitute.org>
> wrote:
>
>> As a general rule we're trying to stick to Python 3.4. I don't imagine
>> implementing something a THTTPClient of my own will be too difficult,
>> especially given that I have the Aurora client's TRequestsTransport [1] for
>> reference.
>>
>> [1]
>> https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/common/transport.py
>>
>> Hussein Elgridly
>> Senior Software Engineer, DSDE
>> The Broad Institute of MIT and Harvard
>>
>>
>> On 16 March 2015 at 22:58, Bill Farner <wfar...@apache.org> wrote:
>>
>>> Exploring the possibilities - can you use python 2.7?  If so, you could
>>> leverage some of the private libraries within the client and lower the
>>> surface area of what you need to build.  It won't be a stable
>>> programmatic
>>> API, but you might get moving faster.  I assume this is what Stephan is
>>> suggesting.
>>>
>>> -=Bill
>>>
>>> On Mon, Mar 16, 2015 at 7:52 PM, Hussein Elgridly <
>>> huss...@broadinstitute.org> wrote:
>>>
>>> > I'm not quite sure I understand your question, so I'll be painfully
>>> > explicit instead.
>>> >
>>> > I don't want to use the existing Aurora client because it's slow
>>> (Pystachio
>>> > + repeated HTTP connection overheads, as detailed earlier in this
>>> thread).
>>> > Instead, I want to use the Thrift interface to talk to the Aurora
>>> scheduler
>>> > directly - I can skip Pystachio entirely and keep the HTTP connection
>>> > open).
>>> >
>>> > I cannot use the official Thrift bindings for Python as they do not yet
>>> > support Python 3 [1]. There is a third-party, pure Python
>>> implementation of
>>> > Thrift that does support Python 3 called thriftpy [2]. However,
>>> thriftpy
>>> > does not include a THTTPClient transport, which is what the Aurora
>>> > scheduler uses. I will therefore have to write my own THTTPClient
>>> transport
>>> > (and probably contribute it back to thriftpy).
>>> >
>>> > [1] https://issues.apache.org/jira/browse/THRIFT-1857
>>> > [2] https://github.com/eleme/thriftpy
>>> >
>>> > Hussein Elgridly
>>> > Senior Software Engineer, DSDE
>>> > The Broad Institute of MIT and Harvard
>>> >
>>> >
>>> > On 16 March 2015 at 19:11, Erb, Stephan <stephan....@blue-yonder.com>
>>> > wrote:
>>> >
>>> > > Just to make sure I get this correctly: You say, you cannot use the
>>> > > existing python client because it is python 2.7 only so you want to
>>> > write a
>>> > > new one in python 3?
>>> > >
>>> > > Regards,
>>> > > Stephan
>>> > > ________________________________________
>>> > > From: Hussein Elgridly <huss...@broadinstitute.org>
>>> > > Sent: Monday, March 16, 2015 11:44 PM
>>> > > To: dev@aurora.incubator.apache.org
>>> > > Subject: Re: Speeding up Aurora client job creation
>>> > >
>>> > > So this has now bubbled back to the top of my TODO list and I'm
>>> actively
>>> > > working on it. I am entirely new to Thrift so please forgive the
>>> newbie
>>> > > questions...
>>> > >
>>> > > I would like to talk to the Aurora scheduler directly from my
>>> (Python)
>>> > > application using Thrift. Since I'm on Python 3.4 I've had to use
>>> > thriftpy:
>>> > > https://github.com/eleme/thriftpy
>>> > >
>>> > > As far as I can tell, the following should work (by default, thriftpy
>>> > uses
>>> > > a TBufferedTransport around a TSocket):
>>> > >
>>> > > ---
>>> > > import thriftpy
>>> > > import thriftpy.rpc
>>> > >
>>> > > aurora_api = thriftpy.load("api.thrift")
>>> > >
>>> > > client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
>>> > > host="localhost", port=8081,
>>> > > proto_factory=thriftpy.protocol.TJSONProtocolFactory() )
>>> > >
>>> > > print(client.getJobSummary())
>>> > > ---
>>> > >
>>> > > Obviously I wouldn't be writing this email if it did work :) It
>>> hangs.
>>> > >
>>> > > I jumped into pdb and found it was sending the following payload:
>>> > >
>>> > > b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0,
>>> > "ttype":
>>> > > 1, "version": 1}, "payload": {}}'
>>> > >
>>> > > to a socket that looked like this:
>>> > >
>>> > > <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049,
>>> proto=0,
>>> > > laddr=('<localhost's_private_ip>', 49167),
>>> > raddr=('localhost's_private_ip',
>>> > > 8081)>
>>> > >
>>> > > ...but was waiting forever to receive any data. Adding a timeout just
>>> > > triggered the timeout.
>>> > >
>>> > > I'm stumped. Any clues?
>>> > >
>>> > >
>>> > > Hussein Elgridly
>>> > > Senior Software Engineer, DSDE
>>> > > The Broad Institute of MIT and Harvard
>>> > >
>>> > >
>>> > > On 12 February 2015 at 04:15, Erb, Stephan <
>>> stephan....@blue-yonder.com>
>>> > > wrote:
>>> > >
>>> > > > Hi Hussein,
>>> > > >
>>> > > > we also had slight performance problems when talking to Aurora. We
>>> > ended
>>> > > > up using the existing python client directly in our code (see
>>> > > > apache.aurora.client.api.__init__.py). This allowed us to reuse
>>> the api
>>> > > > object and its scheduler connection, dropping a connection latency
>>> of
>>> > > about
>>> > > > 0.3-0.4 seconds per request.
>>> > > >
>>> > > > Best Regards,
>>> > > > Stephan
>>> > > > ________________________________________
>>> > > > From: Bill Farner <wfar...@apache.org>
>>> > > > Sent: Wednesday, February 11, 2015 9:29 PM
>>> > > > To: dev@aurora.incubator.apache.org
>>> > > > Subject: Re: Speeding up Aurora client job creation
>>> > > >
>>> > > > To reduce that time you will indeed want to talk directly to the
>>> > > > scheduler.  This will definitely require you to roll up your
>>> sleeves a
>>> > > bit
>>> > > > and set up a thrift client to our api (based on api.thrift [1]),
>>> since
>>> > > you
>>> > > > will need to specify your tasks in a format that the thermos
>>> executor
>>> > can
>>> > > > understand.  Turns out this is JSON data, so it should not be *too*
>>> > > > prohibitive.
>>> > > >
>>> > > > However, there is another technical limitation you will hit for the
>>> > > > submission rate you are after.  The scheduler is backed by a
>>> durable
>>> > > store
>>> > > > whose write latency is at minimum the amount of time required to
>>> fsync.
>>> > > >
>>> > > > [1]
>>> > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
>>> > > >
>>> > > > -=Bill
>>> > > >
>>> > > > On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
>>> > > > huss...@broadinstitute.org> wrote:
>>> > > >
>>> > > > > Hi folks,
>>> > > > >
>>> > > > > I'm looking at a use cases that involves submitting potentially
>>> > > hundreds
>>> > > > of
>>> > > > > jobs a second to our Mesos cluster. My tests show that the aurora
>>> > > client
>>> > > > is
>>> > > > > taking 1-2 seconds for each job submission, and that I can run
>>> about
>>> > > four
>>> > > > > client processes in parallel before they peg the CPU at 100%. I
>>> need
>>> > > more
>>> > > > > throughput than this!
>>> > > > >
>>> > > > > Squashing jobs down to the Process or Task level doesn't really
>>> make
>>> > > > sense
>>> > > > > for our use case. I'm aware that with some shenanigans I can
>>> batch
>>> > jobs
>>> > > > > together using job instances, but that's a lot of work on my
>>> current
>>> > > > > timeframe (and of questionable utility given that the jobs
>>> certainly
>>> > > > won't
>>> > > > > have identical resource requirements).
>>> > > > >
>>> > > > > What I really need is (at least) an order of magnitude speedup in
>>> > terms
>>> > > > of
>>> > > > > being able to submit jobs to the Aurora scheduler (via the
>>> client or
>>> > > > > otherwise).
>>> > > > >
>>> > > > > Conceptually it doesn't seem like adding a job to a queue should
>>> be a
>>> > > > thing
>>> > > > > that takes a couple of seconds, so I'm baffled as to why it's
>>> taking
>>> > so
>>> > > > > long. As an experiment, I wrapped the call to client.execute() in
>>> > > > > client.py:proxy_main in cProfile and called aurora job create
>>> with a
>>> > > very
>>> > > > > simple test job.
>>> > > > >
>>> > > > > Results of the profile are in the Gist below:
>>> > > > >
>>> > > > > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
>>> > > > >
>>> > > > > Our of a 0.977s profile time, the two things that stick out to me
>>> > are:
>>> > > > >
>>> > > > > 1. 0.526s spent in Pystachio for a job that doesn't use any
>>> templates
>>> > > > > 2. 0.564s spent in create_job, presumably talking to the
>>> scheduler
>>> > (and
>>> > > > > setting up the machinery for doing so)
>>> > > > >
>>> > > > > I imagine I can sidestep #1 with a check for "{{" in the job
>>> file and
>>> > > > > bypass Pystachio entirely. Can I also skip the Aurora client
>>> entirely
>>> > > and
>>> > > > > talk directly to the scheduler? If so what does that entail, and
>>> are
>>> > > > there
>>> > > > > any risks associated?
>>> > > > >
>>> > > > > Thanks,
>>> > > > > -Hussein
>>> > > > >
>>> > > > > Hussein Elgridly
>>> > > > > Senior Software Engineer, DSDE
>>> > > > > The Broad Institute of MIT and Harvard
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Speeding up Aurora client job creation

Reply via email to