On 10/05/2016 23:28, Gregory Haynes wrote:
> On Tue, May 10, 2016, at 11:10 AM, Hayes, Graham wrote:
>> On 10/05/2016 01:01, Gregory Haynes wrote:
>>>
>>> On Mon, May 9, 2016, at 03:54 PM, John Dickinson wrote:
>>>> On 9 May 2016, at 13:16, Gregory Haynes wrote:
>>>>>
>>>>> This is a bit of an aside but I am sure others are wondering the same
>>>>> thing - Is there some info (specs/etherpad/ML thread/etc) that has more
>>>>> details on the bottleneck you're running in to? Given that the only
>>>>> clients of your service are the public facing DNS servers I am now even
>>>>> more surprised that you're hitting a python-inherent bottleneck.
>>>>
>>>> In Swift's case, the summary is that it's hard[0] to write a network
>>>> service in Python that shuffles data between the network and a block
>>>> device (hard drive) and effectively utilizes all of the hardware
>>>> available. So far, we've done very well by fork()'ing child processes,
>>>> using cooperative concurrency via eventlet, and basic "write more
>>>> efficient code" optimizations. However, when it comes down to it,
>>>> managing all of the async operations across many cores and many drives
>>>> is really hard, and there just isn't a good, efficient interface for
>>>> that in Python.
>>>
>>> This is a pretty big difference from hitting an unsolvable performance
>>> issue in the language and instead is a case of language preference -
>>> which is fine. I don't really want to fall in to the language-comparison
>>> trap, but I think more detailed reasoning for why it is preferable over
>>> python in specific use cases we have hit is good info to include /
>>> discuss in the document you're drafting :). Essentially its a matter of
>>> weighing the costs (which lots of people have hit on so I won't) with
>>> the potential benefits and so unless the benefits are made very clear
>>> (especially if those benefits are technical) its pretty hard to evaluate
>>> IMO.
>>>
>>> There seemed to be an assumption in some of the designate rewrite posts
>>> that there is some language-inherent performance issue causing a
>>> bottleneck. If this does actually exist then that is a good reason for
>>> rewriting in another language and is something that would be very useful
>>> to clearly document as a case where we support this type of thing. I am
>>> highly suspicious that this is the case though, but I am trying hard to
>>> keep an open mind...
>>
>> The way this component works makes it quite difficult to make any major
>> improvement.
>
> OK, I'll bite.
>
> I had a look at the code and there's a *ton* of low hanging fruit. I
> decided to hack in some fixes or emulation of fixes to see whether I
> could get any major improvements. Each test I ran 4 workers using
> SO_REUSEPORT and timed doing 1k axfr's with 4 in parallel at a time and
> recorded 5 timings. I also added these changes on top of one another in
> the order they follow.

Thanks for the analysis - any suggestions about how we can improve the
current design are more than welcome .

For this test, was it a single static zone? What size was it?

>
> Base timings: [9.223, 9.030, 8.942, 8.657, 9.190]
>
> Stop spawning a thread per request - there are a lot of ways to do this
> better, but lets not even mess with that and just literally move the
> thread spawning that happens per request because its a silly idea here:
> [8.579, 8.732, 8.217, 8.522, 8.214] (almost 10% increase).
>
> Stop instantiating oslo config object per request - this should be a no
> brainer, we dont need to parse config inside of a request handler:
> [8.544, 8.191, 8.318, 8.086] (a few more percent).
>
> Now, the slightly less low hanging fruit - there are 3 round trips to
> the database *every request*. This is where the vast majority of request
> time is spent (not in python). I didn't actually implement a full on
> cache (I just hacked around the db queries), but this should be trivial
> to do since designate does know when to invalidate the cache data. Some
> numbers on how much a warm cache will help:
>
> Caching zone: [5.968, 5.942, 5.936, 5.797, 5.911]
>
> Caching records: [3.450, 3.357, 3.364, 3.459, 3.352].
>
> I would also expect real-world usage to be similar in that you should
> only get 1 cache miss per worker per notify, and then all the other
> public DNS servers would be getting cache hits. You could also remove
> the cost of that 1 cache miss by pre-loading data in to the cache.

I actually would expect the real world use of this to have most of the
servers have a cache miss.

We shuffle the order of the miniDNS servers sent out to the user facing
DNS servers, so I would expect them to hit different minidns servers
at nearly same time, and each of them try to generate the cache entry.

For pre-loading - this could work, but I *really* don't like relying on
a cache for one of the critical path components.

>
> All said and done, I think that's almost a 3x speed increase with
> minimal effort. So, can we stop saying that this has anything to do with
> Python as a language and has everything to do with the algorithms being
> used?

As I have said before - for us, the time spent : performance
improvement ratio is just much higher (for our dev team at least) with
Go.

We saw a 50x improvement for small SOA queries, and ~ 10x improvement
for 2000 record AXFR (without caching). The majority of your
improvement came from caching, so I would imagine that would speed up
the Go implementation as well.

>>
>> MiniDNS (the component) takes data and sends a zone transfer every time
>> a recordset gets updated. That is a full (AXFR) zone transfer, so every
>> record in the zone gets sent to each of the DNS servers that end users
>> can hit.
>>
>> This can be quite a large number - ns[1-6].example.com. may well be
>> tens or hundreds of servers behind anycast IPs and load balancers.
>>
>
> This design sounds like a *perfect* contender for caching. If you're
> designing this properly its purely a question of how quickly can you
> shove memory over the wire and as a result your choice in language will
> have almost no effect - it'll be entirely an i/o bound problem.
>
>> In many cases, internal zones (or even external zones) can be quite
>> large - I have seen zones that are 200-300Mb. If a zone is high traffic
>> (like say cloud.example.com. where a record is added / removed for
>> each boot / destroy, or the reverse DNS zones for a cloud), there can
>> be a lot of data sent out from this component.
>
> Great! It's even more I/O bound, then.
>
>>
>> We are a small development team, and after looking at our options, and
>> judging the amount of developer hours we had available, a different
>> language was the route we decided on. I was going to go implement a few
>> POCs and see what was most suitable.
>
> This is what especially surprises me. The problems going on here are
> purely algorithmic, and the thinking is that rather than solve those
> issues the small amount of development time needs to be spent on a re
> implementation which also is going to have costs the the wider community
> due to the language choice.
>
>>
>> Golang was then being proposed as a new "blessed" language, and as it
>> was a language that we had a pre-existing POC in we decided to keep
>> this within the potential new list of languages.
>>
>> As I said before, we did not just randomly decide this. We have been
>> talking about it for a while, and at this summit we dedicated an entire
>> session to it, and decided to do it.
>
> Cheers,
> Greg
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to