On 10/05/2016 23:28, Gregory Haynes wrote: > On Tue, May 10, 2016, at 11:10 AM, Hayes, Graham wrote: >> On 10/05/2016 01:01, Gregory Haynes wrote: >>> >>> On Mon, May 9, 2016, at 03:54 PM, John Dickinson wrote: >>>> On 9 May 2016, at 13:16, Gregory Haynes wrote: >>>>> >>>>> This is a bit of an aside but I am sure others are wondering the same >>>>> thing - Is there some info (specs/etherpad/ML thread/etc) that has more >>>>> details on the bottleneck you're running in to? Given that the only >>>>> clients of your service are the public facing DNS servers I am now even >>>>> more surprised that you're hitting a python-inherent bottleneck. >>>> >>>> In Swift's case, the summary is that it's hard[0] to write a network >>>> service in Python that shuffles data between the network and a block >>>> device (hard drive) and effectively utilizes all of the hardware >>>> available. So far, we've done very well by fork()'ing child processes, >>>> using cooperative concurrency via eventlet, and basic "write more >>>> efficient code" optimizations. However, when it comes down to it, >>>> managing all of the async operations across many cores and many drives >>>> is really hard, and there just isn't a good, efficient interface for >>>> that in Python. >>> >>> This is a pretty big difference from hitting an unsolvable performance >>> issue in the language and instead is a case of language preference - >>> which is fine. I don't really want to fall in to the language-comparison >>> trap, but I think more detailed reasoning for why it is preferable over >>> python in specific use cases we have hit is good info to include / >>> discuss in the document you're drafting :). Essentially its a matter of >>> weighing the costs (which lots of people have hit on so I won't) with >>> the potential benefits and so unless the benefits are made very clear >>> (especially if those benefits are technical) its pretty hard to evaluate >>> IMO. >>> >>> There seemed to be an assumption in some of the designate rewrite posts >>> that there is some language-inherent performance issue causing a >>> bottleneck. If this does actually exist then that is a good reason for >>> rewriting in another language and is something that would be very useful >>> to clearly document as a case where we support this type of thing. I am >>> highly suspicious that this is the case though, but I am trying hard to >>> keep an open mind... >> >> The way this component works makes it quite difficult to make any major >> improvement. > > OK, I'll bite. > > I had a look at the code and there's a *ton* of low hanging fruit. I > decided to hack in some fixes or emulation of fixes to see whether I > could get any major improvements. Each test I ran 4 workers using > SO_REUSEPORT and timed doing 1k axfr's with 4 in parallel at a time and > recorded 5 timings. I also added these changes on top of one another in > the order they follow.
Thanks for the analysis - any suggestions about how we can improve the current design are more than welcome . For this test, was it a single static zone? What size was it? > > Base timings: [9.223, 9.030, 8.942, 8.657, 9.190] > > Stop spawning a thread per request - there are a lot of ways to do this > better, but lets not even mess with that and just literally move the > thread spawning that happens per request because its a silly idea here: > [8.579, 8.732, 8.217, 8.522, 8.214] (almost 10% increase). > > Stop instantiating oslo config object per request - this should be a no > brainer, we dont need to parse config inside of a request handler: > [8.544, 8.191, 8.318, 8.086] (a few more percent). > > Now, the slightly less low hanging fruit - there are 3 round trips to > the database *every request*. This is where the vast majority of request > time is spent (not in python). I didn't actually implement a full on > cache (I just hacked around the db queries), but this should be trivial > to do since designate does know when to invalidate the cache data. Some > numbers on how much a warm cache will help: > > Caching zone: [5.968, 5.942, 5.936, 5.797, 5.911] > > Caching records: [3.450, 3.357, 3.364, 3.459, 3.352]. > > I would also expect real-world usage to be similar in that you should > only get 1 cache miss per worker per notify, and then all the other > public DNS servers would be getting cache hits. You could also remove > the cost of that 1 cache miss by pre-loading data in to the cache. I actually would expect the real world use of this to have most of the servers have a cache miss. We shuffle the order of the miniDNS servers sent out to the user facing DNS servers, so I would expect them to hit different minidns servers at nearly same time, and each of them try to generate the cache entry. For pre-loading - this could work, but I *really* don't like relying on a cache for one of the critical path components. > > All said and done, I think that's almost a 3x speed increase with > minimal effort. So, can we stop saying that this has anything to do with > Python as a language and has everything to do with the algorithms being > used? As I have said before - for us, the time spent : performance improvement ratio is just much higher (for our dev team at least) with Go. We saw a 50x improvement for small SOA queries, and ~ 10x improvement for 2000 record AXFR (without caching). The majority of your improvement came from caching, so I would imagine that would speed up the Go implementation as well. >> >> MiniDNS (the component) takes data and sends a zone transfer every time >> a recordset gets updated. That is a full (AXFR) zone transfer, so every >> record in the zone gets sent to each of the DNS servers that end users >> can hit. >> >> This can be quite a large number - ns[1-6].example.com. may well be >> tens or hundreds of servers behind anycast IPs and load balancers. >> > > This design sounds like a *perfect* contender for caching. If you're > designing this properly its purely a question of how quickly can you > shove memory over the wire and as a result your choice in language will > have almost no effect - it'll be entirely an i/o bound problem. > >> In many cases, internal zones (or even external zones) can be quite >> large - I have seen zones that are 200-300Mb. If a zone is high traffic >> (like say cloud.example.com. where a record is added / removed for >> each boot / destroy, or the reverse DNS zones for a cloud), there can >> be a lot of data sent out from this component. > > Great! It's even more I/O bound, then. > >> >> We are a small development team, and after looking at our options, and >> judging the amount of developer hours we had available, a different >> language was the route we decided on. I was going to go implement a few >> POCs and see what was most suitable. > > This is what especially surprises me. The problems going on here are > purely algorithmic, and the thinking is that rather than solve those > issues the small amount of development time needs to be spent on a re > implementation which also is going to have costs the the wider community > due to the language choice. > >> >> Golang was then being proposed as a new "blessed" language, and as it >> was a language that we had a pre-existing POC in we decided to keep >> this within the potential new list of languages. >> >> As I said before, we did not just randomly decide this. We have been >> talking about it for a while, and at this summit we dedicated an entire >> session to it, and decided to do it. > > Cheers, > Greg > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev