On 05/10/2016 09:56 PM, Samuel Merritt wrote: > On 5/9/16 5:21 PM, Robert Collins wrote: >> On 10 May 2016 at 10:54, John Dickinson <m...@not.mn> wrote: >>> On 9 May 2016, at 13:16, Gregory Haynes wrote: >>>> >>>> This is a bit of an aside but I am sure others are wondering the same >>>> thing - Is there some info (specs/etherpad/ML thread/etc) that has more >>>> details on the bottleneck you're running in to? Given that the only >>>> clients of your service are the public facing DNS servers I am now even >>>> more surprised that you're hitting a python-inherent bottleneck. >>> >>> In Swift's case, the summary is that it's hard[0] to write a network >>> service in Python that shuffles data between the network and a block >>> device (hard drive) and effectively utilizes all of the hardware >>> available. So far, we've done very well by fork()'ing child processes, >> ... >>> Initial results from a golang reimplementation of the object server in >>> Python are very positive[1]. We're not proposing to rewrite Swift >>> entirely in Golang. Specifically, we're looking at improving object >>> replication time in Swift. This service must discover what data is on >>> a drive, talk to other servers in the cluster about what they have, >>> and coordinate any data sync process that's needed. >>> >>> [0] Hard, not impossible. Of course, given enough time, we can do >>> anything in a Turing-complete language, right? But we're not talking >>> about possible, we're talking about efficient tools for the job at >>> hand. >> ... >> >> I'm glad you're finding you can get good results in (presumably) >> clean, understandable code. >> >> Given go's historically poor perfornance with multiple cores >> (https://golang.org/doc/faq#Why_GOMAXPROCS) I'm going to presume the >> major advantage is in the CSP programming model - something that >> Twisted does very well: and frustratingly we've had numerous >> discussions from folk in the Twisted world who see the pain we have >> and want to help, but as a community we've consistently stayed with >> eventlet, which has a threaded programming model - and threaded models >> are poorly suited for the case here. > > At its core, the problem is that filesystem IO can take a surprisingly > long time, during which the calling thread/process is blocked, and > there's no good asynchronous alternative. > > Some background: > > With Eventlet, when your greenthread tries to read from a socket and the > socket is not readable, then recvfrom() returns -1/EWOULDBLOCK; then, > the Eventlet hub steps in, unschedules your greenthread, finds an > unblocked one, and lets it proceed. It's pretty good at servicing a > bunch of concurrent connections and keeping the CPU busy. > > On the other hand, when the socket is readable, then recvfrom() returns > quickly (a few microseconds). The calling process was technically > blocked, but the syscall is so fast that it hardly matters. > > Now, when your greenthread tries to read from a file, that read() call > doesn't return until the data is in your process's memory. This can take > a surprisingly long time. If the data isn't in buffer cache and the > kernel has to go fetch it from a spinning disk, then you're looking at a > seek time of ~7 ms, and that's assuming there are no other pending > requests for the disk. > > There's no EWOULDBLOCK when reading from a plain file, either. If the > file pointer isn't at EOF, then the calling process blocks until the > kernel fetches data for it. > > Back to Swift: > > The Swift object server basically does two things: it either reads from > a disk and writes to a socket or vice versa. There's a little HTTP > parsing in there, but the vast majority of the work is shuffling bytes > between network and disk. One Swift object server can service many > clients simultaneously. > > The problem is those pauses due to read(). If your process is servicing > hundreds of clients reading from and writing to dozens of disks (in, > say, a 48-disk 4U server), then all those little 7 ms waits are pretty > bad for throughput. Now, a lot of the time, the kernel does some > readahead so your read() calls can quickly return data from buffer > cache, but there are still lots of little hitches. > > But wait: it gets worse. Sometimes a disk gets slow. Maybe it's got a > lot of pending IO requests, maybe its filesystem is getting close to > full, or maybe the disk hardware is just starting to get flaky. For > whatever reason, IO to this disk starts taking a lot longer than 7 ms on > average; think dozens or hundreds of milliseconds. Now, every time your > process tries to read from this disk, all other work stops for quite a > long time. The net effect is that the object server's throughput > plummets while it spends most of its time blocked on IO from that one > slow disk. > > Now, of course there's things we can do. The obvious one is to use a > couple of IO threads per disk and push the blocking syscalls out > there... and, in fact, Swift did that. In commit b491549, the object > server gained a small threadpool for each disk[1] and started doing its > IO there. > > This worked pretty well for avoiding the slow-disk problem. Requests > that touched the slow disk would back up, but requests for the other > disks in the server would proceed at a normal pace. Good, right? > > The problem was all the threadpool overhead. Remember, a significant > fraction of the time, write() and read() only touch buffer cache, so the > syscalls are very fast. Adding in the threadpool overhead in Python > slowed those down. Yes, if you were hit with a 7 ms read penalty, the > threadpool saved you, but if you were reading from buffercache then you > just paid a big cost for no gain. > > On some object-server nodes where the CPUs were already fully-utilized, > people saw a 25% drop in throughput when using the Python threadpools. > It's not worth that performance loss just to gain protection from slow > disks. > > > The second thing Swift tried was to run separate object-server processes > for each disk [2]. This also mitigates slow disks, but it avoids the > threadpool overhead. The downside here is that dense nodes end up with > lots of processes; for example, a 48-disk node with 2 object servers per > disk will end up with about 96 object-server processes running. While > these processes aren't particularly RAM-heavy, that's still a decent > chunk of memory that could have been holding directories in buffer cache. > > > Aside: there's a few other things we looked at but rejected. Using Linux > AIO (kernel AIO, not POSIX libaio) would let the object server have many > pending IOs cheaply, but it only works in O_DIRECT mode, so there's no > buffer cache. We also looked at the readv2() syscall to let us perform > buffer-cache-only reads in the main thread and use a blocking read() > syscall in a threadpool, but unfortunately readv2() and preadv2() only > hit Linux in March 2016, so people running such ancient software as > Ubuntu Xenial Xerus [3] can't use it.
Didn't you try asyncio from Python 3? Wouldn't it be helpful here? Cheers, Thomas Goirand (zigo) __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev