On 05/11/2016 03:23 PM, Samuel Merritt wrote: > On 5/11/16 7:09 AM, Thomas Goirand wrote: >> On 05/10/2016 09:56 PM, Samuel Merritt wrote: >>> On 5/9/16 5:21 PM, Robert Collins wrote: >>>> On 10 May 2016 at 10:54, John Dickinson <m...@not.mn> wrote: >>>>> On 9 May 2016, at 13:16, Gregory Haynes wrote: >>>>>> >>>>>> This is a bit of an aside but I am sure others are wondering the >>>>>> same >>>>>> thing - Is there some info (specs/etherpad/ML thread/etc) that >>>>>> has more >>>>>> details on the bottleneck you're running in to? Given that the only >>>>>> clients of your service are the public facing DNS servers I am >>>>>> now even >>>>>> more surprised that you're hitting a python-inherent bottleneck. >>>>> >>>>> In Swift's case, the summary is that it's hard[0] to write a network >>>>> service in Python that shuffles data between the network and a block >>>>> device (hard drive) and effectively utilizes all of the hardware >>>>> available. So far, we've done very well by fork()'ing child >>>>> processes, >>>> ... >>>>> Initial results from a golang reimplementation of the object >>>>> server in >>>>> Python are very positive[1]. We're not proposing to rewrite Swift >>>>> entirely in Golang. Specifically, we're looking at improving object >>>>> replication time in Swift. This service must discover what data is on >>>>> a drive, talk to other servers in the cluster about what they have, >>>>> and coordinate any data sync process that's needed. >>>>> >>>>> [0] Hard, not impossible. Of course, given enough time, we can do >>>>> anything in a Turing-complete language, right? But we're not talking >>>>> about possible, we're talking about efficient tools for the job at >>>>> hand. >>>> ... >>>> >>>> I'm glad you're finding you can get good results in (presumably) >>>> clean, understandable code. >>>> >>>> Given go's historically poor perfornance with multiple cores >>>> (https://golang.org/doc/faq#Why_GOMAXPROCS) I'm going to presume the >>>> major advantage is in the CSP programming model - something that >>>> Twisted does very well: and frustratingly we've had numerous >>>> discussions from folk in the Twisted world who see the pain we have >>>> and want to help, but as a community we've consistently stayed with >>>> eventlet, which has a threaded programming model - and threaded models >>>> are poorly suited for the case here. >>> >>> At its core, the problem is that filesystem IO can take a surprisingly >>> long time, during which the calling thread/process is blocked, and >>> there's no good asynchronous alternative. >>> >>> Some background: >>> >>> With Eventlet, when your greenthread tries to read from a socket and >>> the >>> socket is not readable, then recvfrom() returns -1/EWOULDBLOCK; then, >>> the Eventlet hub steps in, unschedules your greenthread, finds an >>> unblocked one, and lets it proceed. It's pretty good at servicing a >>> bunch of concurrent connections and keeping the CPU busy. >>> >>> On the other hand, when the socket is readable, then recvfrom() returns >>> quickly (a few microseconds). The calling process was technically >>> blocked, but the syscall is so fast that it hardly matters. >>> >>> Now, when your greenthread tries to read from a file, that read() call >>> doesn't return until the data is in your process's memory. This can >>> take >>> a surprisingly long time. If the data isn't in buffer cache and the >>> kernel has to go fetch it from a spinning disk, then you're looking >>> at a >>> seek time of ~7 ms, and that's assuming there are no other pending >>> requests for the disk. >>> >>> There's no EWOULDBLOCK when reading from a plain file, either. If the >>> file pointer isn't at EOF, then the calling process blocks until the >>> kernel fetches data for it. >>> >>> Back to Swift: >>> >>> The Swift object server basically does two things: it either reads from >>> a disk and writes to a socket or vice versa. There's a little HTTP >>> parsing in there, but the vast majority of the work is shuffling bytes >>> between network and disk. One Swift object server can service many >>> clients simultaneously. >>> >>> The problem is those pauses due to read(). If your process is servicing >>> hundreds of clients reading from and writing to dozens of disks (in, >>> say, a 48-disk 4U server), then all those little 7 ms waits are pretty >>> bad for throughput. Now, a lot of the time, the kernel does some >>> readahead so your read() calls can quickly return data from buffer >>> cache, but there are still lots of little hitches. >>> >>> But wait: it gets worse. Sometimes a disk gets slow. Maybe it's got a >>> lot of pending IO requests, maybe its filesystem is getting close to >>> full, or maybe the disk hardware is just starting to get flaky. For >>> whatever reason, IO to this disk starts taking a lot longer than 7 >>> ms on >>> average; think dozens or hundreds of milliseconds. Now, every time your >>> process tries to read from this disk, all other work stops for quite a >>> long time. The net effect is that the object server's throughput >>> plummets while it spends most of its time blocked on IO from that one >>> slow disk. >>> >>> Now, of course there's things we can do. The obvious one is to use a >>> couple of IO threads per disk and push the blocking syscalls out >>> there... and, in fact, Swift did that. In commit b491549, the object >>> server gained a small threadpool for each disk[1] and started doing its >>> IO there. >>> >>> This worked pretty well for avoiding the slow-disk problem. Requests >>> that touched the slow disk would back up, but requests for the other >>> disks in the server would proceed at a normal pace. Good, right? >>> >>> The problem was all the threadpool overhead. Remember, a significant >>> fraction of the time, write() and read() only touch buffer cache, so >>> the >>> syscalls are very fast. Adding in the threadpool overhead in Python >>> slowed those down. Yes, if you were hit with a 7 ms read penalty, the >>> threadpool saved you, but if you were reading from buffercache then you >>> just paid a big cost for no gain. >>> >>> On some object-server nodes where the CPUs were already fully-utilized, >>> people saw a 25% drop in throughput when using the Python threadpools. >>> It's not worth that performance loss just to gain protection from slow >>> disks. >>> >>> >>> The second thing Swift tried was to run separate object-server >>> processes >>> for each disk [2]. This also mitigates slow disks, but it avoids the >>> threadpool overhead. The downside here is that dense nodes end up with >>> lots of processes; for example, a 48-disk node with 2 object servers >>> per >>> disk will end up with about 96 object-server processes running. While >>> these processes aren't particularly RAM-heavy, that's still a decent >>> chunk of memory that could have been holding directories in buffer >>> cache. >>> >>> >>> Aside: there's a few other things we looked at but rejected. Using >>> Linux >>> AIO (kernel AIO, not POSIX libaio) would let the object server have >>> many >>> pending IOs cheaply, but it only works in O_DIRECT mode, so there's no >>> buffer cache. We also looked at the readv2() syscall to let us perform >>> buffer-cache-only reads in the main thread and use a blocking read() >>> syscall in a threadpool, but unfortunately readv2() and preadv2() only >>> hit Linux in March 2016, so people running such ancient software as >>> Ubuntu Xenial Xerus [3] can't use it. >> >> Didn't you try asyncio from Python 3? Wouldn't it be helpful here? > > Unfortunately, it would not help. > > The core problem is that filesystem syscalls can take a long time and > that they block the calling thread. > > At a syscall level, Eventlet and asyncio look pretty similar: > > (1) Call select()/epoll()/whatever to wait for something to happen on > many file descriptors > > (2) For each ready file descriptor, do something. For example, if a > socket fd is readable, call recvfrom(fd, buf, ...) repeatedly until > the kernel returns -1/EWOULDBLOCK, then go to the next fd. > > (3) Repeat. (Yes, there's timed events and such, but they don't matter > for purposes of this discussion.) > > The key thing that makes this all work is that, when select() says a > file descriptor is readable, then reading from that file descriptor is > extremely fast. In a matter of just a few microseconds, the program > either gets some data to operate on or it gets -1/EWOULDBLOCK > returned. And, of course, the same goes for writability. > > > Now, with files, this breaks down. Consider a file descriptor opened > for reading on a normal file on a filesystem on a spinning disk. > select() says it's readable, so the program calls read(), and the > whole thing blocks for anywhere from a few tens of microseconds up to > entire tens of seconds.
Have you tried putting the file into non-blocking mode? >From the above description is doesn't sound like you have. gevent[1] supports that and provides an interface for wrapping python file objects for doing so, at least on Linux. This keeps the thread from being blocked and thereby allowing a different greenlet worker to continue while the greenlet waits on its I/O to be available. I don't see any equivalent in eventlet[2] but that would seem like a more worthwhile contribution to eventlet or even oslo. $0.02 Ben [1] http://www.gevent.org/gevent.fileobject.html [2] http://eventlet.net/doc/ __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev