On Wed, Nov 16, 2011 at 8:03 AM, Kyle Sluder <kyle.slu...@gmail.com> wrote: > On Nov 16, 2011, at 1:00 AM, Don Quixote de la Mancha > <quix...@dulcineatech.com> wrote: >> Calling accessors is also quite slow compared to a direct iVar access, >> because it has to go through Objective-C's message dispatch mechanism. > > objc_msgSend isn't very slow. What measurements have you done that indicate > objc_msgSend is taking any appreciable amount of time?
objc_msgSend is slow as Alaskan Molasses compared to a simple C function call. According to Instruments, my iOS App now spends about half its time in a C (not Objective-C) void function that updates the state of a Cellular Automaton grid, and the other half in a low-level iOS Core Graphics routine that fills rectangles with constant colors. Despite that I implemented my own most time-critical routine in C, objc_msgSend takes up two percent of my run time. I expect it would be a lot more if I didn't implement that grid update routine in C. >> Focussing on interface is no excuse for weighty, bloated code! > > I would argue that you have your priorities confused. Focusing on interface > recognizes that programmer time is far more expensive than computer time. End-user time is even more expensive than programmer time. Large executables take up more disk or Flash storage space, so they load slower and require more electrical power to load. For desktop and server machines, large executables are less likely to be completely resident in physical memory, so the user's hard drive will spin down less often. For iOS and other mobile devices, your App will be slower to launch then, after having been suspended, slower to resume. Also large executables shorten battery life significantly. It is worthwhile to reduce the size even of the portion of executables that don't seem to impact run time. The reason is that code takes up space in the CPU's code cache. That displaces your time-critical code from the cache, so your app has to slow down by hitting main memory to load the cache. Calling any Objective-C method also loads data into the data cache, at the very least to refer to self, as well as to search the selector table. That selector table is fscking HUGE! My iOS App isn't particularly big, but Instruments tells me that just about the very first thing that my App does is malloc() 300,000 CFStrings right at startup. I expect each such CFString is either a selector, or one component of a selector that takes multiple paramaters. While the Mach-O executable format is designed to speed selector table searching, that search reads stuff into both the code and data cache. If the code and data are not already cached, then objc_msgSend will be very, very slow compared to a C subroutines call. I expect that all Objective-C methods are implemented just like C subroutines that take "self" as a parameter that's not named explicitly by the coder. So objc_msgSend blows away some cache lines to determine which C implementation to call, as well as to check that the object to which we are sending the message is not nil. It then calls the C subroutine, only at that point becoming as efficient as C code. Not all Objective-C methods refer to self, not even implicitly. But self is always passed by objc_msgSend, even if you don't need it. While the compiler can optimize away the register self is stored in, the runtime has no choice but to always pass self to methods. That means method parameters take up space for self that's not always necessary. On ARM, the first four parameters are, more or less, passed in registers. One of those will always be self, even if the called method doesn't use it. In C++, one can declare such functions as "static", so that "this" - the C++ equivalient of self - is not passed. But while Objective-C supports Class Methods, those whose signature starts with "+" instead of "-", using a Class Method instead of an Instance Method doesn't save you anything. Rather than self being passed, a pointer to the class object will be passed. I've been trying to beat these arguments into the heads of my fellow coders since most of you lot were in diapers. Just about always the responses are that "Premature Optimization is the Root of All Evil," as well as that programmer time is too valuable to optimize. But at the same time, the vast majority of the software I use on a daily basis is huge, bloated and slow. It has been my experience for many years that it is simply not possible for me to purchase a computer with a fast enough CPU, memory, network or disk drive, or with enough memory or disk that the computer remains useful for any length of time. How much do you all know about how hardware engineers try to make our software run faster? Moore's "Law" claims that the number of transistors on any given type of chip doubles every eighteen months. Most such chips also double in their maximum throughput. To design each new generation of chip, as well as to construct and equip the wafer fabs that make them is collosally expensive. A low-end wafer fab that makes chips that aren't particularly fancy costs at least a billion dollars. A fab for any kind of really interesting chip like a high-end microprocessor or really large and fast memory costs quite a lot more than that. But the woeful industry practice of just assuming that memory, CPU power, disk storage and network bandwidth are infinite more than reverses the speed and capacity gains developed at collosal expense by the hardware people. You all speak as if you think I'm a clueless newbie, but I was a "White Badge" Senior Engineer in Apple's Traditional OS Integration team from 1995 through 1996. For most of that time I worked as a "Debug Meister", in which I isolated and fixed the very-most serious bugs and performance problems in the Classic Mac OS System 7.5.2 - the System for the first PCI PowerPC macs - and 7.5.3. One of my tasks was to find a way to speed up a new hardware product that wasn't as fast as it needed to be to compete well against similar products from other vendors. After tinkering around with a prototype unit for a while, I rewrote an important code path in the Resource Manager in such a way that it would use less of both the code and data caches. That particular code path was quite commonly taken by every Classic App, as well as the System software, so it improved the performance of the entire system. Even so, our product wasn't going to sell well unless most if not all of the code paths in the entire Classic System software improved its cache utilization, so I wrote and distributed a Word document that pointed out that the code and data caches in our product's CPU were very limited resources. Rather than writing our software with the assumption that we had the use of - at the time - dozens to hundreds of megabytes of memory, which memory was very fast, this document asserted that one should focus instead on cache utilization. I illustrated my point with a rectangular array of bullet characters, one for each of the 32-byte data or cache lines in the PowerPC chips of the day. Let me give you such a diagram for the ARM Cortex A8 CPUs that are used by the iPhone 4, first-generation iPad, and the third and - I think - fourth generation iPod Touch. The Cortex A8 has 64 bytes in each cache line, which you might think is a good thing, but it might not be if your memory access patterns aren't harmonious with the Cortex's cache design. Specifically, if you read or write so much as one byte in a cache line without then using the remaining 63 bytes somehow, you are wasting the user's time and draining their battery in a way that you should not have to. The ARM Holdings company doesn't manufacture chips itself, it just designs them, then sells the designs to other companies who use the "IP" or "cores" in the designs for what are usually more complex chips. In the case of the first-gen iPad and iPhone 4, while the design is based on the Cortex A8, the proper name for the chips is the Apple A4. The A4 is different in some respects from other Cortex A8 implementations. From Wikipedia: http://en.wikipedia.org/wiki/Apple_A4 ... the Apple A4 has a 32 kb L1 code cache and a 32 kb data cache. At 64 bytes per cache line, this gives us just 512 cache lines for each of code and data. Thus, rather than assuming that programmer time is too valuable to take pride in your work, you should be assuming that your code and data must make the very best use possible of just 512 available 64-byte cache lines for each of your code and data. Here is a graphical diagram of how many cache lines are available in each cache: * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * That's actually quite a limited resource. I didn't have much of a clue about Objective-C, Cocoa or Cocoa Touch when I first started writing Warp Life. I learned all about them as I went along. Much of my work has focussed on making Warp Life run faster, as well as to use less memory. But because I didn't really know what I was doing when I started coding, my early implementations were quite half-baked. A few nights ago I decided to put quite a lot of time and effort into refactoring the code so as to reduce my executable size. I know a lot of algorithmic improvements that will speed it up dramatically, but I'm putting those off until my refactoring is complete. At the start of refactoring, Warp Life's executable was about 395 kb. I've done most of the refactoring that I can, with the result being that its executable is about 360 kb. That's about a nine percent reduction in code size that resulted from twelve hours of work or so. I assert that is a good use of my time. One more thing to consider is that unless you use some kind of profile-directed optimization, or you create a linker script manually, the linker isn't that intelligent about laying out your code in your executable file. What that means is that uncommonly-used code will quite often be placed in the same virtual memory page as commonly used code. Even if that uncommon code is never called, so that it doesn't blow away your code cache, your program will be using more physical memory than would be the case if your executable were smaller. The simple fix is to refactor your source so that as many of your methods compile to less executable code. Not so easy but far more effective is to create a linker script that will place all your less-commonly used code together at the end of your executable file, with more commonly-used code at the beginning. More most of that code at the end of your file, it will have the effect that large portions of your program are never paged in from disk or Flash storage. Also, despite the fact that the iOS doesn't have backing store, it DOES have virtual memory. The Cortex A8 has has hardware memory management, so that the way executable code is read into memory is by attempting to jump into unmapped memory regions, which causes a page fault, saves all the registers on the stack the A8 has quite a few registers - enters the kernel, which eventually figures out that the faulted access really is valid, so that a page table entry is allocated in the kernel's memory, the Flash storage page that page table entry refers to is read into physical memory, and the page fault exception returns, with all of those registers being restored from the stack. If your code is smaller, and you place less-frequently used code all in one place, the colossal overhead that results from reading executable code by generating a page fault won't happen so much, your user's device will spend more of its time and battery power running your App instead of executing kernel code, the battery charge will last longer, and the kernel will allocate fewer memory page tables. I Hope That Clears This All Up. Don Quixote -- Don Quixote de la Mancha Dulcinea Technologies Corporation Software of Elegance and Beauty http://www.dulcineatech.com quix...@dulcineatech.com _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com