Re: About iVars declaration and property

Don Quixote de la Mancha Wed, 16 Nov 2011 09:33:39 -0800

On Wed, Nov 16, 2011 at 8:03 AM, Kyle Sluder <kyle.slu...@gmail.com> wrote:
> On Nov 16, 2011, at 1:00 AM, Don Quixote de la Mancha 
> <quix...@dulcineatech.com> wrote:
>> Calling accessors is also quite slow compared to a direct iVar access,
>> because it has to go through Objective-C's message dispatch mechanism.
>
> objc_msgSend isn't very slow. What measurements have you done that indicate 
> objc_msgSend is taking any appreciable amount of time?


objc_msgSend is slow as Alaskan Molasses compared to a simple C function call.

According to Instruments, my iOS App now spends about half its time in
a C (not Objective-C) void function that updates the state of a
Cellular Automaton grid, and the other half in a low-level iOS Core
Graphics routine that fills rectangles with constant colors.

Despite that I implemented my own most time-critical routine in C,
objc_msgSend takes up two percent of my run time.  I expect it would
be a lot more if I didn't implement that grid update routine in C.

>> Focussing on interface is no excuse for weighty, bloated code!
>
> I would argue that you have your priorities confused. Focusing on interface 
> recognizes that programmer time is far more expensive than computer time.

End-user time is even more expensive than programmer time.

Large executables take up more disk or Flash storage space, so they
load slower and require more electrical power to load.

For desktop and server machines, large executables are less likely to
be completely resident in physical memory, so the user's hard drive
will spin down less often.  For iOS and other mobile devices, your App
will be slower to launch then, after having been suspended, slower to
resume.  Also large executables shorten battery life significantly.

It is worthwhile to reduce the size even of the portion of executables
that don't seem to impact run time.  The reason is that code takes up
space in the CPU's code cache.  That displaces your time-critical code
from the cache, so your app has to slow down by hitting main memory to
load the cache.

Calling any Objective-C method also loads data into the data cache, at
the very least to refer to self, as well as to search the selector
table.

That selector table is fscking HUGE!  My iOS App isn't particularly
big, but Instruments tells me that just about the very first thing
that my App does is malloc() 300,000 CFStrings right at startup.  I
expect each such CFString is either a selector, or one component of a
selector that takes multiple paramaters.

While the Mach-O executable format is designed to speed selector table
searching, that search reads stuff into both the code and data cache.
If the code and data are not already cached, then objc_msgSend will be
very, very slow compared to a C subroutines call.

I expect that all Objective-C methods are implemented just like C
subroutines that take "self" as a parameter that's not named
explicitly by the coder.  So objc_msgSend blows away some cache lines
to determine which C implementation to call, as well as to check that
the object to which we are sending the message is not nil.  It then
calls the C subroutine, only at that point becoming as efficient as C
code.

Not all Objective-C methods refer to self, not even implicitly.  But
self is always passed by objc_msgSend, even if you don't need it.
While the compiler can optimize away the register self is stored in,
the runtime has no choice but to always pass self to methods.  That
means method parameters take up space for self that's not always
necessary.

On ARM, the first four parameters are, more or less, passed in
registers.  One of those will always be self, even if the called
method doesn't use it.

In C++, one can declare such functions as "static", so that "this" -
the C++ equivalient of self - is not passed.  But while Objective-C
supports Class Methods, those whose signature starts with "+" instead
of "-", using a Class Method instead of an Instance Method doesn't
save you anything.  Rather than self being passed, a pointer to the
class object will be passed.

I've been trying to beat these arguments into the heads of my fellow
coders since most of you lot were in diapers.  Just about always the
responses are that "Premature Optimization is the Root of All Evil,"
as well as that programmer time is too valuable to optimize.

But at the same time, the vast majority of the software I use on a
daily basis is huge, bloated and slow.  It has been my experience for
many years that it is simply not possible for me to purchase a
computer with a fast enough CPU, memory, network or disk drive, or
with enough memory or disk that the computer remains useful for any
length of time.

How much do you all know about how hardware engineers try to make our
software run faster?

Moore's "Law" claims that the number of transistors on any given type
of chip doubles every eighteen months.  Most such chips also double in
their maximum throughput.  To design each new generation of chip, as
well as to construct and equip the wafer fabs that make them is
collosally expensive.  A low-end wafer fab that makes chips that
aren't particularly fancy costs at least a billion dollars.  A fab for
any kind of really interesting chip like a high-end microprocessor or
really large and fast memory costs quite a lot more than that.

But the woeful industry practice of just assuming that memory, CPU
power, disk storage and network bandwidth are infinite more than
reverses the speed and capacity gains developed at collosal expense by
the hardware people.

You all speak as if you think I'm a clueless newbie, but I was a
"White Badge" Senior Engineer in Apple's Traditional OS Integration
team from 1995 through 1996.  For most of that time I worked as a
"Debug Meister", in which I isolated and fixed the very-most serious
bugs and performance problems in the Classic Mac OS System 7.5.2 - the
System for the first PCI PowerPC macs - and 7.5.3.

One of my tasks was to find a way to speed up a new hardware product
that wasn't as fast as it needed to be to compete well against similar
products from other vendors.  After tinkering around with a prototype
unit for a while, I rewrote an important code path in the Resource
Manager in such a way that it would use less of both the code and data
caches.  That particular code path was quite commonly taken by every
Classic App, as well as the System software, so it improved the
performance of the entire system.

Even so, our product wasn't going to sell well unless most if not all
of the code paths in the entire Classic System software improved its
cache utilization, so I wrote and distributed a Word document that
pointed out that the code and data caches in our product's CPU were
very limited resources.

Rather than writing our software with the assumption that we had the
use of - at the time - dozens to hundreds of megabytes of memory,
which memory was very fast, this document asserted that one should
focus instead on cache utilization.   I illustrated my point with a
rectangular array of bullet characters, one for each of the 32-byte
data or cache lines in the PowerPC chips of the day.

Let me give you such a diagram for the ARM Cortex A8 CPUs that are
used by the iPhone 4, first-generation iPad, and the third and - I
think - fourth generation iPod Touch.

The Cortex A8 has 64 bytes in each cache line, which you might think
is a good thing, but it might not be if your memory access patterns
aren't harmonious with the Cortex's cache design.  Specifically, if
you read or write so much as one byte in a cache line without then
using the remaining 63 bytes somehow, you are wasting the user's time
and draining their battery in a way that you should not have to.

The ARM Holdings company doesn't manufacture chips itself, it just
designs them, then sells the designs to other companies who use the
"IP" or "cores" in the designs for what are usually more complex
chips.  In the case of the first-gen iPad and iPhone 4, while the
design is based on the Cortex A8, the proper name for the chips is the
Apple A4.  The A4 is different in some respects from other Cortex A8
implementations.  From Wikipedia:

   http://en.wikipedia.org/wiki/Apple_A4

... the Apple A4 has a 32 kb L1 code cache and a 32 kb data cache.  At
64 bytes per cache line, this gives us just 512 cache lines for each
of code and data.

Thus, rather than assuming that programmer time is too valuable to
take pride in your work, you should be assuming that your code and
data must make the very best use possible of just 512 available
64-byte cache lines for each of your code and data.  Here is a
graphical diagram of how many cache lines are available in each cache:

   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

That's actually quite a limited resource.

I didn't have much of a clue about Objective-C, Cocoa or Cocoa Touch
when I first started writing Warp Life.  I learned all about them as I
went along.  Much of my work has focussed on making Warp Life run
faster, as well as to use less memory.  But because I didn't really
know what I was doing when I started coding, my early implementations
were quite half-baked.

A few nights ago I decided to put quite a lot of time and effort into
refactoring the code so as to reduce my executable size.  I know a lot
of algorithmic improvements that will speed it up dramatically, but
I'm putting those off until my refactoring is complete.

At the start of refactoring, Warp Life's executable was about 395 kb.
I've done most of the refactoring that I can, with the result being
that its executable is about 360 kb.  That's about a nine percent
reduction in code size that resulted from twelve hours of work or so.
I assert that is a good use of my time.

One more thing to consider is that unless you use some kind of
profile-directed optimization, or you create a linker script manually,
the linker isn't that intelligent about laying out your code in your
executable file.

What that means is that uncommonly-used code will quite often be
placed in the same virtual memory page as commonly used code.  Even if
that uncommon code is never called, so that it doesn't blow away your
code cache, your program will be using more physical memory than would
be the case if your executable were smaller.

The simple fix is to refactor your source so that as many of your
methods compile to less executable code.  Not so easy but far more
effective is to create a linker script that will place all your
less-commonly used code together at the end of your executable file,
with more commonly-used code at the beginning.

More most of that code at the end of your file, it will have the
effect that large portions of your program are never paged in from
disk or Flash storage.  Also, despite the fact that the iOS doesn't
have backing store, it DOES have virtual memory.

The Cortex A8 has has hardware memory management, so that the way
executable code is read into memory is by attempting to jump into
unmapped memory regions, which causes a page fault, saves all the
registers on the stack the A8 has quite a few registers - enters the
kernel, which eventually figures out that the faulted access really is
valid, so that a page table entry is allocated in the kernel's memory,
the Flash storage page that page table entry refers to is read into
physical memory, and the page fault exception returns, with all of
those registers being restored from the stack.

If your code is smaller, and you place less-frequently used code all
in one place, the colossal overhead that results from reading
executable code by generating a page fault won't happen so much, your
user's device will spend more of its time and battery power running
your App instead of executing kernel code, the battery charge will
last longer, and the kernel will allocate fewer memory page tables.

I Hope That Clears This All Up.

Don Quixote
-- 
Don Quixote de la Mancha
Dulcinea Technologies Corporation
Software of Elegance and Beauty
http://www.dulcineatech.com
quix...@dulcineatech.com
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: About iVars declaration and property

Reply via email to