On Sun, 2004-10-24 at 10:46 +0200, Jan Kesten wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi all! > > |> I'm not sure but what do you mean when you say "Linux doesn't see > |> the CPU cache." > > I think he meant that the CPU cache isn't directly accessible for linux > as for any other OS. It is inside the chip and there it stores recently > accessed memory pages (and their addresses) to speed up following > accessed to that memory page (indeed since cpu cache runs at cpu clock > speed and is sram it is much much more faster than fetching a memory > page from main memory or even from a disk).
CPU cache holds "data" or a single instruction. It doesn't contain memory addresses because that would still mean we have to fetch the actual data. A "page" is a concept that the OS uses but a CPU has no concept of paging though set associative cache acts much like paging in the OS on a conceptual level. > > |> On my Intel Centrino laptop the /proc/cpuinfo file does point out > |> that the processor has a 2mb capacity of L2 Cache. > > Right, it also will have some layer 1 cache. Both are on chip, if a > requested memory page isn't in layer 1 cache at next the cpu (not a OS) > will look into layer 2 cache. 2MB doesn't seem very large in view of > sizes of main memory - but often programms have some variables (data) > that is read/written quite often (a counter in a loop for a simple > example) and these variables could fit into one memory page. So if this > page is in the cache, while running the cpu doesn't need to fetch pages > from memory too often. This is referred to as "temporal locality" which is the idea that if an item is accessed once then chances are it will be accessed again. Although CPU cache does consider temporal locality it works more strongly (at CPU level) using the idea of "spatial locality". Spatial locality is the idea that if we access some data in memory, chances are we will need the surrounding data. Spatial locality is so predominant in CPU caches because of the linear sequential fashion in which instructions are arranged. The CPU has three main stages; fetch, decode, and execute. The fetch cycle is expensive because it needs to access external memory. Spatial locality tells us that if we are going out for an instruction we might as well bring back the next few and store them on the chip (L1 cache). Now all would be perfect except for those darn branch instructions (if - then - else). This branching causes cache misses because we suddenly jumped to some arbitrary instruction we were not anticipating. This may seem like not such a big deal but it's a fundamental computer science problem that computer engineers have spent years trying to solve. Things are a bit more complicated than just a simple cache miss. If we were to operate on an instruction from start to finish we would first fetch it then decode it and finally execute it. Then move to the next instruction and repeat. New processors use what is known as "pipelining". The concept of a pipeline is that we work on the instruction as if we had an assembly line. As soon as we fetch an instruction we pass it on to the next stage and go fetch another. The reason has to do with clock timing. It may take a complex instruction 10 or 12 clock cycles to complete. 1 micro operation (micro instruction) takes 1 clock cycle and it may take several instructions to complete (i.e. push and pop). For that reason the concept of pipelines was introduced. If it takes a number of clock cycles to complete an instruction than you can see that, on average, we probably complete an instruction about every 10th clock cycle. With a pipeline we actually complete an instruction every clock cycle. This may seem odd but picture a 100 foot hose. Lets say that we need to fill a glass using this hose. You stand at the faucet and I stand at the glass. I shout to you to turn the water on and when the glass is full I shout for you to turn the water off. I go get another glass and we repeat the process until 100 glasses have been filled. That seems silly though. Where is all our time being lost? It is lost on moving the water from the faucet to the glass (filling the hose). It may take a few seconds until water from the faucet is actually leaving the other end of the hose. A pipeline solves this type of problem. Once we have the pipeline filled, we are theoretically executing an instruction on every clock edge. As one instruction comes into the front of the pipe, an instruction is pushed out the back. Anything coming out the back is a completed instruction. Just think of an assembly line in some factory. Now back to our cache problem. We can see that initially filling the pipeline is the most expensive. Once it's full we are in good shape. In order to fill it we are actually using CPU cache or spatial locality, to assume that the next instruction in is the one proceeding the last. When a branch instruction causes a cache miss it means we are most likely going to have to flush our pipeline because what we assumed to be the next sequence of instructions has changed. This is a major problem. The P4 has two 31 stage pipelines so that it can perform branch predictions. This means it has more than one choice to make. Newer versions of GCC use branch prediction hinting to tell the CPU that a branch may be coming. L1 cache is relatively small because it is on the CPU. L2 cache is cheaper and larger because it is located on a separate chip. The general flow of cache fill is that we fill L2 cache and then from that cache we fill L1 with a smaller subset. Most modern CPUs don't use a direct mapped cache. Direct mapped cache means that each cache lookup points to a single piece of data. The Pentium based processors use a set associative cache which means that one cache mapping will point to several locations. When we do the cache lookup we see a set number of cache datum. This may seem odd but it works in the same fashion as paging in the OS. Think about how we map page tables. One entry in the table points to and entire page (4k of data) yet we are in search of a single piece of data. Set associative cache mapping is identical to page table mapping where a certain amount of bits are used as an index into the table and then more bits are used to index into the page itself and so on... Lastly you have a form of address mapping cache called a "Translation Lookaside Buffer" or TLB. The TLB caches virtual to physical address mappings so that the MMU (memory management unit) on the CPU doesn't have to keep performing this operation. Remember that the OS deals with virtual addressing so that when it communicates with the CPU, the CPU must convert that virtual address into a physical address. The TLB caches this data for fast reference. I've heard a few of you state that the OS doesn't see these type of caches. Although it has no direct control over them it definitely considers them and the value they offer. Things like processor affinity exist to reduce cache misses (SMP does not necessarily share certain caches). The compiler also plays a major part in CPU cache. Most optimizations exist to increase cache performance on the CPU. This probably more than most of you care to know about CPU cache but I wanted to clear up some common misconceptions as well as test my retention. My computer architecture professor would be proud ;-) -- Eric Gaumer <[EMAIL PROTECTED]>
signature.asc
Description: This is a digitally signed message part