On Fri, Sep 24, 2010 at 02:43:52PM +1000, Benjamin Herrenschmidt wrote: >> The DMA is what I use in the "real world case" to get data into and out >> of these buffers. However, I can disable the DMA completely and do only >> the kmalloc. In this case I still see the same poor performance. My >> prefetching is part of my algo using the dcbt instructions. I know the >> instructions are effective b/c without them the algo is much less >> performant. So yes, my prefetches are explicit. > >Could be some "effect" of the cache structure, L2 cache, cache geometry >(number of ways etc...). You might be able to alleviate that by changing >the "stride" of your prefetch. > >Unfortunately, I'm not familiar enough with the 440 micro architecture >and its caches to be able to help you much here.
Also, doesn't kmalloc have a limit to the size of the request it will let you allocate? I know in the distant past you could allocate 128K with kmalloc, and 2M with an explicit call to get_free_pages. Anything larger than that had to use vmalloc. The limit might indeed be higher now, but a 4MB kmalloc buffer sounds very large, given that it would be contiguous pages. Two of them even less so. >> Ok, I will give that a try ... in addition, is there an easy way to use >> any sort of gprof like tool to see the system performance? What about >> looking at the 44x performance counters in some meaningful way? All >> the experiments point to the fetching being slower in the full program >> as opposed to the algo in a testbench, so I want to determine what it is >> that could cause that. > >Does it have any useful performance counters ? I didn't think it did but >I may be mistaken. No, it doesn't. josh _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev