Hi, Im sorry that this is not 100% specific to gcc, however this mailing list is the last place where I think this knowledge may lie. I have written some image processing routines in assembly language making extensive use of MMX, and now I want to start optimizing it, however I cant for the life of me find any documentation such as the Intel/AMD optimization manuals for pentium/athlon/opertron cores. I cant even find much useful information from mailing lists such as this one, where i was hoping to find it. Anyway, Im no expert in the matter, however I do understand the concepts of instruction pairing and pipelining. I know that the geode lx core, which is what we have for robocup, is non-superscalar. From what I understand the core has two pipelines, the one to the Integer unit and the other to the fpu/MMX/3d Now unit. Does this more or less mean that instruction pairing has no effect? Is it still worth scheduling instructions in a pattern, such as the 4 - 1 - 1 the intel optimization manual suggests for its cores? I saw that gcc 4.3 added geode support, and Im hoping someone will have some better knowledge of the subject. Can anyone give me any pointers as to what i should be trying to optimize, or better yet links to documentation or hard benchmarks? Thanks in advance.