I'm certain that Paul has done his share of this, but an art on the CDC 6600 was hand-scheduling instruction execution. There was at least one class for this--and probably more. The CPU could issue one instruction every cycle, assuming that there were no conflicts. The 6600 had several functional units whose operation could overlap.
But we've discussed this before... On the large vector STAR-100, operands were fetched via a 512-bit wide (not counting error checking bits) memory bus and pipelined vector units. The trick there was not so much scheduling of scalar instructions, but avoiding "bubbles" in the vector pipes. --Chuck