I had a conversation with Cadence at ARM TechCon, and one of the things they’re talking about is what they call clock/data co-optimization as an alternative to traditional synchronous logic optimization.
Typically, designers work hard to make sure each logic stage in the overall logic pipeline can be implemented in the time required before the next clock arrives. Some stages have more logic than others, and you have to spend more time on those ones to get the speed right.
Meanwhile, someone spends a lot of effort synthesizing a clock tree that’s balanced and homogeneous and gets a strobe to each register at the same time.
But Cadence’s point is, it doesn’t have to be that way. What if you borrowed from the easy logic stages to cut the hard logic stages more slack? Then the clock could arrive a bit early at the hard logic stage so that it would have more time to get the hard logic done. You might be able to use smaller or fewer buffers, reducing logic and power.
This means that different stages may have different delays, and, most unconventionally, that clocks might arrive at different times in different places. (Which actually isn’t such a bad thing from an EMI standpoint.)
By doing this, they found that a particular ARM A9 design gained 40 MHz of performance while lowering dynamic power by 10.4% and clock area (and hence leakage) by 31%. A complete win win.