feature article
Subscribe Now

Municipal Clock Design

Picture yourself living in a big city with lots of traffic. That could be anywhere in the world. Now picture that city with a robust subway/rail system (in other words, not busses that also have to contend with traffic). Admittedly, that narrows things down (to places mostly outside the US, but never mind… work with me here.)

In this city, you have a choice. When you want to go from your home to your work (both within the city), you could drive the entire distance. Or, if you were lucky, you could take public transit the entire distance and not use your car at all. Perhaps you wouldn’t even need to own a car.

Now… for the sake of this discussion, let’s assume we’re working with a transit system that has an excellent on-time record. I know, none of them are perfect, but let’s say that the variation in your train trip duration rounds to zero – it always takes exactly the same time. We can then make the following observation: due to the vagaries of traffic and random variations in behind-the-wheel idiocy, driving all the way to work gives the most uncertainty on when you’ll get there – it has the highest arrival time variation or trip-to-trip skew; the train trip has the lowest (zero in this case).

Granted, the logistics of using your car are easiest; for the train, there’s more to do to get there: get a ticket, wait for it to arrive, etc. So the car has the best ease of use, but is the least predictable; the train is more work, but is the most predictable.

Now let’s say that you work somewhere in a plant located outside the edge of town, out of the reach of the transit system, and you live in the heart of town. Then you might have a beater car that you park at the train station nearest your work. You take the train that far, and you then drive the remaining distance. You’ve now put some unpredictability back into your arrival time due to the drive. But because the drive is so much shorter than driving all the way from home, the variation in your trip times is likely to be much smaller.

And if, for any reason, you decided not to park at the closest-to-work station, but rather at some farther-away station (perhaps it’s near your favorite pub or grocery store), then your drive would be longer and less predictable – but still shorter and more predictable than driving all the way from home.

The point here is that the shorter the drive portion of the trip, the more predictable your arrival time will be. Which relates rather nicely to clock design on large ICs. Really. Work with me here.

The easiest way to get your clocks where you need them with the desired timing is good ol’, tried-and-true, push-button clock tree synthesis (CTS). Each clock load gets a custom-crafted signal path that originates at the clock source. It’s like driving to work: you get the best control, the easiest methodology, and it’s what most people do.

There’s just one issue: on-chip variation at today’s aggressive dimensions is a tremendous consideration, which makes it really hard to manage the fact that on any given die, one side may be faster or slower than the other (to oversimplify things). And you don’t know which side and you don’t know if it’s faster or slower. It’s like trying to drive all the way to work and figuring out what time you should leave so that you’ll never arrive to work late and you’ll never arrive too ridiculously early (which would make you look too eager… no one wants to be that guy…). If the traffic varies too much from day to day, there’s really no way to do that. The only way you would be able to manage your arrival time would be to look at a traffic map before leaving home so you could time your departure (and hope things didn’t change too much after you left). Unfortunately, there are no on-line traffic maps on an IC, so that option isn’t available to clock designers.

The other option, then, is to use transit. The IC version of this has been so-called “clock mesh” design. So-called because it involves a large, fine-grained clock mesh. The idea is that you have many sources of a given clock all around the die, and you hook them all onto the mesh. You are literally shorting the values of all of these drivers, so if one is a little slower than another, the faster one starts yanking on the line earlier, which compensates for the guy that’s slow. In other words, by shorting all these drivers, their delay variations are sort of averaged out.

In that manner, all of the paths from the clock sources to the mesh end up having pretty much the same effective delay. It’s like the transit system – low skew. And if you’re lucky enough to be able to drive your clock loads directly from the mesh, then you have effectively zero skew, like a door-to-door transit system provides.

But, of course, just as none of us can reasonably expect a complete door-to-door transit system, you also can’t drive the mesh to absolutely everywhere you need a clock signal. So you get it as close as possible – the tight mesh gives you that – and then you have a bit more logic that gets you from the mesh to your actual load. You get one or two levels of logic where you can, for instance, do some gating. This is like taking the train to the outer station and then driving the last little bit.

Because the source-to-mesh skew is roughly zero, you really have to worry only about the skew of that last couple of logic levels, and that’s manageable. So it sounds like an easy solution. But, of course, as with all things engineering, there are tradeoffs.

First off, in the same way that driving is easier than using transit, CTS is easier than designing a clock mesh. It’s push-button. Clock meshes are harder to design and analyze, and they have to have a nice, clean, uninterrupted area. If your chip has a circuit – perhaps it’s some hard IP you purchased whose layout you can’t control – that uses the same metal layers as the mesh, now you’ve blocked the mesh and screwed things up.

The upshot is that, as Synopsys tells it, clock meshes aren’t used very often: they tend to be limited to the processor areas on SoCs, and they’re created by well-trained, dedicated teams. Other companies are reluctant to go that route because of the risk inherent in changing to an approach that’s significantly harder than CTS.

There’s another problem: If a transit system has too many stations, if they try to deliver you too close to any possible destination, then the train will just get underway from one station and have to start slowing down for the next station. The train practically won’t spend any time traveling; it’s constantly starting and stopping and waiting for folks to get on and off the train. And we all know that starting and stopping frequently is more work – takes more energy – than simply gliding along the rail. So a train system with closely-packed stations will use more power than one without. To a first approximation, anyway. Work with me here… (Will it use less than if everyone takes their car? Probably not… work with me here…)

Albeit for different reasons, the clock mesh approach also uses much more power than a CTS approach. It’s because of those big metal mesh lines. Unlike CTS, where only the lines going from point A to point B switch, here we have parts of the mesh that aren’t near point A or point B – and they’re still switching. Bottom line, lots of metal, all of it switching: higher power.

The compromise here is to make the mesh less fine: raise the level of granularity. This is like designing a transit system with more space between stations. You consume less power, but you also end up getting dropped off farther from your destination. Synopsys calls this multi-source CTS. You still end up designing a clock tree between the mesh and load – 8-9 levels of logic rather than 1 or 2 in a full mesh – but now the source for that tree isn’t only a single clock source (as it would be in standard CTS), but it’s the combination of all of the sources that drive the mesh.

Doing this helps with the power problem while compromising a little on the skew. The thinking is that even nine levels of logic is still manageable. That aside, there’s still the potential problem of ease of use. This is where Synopsys recently announced, as a part of their 20-nm support, a multi-source CTS methodology with IC Compiler that’s more automated to make it more like CTS when it comes to ease of use.

One of the issues they have to solve is, what’s the best place to tap off of the mesh to drive a local tree? And where do you run the local clock tree from that tap? To automate this design issue, they have a “clustering” algorithm that finds clusters of the circuit that are naturally near each other and need the same clock. Those clusters will often reflect the design hierarchy, and the tool will pay attention to the hierarchy in its deliberations, but it’s not tightly bound to that – it can bring in or leave out pieces of the circuit that cross the block boundaries

The result is that you now have three clock design choices:

  • CTS, which will give you the lowest power and is the easiest to do but has the least ability to deal with on-chip variation;
  • Multi-source CTS, which tolerates on-chip variation better than CTS but uses 10-20% (preliminary numbers) more power than CTS and is a bit more work to implement; and
  • a full clock mesh, which has the best on-chip variation tolerance but uses 20-40% more power than CTS and is harder to implement than either CTS or multi-source CTS.

So don’t sell your car quite yet. You may decide you want it just to make that last leg between the station and your work a little bit more flexible.

Image: Bryon Moyer

Picture yourself living in a big city with lots of traffic. That could be anywhere in the world. Now picture that city with a robust subway/rail system (in other words, not busses that also have to contend with traffic). Admittedly, that narrows things down (to places mostly outside the US, but never mind… work with me here.)

In this city, you have a choice. When you want to go from your home to your work (both within the city), you could drive the entire distance. Or, if you were lucky, you could take public transit the entire distance and not use your car at all. Perhaps you wouldn’t even need to own a car.

Now… for the sake of this discussion, let’s assume we’re working with a transit system that has an excellent on-time record. I know, none of them are perfect, but let’s say that the variation in your train trip duration rounds to zero – it always takes exactly the same time. We can then make the following observation: due to the vagaries of traffic and random variations in behind-the-wheel idiocy, driving all the way to work gives the most uncertainty on when you’ll get there – it has the highest arrival time variation or trip-to-trip skew; the train trip has the lowest (zero in this case).

Granted, the logistics of using your car are easiest; for the train, there’s more to do to get there: get a ticket, wait for it to arrive, etc. So the car has the best ease of use, but is the least predictable; the train is more work, but is the most predictable.

Now let’s say that you work somewhere in a plant located outside the edge of town, out of the reach of the transit system, and you live in the heart of town. Then you might have a beater car that you park at the train station nearest your work. You take the train that far, and you then drive the remaining distance. You’ve now put some unpredictability back into your arrival time due to the drive. But because the drive is so much shorter than driving all the way from home, the variation in your trip times is likely to be much smaller.

And if, for any reason, you decided not to park at the closest-to-work station, but rather at some farther-away station (perhaps it’s near your favorite pub or grocery store), then your drive would be longer and less predictable – but still shorter and more predictable than driving all the way from home.

The point here is that the shorter the drive portion of the trip, the more predictable your arrival time will be. Which relates rather nicely to clock design on large ICs. Really. Work with me here.

The easiest way to get your clocks where you need them with the desired timing is good ol’, tried-and-true, push-button clock tree synthesis (CTS). Each clock load gets a custom-crafted signal path that originates at the clock source. It’s like driving to work: you get the best control, the easiest methodology, and it’s what most people do.

There’s just one issue: on-chip variation at today’s aggressive dimensions is a tremendous consideration, which makes it really hard to manage the fact that on any given die, one side may be faster or slower than the other (to oversimplify things). And you don’t know which side and you don’t know if it’s faster or slower. It’s like trying to drive all the way to work and figuring out what time you should leave so that you’ll never arrive to work late and you’ll never arrive too ridiculously early (which would make you look too eager… no one wants to be that guy…). If the traffic varies too much from day to day, there’s really no way to do that. The only way you would be able to manage your arrival time would be to look at a traffic map before leaving home so you could time your departure (and hope things didn’t change too much after you left). Unfortunately, there are no on-line traffic maps on an IC, so that option isn’t available to clock designers.

The other option, then, is to use transit. The IC version of this has been so-called “clock mesh” design. So-called because it involves a large, fine-grained clock mesh. The idea is that you have many sources of a given clock all around the die, and you hook them all onto the mesh. You are literally shorting the values of all of these drivers, so if one is a little slower than another, the faster one starts yanking on the line earlier, which compensates for the guy that’s slow. In other words, by shorting all these drivers, their delay variations are sort of averaged out.

In that manner, all of the paths from the clock sources to the mesh end up having pretty much the same effective delay. It’s like the transit system – low skew. And if you’re lucky enough to be able to drive your clock loads directly from the mesh, then you have effectively zero skew, like a door-to-door transit system provides.

But, of course, just as none of us can reasonably expect a complete door-to-door transit system, you also can’t drive the mesh to absolutely everywhere you need a clock signal. So you get it as close as possible – the tight mesh gives you that – and then you have a bit more logic that gets you from the mesh to your actual load. You get one or two levels of logic where you can, for instance, do some gating. This is like taking the train to the outer station and then driving the last little bit.

Because the source-to-mesh skew is roughly zero, you really have to worry only about the skew of that last couple of logic levels, and that’s manageable. So it sounds like an easy solution. But, of course, as with all things engineering, there are tradeoffs.

First off, in the same way that driving is easier than using transit, CTS is easier than designing a clock mesh. It’s push-button. Clock meshes are harder to design and analyze, and they have to have a nice, clean, uninterrupted area. If your chip has a circuit – perhaps it’s some hard IP you purchased whose layout you can’t control – that uses the same metal layers as the mesh, now you’ve blocked the mesh and screwed things up.

The upshot is that, as Synopsys tells it, clock meshes aren’t used very often: they tend to be limited to the processor areas on SoCs, and they’re created by well-trained, dedicated teams. Other companies are reluctant to go that route because of the risk inherent in changing to an approach that’s significantly harder than CTS.

There’s another problem: If a transit system has too many stations, if they try to deliver you too close to any possible destination, then the train will just get underway from one station and have to start slowing down for the next station. The train practically won’t spend any time traveling; it’s constantly starting and stopping and waiting for folks to get on and off the train. And we all know that starting and stopping frequently is more work – takes more energy – than simply gliding along the rail. So a train system with closely-packed stations will use more power than one without. To a first approximation, anyway. Work with me here… (Will it use less than if everyone takes their car? Probably not… work with me here…)

Albeit for different reasons, the clock mesh approach also uses much more power than a CTS approach. It’s because of those big metal mesh lines. Unlike CTS, where only the lines going from point A to point B switch, here we have parts of the mesh that aren’t near point A or point B – and they’re still switching. Bottom line, lots of metal, all of it switching: higher power.

The compromise here is to make the mesh less fine: raise the level of granularity. This is like designing a transit system with more space between stations. You consume less power, but you also end up getting dropped off farther from your destination. Synopsys calls this multi-source CTS. You still end up designing a clock tree between the mesh and load – 8-9 levels of logic rather than 1 or 2 in a full mesh – but now the source for that tree isn’t only a single clock source (as it would be in standard CTS), but it’s the combination of all of the sources that drive the mesh.

Doing this helps with the power problem while compromising a little on the skew. The thinking is that even nine levels of logic is still manageable. That aside, there’s still the potential problem of ease of use. This is where Synopsys recently announced, as a part of their 20-nm support, a multi-source CTS methodology with IC Designer that’s more automated to make it more like CTS when it comes to ease of use.

One of the issues they have to solve is, what’s the best place to tap off of the mesh to drive a local tree? And where do you run the local clock tree from that tap? To automate this design issue, they have a “clustering” algorithm that finds clusters of the circuit that are naturally near each other and need the same clock. Those clusters will often reflect the design hierarchy, and the tool will pay attention to the hierarchy in its deliberations, but it’s not tightly bound to that – it can bring in or leave out pieces of the circuit that cross the block boundaries

The result is that you now have three clock design choices:

  • CTS, which will give you the lowest power and is the easiest to do but has the least ability to deal with on-chip variation;
  • Multi-source CTS, which tolerates on-chip variation better than CTS but uses 10-20% (preliminary numbers) more power than CTS and is a bit more work to implement; and
  • a full clock mesh, which has the best on-chip variation tolerance but uses 20-40% more power than CTS and is harder to implement than either CTS or multi-source CTS.

So don’t sell your car quite yet. You may decide you want it just to make that last leg between the station and your work a little bit more flexible.

Image: Bryon Moyer

One thought on “Municipal Clock Design”

Leave a Reply

featured blogs
Apr 17, 2024
The semiconductor industry thrives on innovation, and at the heart of this progress lies Electronic Design Automation (EDA). EDA tools allow engineers to design and evaluate chips, before manufacturing, a data-intensive process. It would not be wrong to say that data is the l...
Apr 16, 2024
Learn what IR Drop is, explore the chip design tools and techniques involved in power network analysis, and see how it accelerates the IC design flow.The post Leveraging Early Power Network Analysis to Accelerate Chip Design appeared first on Chip Design....
Mar 30, 2024
Join me on a brief stream-of-consciousness tour to see what it's like to live inside (what I laughingly call) my mind...

featured video

MaxLinear Integrates Analog & Digital Design in One Chip with Cadence 3D Solvers

Sponsored by Cadence Design Systems

MaxLinear has the unique capability of integrating analog and digital design on the same chip. Because of this, the team developed some interesting technology in the communication space. In the optical infrastructure domain, they created the first fully integrated 5nm CMOS PAM4 DSP. All their products solve critical communication and high-frequency analysis challenges.

Learn more about how MaxLinear is using Cadence’s Clarity 3D Solver and EMX Planar 3D Solver in their design process.

featured chalk talk

Intel AI Update
Sponsored by Mouser Electronics and Intel
In this episode of Chalk Talk, Amelia Dalton and Peter Tea from Intel explore how Intel is making AI implementation easier than ever before. They examine the typical workflows involved in artificial intelligence designs, the benefits that Intel’s scalable Xeon processor brings to AI projects, and how you can take advantage of the Intel AI ecosystem to further innovation in your next design.
Oct 6, 2023
24,918 views