Aggressive Power and Performance Optimizations in Next-Generation DSPs

by Octaisic (http://www.octasic.com)

 

The ever-growing demand for rich, multi-media signal processing in mobile devices raises a chronic technology challenge. The challenge is to squeeze higher functionality and performance within increasingly tighter power and space constraints. As a result, power-performance metrics are now a central concern in DSP design. New methods have been devised enabling designers to address the main areas of power consumption – namely leakage power, clock trees, logic transitions, and power grids to significantly improve performance compared to conventional techniques.

 

In today's CMOS technology, power is consumed in two basic ways: statically and dynamically. Static power is consumed through various leakage mechanisms, while dynamic power is attributable to logic and interface operations.

 

Leakage Power

 

For pure standby operation, leakage current may be the only power draw. In general, however an ideal, application-dependent static-to-dynamic power ratio exists and should be attained as a compromise. To achieve that ideal ratio, a combination of design techniques can be applied to limit the leakage power to a given value. Such techniques include using more conservative CMOS processes (e.g. opting for a 0.13u process over a 90nm process), using lower leakage transistors, and/or using circuit techniques that dynamically shut down the power of entire circuit sections on a duty cycle basis or when not in use.

 

Although this may seem counterintuitive at first, to optimize leakage power it is often best to select a higher-leakage process and then to limit the overall leakage by circuit design. For example, a lower-leakage process may use high-threshold transistors (e.g. VT = 0.4V for HVT), while a higher-leakage process may use lower threshold transistors (e.g. VT = 0.3V for SVT).  The higher-leakage process could draw up to 10 times more leakage current than a low-leakage process, but deliver the same performance because it can use a lower supply voltage.

 

Let us consider an example. A design using an HVT process operates with a supply voltage of 0.8V. Operating at maximum capacity (100% duty cycle), it consumes 5mW of leakage power and 1W of dynamic power. A similar design using an SVT process (operating at 0.7V) delivers exactly the same performance and draws 10 times more leakage power (50mW). However, the higher-leakage SVT design will only consume a total of 810mW power: that's a 20% power savings!

 

Power Consumption

HVT process

(VT = 0.8V)

SVT process

(VT = 0.7V)

Leakage Power

0.005W

0.05W

Dynamic Power

1W

0.76W

Total Power

1.005W

0.810W

Performance

Both processes deliver the same performance.

 

Table 1 – Comparison of Power Consumption of HVT and SVT processes

 

Power can be even further reduced by not running the SVT circuit at full capacity. The circuit can be powered-down during inactive portions of its duty-cycle thereby completely eliminating leakage current during those periods.

 

Dynamic Power

 

Dynamic power-performance metrics vary based on many factors including the type of processes and algorithms run, the DSP architecture and instruction sets used, as well as the way memory is partitioned. Inside a chip, however, dynamic power is generally consumed by three main processes: clock trees, logic transitions, and power grid losses. So power-performance metrics can be substantially improved by aggressively optimizing each of these three power consumers.

 

Power Grid Losses Power grid (IR) losses can easily be diminished to an insignificant value using a substantial power distribution mesh. By analogy, circuit board designs have long used dedicated power plane layers for impedance control, shielding, minimization of and susceptibility to emission (crosstalk), as well as for power distribution.

 

In smaller geometry chips (90nm and below), crosstalk has become such a prevalent and difficult issue that some advanced designs shield signal lines by interleaving power and signal traces on each metal layer. Due to the sheer size of the power mesh, when this method is used power grid losses become insignificant compared to those due to clock tree and logic transition.

 

Clock Tree Losses Every time a flip-flop is clocked some energy is spent in the flip-flop operation itself and in charging and discharging the (often massive) clock trees that span modern chips. The power consumption in clock trees can be minimized via a combination of increasingly sophisticated techniques, including the use of:

 

·         Individually clock-enabled flip-flops to restrict flip-flop operation to the times when clocking is absolutely necessary. 

·         Gated clock trees to dynamically prevent clocking entire circuit sections when not in use.

·         Multi-cycle path design to reduce the number of flip-flops in circuits as well as the frequency at which they are triggered.

·         Asynchronous computational circuitry whenever architecturally feasible.

 

For example, a typical power-hungry DSP Sum-Of-Products operation can be implemented in a cascaded asynchronous circuit (without interspersed flip-flops), rather than in a synchronous feedback circuit. (Traditional synchronous circuits typically have lots of flip-flops clocked very frequently.) Like the multi-cycle path, this approach substantially reduces the number of flip-flops used and the frequency at which they are triggered.

 

·         Minimizing the size of flip-flops and the size of circuits to have physically smaller clock trees requiring smaller drive buffers.

·         Reducing the voltage level of a clock tree. (This is coupled with the voltage level technique used with logic operation.)

 

The voltage level and size minimizing techniques are discussed further in the next section on logic transition.

 

Logic Transition Losses Every time a logic circuit changes state some power is consumed in charging or discharging the circuit to its new state. Here again a combination of increasingly sophisticated techniques can be applied to minimize the power consumed in logic transitions:

 

Eliminate circuits that change states uselessly. (That is, remove any circuit whose changed output is not used.) This can be done through clock gating design.

 

Reduce frequency of operation. Modern personal computer CPUs (like Pentium) are proof that frequency of operation can be pushed higher and higher at the expense of using an inordinate amount of power. Increasing a circuit's performance requires using one or more of the following techniques:

 

·                     Use more complex circuits (i.e. look-ahead vs. ripple-carry adders) which, regrettably, consume more area and power to operate;

·                     Use larger gates, buffers, and drivers to speed transitions. However, as designers know all too well, an increase in driver size and power consumption does not translate into a proportional gain in performance (i.e. speed). So extensively using bigger and stronger gates, drivers, and buffers is a losing proposition to improve power-performance metrics.

·                     Use simpler and slower circuits that operate in parallel or in a staggered multi-cycle path to deliver the same performance, at a lower power cost.

 

What's more, such circuits yield much smaller implementations. In general, even when used in parallel, their aggregate implementations are smaller than those of faster, more power-hungry circuits!

 

Compact circuitry size. This technique can deliver - and by far - the highest return in terms of power-performance metrics, especially with smaller geometry technologies. Although very simple in principle, implementing this technique is most difficult for those who use conventional back-end design tools. Let us first analyze why it is so effective in improving the power-performance metrics. 

 

Firstly, in today's technology, wire interconnects (charging and discharging interconnect wires) are the chip elements that consume the most power. They consume far more than gates.

 

Secondly, despite the sophistication and complexity of conventional back-end design tools, they are often unable to figure out how to optimize placement to minimize routing; in contrast to human designers. This is most apparent in regular, repetitive, and parallel circuits that are so common today and that consume the most power in data path engines, DSPs, and CPUs. Within a very short time, a human brain can normally make out the overall picture from the details, a simple exercise that still baffles the most powerful super-computers.  As a result, the human brain can figure out the best element placement, alignment, rotation, I/O positioning, etc., in a data-path engine or section thereof, quite quickly. The designer may also readily see that slightly altering the functionality or I/O positioning of a frequently used element (cell) would slash interconnect wire lengths by a factor of two, four or even eight.

 

Practical results reveal that in typical data-path engines, such techniques reduce average circuit wire lengths by a factor of 8 (compared to circuits implemented by conventional, state-of-the-art automatic back-end tools). Furthermore, circuit compaction stemming from strategic placement easily yields 90% silicon usage efficiency; an impressive 30-50% increase over results with automatic back-end tools. But that is not all! Over and above these gains, there is a strong compounding effect that propels the power-performance metrics even further: the gates that drive these very short wires are generally minuscule in size and power consumption. Accordingly, entire circuits are much smaller and operate faster than their automatically-placed counterparts, while consuming a fraction of the power. Using this simple circuit compaction technique with 90nm technology, entire data-path engines can be pushed to run at 1.5-2GHz while consuming 10 times less power than equivalent, conventionally-designed circuits!

 

 

 

Figure 1 - Advantages of Optimized Placement: Circuit Compaction and Power Reduction, up to a factor of 10. The gates are illustrated in yellow, while the space between them is shown in purple.

 

 

Reduce voltage transition swing There are plenty of long, heavily-loaded parallel busses in data-path engines, DSPs, and CPUs. These buses consume a significant amount of power while switching and are most often a drag on performance due to their heavy capacitance loading. The use of transmission line technology with smaller voltage swings like those used in high-performance memory design (like differential amplifiers) can drastically improve the picture. These transmission lines operate with smaller voltage transitions, greatly reducing power consumption. More important still is that these transmission lines can change states ten times faster than conventional CMOS rail to rail circuitry, for the same power consumption, leading to much improved power-performance metrics.

 

Modulate voltage operationWith the level of integration achievable today, entire systems are built on a single die. As in any complex system, not every element needs to run at full tilt. By analogy, not every transport vehicle needs to be fueled with 130 octane rating (aviation caliber) gasoline. It is the same in chip design. A portion of the circuit, say 10%, may be a bottleneck in the whole design and for that reason cannot be designed or parallelized in other ways. In fact, we would want to run this circuit in overdrive. In contrast, the rest of the circuit, where the performance required is attained, could be run in the meanest way possible - analogous to fueling it with diesel whose octane rating is a mere 20! In chip design this translates into applying different voltage rails for different parts of the circuits. For example 10% of a chip's circuitry could be fed with 1.2V to run at 4GHz, another 40% at 1.0V to run at 2GHz, while the remaining 50% is fed with 0.7V to run at 500MHz. The aggregate would provide the best overall power-performance metrics achievable for that particular situation.

 

Designers today have access to a wide array of tools and techniques to manage their chip's power-performance metrics. These techniques range from very simple to ultra complex and also offer a wide spectrum of improvement possibilities. Surprisingly enough, the most efficient ones, like optimized placement, are very simple. Unfortunately, the ever increasing compartmentalization and specialization in chip design methodology keeps these techniques out of reach for those who use only conventional – albeit state-of-the-art – back-end design tools.