Aggressive Power and Performance
Optimizations in Next-Generation DSPs
by Octaisic (http://www.octasic.com)
The
ever-growing demand for rich, multi-media signal processing in mobile
devices raises a chronic technology challenge. The challenge is to
squeeze higher functionality and performance within increasingly tighter
power and space constraints. As a result, power-performance metrics
are now a central concern in DSP design. New methods
have been devised enabling designers to address the main areas of power
consumption – namely leakage power, clock trees, logic transitions,
and power grids – to significantly improve
performance compared to conventional techniques.
In
today's CMOS technology, power is consumed in two basic ways: statically
and dynamically. Static power is consumed through various leakage mechanisms,
while dynamic power is attributable to logic and interface operations.
Leakage
Power
For
pure standby operation, leakage current may be the only power draw.
In general, however an ideal, application-dependent static-to-dynamic
power ratio exists and should be attained as a compromise. To achieve
that ideal ratio, a combination of design techniques can be applied
to limit the leakage power to a given value. Such techniques include
using more conservative CMOS processes (e.g. opting for a 0.13u process
over a 90nm process), using lower leakage transistors, and/or using
circuit techniques that dynamically shut down the power of entire circuit
sections on a duty cycle basis or when not in use.
Although
this may seem counterintuitive at first, to optimize leakage power
it is often best to select a higher-leakage process and
then to limit the overall leakage by circuit design. For example, a
lower-leakage process may use high-threshold transistors (e.g. VT =
0.4V for HVT), while a higher-leakage process may use lower threshold
transistors (e.g. VT = 0.3V for SVT). The higher-leakage
process could draw up to 10 times more leakage current than a low-leakage
process, but deliver the same performance because it can use a lower
supply voltage.
Let
us consider an example. A design using an HVT process operates with
a supply voltage of 0.8V. Operating at maximum capacity (100% duty
cycle), it consumes 5mW of leakage power and 1W of dynamic power. A
similar design using an SVT process (operating at 0.7V) delivers exactly
the same performance and draws 10 times more leakage power (50mW).
However, the higher-leakage SVT design will only consume a total of
810mW power: that's a 20% power savings!
|
Power
Consumption |
HVT process
(VT = 0.8V) |
SVT process
(VT = 0.7V) |
|
Leakage
Power |
0.005W |
0.05W |
|
Dynamic
Power |
1W |
0.76W |
|
Total
Power |
1.005W |
0.810W |
|
Performance |
Both processes deliver the same
performance. |
Table 1 – Comparison of Power Consumption
of HVT and SVT processes
Power
can be even further reduced by not running the SVT circuit at full
capacity. The circuit can be powered-down during inactive portions
of its duty-cycle thereby completely eliminating leakage current during
those periods.
Dynamic
Power
Dynamic
power-performance metrics vary based on many factors including the
type of processes and algorithms run, the DSP architecture and instruction
sets used, as well as the way memory is partitioned. Inside a chip,
however, dynamic power is generally consumed by three main processes: clock
trees, logic transitions, and power grid losses.
So power-performance metrics can be substantially improved by aggressively
optimizing each of these three power consumers.
Power
Grid Losses – Power
grid (IR) losses can easily be diminished to an insignificant value
using a substantial power distribution mesh. By analogy, circuit
board designs have long used dedicated power plane layers for impedance
control, shielding, minimization of and susceptibility to emission
(crosstalk), as well as for power distribution.
In
smaller geometry chips (90nm and below), crosstalk has become such
a prevalent and difficult issue that some advanced designs shield signal
lines by interleaving power and signal traces on each metal layer.
Due to the sheer size of the power mesh, when this method is used power
grid losses become insignificant compared to those due to clock tree
and logic transition.
Clock
Tree Losses – Every
time a flip-flop is clocked some energy is spent in the flip-flop
operation itself and in charging and discharging the (often massive)
clock trees that span modern chips. The power consumption in clock
trees can be minimized via a combination of increasingly sophisticated
techniques, including the use of:
· Individually
clock-enabled flip-flops to restrict flip-flop operation to the times when clocking is absolutely necessary.
· Gated
clock trees to
dynamically prevent clocking entire circuit sections when not in
use.
· Multi-cycle
path design to
reduce the number of flip-flops in circuits as well as the frequency at
which they are triggered.
· Asynchronous
computational circuitry whenever architecturally feasible.
For
example, a typical power-hungry DSP Sum-Of-Products operation can be
implemented in a cascaded asynchronous circuit (without interspersed
flip-flops), rather than in a synchronous feedback circuit. (Traditional
synchronous circuits typically have lots of flip-flops clocked very
frequently.) Like the multi-cycle path, this approach substantially
reduces the number of flip-flops used and the frequency at
which they are triggered.
· Minimizing
the size of flip-flops and the size of circuits to have physically smaller clock
trees requiring smaller drive buffers.
· Reducing
the voltage level of a clock tree. (This is coupled with the voltage level technique used
with logic operation.)
The
voltage level and size minimizing techniques are discussed further
in the next section on logic transition.
Logic
Transition Losses – Every
time a logic circuit changes state some power is consumed in charging
or discharging the circuit to its new state. Here again a combination
of increasingly sophisticated techniques can be applied to minimize
the power consumed in logic transitions:
Eliminate
circuits that change states uselessly. (That is, remove any circuit whose changed output is not
used.) This can be done through clock gating design.
Reduce
frequency of operation. Modern personal computer CPUs (like Pentium) are proof that frequency
of operation can be pushed higher and higher at the expense of
using an inordinate amount of power. Increasing a circuit's performance
requires using one or more of the following techniques:
· Use
more complex circuits (i.e.
look-ahead vs. ripple-carry adders) which, regrettably, consume more
area and power to operate;
· Use
larger gates, buffers, and drivers to speed transitions. However, as designers know all too well, an increase
in driver size and power consumption does not translate into a proportional
gain in performance (i.e. speed). So extensively using bigger
and stronger gates, drivers, and buffers is a losing proposition
to improve power-performance metrics.
· Use
simpler and slower circuits that operate in parallel or in a staggered
multi-cycle path to deliver the same performance,
at a lower power cost.
What's
more, such circuits yield much smaller implementations. In general,
even when used in parallel, their aggregate implementations are smaller
than those of faster, more power-hungry circuits!
Compact
circuitry size. This
technique can deliver - and by far - the highest return
in terms of power-performance metrics, especially with smaller
geometry technologies. Although very simple in principle, implementing
this technique is most difficult for those who use conventional
back-end design tools. Let us first analyze why it is so effective
in improving the power-performance metrics.
Firstly,
in today's technology, wire interconnects (charging and discharging
interconnect wires) are the chip elements that consume the most power.
They consume far more than gates.
Secondly,
despite the sophistication and complexity of conventional back-end
design tools, they are often unable to figure out how to optimize placement
to minimize routing; in contrast to human designers. This is most apparent
in regular, repetitive, and parallel circuits that are so common today
and that consume the most power in data path engines, DSPs, and CPUs.
Within a very short time, a human brain can normally make out the overall
picture from the details, a simple exercise that still baffles the
most powerful super-computers. As a result, the human brain can
figure out the best element placement, alignment, rotation, I/O positioning,
etc., in a data-path engine or section thereof, quite quickly. The
designer may also readily see that slightly altering the functionality
or I/O positioning of a frequently used element (cell) would slash
interconnect wire lengths by a factor of two, four or even eight.
Practical
results reveal that in typical data-path engines, such techniques reduce
average circuit wire lengths by a factor of 8 (compared to circuits
implemented by conventional, state-of-the-art automatic back-end tools).
Furthermore, circuit compaction stemming from strategic placement easily
yields 90% silicon usage efficiency; an impressive 30-50% increase
over results with automatic back-end tools. But that is not all! Over
and above these gains, there is a strong compounding effect that
propels the power-performance metrics even further: the gates that
drive these very short wires are generally minuscule in size and power
consumption. Accordingly, entire circuits are much smaller and operate
faster than their automatically-placed counterparts, while consuming
a fraction of the power. Using this simple circuit compaction technique
with 90nm technology, entire data-path engines can be pushed to run
at 1.5-2GHz while consuming 10 times less power than equivalent,
conventionally-designed circuits!

Figure 1 - Advantages
of Optimized Placement: Circuit Compaction and Power Reduction,
up to a factor of 10. The gates are illustrated in yellow, while the
space between them is shown in purple.
Reduce
voltage transition swing – There are plenty of long, heavily-loaded parallel busses in data-path
engines, DSPs, and CPUs. These buses consume a significant amount
of power while switching and are most often a drag on performance
due to their heavy capacitance loading. The use of transmission
line technology with smaller voltage swings like those used in
high-performance memory design (like differential amplifiers) can
drastically improve the picture. These transmission lines operate
with smaller voltage transitions, greatly reducing power consumption.
More important still is that these transmission lines can change
states ten times faster than conventional CMOS rail to rail circuitry, for
the same power consumption, leading to much improved power-performance
metrics.
Modulate
voltage operation – With
the level of integration achievable today, entire systems are built
on a single die. As in any complex system, not every element needs
to run at full tilt. By analogy, not every transport vehicle needs
to be fueled with 130 octane rating (aviation caliber) gasoline.
It is the same in chip design. A portion of the circuit, say 10%,
may be a bottleneck in the whole design and for that reason cannot
be designed or parallelized in other ways. In fact, we would want
to run this circuit in overdrive. In contrast, the rest of the
circuit, where the performance required is attained, could be run
in the meanest way possible - analogous to fueling it with diesel
whose octane rating is a mere 20! In chip design this translates
into applying different voltage rails for different parts of the
circuits. For example 10% of a chip's circuitry could be fed with
1.2V to run at 4GHz, another 40% at 1.0V to run at 2GHz, while
the remaining 50% is fed with 0.7V to run at 500MHz. The aggregate
would provide the best overall power-performance metrics achievable
for that particular situation.
Designers
today have access to a wide array of tools and techniques to manage
their chip's power-performance metrics. These techniques range from
very simple to ultra complex and also offer a wide spectrum of improvement
possibilities. Surprisingly enough, the most efficient ones, like optimized
placement, are very simple. Unfortunately, the ever increasing compartmentalization
and specialization in chip design methodology keeps these techniques
out of reach for those who use only conventional – albeit state-of-the-art – back-end
design tools.
|