Timing-Closure In High-End FPGAs
Jeff Garrison, Sr. Director, FPGA
Implementation Marketing, Synplicity, Inc.
Introduction
Timing-closure
is a growing concern for FPGA designers, particularly with the recent
introduction of multi-million gate architectures fabricated at the
90nm and 65nm technology nodes. It is not sufficient for a timing-closure
solution – the entire flow, including synthesis – to only
meet the required timing; such a solution must also minimize the number
of time-consuming synthesis-place-route iterations and provide results
that remain stable across multiple physical synthesis runs and during
final routing.
A Problem With Conventional Technologies
It
is no secret that wire delays dominate cell delays in modern silicon
chips. In the case of FPGAs implemented at the 90nm technology node,
for example, wire delays can account for 80-to-90% of each delay path.
This causes a problem with conventional FPGA synthesis solutions, because
only the cell delays are known, while the wire delays – which
cannot be fully characterized until after place-and-route – are
estimated.
As
noted in Figure 1 below, the distribution of delays – the difference
between what is estimated after logic synthesis compared to the actual
delays following place-and-route – increases with each technology
node and also as a function of the size of a design.
The
result of these inaccurate delay estimations is a timing-closure nightmare,
because the critical paths seen by the synthesis engine do not match
the critical paths seen by place-and-route. In many cases, based on
these inaccurate delay estimations, true critical paths may be under-optimized
(thereby degrading performance), while non-critical paths may be over-optimized
(thereby degrading area). Ultimately, this results in a significant
increase in the number of synthesis-place-route iterations with poor
convergence. Also, it leads to instability, because minor changes to
the RTL can result in large, unpredictable changes to the results from
synthesis and place-and-route.

Figure 1. There is increasing uncertainty
in delay distribution as designs use newer technology nodes
Some
Considerations Associated With FPGA Architectures
The
majority of today's conventional FPGA logic synthesis solutions are
derived from technology that was originally designed with ASIC synthesis
in mind. In the case of physically-aware synthesis for ASICs, for example,
placement and synthesis are combined in a single pass. The reason this
works so well in the ASIC world is that routing tracks are built-to-order.
This means that the delays associated with the final placed-and-routed
design tend to correlate reasonably well to those estimated by the
synthesis engine.
One
way to think of this is that working with an ASIC is like living in
an undeveloped area (such as a desert) with a large budget. In this
case you can build new roads as you need them and – generally
speaking – each road can be as fast as you wish. Thus, if you
wished to construct a path from your home to the office, this path
could be implemented as a point-to-point super-fast highway. Based
on this, it would be easy for you to accurately estimate the time it
will take for you to travel to work and back.
By
comparison, working with an FPGA is similar to living in the middle
of a large, well-established city, whose “interconnects” are
a mixture of small back streets, medium-sized roads, and really fast
highways. In that case, when you are returning home from the office,
you don’t necessarily aim at driving the shortest distance. Instead,
you may commence your journey by heading the “wrong way”
until you can get on a highway, at which point you may reverse your direction
and pass the office again as you head toward your home.
Planning
on traveling in a straight line is great if nothing goes wrong. For
example, if you wish to travel from point A to point B in a car and
you plan on taking a certain road, you may estimate that the trip will
take only 1 hour assuming good weather and no significant amount of
traffic. But suppose some fog comes in and it starts to rain, plus
it turns out that a major festival is taking place and there are cars
everywhere. There may be lots of alternative routes available: some
offering a short detour but with lots of tight curves, while others
are straighter but longer – which route offers the best solution?
A
similar situation occurs with regard to synthesis-place-route – basing
one's estimations on a straight-line route for a signal is great if
nothing gets in the way and nothing goes wrong. In this case, the logic
synthesis will work and the place-and-route will closely match its
results. In a real-world design, however, multiple gates and routes
end up fighting for the same resources. This means that some placement-signal
combinations will have to "take a detour", but which ones
have to take the detour and which detour should they take?
The
situation is further confused by the fact that having two logical functions
in close proximity to each other on an FPGA does not necessarily imply
the availability of a fast connection between them. Although this is
somewhat non-intuitive – and it is dependant on the available
routing resources – one may achieve better routing and timing
results by placing the logic functions further apart. This is why ASIC-derived
synthesis technologies return less-than-optimal results when applied
to FPGA architectures. This is one reason why design flows based on
these technologies require large numbers of time-consuming iterations
between the front-end (synthesis) and back-end (place-and-route) engines
in order to achieve correlation and timing closure.
A
More-Detailed Look At Conventional Synthesis Technologies
In
order to improve the results from standard RTL synthesis, there are
three conventional synthesis-enhancement techniques: floorplanning,
in-place optimization (IPO), and physically-aware synthesis.
Floorplanning
Floorplanning
tools allow designers to define physical regions on the device, to
place regions (by hand or auto-interactively), and to assign distinct
portions of the design to the regions. (In the case of FPGAs, floorplanning
is not limited to module assignments – it is also very common
to floorplan detailed, point-to-point paths.) The floorplanned design
is typically synthesized and optimized on a block-by-block basis and
then everything is "stitched together" at the end.
In
the case of FPGA implementations, analysis of a number of representative
designs shows that critical paths in a design typically end up passing
through three
“rooms” in the floorplan. Due to the regular structures associated
with FPGA architectures, as logic modules are assigned to physical regions
things actually end up getting spread out, with the result that the use
of floorplanning can actually end up degrading the design’s performance.
Floorplanning
tools do not consider routing resources globally and it is not possible
to accurately analyze all of the timing paths until after the design
has been fully routed, which makes it difficult to realize global optimizations.
This can result in large numbers of time-consuming iterations between
the front-end and back-end tools. Furthermore, having a floorplan is
detrimental with regard to migrating the design to a different device
or FPGA vendor. Since floorplanning is largely a manual
process, it requires special expertise on the part of the user.

Figure 2. Manual floorplanning can
reduce the FPGA delay distribution, but is limited and requires special
expertise
In-Place
Optimization
In
the case of in-place optimization (IPO), the output from synthesis
is passed to the downstream place-and-route engines. Following place-and-route
and parasitic extraction, real-world delays are back-annotated into
the synthesis domain. These new values trigger incremental optimizations
in the synthesis engine such as logic restructuring and replication.
The result is a new netlist that has been partially modified. This
netlist is then handed over to incremental place-and-route engines
which generate a modified design topology.
The
end results from an IPO-based flow are typically better than those
achieved by all but the very best of floorplans and hand routing. Once
again, however, this technique can require numerous time-consuming
iterations between the front-end and back-end tools.
Furthermore,
a significant problem with an IPO-based technique is that optimizing
timing paths on an individual basis often leads to degradation of other
paths and incomplete timing closure. Designers need reliable results
that allow them to make changes to the design without losing what they
have achieved in earlier versions. However, the IPO-based technique
does not produce stable results over multiple design iterations, because
optimizing critical paths in one iteration can create new critical
paths in the next. Similarly, adding constraints to improve timing
in one area can worsen timing in other areas. All of this can result
in convergence issues.
Physically-Aware
Synthesis
As
noted earlier, the underlying concept behind physically-aware synthesis
is to combine placement and synthesis in a single pass. This works
well in the ASIC domain, because the placement-aware synthesis engine
can base its timing estimations on the proximity of the placed cells
and on the ensuing Steiner and Manhattan routing estimations.
Starting
around the 2002-2003 timeframe, some conventional synthesis solutions
started to apply ASIC-derived physically-aware synthesis technology
to FPGA designs, but this technology has not enjoyed a high level of
success because things are different in the FPGA domain. The problem
is that, unlike tracks in the ASIC world that are “built-to-order,” an
FPGA has a fixed amount of pre-defined routing resources, and not all
of these routes are created equal.
Most
of the physically-aware synthesis techniques apply partial solutions
for improving the critical paths in the design. There are integrations
of retiming and placement, replication and placement, clustering and
placement, etc. As a result of these approaches, the netlist is modified
in a physically-aware way, but not legalized. The last step of legalization
can undo the improvement made by physically-aware optimizations, or
make them show unrealistic improvements. They don't provide integral
solution of placement, mapping and optimizations.
Lets
take a look at the nature of the problem that we are trying to solve.
The
Local Optima Problem
Assume
that (for purposes unknown) you construct a simple device consisting
of a power source and two variable resistors in series with an incandescent
light bulb (Figure 3). Let’s further assume that the resistors
can be rotated through a 180 degree range from –90 degrees to
+90 degrees, and that the point of least resistance for each resistor
is its center (upright) position.

Figure 3. A simple device consisting
of a power source
and two potentiometers in series with a light.
Now,
suppose that you were to set the resistors to random positions, then
you give this device to someone and ask them to use it to make the
light as bright as possible. Most people would select one of the knobs,
turn it a little way to the left or right, and note the result. If
the light gets brighter they will continue to turn the knob in that
direction. Alternatively, if the light dims, they will reverse their
original action and try turning the knob the other way. After a little
experimentation, the user would quickly determine the optimal setting
for the first resistor. They would then turn their attentions to the
second resistor and perform a similar sequence of actions.
The
phrase “solution space” refers to all of the possible solutions
associated with a particular problem. We can represent the simple solution
space corresponding to our light bulb problem in the form of a three-dimensional
graph (Figure 4).

Figure 4. The simple solution space
for our light bulb problem.
In this particular case, we’re assuming that the x
axis corresponds to the rotation of the first resistor, the z axis
corresponds to the second, and the y axis reflects the brightness of
the light.
If
we were to use a software program to solve a similar problem, one approach
would be to use a “hill climbing” algorithm. In this case,
the program would commence at some random point, make a small modification
to one of the variables under its control, and then check to see if
the quality of the solution had improved or degraded. In the case of
the former, the program would continue to effect changes in the same
direction; but if the original modification had caused the solution
to worsen, then the program would reverse the direction in which it
had changed the variable and try again. The program could repeat this
process with all of the variables under its control. Also, the program
would require some way to recognize when it had reached the optimal
solution.
In
the real world, of course, the solution space associated with a particular
problem may consist of a number of “hills” and “valleys” (Figure
5). In this case, a hill climbing algorithm may end up on some local
maxima (the top of one of the
“hills”), but the algorithm has no way of “seeing” any
other “hills,” some of which may be higher.

Figure 5. A slightly more complex
solution space.
Furthermore,
as the number of variables used to describe the problem increases,
the result can be a solution space containing constructs such as “tunnels,” “bridges,” and
even elements like “cities in the sky” (the names of these
constructs are fairly self-explanatory). As the topology of a solution
space approaches these levels of complexity, finding the highest peak
(or even being able to tell the difference between “up” and “down”)
becomes highly problematic.
The
point of all this is that – due to the complexities of FPGA fabrics
and routing topologies – design flows based on the use of conventional
synthesis solutions can spend vast amounts of time and computational
resources, only to arrive at a less-than-optimal solution.
Graph-Based
Physical Synthesis Solution
An
alternative synthesis solution that truly handles the complexities
associated with FPGA architectures is that of graph-based physical
synthesis. Conventional synthesis approaches are reactive in that they
constantly have to react to bad decisions and delay estimations. By
comparison, graph-based physical synthesis is proactive in that it
starts off by making the correct decisions, which result in accurate
delay estimations. In turn, these good delay estimations dramatically
reduce the number of synthesis-place-route estimations. Furthermore,
the way in which graph-based physical synthesis works means that it
is inherently stable; small changes to the RTL result in small localized
changes to the synthesis-place-route timing path results.
The
way this works is to characterize all of the tracks in the FPGA – including
entry points, end points, and internal exit points, and to then build
a “map” of all of these tracks. In the software world this
type of map is referred to as a Graph; hence the reason why this technique
is known as "Graph-Based Physical Synthesis".

Figure 6. Graph-based physical synthesis
further reduces FPGA delay distribution resulting in very tight correlation
between post synthesis timing and final post P&R timing
In
addition to the tracks themselves, this map also includes details as
to which LUT pins have access to which types of track, any differences
in input-to-output delays through each LUT, and the size and locations
of any hard macros in the device. Returning to our analogy for a moment,
this is like having access to a map of a city showing the streets and
highways and features such as parks (hard macros) that you will have
to drive around. When wishing to travel between two points in the city
as quickly as possible, you would use the map to select the fastest
route – in many cases this will not be the shortest and most
direct route.
Similarly,
instead of looking for proximity, the synthesis engine focuses on speed
using an interconnect-centric approach. Starting with the most critical
paths and working its way down to the least critical paths (thereby
ensuring that the fastest routes are available for the most critical
paths), the synthesis engine will select tracks and their associated
entry points and exit points; from these tracks it will derive placements;
from the tracks and placements it will derive accurate delays; and
it will then optimize and iterate as required.
A
key point is that all of the optimization and iteration is performed
in the front-end (synthesis) portion of the flow. During all steps
of graph-based synthesis, the netlist is legally placed and routing
is accurately estimated. Therefore, all improvements to the critical
path done during physical synthesis are realistic. The output
from the graph-based physical synthesis engine is a fully placed netlist
(including the specific LUT pins that are to be associated with each
track) that can be handed over to the FPGA’s back-end route engine.
This
dramatically reduces the amount of delay distribution. In turn, these
highly accurate delay estimations significantly reduce iterations and
increase design stability.
As
we noted in the previous topic, a conventional synthesis solution can
easily arrive at a less-than-optimal solution. It mistakenly thinks
it's reached the highest peak in its solution space, but it can't "see" taller
surrounding hills because they are too far away. One approach is the
FPGA synthesis engine's equivalent of the film Groundhog Day – that
is, to generate a new pseudo-random seed (starting point) and do the
whole thing again (sort of like shaking a snow globe and seeing where
all the flakes end up falling).
By
comparison, the graph-based physical synthesis engine's view of the
solution space is like flying high over the landscape in a plane. Its
ability to
"see" the whole world means that the graph-based physical synthesis
engine can easily spot the highest mountain and then "parachute"
directly in to a soft landing on top.
The
end result is a single-pass, push-button synthesis step requiring minimal
iterations with the downstream place-and-route tools.
Summary
There
is an increasingly serious timing closure problem when using high-performance,
high-complexity FPGAs implemented at the 90nm technology node and lower.
The solution is to use a graph-based physical synthesis approach.
Unlike
the ASIC-centric physically-aware synthesis world in which routing
tracks are derived from placement selections, it is the placements
derived from the routing selections when using a Graph-Based Physical
Synthesis approach in the FPGA domain.
Due
to potentially inaccurate delay estimation, FPGA design flows based
on existing (ASIC-derived) physical synthesis engines tend to exhibit
a high degree of instability with regard to their results (small changes
to the RTL can result in major changes to the synthesis-place-route
results). In turn, this may require many time-consuming iterations
between the front-end (synthesis) and back-end (place-and-route) portions
of the flow. And, after all of this, they may still fail to converge
on a timing solution.
By
comparison, an analysis of graph-based physical synthesis results from
more than 200 designs shows that 90% are within 10% of the final real-world
timing and 80% of these designs are within 5% (by comparison, only
30% of designs using state-of-the art ASIC-derived physical synthesis
are within 5% of the real-world timing values and many designs can
easily be in error by 30% or more).
In
addition to the fact that a flow using graph-based physical synthesis
requires minimal iterations with the downstream place-and-route engines,
the quality of the placed netlist is dramatically improved; this optimal
placement means that the router is presented with obvious decisions,
thereby allowing it to run significantly faster.