FIR Filter Design - Space Exploration
with the Synplify® DSP Design Tool
Shiv Balakrishnan, DSP Technical
Marketing Engineer, Synplicity, Inc.
Introduction:
Finite
Impulse Response (FIR) filters are basic DSP building blocks that have
been studied and analyzed thoroughly over the past few decades. One
might be tempted to assume that the design of FIR filters can be handled
with a
‘cookbook’ approach. In other words, a user might just
have to specify a few high level design parameters and the software does
the rest – down to the implementation details.
While
a few situations lend themselves to the scenario above, there are many
situations where it is necessary to explore a range of options in the
overall design-space. Some of the factors that make this task
complicated and time-consuming are related to fixed point quantization
with variations in arithmetic and coefficient word length and the impact
this has on product performance. In addition to this are implementation
details such as pipeline registers for maximizing throughput or optimizing
for area. When implementing into FPGAs or ASICs it is often necessary
to explore the tradeoffs in achievable rates versus system resource
usage, and the process of going through the design flow for each alternative
involves a lot of effort.
The
Synplify DSP tool allows you to rapidly explore a number of architectural
alternatives from a high level (e.g. MATLAB/Simulink [1]) domain by
automating the synthesis of the logic architecture and evaluating the
gate-level implementation results.
This design exploration flow is shown in Figure 1.

Figure 1: DSP synthesis enables
rapid design-space exploration
We
illustrate the process with a design example which highlights the power
of the Synplify DSP design tool, coupled with other tools from the
FPGA design flow.
FIR
Example:
The
most basic FIR filter has to implement a sum-of-products shown in Figure
2 below. The simplest case assumes that the filter coefficients are
fixed and the output sample rate is the same as the input.

Figure 2: Basic FIR filter
equation (filter length is N). x(n) and y(n) are input and output
sequences; h(k) are filter coefficients
The
FIR filter above can clearly be implemented in a completely serial
fashion (with one Multiplier-Accumulator (MAC) time-shared or, at the
other extreme, completely in parallel using N MACs for every output
point. The latter yields the highest throughput (roughly N times
the operating frequency of the fully serial implementation); however,
there is a range of implementations that are in between these two extremes.
There
are other well known optimizations in FIR implementation such as using
coefficient symmetry to halve the number of multipliers at the cost
of more adders which is often a useful tradeoff. Another basic
difference in FIR architectures is Direct Form versus Transposed Form
implementations; in the former, input samples are buffered (i.e. effectively
move through a delay line) whereas in the Transpose version, partial
sums are stored and propagated. Although the theoretical number
of computations is often nominally the same, these differences will
often show up in word lengths required, control logic, pipeline stages,
etc.
Let
us follow an example of a simple FIR filter through a number of design
iterations showing some of the tradeoffs mentioned above.
Algorithm
Design:
The
first step is to create a model that includes fixed-point and sample
rate behavior.
Synplify DSP software makes it very easy to do this using the powerful
features available in Simulink. Shown below in Figure 3 is a Synplify
DSP model of a basic 16-tap FIR filter with fixed coefficients.
Figure 3: Simulink block diagram
of a 16-tap FIR. Also shown are the filter design tool, input
waveform, and output analysis in time and frequency domains.
If
we were to examine the specification of the filter itself, some of
the parameters are as shown below in Figure 4. Note that we can specify
the fixed point properties of the input and output data as well as
the coefficients, which is yet another axis of exploration which is
often important.

Figure 4: Parameters of the
Synplify DSP FIR filter block. Note specified data formats.
Successive
runs of the Synplify DSP tool, which takes place at the Simulink level
within MATLAB, give a quick estimate of whether the desired functionality
is achieved; the typical progression would be to first make sure that
the coefficient and data word lengths are sufficient to achieve the
desired results. The design and analysis tools of Simulink shown in
Figure 2 are very useful for this purpose.
This step then gives us workable specifications for the FIR block (Figure
4). It is often useful to have a record of how close different word length
choices come to the desired (or floating point) performance.
Algorithm
Implementation Using DSP Synthesis
Next
we can look at the invocation of the Synplify DSP tool itself, a typical
instance of which is shown in Figure 5 below.

Figure 5: Parameters of a typical
run of Synplify DSP. Note that the target device architecture is
selected at this stage.
Of
particular note are the Retiming and Folding options. Essentially
the Retiming option allows the Synplify DSP tool to modify the architecture
to use pipelining and other techniques to get to the desired performance
goal, at the expense of latency at the ouput. The Folding option
allows the design to share hardware, at the expense of lower throughput
(i.e. trade off maximum sample rate for resource utilization).
Creating
Architectural Alternatives:
Once
the baseline architecture is established, the exploration can commence
and the rest of this article deals with the results of using the Retiming
and Folding controls, particularly the latter. As noted in Figure 4’s
caption, at this point we need knowledge of the target FPGA’s
architecture and resources. In this particular example we map
to a Xilinx Virtex-5 FPGA with a maximum clock frequency of 400 MHz
(-10 speed grade). This family uses a versatile piece of hard-coded
logic called a DSP48 slice [2]. A key objective of our design-space
exploration in this example is to measure the tradeoff in sharing DSP48
slices versus parallelism for maximum throughput.
Evaluating
Results:
Using
the architecturally-optimized RTL from Synplify DSP, we can now use
logic synthesis tools to evaluate the speed and area results. For
FPGA targets, the Synplify DSP tool writes the RTL in a way that optimizes
mapping of multipliers, registers, and memory for the target device
using the Synplify Pro® or Synplify®
Premier software. This includes using HW multipliers, block
memory, and shift register elements when necessary. For ASIC targets
we can use any commercial ASIC synthesis tool.
Figure
6 shows how the Synplify Pro tool reports on speed and resource utilization
for a typical run.

Figure 6: Output from the Synplify
Pro tool showing Estimated Frequency and Resource Usage (DSP48’s
highlighted)
As
can be seen from the above, the simulation is very efficient, enabling
rapid turn around of multiple architectures to find an optimum for
a particular design.
If
we examine the RTL output from Synplify Pro tool for this run, we can
see the implementation details; a typical portion with multiple DSP48
Slices is shown below in Figure 7.

Figure 7: Analyzing results in the
Synplify Pro HDL Analyst. On inspection, this is seen as a
Transposed Form structure.
The
results of a few experiments with different Folding options in this
particular FIR filter example are summarized below in Table 1:
Folding
Factor |
Est.
Frequency |
Max.
Throughput |
DSP48
Slices |
None |
414.5
MHz |
>
400 Ms/s |
16 |
4 |
346.7
MHz |
86.7
Ms/s |
4 |
8 |
352.0
MHz |
44
Ms/s |
2 |
16 |
338.5
MHz |
21.2
Ms/s |
1 |
Table 1: Folding effects on
filter throughput and hardware sharing
This
particular Virtex-5 FPGA has a total of 32 DSP48 slices. The additional
hardware resources (LUTs, Register bits etc.) are small (< 4%) in
all cases compared to the available resources in the part. The
Synplify tools, of course, provide all the detailed information on
timing, constraints, resource utilization etc. taking into account
any extra latencies needed by Retiming and other optimizations.
What
we can observe from the table is that you can rapidly create more serialized,
area efficient architectures when the sample rates are lower. The
powerful capability of doing this from a single algorithm models allows
you to easily map to different technologies and exploit higher clock
frequencies without changing and re-verifying the algorithm model.
Conclusion:
What
we have shown from the simple FIR example is that the Synplify DSP
software provides a rapid and easy way of making architectural tradeoffs
while keeping the target implementation in the design loop. This
enables you to explore multiple architectural possibilities, including
important fixed point considerations, and obtain useful implementation
cost versus performance tradeoffs in an efficient way. The result
is optimal FPGA implementation of high level algorithms while minimizing
design time.
References:
1.
MATLAB R2006b, Version 7.3.0.267, the MathWorks, Aug. 2006
2. ‘DSP:
Designing for Optimal Results”, Advanced Design Guide, Xilinx
Inc., 2005