FIR Filter Design - Space Exploration with the Synplify® DSP Design Tool

 

Shiv Balakrishnan, DSP Technical Marketing Engineer, Synplicity, Inc.

 

Introduction:

 

Finite Impulse Response (FIR) filters are basic DSP building blocks that have been studied and analyzed thoroughly over the past few decades.  One might be tempted to assume that the design of FIR filters can be handled with a ‘cookbook’ approach.  In other words, a user might just have to specify a few high level design parameters and the software does the rest – down to the implementation details.

 

While a few situations lend themselves to the scenario above, there are many situations where it is necessary to explore a range of options in the overall design-space.  Some of the factors that make this task complicated and time-consuming are related to fixed point quantization with variations in arithmetic and coefficient word length and the impact this has on product performance.  In addition to this are implementation details such as pipeline registers for maximizing throughput or optimizing for area.  When implementing into FPGAs or ASICs it is often necessary to explore the tradeoffs in achievable rates versus system resource usage, and the process of going through the design flow for each alternative involves a lot of effort.

 

The Synplify DSP tool allows you to rapidly explore a number of architectural alternatives from a high level (e.g. MATLAB/Simulink [1]) domain by automating the synthesis of the logic architecture and evaluating the gate-level implementation results.  This design exploration flow is shown in Figure 1. 

 

Adobe Systems

 

Figure 1:  DSP synthesis enables rapid design-space exploration

 

We illustrate the process with a design example which highlights the power of the Synplify DSP design tool, coupled with other tools from the FPGA design flow.

 

FIR Example:

 

The most basic FIR filter has to implement a sum-of-products shown in Figure 2 below. The simplest case assumes that the filter coefficients are fixed and the output sample rate is the same as the input.

 

 

Figure 2:  Basic FIR filter equation (filter length is N). x(n) and y(n) are input and output sequences; h(k) are filter coefficients

 

The FIR filter above can clearly be implemented in a completely serial fashion (with one Multiplier-Accumulator (MAC) time-shared or, at the other extreme, completely in parallel using N MACs for every output point.  The latter yields the highest throughput (roughly N times the operating frequency of the fully serial implementation); however, there is a range of implementations that are in between these two extremes.

 

There are other well known optimizations in FIR implementation such as using coefficient symmetry to halve the number of multipliers at the cost of more adders which is often a useful tradeoff.  Another basic difference in FIR architectures is Direct Form versus Transposed Form implementations; in the former, input samples are buffered (i.e. effectively move through a delay line) whereas in the Transpose version, partial sums are stored and propagated.  Although the theoretical number of computations is often nominally the same, these differences will often show up in word lengths required, control logic, pipeline stages, etc.

 

Let us follow an example of a simple FIR filter through a number of design iterations showing some of the tradeoffs mentioned above. 

 

Algorithm Design:

 

The first step is to create a model that includes fixed-point and sample rate behavior.  Synplify DSP software makes it very easy to do this using the powerful features available in Simulink.  Shown below in Figure 3 is a Synplify DSP model of a basic 16-tap FIR filter with fixed coefficients.

 

                  

Figure 3: Simulink block diagram of a 16-tap FIR.  Also shown are the filter design tool, input waveform, and output analysis in time and frequency domains.

 

If we were to examine the specification of the filter itself, some of the parameters are as shown below in Figure 4. Note that we can specify the fixed point properties of the input and output data as well as the coefficients, which is yet another axis of exploration which is often important.

 

 

Figure 4:  Parameters of the Synplify DSP FIR filter block.  Note specified data formats.

 

Successive runs of the Synplify DSP tool, which takes place at the Simulink level within MATLAB, give a quick estimate of whether the desired functionality is achieved; the typical progression would be to first make sure that the coefficient and data word lengths are sufficient to achieve the desired results. The design and analysis tools of Simulink shown in Figure 2 are very useful for this purpose.  This step then gives us workable specifications for the FIR block (Figure 4). It is often useful to have a record of how close different word length choices come to the desired (or floating point) performance.

 

Algorithm Implementation Using DSP Synthesis

 

Next we can look at the invocation of the Synplify DSP tool itself, a typical instance of which is shown in Figure 5 below.

 

 

Figure 5: Parameters of a typical run of Synplify DSP. Note that the target device architecture is selected at this stage.

 

Of particular note are the Retiming and Folding options.  Essentially the Retiming option allows the Synplify DSP tool to modify the architecture to use pipelining and other techniques to get to the desired performance goal, at the expense of latency at the ouput.  The Folding option allows the design to share hardware, at the expense of lower throughput (i.e. trade off maximum sample rate for resource utilization).

 

Creating Architectural Alternatives:

 

Once the baseline architecture is established, the exploration can commence and the rest of this article deals with the results of using the Retiming and Folding controls, particularly the latter. As noted in Figure 4’s caption, at this point we need knowledge of the target FPGA’s architecture and resources.  In this particular example we map to a Xilinx Virtex-5 FPGA with a maximum clock frequency of 400 MHz (-10 speed grade).  This family uses a versatile piece of hard-coded logic called a DSP48 slice [2]. A key objective of our design-space exploration in this example is to measure the tradeoff in sharing DSP48 slices versus parallelism for maximum throughput.

 

Evaluating Results:

 

Using the architecturally-optimized RTL from Synplify DSP, we can now use logic synthesis tools to evaluate the speed and area results.  For FPGA targets, the Synplify DSP tool writes the RTL in a way that optimizes mapping of multipliers, registers, and memory for the target device using the Synplify Pro® or Synplify® Premier software.   This includes using HW multipliers, block memory, and shift register elements when necessary.  For ASIC targets we can use any commercial ASIC synthesis tool.

 

Figure 6 shows how the Synplify Pro tool reports on speed and resource utilization for a typical run.

 

Adobe Systems

 

Figure 6: Output from the Synplify Pro tool showing Estimated Frequency and Resource Usage (DSP48’s highlighted)

 

As can be seen from the above, the simulation is very efficient, enabling rapid turn around of multiple architectures to find an optimum for a particular design.

 

If we examine the RTL output from Synplify Pro tool for this run, we can see the implementation details; a typical portion with multiple DSP48 Slices is shown below in Figure 7.

 

 

Figure 7: Analyzing results in the Synplify Pro HDL Analyst.  On inspection, this is seen as a Transposed Form structure.

 

The results of a few experiments with different Folding options in this particular FIR filter example are summarized below in Table 1:

 

Folding Factor

Est. Frequency

Max. Throughput

DSP48 Slices

None

414.5 MHz

> 400 Ms/s

16

4

346.7 MHz

86.7 Ms/s

4

8

352.0 MHz

44 Ms/s

2

16

338.5 MHz

21.2 Ms/s

1

    

Table 1:  Folding effects on filter throughput and hardware sharing

 

This particular Virtex-5 FPGA has a total of 32 DSP48 slices. The additional hardware resources (LUTs, Register bits etc.) are small (< 4%) in all cases compared to the available resources in the part.  The Synplify tools, of course, provide all the detailed information on timing, constraints, resource utilization etc. taking into account any extra latencies needed by Retiming and other optimizations.

 

What we can observe from the table is that you can rapidly create more serialized, area efficient architectures when the sample rates are lower.  The powerful capability of doing this from a single algorithm models allows you to easily map to different technologies and exploit higher clock frequencies without changing and re-verifying the algorithm model.

 

Conclusion:

 

What we have shown from the simple FIR example is that the Synplify DSP software provides a rapid and easy way of making architectural tradeoffs while keeping the target implementation in the design loop.  This enables you to explore multiple architectural possibilities, including important fixed point considerations, and obtain useful implementation cost versus performance tradeoffs in an efficient way.  The result is optimal FPGA implementation of high level algorithms while minimizing design time.

 

References:

 

1. MATLAB R2006b, Version 7.3.0.267, the MathWorks, Aug. 2006

2. ‘DSP: Designing for Optimal Results”, Advanced Design Guide,  Xilinx Inc., 2005