![]() |
![]() |
| |
Many high-performance signal processing products are now being implemented in field-programmable gate arrays (FPGAs). FPGAs can offer an order of magnitude performance increase over standard DSP chips, making programmable logic a natural choice for high-performance DSP electronics. There are typically two groups involved in the design and realization of DSP algorithms in an FPGA: DSP architects and hardware design engineers. Unfortunately, there is a "wall of abstraction" between the architects who formulate the algorithms and the design engineers who are charged with their physical implementation (Figure 1).
This article reviews past attempts to build a bridge between the DSP design domain and physical implementations. Also discussed are the ways in which each of these conventional approaches falls short. Next, we introduce a new DSP design methodology—true DSP synthesis—which integrates into existing design flows without any disruption. This solution bridges the gap between the algorithmic and implementation domains by automating the processes of system-level optimization and implementation-level mapping.
Background: RTL, MATLAB, and Simulink Hardware design engineers typically visualize their world as a collection of blocks described in a hardware description language (HDL) such as Verilog and/or VHDL. These blocks are captured at a very low level of abstraction referred to as register transfer level (RTL). In contrast, DSP architects typically create, explore, evaluate, and analyze DSP algorithms at a very high level of abstraction. These evaluations are usually performed using the de facto industry standard MATLAB environment from The MathWorks, Inc. It should be noted that MATLAB refers both to a language and an algorithmic-level simulation, visualization, and analysis environment. In order to avoid confusion, it is common to refer to M-code (meaning "MATLAB code") and M-files (meaning "files containing MATLAB code"). MATLAB allows DSP architects to represent a complex signal transformation, such as an FFT, using a single statement along the lines of: y = fft(x); This means that MATLAB can be used to describe complex systems in a concise manner that is suitable for analysis at a very high level of abstraction. However, real-world digital signals are characterized by varying sampling rates and discrete amplitudes. Thus, in order to verify DSP algorithms to a level suitable for implementation, the verification environment must efficiently represent discrete time in multi-rate DSP applications. Unfortunately, MATLAB does not support the concept of discrete time. Furthermore, the native simulation engine in MATLAB is optimized for floating-point mathematical operations. MATLAB slows down significantly with fixed-point representations, in which each operation is "wrapped" with checks for overflow, underflow, rounding, and so forth. In order to address these issues, The MathWorks introduced the Simulink environment, which has been designed to natively handle DSP issues such as multi-rate discrete time definitions and fast, efficient simulation for both floating-point and fixed-point design representations. At the time of this writing, Simulink is already being employed in the majority of existing design environments for which FPGAs are the target implementation technology.
Problems with conventional techniques Once a suite of DSP algorithms have been proven in Simulink and the system architecture has been defined at a high level of abstraction, hardware design engineers have to transform the design into a physical implementation. Traditional techniques for bridging the gap between the architectural and implementation domains typically fall into two main camps—language translation and IP instantiation and netlisting—as discussed below. Before looking at these approaches, it is important to note that there are many ways to implement a given DSP design on a given FPGA. Each implementation will vary in terms of resource utilization (such as the number of logic blocks used) and performance. Thus, the ability to quickly evaluate a wide variety of alternative implementations is critical to achieving high-quality DSP realizations in a timely manner.
Language translation: MATLAB/Simulink to RTL (hand translation)
There are, of course, numerous problems associated with this flow. One key concern is that there is no clear handoff between the DSP architect working in the MATLAB/Simulink domain and the hardware design engineer working in the implementation domain. In fact, this worst-case scenario requires an engineer who is expert in both domains, and there are few such experts. Furthermore, it is very time-consuming to hand-code the RTL for a large, complex design. In turn, modifying (and re-verifying) the RTL to evaluate alternative implementations is difficult and time-consuming. This limits the number of such evaluations the design team can perform, which can easily result in a less-than-optimal implementation. Another concern with this approach is that the implementation will be device-specific. Even though the RTL synthesis tool is capable of targeting the RTL to any FPGA, achieving the best implementation requires that the RTL be coded with a specific device in mind.
Language translation: MATLAB to RTL (auto-interactive translation) At least one EDA vendor boasts the ability to go directly from MATLAB M-code representations into equivalent RTL. This process, which may be referred to as "algorithmic synthesis," involves language translation requiring a large amount of user interaction (Figure 3).
The way in which this works is as follows. As was previously noted, MATLAB (and M-code) does not support the concept of discrete time. For example, in the case of a "pure" mathematical M-code function such as y = fft(x), it would be possible to present the input with an entire frame of data and immediately receive a corresponding frame of output without any time having elapsed. Not surprisingly, there is no obvious corresponding implementation for such a construct. The solution is for the user to analyze the M-code associated with the DSP design and, for each abstract construct such as y = fft(x), to assign the function a new name such as y = my_fft(x). The user also employs a forms-based interface to specify details as to a specific implementation to be associated with this function. For example, in the case of the "my_fft" instantiation of an FFT, the user will be obliged to select between different ways of implementing the algorithm; to decide whether to perform buffering and storage using FIFOs, registers, or RAM blocks; and to make micro-architecture decisions such as how many pipeline stages to use. Once all of the M-code constructs have been treated in this manner, an appropriate tool can be used to take the new M-code representation, combine it with the library of forms-based functional specifications, and generate a corresponding RTL representation. Once again, a major problem with this approach is that there is no clear handoff between the algorithmic and implementation domains. As before, a worst-case scenario requires an engineer who is expert in both domains. Furthermore, the design team will not have a clear picture of the implementation's performance until the design has been fully converted into RTL.
Language translation: C/C++ to RTL or netlist An alternative approach is to use a tool that translates a design from a C/C++ representation to either an equivalent RTL description or directly into an implementation-level netlist (Figure 4). However, there is still a wall of abstraction between the DSP architects and the hardware design engineers in charge of the physical implementation. Although both MATLAB and Simulink are capable of generating C/C++ representations, this capability is rarely used to create a hardware implementation of the DSP algorithms. Instead, it is common practice for the design team to generate the C/C++ representation by hand.
Another big problem with this approach is that C/C++ is inherently sequential in nature, so such representations have to be augmented with special keywords called pragmas (for "pragmatic information") that specify concurrency (parallelism), resource-sharing, timing, and so forth. Some of these design flows take the augmented C/C++ representation and synthesize it directly into an implementation-level FPGA representation. Others use a synthesis/translation engine to generate RTL as an intermediate step, and then use traditional RTL synthesis to progress the design into its final implementation. These latter flows are preferred by some because RTL synthesis technology is extremely mature, whereas many C-based synthesis offerings are not.
IP instantiation and netlisting
One big consideration with this solution is that it is vendor-specific; that is, once a design has been created in Simulink using a vendor-supplied library of IP blocks, that design cannot easily be ported to another FPGA vendor's offerings. As usual, a major problem with this technique is that there is no clear handoff between the algorithmic and implementation domains. Although the DSP architects are supposed to be working at a high level of abstraction in the Simulink domain, in reality they are obliged to parameterize every aspect of the models with low-level implementation-specific decisions that are really the purview of the hardware design engineers. In the case of a FIR block, for example, the user will have to specify how to implement the delay line (RAM versus distributed registers), how much latency is to be involved, whether or not to use a shared MAC infrastructure, and so forth.
The solution: true DSP synthesis To solve these problems, we can look to the methods used for logic synthesis. Rather than requiring the hardware designer to specify the details of the netlist-level implementation, the designer only needs to provide RTL and identity requirements such as timing and area utilization. Given this information, the synthesis engine can rapidly explore a tremendous number of different implementation alternatives and perform appropriate optimizations to ensure that the design meets its objectives (Figure 6a).
A true DSP synthesis solution, such as Synplicity's Synplify DSP, is conceptually similar. The key difference is that the synthesis tool starts with a Simulink representation and outputs RTL (Figure 6b). By chaining a DSP synthesis tool to an RTL synthesis tool, this approach brings the DSP architects and the hardware design engineers into a common environment (Figure 7).
A key feature of a true DSP synthesis solution is an associated architecture-independent, vendor-independent blockset (library) for use with Simulink. In order to aid in the quantization process (the conversion of the initial floating-point representations into their fixed-point counterparts), each of these library elements supports automatic data-type propagation. This means that the user need only specify the fixed-point data types (signed, unsigned, etc.) and bit-widths of selected signals, and derived values will then automatically propagate throughout the design. Unlike conventional IP solutions, this blockset maintains the entry point for the DSP architects at the pure algorithmic level. That is, the architects are not obliged to define any low-level implementation decisions (such as whether internal storage is to be based on FIFOs, registers, or memory). The only parameters that need to be specified at this level are high-level attributes such as filter coefficients and gain requirements. The resulting Simulink representation therefore has no architectural implications and provides the most appropriate handoff point to the hardware design engineers. These engineers need only inform the DSP synthesis engine as to the target FPGA architecture, the desired sample rate(s) associated with the system, and the speed requirements of the design. The DSP synthesis engine then evaluates all of the different possible solutions so as to achieve the most optimal implementation. The DSP synthesis engine performs system-level optimization techniques such as retiming, resource allocation, scheduling (folding), multichannelization, and architectural selection. In this context, folding refers to taking the operations associated with a datapath and folding those operations onto fewer resources operating at a higher rate. For example, consider a FIR filter with 100 taps (stages) running at 1 MHz. Each tap has an associated multiplier and adder function. As opposed to using 100 multipliers and 100 adders running at 1 MHz, an equivalent filter can be created using only one multiplier and one adder running at 100 MHz with the intermediate results being stored in memory. With regard to multichannelization, consider a video signal in which the same DSP operations are required to be performed on the Red, Green, and Blue channels. In this case, the user need only identify one channel and instruct the DSP synthesis engine to use it for multiple signals if it can. If the sample rate is low enough compared to the system clock, the synthesis engine will automatically identify the additional channels and apply the multichannelization technique to them.
All libraries are not created equal!
As our analogous comparison, let's suppose that we are designing a military training course comprising a number of halls (blocks), where each hall is connected to the next hall in the chain by a corridor. Now imagine a line of soldiers (samples) jogging down a corridor toward the entrance to the first hall. Initially these soldiers are wearing green uniforms. The soldiers enter the hall one after the other. At some time in the future the soldiers start to emerge—still jogging one after the other—but now they have been "processed" and are wearing blue uniforms. Similarly, once the soldiers have passed through the second hall, they emerge wearing red uniforms, and so it goes.
Algorithmic/math libraries Now let's consider the way in which the various DSP libraries work. At the highest level of abstraction we have pure algorithmic/math library elements such as the y = fft(x) element provided in MATLAB. As was previously noted, it is possible to present the input to such an element with an entire frame of data and immediately receive a corresponding frame of output without any time having elapsed. This would be the same as if 1,000 soldiers had simultaneously arrived at the entrance to the first hall, and then instantaneously reappeared at the exit. In this case, it's very difficult for us to envisage what is actually taking place inside the hall. Similarly, in the DSP world, the algorithmic/math library is so far removed from a physical realization that there's no way for a tool to automatically infer a functional implementation. This is why the MATLAB to RTL language translation approach discussed earlier requires the user to specify details such as how to perform buffering and how many pipeline stages are to be used.
IP blockset libraries At the other end of the spectrum we have the highly parameterized IP blockset libraries. In this case, the user is obliged to parameterize every aspect of the models with low-level, implementation-specific decisions. In our military training course world, this would be like making specifications along the lines of: "Corridor A is 10m long. This Corridor leads to room B, in which each soldier takes off his socks. The soldiers then jog down Corridor C (5 m) to room D, where ..." Obviously, this approach would be excruciatingly tedious for all concerned. An additional concern with this type of library is that designers are restricted to using only those library elements provided by the FPGA vendor. Although such libraries may contain a relatively large number of elements, these elements may not perform their functions in exactly the same way as required by the designers.
DSP synthesis libraries Occupying the middle ground, the elements forming a true DSP synthesis library instruct the system what we want to do but not how to do it. This is akin to telling a special "training course generator" application what we require ("The soldiers enter the hall one after the other, change into their new uniforms, and exit the hall.") without being obliged to specify the actual contents of the various halls. Once we have established the high-level functionality associated with each hall, we specify some constraints such as the fact that we wish the soldiers to take no longer than thirty minutes to make their way through the entire training course. Then we leave it up to the application to determine how the insides of each hall should be constructed An additional advantage associated with this type of DSP synthesis library is that it is both extensible and reconfigurable. Designers can quickly and easily add custom, complex, parameterizable library elements that take full advantage of the high-level optimization and automatic RTL generation capabilities afforded by the DSP synthesis engine.
Summary DSP design is currently one of the fastest growing application areas in digital electronics. Both DSP architects and hardware design engineers have robust and proven design tools: MATLAB and Simulink for the architects, and logic simulation and synthesis for the hardware design engineers. Until now, however, there has been no efficient automated methodology to link the two domains. The solution is true DSP synthesis, which is NOT just another Simulink solution or just another IP-based DSP builder approach. Historically, new EDA solutions requiring radical changes to existing design flows and methodologies have rarely been widely adopted. True DSP synthesis breaks down the wall of abstraction between DSP architects and hardware design engineers. It bridges the gap between algorithmic modeling and hardware realization by automating the processes of system-level optimization and implementation mapping. Although extremely sophisticated, DSP synthesis fully complements and in no way disrupts existing DSP design environments. The result is higher-performance designs that have shorter design cycles and require fewer engineering resources.
About
the author
NB: Previously published in DSP Designline, July 06
|
![]() |