Abstract
Use of a generic FPGA board together with a powerful programming environment is investigated. It is demonstrated that high-performance real-time analysis is achieved at a reasonable cost. The system is simple to integrate with off-the-shelf acquisition systems such as LabVIEW. A flexible wavelet algorithm is programmed and is found to be considerably faster than the processor of the host computer. The flexibility of the processing unit makes it a promising tool for applications where ordinary desktop computers do not provide enough processing power.
Research-oriented experimental equipment must often offer high-performance computing capabilities and a high degree of flexibility simultaneously. To achieve that, the manufacturer of such devices often rely on specialized devices produced at relatively low volume and high cost.
Building systems based on reusable and reconfigurable components, on the other hand, gives the opportunity of reducing cost without any loss of performance and flexibility. With the increase of the computing power of CPUs in personal computers (PC), PC-based systems have become popular and are used in most situations. Such systems offer standard and recognizable user-interfaces and can easily be extended and reconfigured. Problems appear when the computing power of an available machine is not sufficient, and clusters of PCs must be used. The system will then become more complex, and different side-problems, such as space and external cooling, occur, in addition to the need for experience and knowledge from users.
A Field Programmable Gate Array (FPGA) is a digital chip that can be configured to perform the task of any digital logic design. It may be configured as a general flexible micro-processor that executes conventional C/C++ code at one time and as a highly specific algorithm engine the next time. The (re-)programmability of FPGAs distinguishes them from other Application Specific Integrated Circuits (ASICs).
Using FPGA technology, one can configure a circuit to perform a specific algorithm at very high speed. Compared to traditional ASICs, prototyping time is on the order of hours rather than months, with a cost less than a tenth of that for an ASIC. The drawbacks are that FPGAs run at lower clock frequencies and draw more power. For large volume products, FPGAs are more expensive to produce.
In a scientific laboratory, the FPGA-based technologies allow the scientists to build systems based on standard PC-technology, with hardware acceleration for fast computing, using a standard extension board rather than acquiring specialized processing units.
The use of FPGAs has been demostrated in various application areas (Herzen 1998, Wickert & Papenfuss 2001, Meribout, Nakanishi & Ogura, 2002), and they are also used in a number of measurement and analysis applications (Hernandez et al. 2004, Toledo et al. 2002). Using FPGAs requires, in general, specialized programming skills. Attempts to overcome this limitation, by preparing off-the-shelf packages with common algorithms (e.g. FFTs), have been made (Uzun, Amira & Bouridane, 2005).
In this note, an alternative method is presented for desktop high-performance computing using an FPGA-equipped extension card. It will be illustrated that using a high-level language, the FPGA can be programmed for performing mathematical algorithms without any deep knowledge of the underlying hardware architecture.
In the following, the FPGA technology will be briefly described. An overview of the equipment used in the present work will be given, and the novelties in the use of the technology will be pointed out. A description of a wavelet algorithm and its implementation on the FPGA will follow as an illustration of the possibilities.
Finally, we will discuss some limitations that appear when the processing speed in the measurement system is increased.
An FPGA consists of a large number of small programmable digital logic elements, often referred to as Logic Elements or LE. An LE typically has four binary inputs and one binary output as described in figure 1. The LEs are positioned in a 2D pattern on a chip and are connected so that different configurations will be possible (see figure 1.) The LE may be configured to implement any function with four inputs and one output, and by combining several logic elements with switches, any digital function may be realized.
Figure 1: Array of logical elements and detail of a typical logic element.
FPGAs with as many as 100,000 logic elements exist today. The typical frequency at which they will run is 150-250 MHz. The large number of logic resources and the fairly high operating frequency make FPGAs operate in the Tera-Instructions-Per-Second range if the system is decomposed into parallel subsystems.
An FPGA is traditionally programmed using a dedicated programming language, such as Verilog or VHDL. The digital design must first be described at a low abstraction level using one of these languages and then compiled through several development tools before the FPGA can be configured to perform the task. The initial development is carried out by a hardware engineer with typical issues to be resolved such as clock cycles, registers, memory buses, arbitration schemes etc.
The complexity of those tools and the effort needed to manually translate the design specification to a low level language makes it impossible for most engineers to benefit from the advantages the technology offers. This note covers the evaluation of a new method for programming FPGAs: the Mitrion Compiler and its integration with a standard acquisition package (LabVIEW).
The Mitrion compiler allows high-level programming of FPGAs with a language similar to conventional software languages such as C extended with a few constructs to make it better suited for parallel execution. Programing with Mitrion C does not require any hardware design issues such as timing constraints and synchronisation of data; these details are all taken care of by the compiler.
A conventional compiler translates C to assembler, which is executed one instruction at a time if one processor is available. The C code can be easily compiled for any processor, and programs of any size and complexity may be executed on the same processor if execution time is not critical.
The Mitrion compiler creates a data flow representation of the program, which is mapped directly to the resources on the target FPGA. The compiler optimizes the mapping so that any nodes that are not interdependent are executed in parallel, resulting in reduction of the total execution time.
This compiler can be used with any FPGA. The FPGA used in the present work was obtained from Mitrionics. The FPGA was delivered together with memory on a PCI board, compatible with standard PCs. Once installed, the board can be configured with a configuration file obtained from the compiler, and then the processing unit is ready to perform the task assigned to it.
In order to demonstrate the feasibility of using the Mitrion technology in a typical laboratory setup, it was integrated with an existing system developed in the LabVIEW system. The system is used for analysis of measurement data acquired by an AD-board by wavelet transforms. The setup is sketched in figure 2. The processing unit developed in Mitrion C is accessed by linking to an external library. For the LabVIEW user, the FPGA is well integrated, and after compilation, a call to the FPGA is similar to a call to any subroutine (sub-VI).
Figure 2: Schematic of the setup}
The information flow is illustrated in figure 3. First, the signal is read from the input buffer on the AD-board and then transferred to the Mitrion board (possibly after some preprocessing), where the analysis takes place. Once the data is analyzed, the results are available and can be presented to the user.
Figure 3: Block diagram of the algorithm.
The continuous wavelet transform, Ψ(k,t) where k is the frequency of events and t is time, of a signal f(t) is defined as:
Ψ (k,t)= k 0.5 ∫ -∞ ∞ f(t)g[k(y-t)]dy. (1)
The function g(t) is a wavelet function which has to give a value of 0 when integrated along the real line. The wavelet transform can be understood as the correlation of the signal under study with a number of stretched versions of the wavelet function. The amplitude of Ψ(k,t) will be large for values of k and t when the signal f(t) has a structure similar to the wavelet function g(kt).
The wavelet transform can be calculated by Fast Fourier Transforms as sketched in figure 3 and described below.
First, the FPGA is initialized and the Fourier transforms of all the stretched wavelet functions are put on the memory of the Mitrion board.
A large number of Fourier-transformed signal sequences to be analyzed are sent to the card at once. This is done in order to avoid overloading the PCI-bus. This will be discussed later.
The processing unit on the FPGA calculates the wavelet transform by multiplying each transform of the wavelet function with the transform of the signal and then transforming it back.
Finally, the data needed from the wavelet transform is extracted and returnd to the host PC over the PCI-bus.
The algorithm relies on an implementation of the inverse FFT transform, Decimation-in-Frequency (DIF) Radix-2 FFT shown in figure 4, with floating point arithmetic. A 512-point IFFT will require log2 512 = 9 stages of butterflies. It will be possible to pipeline this structure with a throughput of 1/256 datawords/clock cycle if
1) a butterfly is either a one-cycle implementation or pipelined with a throughput of 1,
2) the buffer memories are dual ported and
3) the buffer memories are double buffered.
Figure 4: Schematic of the implementation of the FFT-agorithm.
The Mitrion PCI platform has four memory interfaces of 64 bits. Single-precision floating-point complex numbers, used in present case, require 64 bits each. Each input vector implies reading 2* 512 complex values, inputs to the correlation, and writing 512 complex numbers, the result. The maximum throughput is 1/512 datawords/clock cycle so there is no need to make any efforts achieving 1/256 datawords/clock cycle in the butterfly stages, 1/512 datawords/clock cycle will be sufficient. The precision could easily be extended for a larger device. The size of the processing unit has been optimized by reuse of the floating point block’s reduction of the precision within the inverse fourier transform block.
During the implementation of the algorithm shown in figure 3, the performance of the PCI-bus and the DLL-link in LabVIEW were found to be a strong limiting factor. These bottlenecks can be overcome by sending a large number of vectors to the FPGA each time (100 vectors in the present study). Distributed over 100 vectors, the penalty becomes acceptable. After the calculations, some user-defined information such as peak values and position, typically of the same size as the original data, is returned from the FPGA.
The performance gain, compared to a software implementation run on a Pentium 4, 2800 MHz machine, is demonstrated by 1D wavelet transforms as presented in table 1. As found in that table, an acceleration of about 6 times is measured.
Table 1: Calculation times for the wavelet calculations when implemented in software and on the FPGA.
In this note, the potential of the FPGAs for flexible high-performace computing was studied. It was shown that using Mitrion C, processing units can be developed without any requirement of deep hardware knowledge and can be integrated in off-the-shelf measurement systems. In development of such systems, special attention must be paid to communication bandwidth between different units. As an example, in the present set-up, the PCI bandwidth was experienced as a bottleneck. If the data collection, the FPGA, and the memory chips are integrated on the same board, the bandwidth issues are overcome. Hardware products satisfying these conditions are today available on the market, and combined with efficient and powerful FPGA programming tools, such as Mitrion, offer a flexible online-analysis tool with a performance comparable to PC clusters. Both FPGA-based hardware solutions and systems for their convenient use are experiencing fast development. New, better and more efficient systems can be expected in a near future.
References
B. von Herzen, ”Signal processing at 250 MHz using high-performance FPGA’s,” IEEE transactions on very large scale integration (VLSI) systems
6, 238 (1998).M.~A. Wickert and J. Papenfuss, ”Implementation of a real-time frequency-selective RF channel simulator using a hybrid DSP-FPGA architecture,” IEEE transactions on microwave theory and techniques
49, 1390 (2001).M. Meribout, M. Nakanishi, and T. Ogura, ”Accurate and real-time image processing on a new pc-compatible board,” Real-time imaging
8, 35 (2002).Á. Hernández, J. Ureña, J.J. Garcia, M. Mazo, D. Hernanz, J.-P. Dérutin, and J. Sérot, ”Ultrasonic ranging sensor using simultaneous emissions from different transducers”, IEEE transactions on ultrasonics, ferroelectrics, and frequency control
51, 1660 (2004).J. Toledo, H. Müller, J. Buytaert, F. Bal, A. David, A. Guirao, and F. J. Mora, ”A plug and play approach to data acquisition”, IEEE transactions on nuclear science
49, 1190 (2002).I. S. Uzun, A. Amira, and A. Bouridane, ”FPGA implementations of fast Fourier transforms for real-time signal and image processing”, in IEEE proceedings-vision image and signal processing 152 3 (IEEE institute of electrical engineering, Hertford, England, 2005), pp. 283–296.
1 Stravus Engineering, Bankgatan 14b, S-223 52 Lund, Sweden
2 Dept. Mechanics, KTH, S-100 44 Stockholm, Sweden
3 ABB, Box 6610, S-721 57 Västeråss, Sweden
4 STFI-Packforsk AB, Box 5604, S-114 86 Stockholm, Sweden