FPGAs, possibly the most powerful processors in existence today by many measures, were never intended to be processors at all. Conceived as general-purpose programmable logic devices, their simple arrays of logic elements were not designed to accelerate computationally intensive tasks. Instead, they fell into the role in a Rube-Goldbergian fashion – evolving their processing prowess over a decade or more with engineering’s version of emergent behavior rather than ground-up purposeful design.
Today, FPGAs are used in many applications for heavy lifting chores such as video processing, often paired with conventional processors charged with handling the housekeeping. While the hardware is clearly capable, the ad-hoc programming model is anything but straightforward. Typically, FPGAs achieve “accelerator” status only after the algorithm goes through a bizarre set of transformations from some high-level description to a bit-true, cycle-accurate representation, then to register-transfer-level descriptions, then through logic synthesis to a netlist and finally to a bitstream after place-and-route and mapping. Does this seem convoluted yet?
Several companies have been founded in the past several years on the proposition that there’s got to be an easier way. Surely we could design something from the ground up to accelerate computationally intense algorithms, avoiding the inherent complexities of FPGA design, while maintaining the performance and power benefits of highly parallel processor architectures.
Stretch – one of the best known of those companies, has just launched a new processor, dubbed the S6, that incorporates Tensilica’s Xtensa core with Stretch’s Instruction Set Extension Fabric (ISEF). At this level, the architecture resembles many modern FPGAs – particularly those with integrated processor cores. The ISEF has the same function in an accelerated processor that the LUT fabric has in an FPGA – to provide parallel processing power for algorithmic hot spots. This part of the S6 architecture is nothing new – Stretch’s previous-generation processor took the same approach, although the new family has significant refinements. The big news with S6 is a selection of dedicated acceleration elements pre-designed for multimedia processing algorithms.
With the new S6 design, Stretch has created almost a multi-media ASSP coloring book with plenty of blank pages to bring in the specific value-added features of any given multimedia application. With the new dedicated programmable accelerator, highly-optimized implementations of several functions commonly used for video and image processing, software defined wireless protocols, and audio processing are pre-implemented and available to the programmer through APIs (meaning no hardware design is required). Specifically, there are blocks for motion estimation (for video encoding), entropy encoding (for H.264 video), encryption/decryption (including AED, DES, 3DES), and audio CODECs (AAC, AC3, MP3, etc.).
As an example of the acceleration available, Stretch says that the motion estimation function can process an entire 16X16-pixel macroblock sum-of-absolute-differences (SAD) operation in a single cycle.. With the pipelining available in the hardware accelerator, 256 SAD calculations are made on each clock cycle, and all 41 possible H.264 sub-macroblock combinations are returned with corresponding SAD values.
With S6, Stretch has also streamlined the process of parallelizing processors into a multiprocessor array. The new architecture employs what Stretch calls a processor array interface that abstracts away inter-chip communications, allowing multiple devices to collaborate. While this in no way trivializes the difficult task of optimizing a multi-processor environment, it should at least simplify the housekeeping on the hardware side.
While every “alternative” architecture processor can crunch mountains of data with ferocious abandon, the problem of getting that data on and off the chip is often overlooked. Stretch has built a robust and purposeful I/O infrastructure into S6, aimed at the specific class of applications they’re targeting and designed to reduce the need for external I/O devices and glue logic. These include a “quad data port” comprised of four 10-bit data ports that can interface directly to a variety of video devices using standards like BT656 (standard definition) and BT1120 (high definition) or directly with raw video. Also included is a 10/100/1000 Ethernet MAC, DDR/DDR2 interfaces, serial interfaces, and eGIB/GPIO.
The ISEF also warrants discussion, as the basic arrangement of the ISEF and the Xtensa processor is what most differentiates the Stretch approach. The Tensilica Xtensa VLIW processor core is configured to allow complete hardware accelerators for user-supplied algorithms to be called as single instructions by the host processor. This makes high-level invocation of the accelerator a pure programming task with no direct connection to the mechanism of hardware acceleration. Since the parts of programs that need accelerating typically amount to arithmetic problems in large, looped arrays, the Stretch ISEF is built to mirror just that. The S6 ISEF contains 4096 ALUs that can be configured in various groupings to achieve a variety of bit widths. There are also 64 dedicated 8X16 hardware multipliers that can be ganged to create wider operations. A rich set of registers, muxes, priority encoders and shifters allows these datapath elements to be configured in a variety of ways and provides storage for coefficients and intermediate results.
To provide programmatic access to all that hardware, Stretch provides a tool flow that automates the creation of extension instructions in the ISEF. The algorithm is described in C, and the compiler and cycle-accurate simulator allow the programmer to see the performance improvement provided by the ISEF accelerator. Code fragments are tagged for compilation to the ISEF, and the compiler searches for opportunities for parallelism by unrolling loops and analyzing data dependencies. The parallelized structure is sent through a place-and-route algorithm to map onto ISEF resources. At this point, the compiler generates a report showing how much of the ISEF resources are used and how much remains for further acceleration.
Since the ISEF can be reconfigured very quickly (Stretch claims 27 microseconds), code can be architected so that the ISEF is reconfigured for different tasks at different stages of the process. This effectively multiplies the acceleration available when complex algorithms can be broken down into sequential stages, each with highly parallelizable components.
The ISEF is fed by 32 128-bit-wide registers that load data into the ISEF. The ISEF also contains 64KB of embedded RAM distributed through the fabric in 32 banks of 2KB. In much the same fashion as FPGA block RAM, this RAM can be used as storage for intermediate data, coefficients, etc. This RAM is mapped into the address space so it can be loaded directly by the processor, and it also has a dedicated DMA channel so it can be loaded without processor intervention – a frequent bottleneck in some other custom-instruction-based acceleration schemes.
As with any exotic, alternative-architecture processor, we believe the proof in the pudding for S6 will be in the programming model. If Stretch has created an architecture whose features can be easily harnessed by the average programmer, the performance, price, and power consumption of the device will be compelling enough reasons for adoption in a wide variety of devices. If the programming model proves too cumbersome, however, all that elegant hardware will be waiting for only the few with the wherewithal to dive in and master yet another new high-performance computing programming paradigm.