feature article
Subscribe Now

Optimizing Architectures for Performance and Area using Virtual System Prototypes

With new virtual system prototyping technology, the ability to make correct design decisions has been greatly enhanced so that system engineers may be able to evaluate their decisions under the actual operating conditions of the final system and to evaluate these decisions based on measurable goals such as run time performance and cost.

The Problem

Traditionally, architecture design for silicon embedded systems has been done with very ad hoc approaches. Architects may have used tools such as a spreadsheet or other calculus to make estimates of system performance and cost. These estimates relied heavily on the engineer’s prior experience and “gut feel” for how parameters such as cache size or memory speed might affect design constraints.

Historically, any system-level simulator that may have been used as an aid in this process was likely to have been “home-grown” and primitive such that only high-level activity was simulated; details reflecting actual hardware behavior were abstracted out. Furthermore, stimuli for the simulated system were also likely to be very primitive.

Clearly a better approach is required in order to make informed decisions at architecture specification time. This paper outlines the use of a high-speed, cycle-accurate virtual system prototype (VSP) simulation environment to arrive at more optimal architectures.

The System under Design

Before addressing the solution, let us consider the nature of embedded SoC design today. Today these circuits are heavily used in the automotive, wireless, and consumer markets, where they typically have one or more processors (such as an Arm926E), some sort of hierarchical bus interface (such as AHB), and multiple supporting peripheral blocks (e.g., UART or interrupt controller blocks). This system may be running an operating system and applications that are built on top of this operating system. A typical application featuring such a design may be a cell phone or an automotive power train control system.

The Solution

Now let us examine what will be required from our architecture exploration and optimization tool so that we may better inform our design decisions.

A virtual system prototype is a high-speed, timing-accurate software model of an embedded hardware system. The VSP is executed within the context of a system-level simulation and includes all aspects of the system being designed: processors, buses, peripheral blocks, and target software applications.

An effective and efficient VSP simulation system MUST have the following characteristics:

  1. Near Silicon Speed: The solution must be fast enough so that the real software applications written for the SoC may be run on the VSP, including the operating system (OS) and any target application that may run on top of the OS.

  2. Complete System: The solution must model and simulate the whole system (processors, buses, peripherals, external hardware).

  3. Cycle-Accurate: The solution must retain accuracy; i.e., the simulated hardware must have timing associated with it that is reflected in the real hardware. This must also include asynchronous events and multiple clock domains.

  4. Model Library: For the purpose of architecture design productivity and efficiency, the system should offer a portfolio of processor, bus, and peripheral models.  

  5. High-Speed Modeling Method: A proven modeling method that supports simulation results orders of magnitude faster then traditional RTL simulations must exist by which high-speed, system-level modules of custom hardware are modeled in the VSP.

  6. Binary Compatibility: The solution must be capable of using the same target images that will be executed by real hardware for execution by the modeled processor–that is, binary compatibility between simulated and actual. The solution must also provide the capability to use commercial debugging and development tools for those applications.

  7. Configurable: The solution must include run-time configurability for as many parameters as possible. I.e., no recompilation should be necessary in order to try different experiments for different parameters such as cache size of the processor models.

  8. Visibility: The solution must make available data mining statistics and events that occur in the hardware system. For example, the VSP must be able to track things like instruction counts, cache statistics (hits, misses, fetches) and bus transactions.

The Experimental System

In order to illustrate the utility of using a VSP for architectural optimization, we will run an experiment in which measurements for system performance and cost are gathered and summarized. The SoC under design is representative of what may be found in a hand-held wireless device. The application being used to measure performance will be video decoding.

The system we construct will have the following characteristics:

There are two processor cores. Initially, each of these will be an ARM 926E core. Each core communicates with the memory and other peripherals via a hierarchical bus system. There are sections of “private” memory and peripherals for each core. There is also a shared memory region visible to each processor. There is a postbox device through which processors communicate when messages need to be passed between the processors. Finally, there is a UART driving a simulated console and a liquid crystal display (LCD) controller and simulated LCD display. The file system is modeled using a ramdisk device that is part of the system. (See Figure 1)

Figure 1—The SoC under design

In the software domain, the first processor will boot Linux. Once the boot has completed, a user will log on and get a Linux prompt. At this point the video decode will be initiated by issuing a command at the Linux command prompt on the simulated console device. The presence of Linux indicates an intermediate state of development for the device under design; i.e., we may be driving applications from Linux in order to facilitate testing numerous applications during development before locking down which applications are loaded in the device at productization time.

Upon issuing the decode command, the first processor (Linux) will read in an encoded video frame from the specified MPEG file on the file system and place it in shared memory. Then the Linux core will notify the second core that the data is ready for decoding. The second core will decode the frame via a software decoding algorithm and place the decoded frame in shared memory, at which time it also will notify the first core that a decoded frame is available.

At that point, the first core will initiate a DMA of the decoded frame to the LCD controller which then will drive the simulated LCD display. Meanwhile the first processor will load up the next encoded frame and the process continues until all frames are decoded and displayed.

Experiment and Analysis

In our experiment, we will run the dual ARM core system with a “fast” (and expensive) configuration and a “slow” (and inexpensive) configuration. Then we will swap out the second ARM core for a dedicated digital signal processor (DSP), in our case a StarCore3400 DSP. Finally, we will move the software decode of the video frames into a hardware ASIC that will be designed into the system.

We will consider system performance and cost our experiments. The Linux application that manages the video decoding has a built-in calculator that determines the achieved frames-per-second of the decoded video system. This will be our measure of system performance.

Cost will be loosely determined based on the configuration of the dual-ARM cores, the price of the SC3400, and the cost of developing the dedicated ASIC to do the decoding.

Executing the Experiments in CoMET

To do the experiments described above, we will be using VaST Systems Technology’s CoMET™ System Engineering Environment. We chose this tool because it fulfills the requirements laid out for system level simulation previously described:

  1. VaST’s CoMET has high performance, typically 20-100 MIPS, depending on the complexity of the platform. Therefore it is possible to run real applications at near real-time speeds.

  2. Engineers and architects may construct VSPs from a portfolio of processor and bus models along with customized models developed by the user.

  3. VaST simulation technology is cycle accurate.

  4.  VaST provides a large library of models of commercially available processors, bus architectures, and peripheral devices.

  5. VaST simulation supports System C models and also provides an API interface so that models may interface directly with the VaST simulation kernel.

  6. Target images may be specified for each processor core in the design; furthermore, third-party debuggers (such as Lauterbach T32) are supported so that users may use their standard environment to debug software in the virtual environment.

  7. Virtual processor model (VPM) parameters (such as cache size, processor speed, etc.) may be specified at run-time and therefore no recompilation of the system model is necessary.

  8. VaST, through its Metrix™ profiling tool, enables data mining and tracking of system events so that true system performance may be more accurately evaluated.

The CoMET environment is shown below.

Figure 2 – The CoMET design environment

Experimental Results

As mentioned above, we have four configurations that we wish to try:

  • 2 ARM cores, slow configuration
  • 2 ARM cores, fast configuration
  • 1 ARM / 1 SC3400
  • 2 ARM cores + hardware ASIC for decoding

Two Arm Cores, Slow configuration

Figure 3 — A visual summary of the slow configuration

Note that the processor (and bus) speed is 100 Mhz; the caches are 8K; and the memories require 10 cycles to be read/write.

The cost factor for this configuration is calculated to be 100 units (base line).

The achieved system performance was found to be 6.7 frames per second (FPS).

Two Arm Cores, Fast configuration

Figure 4 — A visual summary of the fast configuration

In this configuration, we have 120 MHz processors (and buses), 64K caches, and memories that return data in 5 cycles.

Achieved FPS is 10.8.

Cost factor is 150 units (based on higher-speed component cost).

1 ARM Core, 1 StarCore

Using the “fast” configuration above, but with an SC3400 for the DSP, yields the following results:

Achieved FPS is 15.5.

Cost factor is 200 (more expensive components, plus dedicated DSP).

Hardware ASIC

Fast configuration plus additional hardware for decoding:

Achieved FPS is 30.1.

Cost factor is 400 (additional hardware required for ASIC).

Power usage is 1773 Joules.

 

The results of our experiment are summarized in the following table:

Configuration

Frames per second

Cost Factor

2 ARM926E, Slow

6.7

100

2 ARM926E, Fast

10.8

150

1 ARM926E, 1 Starcore3400

15.5

200

2 ARMs + Hardware ASIC

30.1

400

As expected, better system performance comes at greater cost.

Power Analysis

In addition to doing cost and performance analysis, a VSP may be used to get an idea of relative power usage for the system for different configurations under consideration.

This estimated power usage may be done based on heuristics of energy usage per simulated event. These per-unit estimates are based on research of targeted process, core processor power consumption, and peripheral power consumption of representative designs. A sample assignment of events and their estimated energy usage is shown below.

Event

PicoJoules per unit

Instructions

2000

Cache Hits

1000

Cache Hits

800

Cache Misses

1000

Bus Transactions

800

Memory Accesses

1000

In the case of a VaST VSP, these events may be monitored via the Metrix data mining capability. This is an integrated utility that, at run time, accumulates counts of these transactions, instructions, and other system events that a designer may wish to track. These raw counts may then be fed, at execution time, through a user-produced software filter to configure and display the data in a visual format. Two such displays are shown below as examples:

Figure 5 – Power analysis

Power predication and analysis is an interesting and complex topic and is probably worthy of a separate paper. For the purposes of this discourse, we present a simple example and how it may be enabled by using a VSP.

Summary

As the experiments show, system architects now have the capability to have their decisions more accurately driven by quantitative results from system simulation, using stimuli that accurately reflect the environment in which the design will operate. These stimuli are the actual target software application for the system. Of course, to use these applications, high-performance, high-speed simulation is a requirement.

The system level simulation also must be cycle-accurate to ensure that timing-dependent behavior will be visible to the architects and software developers. Such simulation eliminates integration re-work once the hardware returns from the vendor because the software — including tricky timing-based corner cases than will not be exposed by a simple instruction set simulations– has been pre-verified on the virtual system.

Better visibility is also available to the software engineer when development is done in a virtual environment. Helpful warning messages may be embedded into hardware models that provide clues to suspicious behavior in the software application. An example of this would be violation of the usage model or protocol of a peripheral device or bus interface.

Finally, the resulting virtual system prototype may be used as an executable specification by hardware engineers to develop RTL models. Written specifications may contain ambiguities in many cases. (In fact, there are some usage models that may not even have been considered by system architects that are exposed in high-speed, cycle-accurate simulation.) By writing RTL to the executable specification, the hardware engineer is assured that these ambiguities do not result in bugs undiscovered until the hardware is brought up in the lab.

With virtual system prototypes all three design communities (architects, software, hardware) use the same model, removing much of the ambiguity that occurs when teams use different models. This commonality of model usage also has the effect of reducing the overall modeling effort. ALL of the design team may be confident of their decisions when using VSPs to pre-verify their designs.

Leave a Reply

featured blogs
Nov 22, 2024
We're providing every session and keynote from Works With 2024 on-demand. It's the only place wireless IoT developers can access hands-on training for free....
Nov 22, 2024
I just saw a video on YouTube'”it's a few very funny minutes from a show by an engineer who transitioned into being a comedian...

featured video

Introducing FPGAi – Innovations Unlocked by AI-enabled FPGAs

Sponsored by Intel

Altera Innovators Day presentation by Ilya Ganusov showing the advantages of FPGAs for implementing AI-based Systems. See additional videos on AI and other Altera Innovators Day in Altera’s YouTube channel playlists.

Learn more about FPGAs for Artificial Intelligence here

featured paper

Quantized Neural Networks for FPGA Inference

Sponsored by Intel

Implementing a low precision network in FPGA hardware for efficient inferencing provides numerous advantages when it comes to meeting demanding specifications. The increased flexibility allows optimization of throughput, overall power consumption, resource usage, device size, TOPs/watt, and deterministic latency. These are important benefits where scaling and efficiency are inherent requirements of the application.

Click to read more

featured chalk talk

Datalogging in Automotive
Sponsored by Infineon
In this episode of Chalk Talk, Amelia Dalton and Harsha Medu from Infineon examine the value of data logging in automotive applications. They also explore the benefits of event data recorders and how these technologies will shape the future of automotive travel.
Jan 2, 2024
59,401 views