feature article
Subscribe Now

Taming Embedded Multi-Core on FPGAs for Packet Processing

Many companies with projects involving packet processing work exclusively in software, predominantly C. They have infrastructures and methodologies based around software. They have hardware groups that provide them with the boards and systems they need, but see the bulk of their value in software. These companies do not want to learn RTL or use a hardware design approach; they want to work in C using a software approach. As a result, they have not included FPGAs in their consideration of packet processors.

Going after flexibility

One of the main benefits to using a software approach is flexibility. Software flexibility refers to the ability to use the vast array of tools and methodologies that exist out there for common languages like C. It’s much easier to debug a design and produce robust working code in that kind of environment. Design changes are easily and quickly accommodated, and the rich set of tools help anywhere from debugging to documentation to code style checking. Software upgrades are possible on systems already deployed by simply placing a new software image in the code store.

Missing in this focus on software has been the main benefit of the FPGA: hardware flexibility. This deals with the designer’s ability to allocate hardware resources much more judiciously. It means using only as much hardware as needed and leaving as little logic unused as possible, reducing the cost of the design. In addition, the ability to upgrade the hardware on systems in the field provides a huge new flexibility dimension, extending the life of boards and systems far beyond what would otherwise be possible.

Given that both kinds of flexibility haven’t been achievable at the same time, most companies that implement packet processing have opted for software flexibility with more rigid architectures. There are companies that have used the more flexible FPGA, but have had to turn to RTL to get the performance they need. RTL doesn’t provide the same kind of flexibility that C does; it takes longer to design, and because it defines hardware, design turns can take longer since a complete hardware build is needed.

The one key to achieving both kinds of flexibility is the existence of embedded processors on FPGAs. This now gives access to a software engine within the FPGA. The challenge has been that in order to get performance, a multi-core approach is required since one processor cannot process packets quickly enough to keep up with the incoming rate. Doing a multi-core design from scratch in an ad-hoc fashion is not a trivial task, and has simply raised too high a barrier. What have been missing are an infrastructure and a methodology for managing multi-core designs on an FPGA. Work that Teja has done developing a packet processing architecture for the Xilinx Virtex 4 FX family has led to both an infrastructure and a methodology to allow multi-gigabit FPGA packet-processing performance, with both software and hardware flexibility.

A multi-core infrastructure

When putting together a multi-core fabric, the tough bits have to do with assigning and scheduling tasks on processors, communication between processors, and accessing shared resources like memories and coprocessors. We can simplify this problem somewhat by restricting the domain of possible processor configurations to one well suited to packet processing: the parallel pipeline. We can then design a library of modular scalable blocks to create such a pipeline.

Pipelines can be structured a number of different ways, focusing on length of the pipeline, amount of parallelism, or both, as shown in Figure 1. This structure immediately takes care of the problem of assigning and scheduling tasks: by partitioning the program at design time, various pieces of code are assigned to various stages. While running, each stage will start its work as soon as it receives the go-ahead from the previous stage.

Figure 1. Parallel pipeline configurations

Each engine will need some resources in order to be able to execute at speed without waiting for other engines. By putting private code and data store along with each engine, as illustrated in Figure 2, the most onerous shared-memory contention issues are avoided. By allowing the use of hardware offloads, specific computation- or time-intensive functions can be accelerated if needed. Note that in an FPGA, such custom offloads can be created and collocated in the same silicon. In addition, the only offloads required are the ones actually used, unlike other silicon that may provide fixed offload functionality that sits there whether or not it’s used.

Figure 2. Contents of a processing engine

The issue of getting packets into and out of the pipeline is non-trivial. Such a block of logic must operate very fast, since it has to touch every packet. This is only possible with a dedicated block created out of RTL. Getting packets and tasks between stages is also tricky. It requires memory copies, something that you don’t want your processor to spend time doing. So another high-speed RTL block is needed here to take care of the bookkeeping involved in moving a task from one stage to the next. This block can add more value if it can act as a load-balancer. This will prevent a high-maintenance packet from jamming up one “row” of the pipeline; packet flow can continue around it. This also makes possible the use of irregularly shaped pipelines. In order to move from a stage with two parallel engines to one with four, something has to decide which of the four to send it to. This then also solves part of the assignment and scheduling problem. Figure 3 illustrates how load balancing can work.

Figure 3. Load balancing between stages

The issue of access to shared resources is addressed by carefully considering the design of the offloads. Offloads themselves can be shared, and they can also provide access to memory, the most common shared resource. By defining a well thought out interface for the offloads and arbitration where needed, the use of shared resources becomes predictable and manageable. The issue of contention and its effect on performance remain, but the problem is vastly simplified to one of whether to share, and if so, how much sharing to provide. Some examples of different sharing schemes are shown in Figure 4.

Figure 4. Offload sharing schemes

So at this point, we have access to the pipeline (in and out), we have communication between stages, we have worked out assignment and scheduling, and we have a clean solution for shared resources. A fully-assembled pipeline using this infrastructure is shown in Figure 5. Given such an infrastructure library already pre-designed, a system architect can now simply snap the pieces together to get a fully-working efficient multi-core engine. This means that the bulk of the designer’s time can be spent on the actual application, not the underlying implementation.

Figure 5. Assembled infrastructure

A multi-core parallel-pipeline methodology

Armed with a convenient infrastructure, the remaining question is how a designer goes about building a real data plane application out of it. The first step is simply to get the protocol working in C as a single monolithic program. This allows you to work the bugs out at a high level without worrying about any intricacies of dealing with a multi-core structure. Only after you have confidence that the algorithm itself is clean do you worry about trying to achieve line rate performance.

The performance problem being solved is one of trying to execute a program requiring many cycles given a budget with few cycles. The cycle budget is determined simply by figuring out how many clock cycles you get to operate on each packet; if you exceed this, then you have to start dropping packets because a new packet comes in before you’re done with the old one. By using multiple engines, however, we’re making room for more packets, so that even though the one engine isn’t yet done with its old packet, the new packet goes into a different engine, and the one after that goes perhaps into yet a different one. This is shown in Figure 6.

Figure 6. Assigning packets to engines

The bottom line is that you end up managing multiple packets in flight, such that if the budget is, say, 50 cycles, a packet completes processing every 50 cycles. When a packet is done, its engine is freed up for a new packet. The trick, then, is to provide just enough engines, but not too many. You can get a general sense of the total number of engines by dividing the actual cycle count by the cycle budget; code requiring 285 cycles, for example, will require about 6 engines (dividing 285 by 50 and rounding up).

This is where the use of soft processors on FPGAs is particularly convenient. By using, for example, a MicroBlaze processor on the Virtex 4 FPGA, as few or many processors as needed can be instantiated. The flexibility of the FPGA really pays off here, because engines can be added or eliminated one at a time as the design is optimized. The question then becomes how to arrange the engines: in parallel or by creating a long pipeline?

If one engine isn’t enough, the easiest way to get a second engine is to replicate the one you’ve got in parallel. The cost of doing so is the fact that you’ve also replicated any offloads in that engine; this literally doubles the resources required. An alternative is to add a pipeline stage. In this case, the offload is not replicated, and merely stays with the stage that needs it. Partitioning the design takes a bit more work than simple replication, but is a very viable alternative. Note that if contention is not a worry, then when in parallel, an offload can be shared between the parallel elements, saving resources. Figure 7 shows some alternatives.

Figure 7. Going parallel vs. extending the pipeline

In reality, a combination of pipelining and parallelizing usually works best for applications large enough to require several engines. The first step is to create a partition (or multiple partitions) using obvious breaks at the highest level of the “main” C program implementing the algorithm. The more partitions made here, the fewer parallel instantiations will be needed. But as it is a bit of work to partition a design, you wouldn’t want to overdo it. And you already have a sense of the number of engines required, so a six-engine problem, for example, would probably want at most two partitions (three stages).

Once you have the design partitioned, you can find the cycle count of each stage (your C program can be easily “instrumented” to count cycles). Dividing the cycle count of each stage by the budget tells you how many parallel elements you need. If our 285-cycle program was partitioned into a 135-cycle stage and a 150-cycle stage, then three parallel engines will be required for each stage.

Putting it all together

At this point, we have a structure that, theoretically, is guaranteed to achieve line rate. The only reason I say “theoretically” is that contention between shared resources can make cycle budgets inaccurate. You have to make an assumption as to how much contention will occur; if you’re right, then when you test it you will find that you meet line rate. If you’re not right, then you have to go back and review your assumptions and see where the bottleneck is. The great thing about working with an FPGA and this methodology is that you can tweak your elements very easily and try again. It could be as simple as changing the sharing of the critical resources. Or, if you can’t do that (for example because it’s external memory), then you can add another engine somewhere to pick up some slack. Only in an FPGA will you have this flexibility.

One extra benefit of the Virtex 4 FX family is that it includes a built-in PowerPC (or two) that can be used for the control plane. And for Ethernet-based applications, the MAC and part of the PHY are built-in so that they don’t have to be built out of internal logic or external chips. So the single FPGA plus some external memory (for large tables, for example) make a complete packet-processing subsystem as shown in Figure 8.

Figure 8. Complete packet processing subsystem

The bottom line

By using a pre-designed infrastructure, a systematic methodology can be created for generating multi-core packet processing structures and applications in FPGAs. The methodology itself can be bolstered by tools that automate or simplify some of the tasks and bookkeeping, further streamlining the design work. This makes it possible to get gigabit performance and software flexibility and hardware flexibility and get a product to market even faster, opening up the potential of FPGAs to the many companies that have steered clear to avoid RTL.

Leave a Reply

Taming Embedded Multi-Core on FPGAs for Packet Processing

In public discussions about embedded multi-processing and FPGAs, most of the focus has been on DSP. But there’s another application area requiring embedded multi-processing that has remained elusive, not because of silicon deficiencies, but due to the lack of an easy methodology. That application is packet processing. Packet processing performance as high as 10 Gbps is possible in FPGAs, so the silicon is fast, but even single gigabit performance has required RTL and a hardware approach to design.

Many companies with projects involving packet processing work exclusively in software, predominantly C. They have infrastructures and methodologies based around software. They have hardware groups that provide them with the boards and systems they need, but see the bulk of their value in software. These companies do not want to learn RTL or use a hardware design approach; they want to work in C using a software approach. As a result, they have not included FPGAs in their consideration of packet processors.

Going after flexibility

One of the main benefits to using a software approach is flexibility. Software flexibility refers to the ability to use the vast array of tools and methodologies that exist out there for common languages like C. It’s much easier to debug a design and produce robust working code in that kind of environment. Design changes are easily and quickly accommodated, and the rich set of tools help anywhere from debugging to documentation to code style checking. Software upgrades are possible on systems already deployed by simply placing a new software image in the code store.

Missing in this focus on software has been the main benefit of the FPGA: hardware flexibility. This deals with the designer’s ability to allocate hardware resources much more judiciously. It means using only as much hardware as needed and leaving as little logic unused as possible, reducing the cost of the design. In addition, the ability to upgrade the hardware on systems in the field provides a huge new flexibility dimension, extending the life of boards and systems far beyond what would otherwise be possible.

Given that both kinds of flexibility haven’t been achievable at the same time, most companies that implement packet processing have opted for software flexibility with more rigid architectures. There are companies that have used the more flexible FPGA, but have had to turn to RTL to get the performance they need. RTL doesn’t provide the same kind of flexibility that C does; it takes longer to design, and because it defines hardware, design turns can take longer since a complete hardware build is needed.

The one key to achieving both kinds of flexibility is the existence of embedded processors on FPGAs. This now gives access to a software engine within the FPGA. The challenge has been that in order to get performance, a multi-core approach is required since one processor cannot process packets quickly enough to keep up with the incoming rate. Doing a multi-core design from scratch in an ad-hoc fashion is not a trivial task, and has simply raised too high a barrier. What have been missing are an infrastructure and a methodology for managing multi-core designs on an FPGA. Work that Teja has done developing a packet processing architecture for the Xilinx Virtex 4 FX family has led to both an infrastructure and a methodology to allow multi-gigabit FPGA packet-processing performance, with both software and hardware flexibility.

A multi-core infrastructure

When putting together a multi-core fabric, the tough bits have to do with assigning and scheduling tasks on processors, communication between processors, and accessing shared resources like memories and coprocessors. We can simplify this problem somewhat by restricting the domain of possible processor configurations to one well suited to packet processing: the parallel pipeline. We can then design a library of modular scalable blocks to create such a pipeline.

Pipelines can be structured a number of different ways, focusing on length of the pipeline, amount of parallelism, or both, as shown in Figure 1. This structure immediately takes care of the problem of assigning and scheduling tasks: by partitioning the program at design time, various pieces of code are assigned to various stages. While running, each stage will start its work as soon as it receives the go-ahead from the previous stage.

20060131_teja_fig1.gif

Figure 1. Parallel pipeline configurations

Each engine will need some resources in order to be able to execute at speed without waiting for other engines. By putting private code and data store along with each engine, as illustrated in Figure 2, the most onerous shared-memory contention issues are avoided. By allowing the use of hardware offloads, specific computation- or time-intensive functions can be accelerated if needed. Note that in an FPGA, such custom offloads can be created and collocated in the same silicon. In addition, the only offloads required are the ones actually used, unlike other silicon that may provide fixed offload functionality that sits there whether or not it’s used.

20060131_teja_fig2.jpg

Figure 2. Contents of a processing engine

The issue of getting packets into and out of the pipeline is non-trivial. Such a block of logic must operate very fast, since it has to touch every packet. This is only possible with a dedicated block created out of RTL. Getting packets and tasks between stages is also tricky. It requires memory copies, something that you don’t want your processor to spend time doing. So another high-speed RTL block is needed here to take care of the bookkeeping involved in moving a task from one stage to the next. This block can add more value if it can act as a load-balancer. This will prevent a high-maintenance packet from jamming up one “row” of the pipeline; packet flow can continue around it. This also makes possible the use of irregularly shaped pipelines. In order to move from a stage with two parallel engines to one with four, something has to decide which of the four to send it to. This then also solves part of the assignment and scheduling problem. Figure 3 illustrates how load balancing can work.

20060131_teja_fig3.gif

Figure 3. Load balancing between stages

The issue of access to shared resources is addressed by carefully considering the design of the offloads. Offloads themselves can be shared, and they can also provide access to memory, the most common shared resource. By defining a well thought out interface for the offloads and arbitration where needed, the use of shared resources becomes predictable and manageable. The issue of contention and its effect on performance remain, but the problem is vastly simplified to one of whether to share, and if so, how much sharing to provide. Some examples of different sharing schemes are shown in Figure 4.

20060131_teja_fig4.gif

Figure 4. Offload sharing schemes

So at this point, we have access to the pipeline (in and out), we have communication between stages, we have worked out assignment and scheduling, and we have a clean solution for shared resources. A fully-assembled pipeline using this infrastructure is shown in Figure 5. Given such an infrastructure library already pre-designed, a system architect can now simply snap the pieces together to get a fully-working efficient multi-core engine. This means that the bulk of the designer’s time can be spent on the actual application, not the underlying implementation.

20060131_teja_fig5.jpg

Figure 5. Assembled infrastructure

A multi-core parallel-pipeline methodology

Armed with a convenient infrastructure, the remaining question is how a designer goes about building a real data plane application out of it. The first step is simply to get the protocol working in C as a single monolithic program. This allows you to work the bugs out at a high level without worrying about any intricacies of dealing with a multi-core structure. Only after you have confidence that the algorithm itself is clean do you worry about trying to achieve line rate performance.

The performance problem being solved is one of trying to execute a program requiring many cycles given a budget with few cycles. The cycle budget is determined simply by figuring out how many clock cycles you get to operate on each packet; if you exceed this, then you have to start dropping packets because a new packet comes in before you’re done with the old one. By using multiple engines, however, we’re making room for more packets, so that even though the one engine isn’t yet done with its old packet, the new packet goes into a different engine, and the one after that goes perhaps into yet a different one. This is shown in Figure 6.

20060131_teja_fig6.jpg

Figure 6. Assigning packets to engines

The bottom line is that you end up managing multiple packets in flight, such that if the budget is, say, 50 cycles, a packet completes processing every 50 cycles. When a packet is done, its engine is freed up for a new packet. The trick, then, is to provide just enough engines, but not too many. You can get a general sense of the total number of engines by dividing the actual cycle count by the cycle budget; code requiring 285 cycles, for example, will require about 6 engines (dividing 285 by 50 and rounding up).

This is where the use of soft processors on FPGAs is particularly convenient. By using, for example, a MicroBlaze processor on the Virtex 4 FPGA, as few or many processors as needed can be instantiated. The flexibility of the FPGA really pays off here, because engines can be added or eliminated one at a time as the design is optimized. The question then becomes how to arrange the engines: in parallel or by creating a long pipeline?

If one engine isn’t enough, the easiest way to get a second engine is to replicate the one you’ve got in parallel. The cost of doing so is the fact that you’ve also replicated any offloads in that engine; this literally doubles the resources required. An alternative is to add a pipeline stage. In this case, the offload is not replicated, and merely stays with the stage that needs it. Partitioning the design takes a bit more work than simple replication, but is a very viable alternative. Note that if contention is not a worry, then when in parallel, an offload can be shared between the parallel elements, saving resources. Figure 7 shows some alternatives.

20060131_teja_fig7.jpg

Figure 7. Going parallel vs. extending the pipeline

In reality, a combination of pipelining and parallelizing usually works best for applications large enough to require several engines. The first step is to create a partition (or multiple partitions) using obvious breaks at the highest level of the “main” C program implementing the algorithm. The more partitions made here, the fewer parallel instantiations will be needed. But as it is a bit of work to partition a design, you wouldn’t want to overdo it. And you already have a sense of the number of engines required, so a six-engine problem, for example, would probably want at most two partitions (three stages).

Once you have the design partitioned, you can find the cycle count of each stage (your C program can be easily “instrumented” to count cycles). Dividing the cycle count of each stage by the budget tells you how many parallel elements you need. If our 285-cycle program was partitioned into a 135-cycle stage and a 150-cycle stage, then three parallel engines will be required for each stage.

Putting it all together

At this point, we have a structure that, theoretically, is guaranteed to achieve line rate. The only reason I say “theoretically” is that contention between shared resources can make cycle budgets inaccurate. You have to make an assumption as to how much contention will occur; if you’re right, then when you test it you will find that you meet line rate. If you’re not right, then you have to go back and review your assumptions and see where the bottleneck is. The great thing about working with an FPGA and this methodology is that you can tweak your elements very easily and try again. It could be as simple as changing the sharing of the critical resources. Or, if you can’t do that (for example because it’s external memory), then you can add another engine somewhere to pick up some slack. Only in an FPGA will you have this flexibility.

One extra benefit of the Virtex 4 FX family is that it includes a built-in PowerPC (or two) that can be used for the control plane. And for Ethernet-based applications, the MAC and part of the PHY are built-in so that they don’t have to be built out of internal logic or external chips. So the single FPGA plus some external memory (for large tables, for example) make a complete packet-processing subsystem as shown in Figure 8.

20060131_teja_fig8.jpg

Figure 8. Complete packet processing subsystem

The bottom line

By using a pre-designed infrastructure, a systematic methodology can be created for generating multi-core packet processing structures and applications in FPGAs. The methodology itself can be bolstered by tools that automate or simplify some of the tasks and bookkeeping, further streamlining the design work. This makes it possible to get gigabit performance and software flexibility and hardware flexibility and get a product to market even faster, opening up the potential of FPGAs to the many companies that have steered clear to avoid RTL.

Leave a Reply

featured blogs
Dec 19, 2024
Explore Concurrent Multiprotocol and examine the distinctions between CMP single channel, CMP with concurrent listening, and CMP with BLE Dynamic Multiprotocol....
Jan 10, 2025
Most of us think we know something about quantum computing, right until someone else asks us to explain it to them'¦...

featured chalk talk

Easily Connect to AWS Cloud with ExpressLink Over Wi-Fi
Sponsored by Mouser Electronics and AWS and u-blox
In this episode of Chalk Talk, Amelia Dalton, Lucio Di Jasio from AWS and Magnus Johansson from u-blox explore common pitfalls of designing an IoT device from scratch, the benefits that AWS IoT ExpressLink brings to IoT device design, and how the the NORA-W2 AWS IoT ExpressLink multiradio modules can make retrofitting an already existing design into a smart AWS connected device easier than ever before.
May 30, 2024
34,335 views