Warp Processors
Warp Processing Press Release (Oct 2007)
Warp processors dynamically optimize their software to improve execution time and energy consumption. By performing optimizations at runtime, Warp processors have the advatanges of eliminating tool flow restrictions and extra designer effort associated with traditional compile-time optimizations. In addition, Warp processors greatly improve upon previous dynamic optimization approaches, such as BOA and Dynamo. Previous approaches utilize dynamic software optimizations, generally achieving speedups ranging from 1.1 to 1.3. By performing hardware/software partitioning at runtime, warp processors are currently capable of achieving speedups averaging 7.4 and energy savings up to 94%. In the near future, we expect warp processors to achieve speedups much greater than an order of magnitude.

The functionality of a warp processor is illustrated in the figure shown above. Initially, software executes on the microprocessor. As the software executes, a profiler monitors the software to determine regions of code responsible for a large percentage of execution time (which we call critical regions). Once the profiler has identified critical regions, the warp processor will partition the critical regions to hardware. Hardware for the critical regions is synthesized in the DPM (dynamic partitioning module). The DPM then programs the configurable logic to implement the synthesized hardware. The DPM also updates the software binary so that the hardware will be used during execution. Finally, the partitioned application begins executing much faster while consuming less energy.
Warp Tools
Hardware/software partitioning tools typically execute on power workstations with gigabytes of memory and extremely fast processors. Warp processors execute these same tools in an on-chip environment. To make on-chip execution possible, Warp processor have specialized tools that target the most common regions implemented in hardware, generally small, frequent loops. These specialized tools are designed to be very lean, requiring orders of magnitude less memory and execution time.
The tool flow implemented by Warp processors is shown in the adjacent figure. Initially, partitioning selects the critical regions identified by the on-chip profiler that are appropriate for hardware. Next, decompilation recovers high-level constructs (loops, arrays, etc.) to create a representation of the code that is more suitable for synthesis. The decompiled representation is then passed to the register-transfer synthesis tool that creates a standard hardware binary. Next, JIT FPGA compilation converts the standard binary into a binary for the specialized WCLA (Warp Configurable Logic Architecture). During JIT FPGA compilation, logic synthesis performs logic optimizations to reduce the number of gates required by the hardware. Technology mapping handles mapping the gate-level netlist onto configurable logic in the WCLA. Place and route determines all connections in the WCLA and then outputs a bitstream that programs the WCLA. The binary updater modifies the software binary by replacing the original software loops with hardware initialization and communication code.

Warp Architecture
|
The Warp processor architecture consists of several components: a main microprocessor, an on-chip profiler, a dynamic partitioning module (DPM), and a specialized warp configurable logic architecture (WCLA). The main microprocessor executes the software partition. The profiler monitors instruction fetches to determine the most frequently executed regions of the software. The DPM is responsible for executing all CAD tools described earlier. The DPM consists of an additional microprocessor and a small amount of memory for executing the CAD tools. The WCLA (Warp Configurable Logic Architecture) is a specialized configurable logic fabric that allows for very efficient place and route operations. |
Results
The charts shown below illustrate the speedup achievable by Warp processors. The first chart shows the experimental setup. For these experiments, the main microprocessor consisted of an ARM7 running at 100 MHz. The DPM used an additional ARM7 to execute the CAD tools, which executed in under two seconds. The second chart shows the speedups of the single most frequent region when implemented in hardware compared to the software-only execution of the same region. The final chart shows overall application speedup, averaging 7.4, after warp processing has implement multiple critical regions in hardware. The energy savings for the same experiements ranged from 38% to 94%.
The benchmarks used in the experiements were selected from PowerStone, EEMBC, NetBench, and our own benchmark suite.
Experimental Setup

Single Kernel Speedup

Overall Speedup
Thread Warping Speedup
The following shows results of executing highly-parallelizable benchmarks using warp processing (with each entire threads being mapped to FPGAs) versus execution on 4-microprocessor (uP), 8, 16, 32, and 64 micrprocessor systems. Even compared to 64 processor systems with optimistic communication assumptions, warp processing of threads still achieves huge speedups.

People
Miscellaneous Presentations
- Self-Improving Computing Chips -- Warp Processing, UC Riverside CS&E Colloquium, Oct 2007. PPT
- Thread Warping -- Int. Conf. on Hw/Sw Codesign and System Synthesis (Austria), Oct. 2007. PPT
- Warp Processor: A Dymamically Reconfigurable Coprocessor -- Talk at Intel's System Design Symposium (San Jose), Nov. 2005 PPT
- Warp Processors -- Talk at ASU, April 2004 PPT
- Warp Processors -- Talk at IBM Research, Yorktown Heights, Apr 2004 PPT
Publications
- Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators PDF
G. Stitt and F. Vahid.
Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2007.
- Binary Synthesis. G. Stitt and F. Vahid.
ACM Transactions on Design Automation of Electronic Systems (TODAES), 2007 (to appear).
- A Code Refinement Methodology for Performance-Improved Synthesis from C PDF
G. Stitt, F. Vahid, W. Najjar
IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006, pp. 716-723.
- Warp Processors PDF
R. Lysecky, G. Stitt, F. Vahid
ACM Transactions on Design Automation of Electronic Systems (TODAES), July 2006, pp. 659-681.
- New Decompilation Techniques for Binary-level Co-processor Generation PDF
G. Stitt, F. Vahid
IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005, pp. 547-554.
- Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode PDF
G. Stitt, F. Vahid, G. McGregor, B. Einloth
International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), Sep. 2005, pp. 285-290.
Shows that binary-level partitioning and synthesis of a real highly-optimized h264 video decoder application is competitive with source (C) level partitioning/synthesis. Also introduces several simple C coding guidelines that greatly improve synthesis results.
- Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware PDF
A. Gordon-Ross and F. Vahid
IEEE Transactions on Computers, Special Issue-Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005, Vol. 54, Issue 10, pp 1203-1215.
Describes extensive studies resulting in lean profiler hardware that effectively finds addresses corresponding to frequent loops in an executing software binary.
- A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms PDF
G. Stitt and F. Vahid
Design Automation and Test in Europe (DATE), March 2005, pp. 396-397
Utilizing advanced decompilation techniques enables synthesis of hardware from binaries to recover nearly all high-level constructs that existed in the source code, even for different compiler optimization levels.
- A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation PDF
R. Lysecky, F. Vahid and S. Tan
IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005
Describes an FPGA routing approach that is lean in terms of runtime and memory, running three times faster while using over 15 times less memory than a popular router, yet creating a critical path that is only 30% longer on average and about equal for very large circuits compared to that other router. Our approach, ROCR (Riverside On-Chip Router), can be useful for methods requiring just-in-time FPGA compilation, like our warp processing method, and future methods using a standard FPGA binary.
- A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning PDF
R. Lysecky and F. Vahid
Design Automation and Test in Europe (DATE), March 2005
Highlights speedup and energy results of implementing warp processing, which dynamically and transparently remaps software kernels to FPGA using on-chip synthesis tools, for software running on a Xilinx MicroBlaze soft-core processor. Results show competitive performance and energy compared to software on regular "hard core" embedded microprocessors, thus making soft-cores on FPGA even more attractive beyond just their flexibility of putting different numbers of cores and custom circuitry on a single chip.
- Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure PDF
G. Stitt, Z. Guo, F. Vahid, and W. Najjar
ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005,
Advanced decompilation methods can make synthesizing FPGA hardware from software binaries competitive with synthesizing directly from C-level source code, even when utilizing an advanced memory structure (smart buffer) requiring knowledge of loops and arrays. Synthesis from binaries provides numerous advantages of language independence, tool independence, portability, and support of legacy code.
- Dynamic FPGA Routing for Just-in-Time Compilation PDF
R. Lysecky, F. Vahid, S. Tan
IEEE/ACM Design Automation Conference (DAC), June 2004.
Describes an FPGA routing heuristic for execution on-chip, to support Just-in-Time compliation for FPGAs
- A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning PDF
R. Lysecky, F. Vahid
Design Automation and Test in Europe Conference (DATE), February 2004.
Describes a simple configurable logic (FPGA) fabric and surrounding architecture specifically intended to support dynamic hardware/software partitioning -- meaning on-chip CAD tools must be able to quickly map a netlist to the fabric.
- Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems PDF
G. Stitt, F. Vahid, S. Nemetebaksh
IEEE Transactions on Embedded Computer Systems, January 2004.
Partitioning a program's kernels to FPGA hardware can reduce overall system energy.
- Dynamic Hardware/Software Partitioning: A First Approach PDF PPT
G. Stitt, R. Lysecky, F. Vahid
Design Automation Conference (DAC), 2003, pp. 250-255.
Dynamically partitioning an executing software application onto on-chip FPGA is not only possible, but quite effective.
- A Codesigned On-Chip Logic Minimizer PDF
R. Lysecky, F. Vahid
IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), October 2003.
Hardware/software partitioning of an on-chip logic minimizer results in 8x speedup and 60% energy savings, improving the usefulness of on-chip logic minimization in a variety of applications.
- On-Chip Logic Minimization PDF
R. Lysecky, F. Vahid
Design Automation Conference (DAC), 2003.
Executing a lean form of logic minimization on-chip is feasible and has several immediate applications in networking.
- Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware PDF
A. Gordon-Ross, F. Vahid
ACM/IEEE Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES) 2003.
Describes extensive studies resulting in lean profiler hardware that effectively finds addresses corresponding to frequent loops in an executing software binary.
- The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic PDF
G. Stitt, F. Vahid
IEEE Design and Test of Computers, November/December 2002, pp. 36-43.
Partitioning critical software kernels to on-chip FPGA improves energy consumption in addition to performance.
- Hardware/Software Partitioning of Software Binaries PDF
G. Stitt and F. Vahid
IEEE/ACM International Conference on Computer Aided Design, November 2002, pp. 164-170.
Performing hw/sw partitioning on software binaries can achieve results similar to a compiler-based approach without imposing restrictions on the high-level language or compiler. Binary-level partitioning also supports partitioning of library code, legacy code, and hand-optimized assembly.
- Binary-Level Hardware/Software Partitioning of MediaBench, NetBench, and EEMBC Benchmarks PDF
G. Stitt and F. Vahid
Technical Report UCR-CSE-03-01. January 2003.
Binary-level hw/sw partitioning achieves similar speedups compared to a compiler-based approach for standard benchmarks from MediaBench, NetBench, and EEMBC.
|