SWMM5 - Stormwater Management Model

SWMM 5, Watersheds, Water Quality,Hydrology, Hydraulics - Watersheds

Warp Processors

Warp Processing Press Release (Oct 2007)

Warp processors dynamically optimize their software to improve execution time and energy consumption. By performing optimizations at runtime, Warp processors have the advatanges of eliminating tool flow restrictions and extra designer effort associated with traditional compile-time optimizations. In addition, Warp processors greatly improve upon previous dynamic optimization approaches, such as BOA and Dynamo. Previous approaches utilize dynamic software optimizations, generally achieving speedups ranging from 1.1 to 1.3. By performing hardware/software partitioning at runtime, warp processors are currently capable of achieving speedups averaging 7.4 and energy savings up to 94%. In the near future, we expect warp processors to achieve speedups much greater than an order of magnitude.


The functionality of a warp processor is illustrated in the figure shown above. Initially, software executes on the microprocessor. As the software executes, a profiler monitors the software to determine regions of code responsible for a large percentage of execution time (which we call critical regions). Once the profiler has identified critical regions, the warp processor will partition the critical regions to hardware. Hardware for the critical regions is synthesized in the DPM (dynamic partitioning module). The DPM then programs the configurable logic to implement the synthesized hardware. The DPM also updates the software binary so that the hardware will be used during execution. Finally, the partitioned application begins executing much faster while consuming less energy.

Warp Tools

Hardware/software partitioning tools typically execute on power workstations with gigabytes of memory and extremely fast processors. Warp processors execute these same tools in an on-chip environment. To make on-chip execution possible, Warp processor have specialized tools that target the most common regions implemented in hardware, generally small, frequent loops. These specialized tools are designed to be very lean, requiring orders of magnitude less memory and execution time.

The tool flow implemented by Warp processors is shown in the adjacent figure. Initially, partitioning selects the critical regions identified by the on-chip profiler that are appropriate for hardware. Next, decompilation recovers high-level constructs (loops, arrays, etc.) to create a representation of the code that is more suitable for synthesis. The decompiled representation is then passed to the register-transfer synthesis tool that creates a standard hardware binary. Next, JIT FPGA compilation converts the standard binary into a binary for the specialized WCLA (Warp Configurable Logic Architecture). During JIT FPGA compilation, logic synthesis performs logic optimizations to reduce the number of gates required by the hardware. Technology mapping handles mapping the gate-level netlist onto configurable logic in the WCLA. Place and route determines all connections in the WCLA and then outputs a bitstream that programs the WCLA. The binary updater modifies the software binary by replacing the original software loops with hardware initialization and communication code.


Warp Architecture


The Warp processor architecture consists of several components: a main microprocessor, an on-chip profiler, a dynamic partitioning module (DPM), and a specialized warp configurable logic architecture (WCLA). The main microprocessor executes the software partition. The profiler monitors instruction fetches to determine the most frequently executed regions of the software. The DPM is responsible for executing all CAD tools described earlier. The DPM consists of an additional microprocessor and a small amount of memory for executing the CAD tools. The WCLA (Warp Configurable Logic Architecture) is a specialized configurable logic fabric that allows for very efficient place and route operations.

Results

The charts shown below illustrate the speedup achievable by Warp processors. The first chart shows the experimental setup. For these experiments, the main microprocessor consisted of an ARM7 running at 100 MHz. The DPM used an additional ARM7 to execute the CAD tools, which executed in under two seconds. The second chart shows the speedups of the single most frequent region when implemented in hardware compared to the software-only execution of the same region. The final chart shows overall application speedup, averaging 7.4, after warp processing has implement multiple critical regions in hardware. The energy savings for the same experiements ranged from 38% to 94%.

The benchmarks used in the experiements were selected from PowerStone, EEMBC, NetBench, and our own benchmark suite.

Experimental Setup



Single Kernel Speedup


Overall Speedup


Thread Warping Speedup

The following shows results of executing highly-parallelizable benchmarks using warp processing (with each entire threads being mapped to FPGAs) versus execution on 4-microprocessor (uP), 8, 16, 32, and 64 micrprocessor systems. Even compared to 64 processor systems with optimistic communication assumptions, warp processing of threads still achieves huge speedups.


People

Miscellaneous Presentations

  1. Self-Improving Computing Chips -- Warp Processing, UC Riverside CS&E Colloquium, Oct 2007. PPT

  2. Thread Warping -- Int. Conf. on Hw/Sw Codesign and System Synthesis (Austria), Oct. 2007. PPT
  3. Warp Processor: A Dymamically Reconfigurable Coprocessor -- Talk at Intel's System Design Symposium (San Jose), Nov. 2005 PPT
  4. Warp Processors -- Talk at ASU, April 2004 PPT
  5. Warp Processors -- Talk at IBM Research, Yorktown Heights, Apr 2004 PPT

Publications

  1. Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators PDF
    G. Stitt and F. Vahid.
    Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2007.

  2. Binary Synthesis. G. Stitt and F. Vahid.
    ACM Transactions on Design Automation of Electronic Systems (TODAES), 2007 (to appear).

  3. A Code Refinement Methodology for Performance-Improved Synthesis from C PDF
    G. Stitt, F. Vahid, W. Najjar
    IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006, pp. 716-723.

  4. Warp Processors PDF
    R. Lysecky, G. Stitt, F. Vahid
    ACM Transactions on Design Automation of Electronic Systems (TODAES), July 2006, pp. 659-681.

  5. New Decompilation Techniques for Binary-level Co-processor Generation PDF
    G. Stitt, F. Vahid
    IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005, pp. 547-554.

  6. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode PDF
    G. Stitt, F. Vahid, G. McGregor, B. Einloth
    International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), Sep. 2005, pp. 285-290.
    Shows that binary-level partitioning and synthesis of a real highly-optimized h264 video decoder application is competitive with source (C) level partitioning/synthesis. Also introduces several simple C coding guidelines that greatly improve synthesis results.

  7. Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware PDF
    A. Gordon-Ross and F. Vahid
    IEEE Transactions on Computers, Special Issue-Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005, Vol. 54, Issue 10, pp 1203-1215.
    Describes extensive studies resulting in lean profiler hardware that effectively finds addresses corresponding to frequent loops in an executing software binary.

  8. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms PDF
    G. Stitt and F. Vahid
    Design Automation and Test in Europe (DATE), March 2005, pp. 396-397
    Utilizing advanced decompilation techniques enables synthesis of hardware from binaries to recover nearly all high-level constructs that existed in the source code, even for different compiler optimization levels.

  9. A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation PDF
    R. Lysecky, F. Vahid and S. Tan
    IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005
    Describes an FPGA routing approach that is lean in terms of runtime and memory, running three times faster while using over 15 times less memory than a popular router, yet creating a critical path that is only 30% longer on average and about equal for very large circuits compared to that other router. Our approach, ROCR (Riverside On-Chip Router), can be useful for methods requiring just-in-time FPGA compilation, like our warp processing method, and future methods using a standard FPGA binary.

  10. A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning PDF
    R. Lysecky and F. Vahid
    Design Automation and Test in Europe (DATE), March 2005
    Highlights speedup and energy results of implementing warp processing, which dynamically and transparently remaps software kernels to FPGA using on-chip synthesis tools, for software running on a Xilinx MicroBlaze soft-core processor. Results show competitive performance and energy compared to software on regular "hard core" embedded microprocessors, thus making soft-cores on FPGA even more attractive beyond just their flexibility of putting different numbers of cores and custom circuitry on a single chip.

  11. Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure PDF
    G. Stitt, Z. Guo, F. Vahid, and W. Najjar
    ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005,
    Advanced decompilation methods can make synthesizing FPGA hardware from software binaries competitive with synthesizing directly from C-level source code, even when utilizing an advanced memory structure (smart buffer) requiring knowledge of loops and arrays. Synthesis from binaries provides numerous advantages of language independence, tool independence, portability, and support of legacy code.

  12. Dynamic FPGA Routing for Just-in-Time Compilation PDF
    R. Lysecky, F. Vahid, S. Tan
    IEEE/ACM Design Automation Conference (DAC), June 2004.
    Describes an FPGA routing heuristic for execution on-chip, to support Just-in-Time compliation for FPGAs

  13. A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning PDF
    R. Lysecky, F. Vahid
    Design Automation and Test in Europe Conference (DATE), February 2004.
    Describes a simple configurable logic (FPGA) fabric and surrounding architecture specifically intended to support dynamic hardware/software partitioning -- meaning on-chip CAD tools must be able to quickly map a netlist to the fabric.

  14. Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems PDF
    G. Stitt, F. Vahid, S. Nemetebaksh
    IEEE Transactions on Embedded Computer Systems, January 2004.
    Partitioning a program's kernels to FPGA hardware can reduce overall system energy.

  15. Dynamic Hardware/Software Partitioning: A First Approach PDF PPT
    G. Stitt, R. Lysecky, F. Vahid
    Design Automation Conference (DAC), 2003, pp. 250-255.
    Dynamically partitioning an executing software application onto on-chip FPGA is not only possible, but quite effective.

  16. A Codesigned On-Chip Logic Minimizer PDF
    R. Lysecky, F. Vahid
    IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), October 2003.
    Hardware/software partitioning of an on-chip logic minimizer results in 8x speedup and 60% energy savings, improving the usefulness of on-chip logic minimization in a variety of applications.

  17. On-Chip Logic Minimization PDF
    R. Lysecky, F. Vahid
    Design Automation Conference (DAC), 2003.
    Executing a lean form of logic minimization on-chip is feasible and has several immediate applications in networking.

  18. Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware PDF
    A. Gordon-Ross, F. Vahid
    ACM/IEEE Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES) 2003.
    Describes extensive studies resulting in lean profiler hardware that effectively finds addresses corresponding to frequent loops in an executing software binary.

  19. The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic PDF
    G. Stitt, F. Vahid
    IEEE Design and Test of Computers, November/December 2002, pp. 36-43.
    Partitioning critical software kernels to on-chip FPGA improves energy consumption in addition to performance.

  20. Hardware/Software Partitioning of Software Binaries PDF
    G. Stitt and F. Vahid
    IEEE/ACM International Conference on Computer Aided Design, November 2002, pp. 164-170.
    Performing hw/sw partitioning on software binaries can achieve results similar to a compiler-based approach without imposing restrictions on the high-level language or compiler. Binary-level partitioning also supports partitioning of library code, legacy code, and hand-optimized assembly.

  21. Binary-Level Hardware/Software Partitioning of MediaBench, NetBench, and EEMBC Benchmarks PDF
    G. Stitt and F. Vahid
    Technical Report UCR-CSE-03-01. January 2003.
    Binary-level hw/sw partitioning achieves similar speedups compared to a compiler-based approach for standard benchmarks from MediaBench, NetBench, and EEMBC.

Last updated by Robert E Dickinson May. 14, 2008.

Latest Activity

tian tian and Hussam joined SWMM5 - Stormwater Management Model
1 hour ago
5 hours ago
Robert E Dickinson added a blog post
MWH Soft Releases InfoWorks and FloodWorks Version 11 New Version Available for Industry-Leading Workgroup Management Modeling Software for Wastewater, Stormwater, Drinking Water, and River Systems Broomfield, Colorado USA, September 8, 2010 MWH Sof…
16 hours ago
mohammed A Alim is now a member of SWMM5 - Stormwater Management Model
yesterday
Robert E Dickinson added a blog post
Creating Clearer Climate Computer Codes by Eli Kintisch on 3 September 2010, 12:01 PM | Permanent Link | 2 Comments EmailPrint|More PREVIOUS ARTICLE NEXT ARTICLE British software engineers Nick Barnes and David Jones have spent the past 3 years t…
on Sunday
Robert E Dickinson just checked out the RSS pages...
(via RSS Pages)
on Friday
August 31
Antony Manoharan is now a member of SWMM5 - Stormwater Management Model
August 31
Robert E Dickinson added a blog post
MWH Soft Releases InfoSewer Version 6.0 for ArcGIS 10 New Version of the Leading Wastewater Modeling and Management Package Leverages Newest Esri Software Broomfield, Colorado USA, August 31, 2010 — MWH Soft, a leading global innovator of wet in…
August 31
------------------------ Build 5.0.020 (08/23/10) ------------------------ Engine Updates 1. A refactoring bug that prevented SWMM from reading rainfall data from external rainfall files was fixed. See gage.c.
August 23
lia almila and Dr. MVRL Murthy joined SWMM5 - Stormwater Management Model
August 22
There are 152 members on SWMM5 - Stormwater Management Model
August 19
SWMM5 - Stormwater Management Model now has leaderboards
August 19
August 17
David Jones is now a member of SWMM5 - Stormwater Management Model
August 17
Robert E Dickinson added a blog post
From the NYT In Weather Chaos, a Case for Global Warming The collective answer of the scientific community can be boiled down to a single word: probably. “The climate is changing,” said Jay Lawrimore, chief of climate analysis at the National Clima…
August 17
You should still adjust the width individually for each subcatchment but compare the peak flows for the combined flows to your monitored data. Unless you want to model S1, S2 and S3 together then you are best off estimating the width for each subcat…
August 15
Thank you for your help. However, the flow measurement was only taken at the outlet of the whole catchment (Out 1). if i want to calibrate the runoff quantity, how can i obtain the width of the whole catchment? Am i combine the total width of subcat…
August 15
Originally, when SWMM was developed the width parameter was the collection length of the subcatchment. For example, it would be the gutter length on a street going to an inlet. In your case one easy way to estimate the Width for the subcatchment goi…
August 14
Chow Ming Fai added a blog post
Hello,I am not familiar on using SWMM for quantity and quality modeling. i need some helps here.I have a problem on calibrating the total width of catchment in SWMM quantity modeling. Figure shown below is my studied catchment. the surface runoff is…
August 14

© 2010   Created by Robert E Dickinson.   Powered by .

Badges  |  Report An Issue | SWMM 5 Blog  |  Terms of Service