Publications by the Reconfigurable Network Group

(In Reverse Chronological Order)
(Also available as BIBTeX format)

Journals and Conferences

Sensitivity Analysis of Gigabit Concept Mining System; by Andrew Levine, Ron Loui, John W. Lockwood, Young H. Cho; IEEE Aerospace Conference; Big Sky, MT; March 3-10, 2007; Paper 10.0801.

Abstract: As described in our prior papers, we have implemented a system that performs real-time analysis and classification of network traffic using reconfigurable hardware. In this paper, we consider how to optimize the performance and make best use of the hardware resources by simulating the effect of parameter variation. We have devised a systematic method to determine the best parameters for the hardware such that we do not sacrifice the quality of the result. We applied the method to determine how our existing system could best identify the topics of Internet newsgroup postings as the content streams over a Gigabit Ethernet link.

Hardware-Accelerated Parser for Extraction of Metadata in Semantic Network Content; by James Moscola, Young H. Cho, John W. Lockwood; IEEE Aerospace Conference; Big Sky, MT, March 3-10, 2007, Paper 10.0702.

Abstract: We have implemented a new network information processing system using reconfigurable hardware that scans volumes of data in real-time. One of the key functions of the system is to extract semantic information. Before we can determine the meaning of text, we must identify its language. In a previous project, we have implemented an N-gram based language identifier that can process up to 1 Gbps throughput. However, a large percentage of computer network traffic, such as email and web data, consists of markup information such as tags and protocol specific options. This additional data interferes with the language identification process causing decreased accuracy. Thus, we developed a hardware architecture for configurable application level processing. Our Application Level Processing System (ALPS) is a custom processor that is automatically generated using syntactic structure of the content. The resulting circuit is mapped on to a reconfigurable device to efficiently extract only the relevant data for the language identifier. To illustrate the effectiveness of the architecture, we have implemented a system that can process electronic mail. Our experiments show that ALPS can improve the accuracy of the hardware language identifier by up to a factor of 200 as compared to a system that does not decode the application-level protocol data.

Streaming Hierarchical Clustering for Concept Mining; by Moshe Looks, Andrew Levine, G. Adam Covington, Ronald P. Loui, John W. Lockwood, Young H. Cho; IEEE Aerospace Conference; Big Sky, MT, March 3-10, 2007; Paper 10.0701

Abstract: We are concerned with the general problem of concept mining-discovering useful associations, relationships, and groupings in large collections of data. Mathematical transformation algorithms have proven effective at reducing the content of multilingual, unstructured data into a vector that describes the content. Such methods are particularly desirable in fields undergoing information explosions, such as network traffic analysis, bioinformatics, and the intelligence community. In response, concept mining methodology is being extended to improve performance and permit hardware implementation - traditional methods are not sufficiently scalable.

Management and Service Discovery in Satellite and Avionic Networks; by Todd Sproull, John W. Lockwood, John Meier; IEEE Aerospace Conference; Big Sky, MT; March 3-10, 2007; Paper 10.1501

Abstract: Command and control services manage network-attached assets deployed in distributed systems that can be separated by thousands of miles. Networks that rely on satellite communications to transit all data to a centralized control center are troubled by high latency due to long propagation delays to satellites and limited data transit over bandwidth constrained links. Low latency communications can be achieved by using a combination of distributed airborne and space-based systems. This research investigates how deployment of a Peer-to-Peer (P2P) overlay network in a region of conflict can reduce the latency for real time control and communication. This overlay network utilizes a hybrid of both satellite and aircraft links to provide services that best satisfy the immediate needs of ground units. Experiments have been performed with an emulation testbed using 147 compute nodes in the Emulab testbed to study the latency and throughput of the overlay network. The overlay network is developed using a Peer-to-Peer Application Programmers Interface (API) called JXTA. Nodes simulate resources requesting and offering several types of video and data services.

Dynamic Optimization of Workload for FPGA Application using Thermal Measurements, by Phillip H. Jones, Young H. Cho, and John W. Lockwood, 20th IEEE/ACM International Conference on VLSI Design, Bangalore, India, January 6-10, 2007.

Abstract: This paper extends the previous work by additionally making this adaptive frequency mechanism workload aware and evaluating power and latency performance under bursty workload conditions. Our working system has been implemented on the Field Programmable Port Extender (FPX) platform developed at Washington University in St. Louis. Experimental results with a scalable image correlation circuit show up to a 30% saving in power for bursty workloads and up to a 2x factor improvement in latency performance as compared to a system without thermal or workload feedback. Our circuit provides power efficient high performance processing of bursty workloads, while ensuring the device always operates within a safe temperature range.

An Adaptive Frequency Control Method Using Thermal Feedback for Reconfigurable Hardware Applications, by Phillip Jones, Young Cho, John Lockwood, IEEE International Conference on Field Programmable Technology (FPT), Bangkok, Thailand, Dec 13-15, 2006.

Abstract: Reconfigurable circuits running in Field Programmable Gate Arrays (FPGAs) can be dynamically optimized for power based on computational requirements and thermal conditions of the environment. In the past, FPGA circuits were typically small and operated at a low frequency. Few users were concerned about high-power consumption and the heat generated by FPGA devices. The current generation of FPGAs, however, use extensive pipelining techniques to achieve high data processing rates and dense layouts that can generate significant amounts of heat. FPGA circuits can be synthesized that can generate more heat than the package can dissipate. For FPGAs that operate in controlled environments, heatsinks and fans can be mounted to the device to extract heat from the device. When FPGA devices do not operate in a controlled environment, however, changes to ambient temperature due to factors such as the failure of a fan or a reconfiguration of bitfile running on the device can drastically change the operating conditions. A protection mechanism is needed to ensure the proper operation of the FPGA circuits when such a change occurs. To address these issues, we have devised a reconfigurable temperature monitoring system that gives feedback to the FPGA circuit using the measured junction temperature of the device. Using this feedback, we designed a novel dual frequency switching system that allows the FPGA circuits to maintain the highest level of performance for a given maximum junction temperature. Our working system has been implemented and deployed on the Field Programmable Port Extender (FPX) platform at Washington University in St. Louis. Our experimental results with a scalable image correlation circuit show up to a 2.4x factor increase in performance as compared to a system without thermal feedback. Our circuit ensures that the device performs the maximum required computation while always operating within a safe temperature range.

Fast Packet Classification Using Bloom Filters, by Sarang Dharmapurikar, Haoyu Song, Jonathan Turner, John Lockwood, ACM Symposim on Architectures for Networking and Communications Systems (ANCS), San Jose, CA, Dec 4-5, 2006

Abstract: Ternary Content Addressable Memory (TCAM), although widely used for general packet classification, is an expensive and high power-consuming device. Algorithmic solutions which rely on commodity memory chips are relatively inexpensive and power-efficient but have not been able to match the generality and performance of TCAMs. Therefore, the development of fast and power-efficient algorithmic packet classification techniques continues to be a research subject.
In this paper we propose a new approach to packet classification which combines architectural and algorithmic tech- niques. Our starting point is the well-known crossproduct algorithm which is fast but has significant memory overhead due to the extra rules needed to represent the crossproducts. We show how to modify the crossproduct method in a way that drastically reduces the memory requirement without compromising on performance. Unnecessary accesses to the off-chip memory are avoided by filtering them through on- chip Bloom filters. For packets that match p rules in a rule set, our algorithm requires just 4+p+e. independent memory accesses to return all matching rules, where . e << 1 is a small constant that depends on the false positive rate of the Bloom filters. Using two commodity SRAM chips, a throughput of 38 Million packets per second can be achieved. For rule set sizes ranging from a few hundred to several thousand filters, the average rule set expansion factor attributable to the al- gorithm is just 1.2 to 1.4. The average memory consumption per rule is 32 to 45 bytes.

Fast and Scalable Pattern Matching for Network Intrusion Detection Systems, by Sarang Dharmapurikar and John W. Lockwood, IEEE Journal on Selected Areas in Communications, (JSAC) Oct. 2006, Volume: 24, Issue: 10, pp. 1781- 1792

Abstract: High-speed packet content inspection and filtering devices rely on a fast multi-pattern matching algorithm which is used to detect predefined keywords or signatures in the packets. Multi-pattern matching is known to require intensive memory accesses and is often a performance bottleneck. Hence specialized hardware-accelerated algorithms are required for line-speed packet processing.
We present hardware-implementable pattern matching algorithm for content filtering applications, which is scalable in terms of speed, the number of patterns and the pattern length. Our algorithm is based on a memory efficient multi-hashing data structure called Bloom filter. We use embedded on-chip memory blocks in FPGA/VLSI chips to construct Bloom filters which can suppress a large fraction of memory accesses and speed up string matching. Based on this concept, we first present a simple algorithm which can scan for several thousand short (up to 16 bytes) patterns at multi-gigabit per second speeds with a moderately small amount of embedded memory and a few mega bytes of external memory. Furthermore, we modify this algorithm to be able to handle arbitrarily large strings at the cost of a little more on-chip memory. We demonstrate the merit of our algorithm through theoretical analysis and simulations performed on Snort's string set.

A Thermal Management and Profiling Method for Reconfigurable Hardware Applications; by Phillip H. Jones, John W. Lockwood, Young H. Cho; 16th Annual Conference on Field Programmable Logic and Applications (FPL); Madrid, Spain, August 28-30, 2006; pp. 103-109.

Abstract: Given large circuit sizes, high clock frequencies, and possibly extreme operating environments, Field Programmable Gate Arrays (FPGAs) are capable of heating beyond their designed thermal limits. As new circuits are developed for FPGAs and deployed remotely, engineers are challenged to determine in advance if the device will operate within recommended thermal ranges. The amount of power consumed by the circuit depends on how an algorithm is compiled into hardware, how the circuit is placed and routed, and the patterns of data that pass through the system. The amount of heat that can be dissipated depends on the thermal transfer characteristics of the package, the air flow that passes over the package, and the ambient temperature of the remote systems. Rather than designing a system to handle unreasonable worst-case situations, we have implemented a thermal management system that continuously monitors the temperature of the FPGA and reprograms the device if the temperate approaches the outer limits of safe operating conditions. Our system measures the junction temperature of a Xilinx Virtex FPGA using a built-in thermal diode. Using the temperature monitoring mechanism, we have studied the steady-state and transient conditions of multiple benchmark circuits implemented in an FPGA logic on the Field-programmable Port Extender (FPX) development platform. We observed properties of these benchmark circuits that enable us to predict power and thermal characteristics for real applications. We propose a Dynamic Thermal Management (DTM) strategy for FPGAs based on temperature feedback.

High Speed Document Clustering in Reconifigurable Hardware; by G. Adam Covington, Charles L.G. Comstock, Andrew A. Levine, John W. Lockwood, Young H. Cho; 16th Annual Conference on Field Programmable Logic and Applications (FPL); Madrid, Spain, August 28-30, 2006; pp. 411-417.

Abstract: High-performance document clustering systems enable similar documents to be automatically organized into groups. In the past, the large amount of computational time needed to cluster documents prevented practical use of such systems with a large number of documents. A full hardware implementation of the K-means clustering algorithm has been designed and implemented in reconfigurable hardware that clusters 512k documents rapidly. This implementation, uses four parallel cosine distance metrics to cluster document vectors that each have 4000 dimensions. The synthesized hardware runs on the Field Programmable Port Extender (FPX) platform at a clock rate of 80 MHz. Although the clock rate on the Xilinx VirtexE 2000 is slower than a CPU, the implementation runs 26 times faster than an algorithmically equivalent software algorithm running on an Intel 3.60 GHz Xeon. The same architecture was used to synthesize a faster and larger design for the Xilinx Virtex4 LX200. This larger implementation can contain up to 25 parallel cosine distance metrics. The implementation synthesized with a clock rate of 250 Mhz and outperforms the equivalent software by a factor of 328.

A Reconfigurable Architecture for Multi-Gigabit Speed Content-Based Routing, by James Moscola, Young H. Cho, and John W. Lockwood, IEEE Symposium on High Performance Interconnects (Hot Interconnects-14), Stanford, CA, August 23-24, 2006, pp. 61-66.

Abstract: This paper presents a reconfigurable architecture for high-speed content-based routing. Our architecture goes beyond simple pattern matching by implementing a parsing engine that defines the semantics of patterns that are parsed within the data stream. Defining the semantics of patterns allows for more accurate processing and routing of packets using any fields that appear within the payload of the packet. The architecture consists of several components, including a pattern matcher, a parsing structure, and a routing module. Both the pattern matcher and parsing structure are automatically generated using an application-specific compiler that is described in this paper. The compiler accepts a grammar specification as input and outputs a data parser in VHDL. The routing module receives control signals from both the pattern matcher and the parsing structure that aid in the routing of packets. We illustrate how a content-based router can be implemented with our technique using an XML parser as an example. The XML parser presented was designed, implemented, and tested in a Xilinx Virtex XCV2000E FPGA on the FPX platform. It is capable of processing 32-bits of data per clock cycle and runs at 100 MHz. This allows the system to process and route XML messages at 3.2 Gbps.

Rethinking Hardware Support for Network Analysis and Intrusion Prevention, by Vern Paxson, Krste Asanovic, Sarang Dharmapurikar, John Lockwood, Ruoming Pang, Robin Sommer, Nick Weaver; USENIX First Workshop on Hot Topics in Security (HotSec), Vancouver, B.C., July 31, 2006.

Abstract: The performance pressures on implementing effective network security monitoring are growing fiercely due to rising traffic rates, the need to perform much more sophisticated forms of analysis, the requirement for inline processing, and the collapse of Moores law for sequential processing. Given these growing pressures, we argue that it is time to fundamentally rethink the nature of using hardware to support network security analysis. Clearly, to do so we must leverage massively parallel computing elements, as only these can provide the necessary performance. The key, however, is to devise an abstraction of parallel processing that will allow us to expose the parallelism latent in semantically rich, stateful analysis algorithms; and that we can then further compile to hardware platforms with different capabilities.

Context-Free Grammar based Token Tagger in Reconfigurable Devices, by Young H. Cho, James Moscola, John W. Lockwood; Proceedings of International Workshop on Data Engineering (ICDE/SeNS), Atlanta, GA, April 3-7, 2006.

Abstract: In this paper, we present reconfigurable hardware architecture for detecting semantics of streaming data on 1+ Gbps networks. The design leverages on the characteristics of context-free-grammar (CFG) that allows the computers to understand the semantics of data. Although our parser is not a true CFG parser, we use the linguistic structure defined in the grammars to explore a new way of parsing data using Field Programmable Gate Array (FPGA) hardware. Our system consists of pattern matchers and a syntax detector. The pattern matchers are automatically generated using the grammar token list while the syntax detector is generated based on the aspects of the grammar that define the order of all possible token sequences. Since all the rules are mapped onto the hardware as parallel processing engines, the meaning of each token can be determined by monitoring where it is being processed. Our highly parallel and fine grain pipelined engines can operate at a frequency above 500 MHz. Our initial implementation is XML content-based router for XML remote procedure calls (RPC). The implementation can process the data at 1.57 Gbps on Xilinx VirtexE FPGA and 4.26 Gbps on the Virtex 4 FPGA.

Hardware Accelerated Algorithms for Semantic Processing of Document Streams, by John W. Lockwood, Stephen G. Eick, Justin Mauger, John Byrnes, Ron Loui, Andrew Levine, Doyle J. Weishar, Alan Ratner, IEEE Aerospace Conference, Big Sky, MT, March 4-11, 2006, Paper 10.0802

Abstract: There is a need within the intelligence communities to analyze massive streams of multilingual unstructured data. Mathematical transformation algorithms have proven effective at interpreting multilingual, unstructured data, but high computational requirements of such algorithms prevent their widespread use. The rate of computation can be vastly increased with Field Programmable Gate Array (FPGA) hardware.
To experiment with this approach, we developed a system with FPGAs that ingests content over a network at high data rates. The system extracts basewords, counts words, scores documents, and discovers concepts on data that are carried in TCP/IP network flows as packets over a Gigabit Ethernet link or in cells transported over an OC48 link. These algorithms, as implemented in FPGA hardware, introduce certain constraints on the complexity and richness of the semantic processing algorithms.
To understand the implications of these constraints and to benchmark the performance of the system, we have performed a series of experiments processing multilingual documents. In these experiments, we compare techniques to generate basewords for our semantic concepts, score documents, and discover concepts across a variety of processing operational scenarios.

Cycle-Accurate Microarchitecture Performance Evaluation, by Richard Hough, Phillip Jones, Scott Friedman, Roger Chamberlain, Jason Fritts, John Lockwood, and Ron Cytron, In Proceedings of Workshop on Introspective Architecture, Austin, TX, February 2006.

Abstract: We present the design, implementation, and evaluation of a circuit we call the Statistics Module that captures cycle-accurate performance data at (or above) the microarchitecture layer. The circuit is deployed introspectively--in the architecture itself-- using an FPGA in the context of a soft-core implementation of a SPARC architecture (LEON). Accessible over the Internet, the circuit can be dynamically configured (without resynthesis) to capture programlevel, function-level, and instruction-level statistics on any subset of predefined VHDL signals. The circuit is deployed outside the actual soft core, so that its operation does not interfere with a program's execution at any level.
In contrast with simulations, StatsMod monitors actual real-time program executions, including runtime artifacts such as multithreading, operating system support, and external interrupts. Furthermore, unlike software-introduced instrumentation, the measurements do not affect the statistics, and microarchitecture characteristics are easily captured.
Our design avoids the otherwise combinatorial size of circuitry that would be required to accommodate all methods and events, scaling well with the number of artifacts that are actually measured. We have used this circuit to measure cycle-accurate cache-RAM statistics, such as cache hits and misses, RAM reads and writes, using both write-through and write-back policies. In this paper, we show the scalabilty of our design as it accommodates more methods and events.

Multi-pattern Signature Matching for Hardware Network Intrusion Detection Systems, by Haoyu Song and John W. Lockwood, IEEE Globecom 2005, St. Louis, MO, Nov. 28, 2005, pp. CN-02-3.

Abstract: Network Intrusion Detection System (NIDS) performs deep inspections on the packet payload to identify, deter and contain the malicious attacks over the Internet. It needs to perform exact matching on multi-pattern signatures in real time. In this paper we introduce an efficient data structure called Extended Bloom Filter (EBF) and the corresponding algorithm to perform the multi-pattern signature matching. We also present a technique to support long signature matching so that we need only to maintain a limited number of supported signature lengths for the EBFs. We show that at reasonable hardware cost we can achieve very fast and almost time-deterministic exact matching for thousands of signatures. The architecture takes the advantages of embedded multi-port memories in FPGAs and can be used to build a full-featured hardware-based NIDS.

Optimizing Memory Bandwidth of a Multi-Channel Packet Buffer, by Sarang Dharmapurikar, Sailesh Kumar, John W. Lockwood, and Patrick Crowley; IEEE Globecom 2005, St. Louis, MO, Nov. 28, 2005, pp. CG-3-8.

Abstract: Backbone routers typically require large buffers to hold packets during congestion. A thumb rule is to provide a buffer at every link, equal to the product of the round trip time and the link capacity. This translates into Gigabytes of buffers operating at line rate at every link. Such a size and rate necessitates the use of SDRAM with bandwidth of, for example, 80 Gbps for link speed of 40 Gbps. With speedup in the switch fabrics used in most routers, the bandwidth requirement of the buffer increases further. While multiple SDRAM devices can be used in parallel to achieve high bandwidth and storage capacity, a wide logical data bus composed of these devices results in suboptimal performance for arbitrarily sized packets. An alternative is to divide the wide logical data bus into multiple logical channels and store packets into them independently. However, in such an organization, the cumulative pin count grows due to additional address buses which might offset the performance gained. We find that due to several existing memory technologies and their characteristics and with Internet traffic composed of particular sized packets, a judiciously architected data channel can greatly enhance the performance per pin. In this paper, we derive an expression for the effective memory bandwidth of a parallel channel packet buffer and show how it can be optimized for a given number of I/O pins available for interfacing to memory. We believe that our model can greatly aid packet buffer designers to achieve the best performance.

Shape Shifting Tries for Faster IP Route Lookup, by Haouy Song, Jonathan Turner and John Lockwood. Proceedings of the IEEE International Conference on Network Protocols (ICNP), Boston, MA, Nov. 6, 2005, pp. 358-367.

Abstract: Some of the fastest practical algorithms for IP route lookup are based on space-efficient encodings of multi-bit tries. Unfortunately, the time required by these algorithms grows in proportion to the address length, making them less attractive for IPv6. This paper describes and evaluates a new data structure called a shape-shifting trie, in which the data structure nodes correspond to arbitrarily shaped subtrees of the underlying binary trie for a given set of address prefixes. The ability to adapt the node shape to the trie reduces the number of nodes that must be accessed to perform a lookup, especially for tries with large sparse regions. We give a fast algorithm for optimally dividing a trie into nodes so as to minimize the maximum lookup depth. We show that seven data structure accesses are sufficient for route tables with more than 150,000 IPv6 prefixes. This makes it possible to achieve wire-speed processing for OC192 link using a single QDRII SRAM chip.

Fast and Scalable Pattern Matching for Content Filtering, by Sarang Dharmapurikar, John Lockwood Proceedings of Symposium on Architectures for Networking and Communications Systems (ANCS), Princeton, NJ, Oct 2005.

Abstract: High-speed packet content inspection and filtering devices rely on a fast multi-pattern matching algorithm which is used to detect predefined keywords or signatures in the packets. Multi-pattern matching is known to require intensive memory accesses and is often a performance bottleneck. Hence specialized hardware-accelerated algorithms are being developed for line-speed packet processing. While several pattern matching algorithms have already been developed for such applications, we find that most of them suffer from scalability issues. To support a large number of patterns, the throughput is compromised or vice versa.
We present a hardware-implementable pattern matching algorithm for content filtering applications, which is scalable in terms of speed, the number of patterns and the pattern length. We modify the classic Aho-Corasick algorithm to consider multiple characters at a time for higher throughput. Furthermore, we suppress a large fraction of memory accesses by using Bloom filters implemented with a small amount of on-chip memory. The resulting algorithm can support matching of several thousands of patterns at more than 10 Gbps with the help of a less than 50 KBytes of embedded memory and a few megabytes of external SRAM. We demonstrate the merit of our algorithm through theoretical analysis and simulations performed on Snort's string set.

Fast Hash Table Lookup Using Extended Bloom Filter: An Aid to Network Processing, by Haoyu Song, Sarang Dharmapurikar, Jonathan Turner, John Lockwood; ACM SIGCOMM; Philadelphia, PA; August 21-26, 2005.

Abstract: Hash table is used as one of the fundamental modules in several network processing algorithms and applications such as route lookup, packet classification, per-flow state management and network monitoring. These applications, which typically form components of data-path in a high-speed router, must process and forward packets with little or no buffer in order to maintain the wire-speed throughout. A poorly designed hash table can critically affect the worstcase throughput due to multiple memory accesses required for each lookup. Hence, high throughput requirement in turn underscores the need for a hash table having good and more predictable worstcase lookup performance. While most of the existing hash table based packet processing algorithms rely on the assumption that hash table lookup needs constant time, very little discussion is provided on the underlying engineering considerations to achieve this performance.
We present a novel hash table data structure and lookup algorithm which improves the performance of a naive hash table by providing better bounds on the hash collisions and memory accesses per search. Our algorithm extends the multiple-hashing Bloom Filter data structure to support exact matches. We contrive our hash table architecture by coupling our algorithm with the latest advances in embedded memory technology. Through theoretical analysis and simulations we show that our algorithm is significantly faster for practical purposes than the naive hash table using the same amount of memory, hence it can support better throughput for router applications based on hash tables.

Automatic Application-Specific Microarchitecture Reconfiguration; by Shobana Padmanabhan, Ron K. Cytron, Roger D. Chamberlain and John W. Lockwood; 13th Reconfigurable Architectures Workshop (RAW), Apr 25-26, 2006

Abstract: Applications for constrained embedded systems are subject to strict time constraints and restrictive resource utilization. With soft core processors, application developers can customize the processor for their application, constrained by resources but aimed at high application performance. With such freedom in the design space of the processor, however, comes complexity. We present here an automatic optimization technique that helps the developers with the processor microarchitecture customization.
A naive approach exploring all possible configurations is exponential with the number of parameters and hence is clearly infeasible, even with only tens of reconfigurable parameters. Instead, our approach runs in time that is linear with the number of parameter values, based on an assumption of parameter independence. This makes the approach feasible and scalable. For the dimensions that we customize, namely application runtime and hardware resources, we formulate their costs as a constrained binary integer nonlinear optimization program. Though the results are not guaranteed to be optimal, we find they are near-optimal in practice. Our technique itself is general and can be applied to other design-space exploration problems.

HAIL: A Hardware-Accelerated Algorithm for Language Identification; by Charles M. Kastner, G. Adam Covington, Andrew A. Levine, John W. Lockwood; 15th Annual Conference on Field Programmable Logic and Applications (FPL); August 24-26, 2005

Abstract: A hardware-accelerated algorithm has been designed to automatically identify the primary languages used in documents transferred over the Internet. The algorithm has been implemented in hardware on the Field programmable port extender (FPX) platform. This system, referred to as the Hardware-Accelerated Identification of Languages (HAIL) project, identifies the primary languages used in content transferred over Transmission Control Protocol (TCP) / Internet Protocol (IP) networks that operate at rates exceeding 2.4 Gigabits/second. We demonstrate that this hardware accelerated circuit, operating on a Xilinx XCV2000E-8 FPGA, far outperforms software algorithms running on modern personal computers while maintaining extremely high levels of accuracy.

SIFT: Snort Intrusion Filter for TCP, by Michael Attig and John W. Lockwood, 13th Annual Proceedings of Hot Interconnects (HotI-13), Stanford, CA, August 17-19, 2005.

Abstract: Intrusion rule processing in reconfigurable hardware enables intrusion detection and prevention services to run at multi Gigabit/second rates. High-level intrusion rules mapped directly into hardware separate malicious content from benign content in network traffic. Hardware parallelism allows intrusion systems to scale to support fast network links, such as OC-192 and 10 Gbps Ethernet. In this paper, a Snort Intrusion Filter for TCP (SIFT) is presented that operates as a preprocessor to prevent benign traffic from being inspected by an intrusion monitor running Snort. Snort is a popular open-source rule-processing intrusion system. SIFT selectively forwards IP packets that contain questionable headers or defined signatures to a PC where complete rule processing is performed. SIFT alleviates the need for most network traffic from being inspected by software. Statistics, like how many packets match rules, are used to optimize rule processing systems. SIFT has been implemented and tested in FPGA hardware and used to process Internet traffic from a campus Internet backbone with live data.

Extracting and Improving Microarchitecture Performance on Reconfigurable Architectures; by Shobana Padmanabhan, Phillip Jones, David V. Schuehler, Scott J. Friedman, Praveen Krishnamurthy, Huakai Zhang, Roger Chamberlain, Ron K. Cytron, Jason Fritts and John W. Lockwood; International Journal on Parallel Programming (IJPP), 33(2-3):115-136, June 2005

Abstract: Applications for constrained embedded systems require careful attention to the match between the application and the support offered by an architecture, at the ISA and microarchitecture levels. Generic processors, such as ARM and Power PC, are inexpensive, but with respect to a given application, they often overprovision in areas that are unimportant for the application's performance. Moreover, while application-specific, customized logic could dramatically improve the performance of an application, that approach is typically too expensive to justify its cost for most applications. In this paper,we describe our experience using reconfigurable architectures to develop an understanding of an application's performance and to enhance its performance with respect to customized, constrained logic.We begin with a standard ISA currently in use for embedded systems.We modify its core to measure performance characteristics, obtaining a system that provides cycle-accurate timings and presents results in the style of gprof, but with absolutely no software overhead. We then provide cache-behavior statistics that are typically unavailable in a generic processor. In contrast with simulation, our approach executes the program at full speed and delivers statistics based on the actual behavior of the cache subsystem. Finally, in response to the performance profile developed on our platform, we evaluate various uses of the FPGA-realized instruction and data caches in terms of the application's performance.

A Framework for Rule Processing in Reconfigurable Network Systems, by Michael Attig and John W. Lockwood, In Proceedings of: IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, April 18-20, 2005. [Slides]

Abstract: High-performance rule processing systems are needed by network administrators in order to protect Internet systems from attack. Researchers have been working to implement components of intrusion detection systems (IDS), such as the highly popular Snort system, in reconfigurable hardware. While considerable progress has been made in the areas of string matching and header processing, complete systems have not yet been demonstrated that effectively combine all of the functionality necessary to perform rule processing for network systems. In this paper, a framework for implementing a rule processing system in reconfigurable hardware is presented. The framework integrates the functionality to scan data flows for regular expressions, fixed strings, and header values. It also allows modules to be added to perform extended functionality to support all features found in Snort rules. Reconfigurability and flexibility are key components of the framework that enable it to adapt to protect Internet systems from threats including malicious worms, computer viruses, and network intruders. To prove the framework viable, a system has been built that scans all bytes of Transmission Control Protocol/Internet Protocol (TCP/IP) traffic entering and leaving a network's gateway at multi-gigabit rates. Using Xilinx FPGA hardware on the Field programmable Port eXtender (FPX) platform, the framework can process 32,768 complex rules at data rates of 2.5 Gbps. Systems to handle data at 10 Gbps rates can be built today using the same framework in the latest reconfigurable hardware devices such as the Virtex 4.

Transformation Algorithms for Data Streams, by John W. Lockwood, Stephen G. Eick, Doyle J. Weishar, Ron Loui, James Moscola, Chip Kastner, Andrew Levine, Mike Attig, IEEE Aerospace Conference, Big Sky, Montana, March 2005.

Abstract: Next-generation data processing systems must deal with very high data ingest rates and massive volumes of data. Such conditions are typically encountered in the Intelligence Community (IC) where analysts must search through huge volumes of data in order to gather evidence to support or refute their hypotheses. Their effort is made all the more difficult given that the data appears as unstructured text that is written in multiple languages using characters that have different encodings. Human Analysts have not been able to keep pace with reading the data and a large amount of data is discarded even though it might contain key information. The goal of our project is to assess the feasibility of incrementally replacing humans with automation in key areas of information processing. These areas include document ingest, content categorization, language translation, and context-and-temporally- based information retrieval.
Mathematical transformation algorithms, when implemented in rapidly reconfigurable hardware, offer the potential to continuously (re)process and (re)interpret extremely high volumes of multi-lingual, unstructured text data. These technologies can automatically elicit the semantics of streaming input data, organize the data by concept (regardless of language), and associate related concepts in order to parameterize models. To test that hypothesis, we are building an experimentation testbed that enables the rapid implementation of semantic processing algorithms in hardware. The system includes a high-performance infrastructure that includes hardwarea accelerated content processing platform; mass storage to hold training data, test data, and experiment scenarios; and tools for analysis and visualization of the data.
In our first use of the testbed, we performed an experiment where we implemented three transformation algorithms using FPX hardware platforms to perform semantic processing on document streams. Our platform uses Field-programmable Port Extender (FPX) modules developed at Washington University in Saint Louis. This paper describes our approach to building the experimental hardware platform components, discusses the major features of the circuit designs, overviews our first experiment, and offers a detailed of the results, which are processing.

Efficient Packet Classification for Network Intrusion Detection using FPGA, by Haoyu Song and John W. Lockwood, International Symposium on Field-Programmable Gate Arrays (FPGA'05), Monterey, CA, Feb 20-22, 2005. [Slides]

Abstract: FPGA technology has become widely used for real-time network intrusion detection. In this paper, a novel packet classification architecture called BV-TCAM is presented, which is implemented for an FPGA-based Network Intrusion Detection System (NIDS). The classifier can report multiple matches at gigabit per second network link rates. The BV-TCAM architecture combines the Ternary Content Addressable Memory (TCAM) and the Bit Vector (BV) algorithm to effectively compress the data representations and boost throughput. A tree-bitmap implementation of the BV algorithm is used for source and destination port lookup while a TCAM performs the lookup of the other header fields, which can be represented as a pre/x or exact value. The architecture eliminates the requirement for prefix expansion of port ranges. With the aid of a small embedded TCAM, packet classification can be implemented in a relatively small part of the available logic of an FPGA. The design is prototyped and evaluated in a Xilinx FPGA XCV2000E on the FPX platform. Even with the most difficult set of rules and packet inputs, the circuit is fast enough to sustain OC48 tra1c throughput. Using larger and faster FPGAs, the system can work at speeds greater than OC192.

Wide-area Hardware-accelerated Intrusion Prevention Systems (WHIPS), by Todd Sproull and John Lockwood, International Working Conference on Active Networking (IWAN), October 27-29, 2004, Lawrence, Kansas, USA.

Abstract: The proliferation of computer viruses and Internet worms has had a major impact on the Internet Community. Cleanup and control of malicious software (malware) has become a key problem for network administrators. Effective techniques are now needed to protect networks against outbreaks of malware. Wire-speed firewalls have been widely deployed to limit the flow of traffic from untrusted domains. But these devices weakness resides in a limited ability to protect networks from infected machines on otherwise trusted networks. Progressive network administrators have been using an Intrusion Prevention System (IPS) to actively block the flow of malicious traffic. New types of active and extensible network systems that use both microprocessors and reconfigurable logic can perform wire-speed services in order to protect networks against computer virus and Internet worm propagation. This paper discusses a scalable system that makes use of automated worm detection and intrusion prevention to stop the spread of computer viruses and Internet worms using extensible hardware components distributed throughout a network. The contribution of this work is to present how to manage and configure large numbers of distributed and extensible IPSs.

Extracting and Improving Microarchitecture Performance on Reconfigurable Architectures ; by Shobana Padmanabhan, Phillip Jones, David V. Schuehler, Scott J. Friedman, Praveen Krishnamurthy, Huakai Zhang, Roger Chamberlain, Ron K. Cytron, Jason Fritts, and John W. Lockwood; Workshop on Compilers and Tools for Constrained Embedded Systems (CTCES) , Washington DC, Sep 22, 2004.

Abstract: We describe our experience using reconfigurable architectures to develop an understanding of an applications performance and to enhance its performance with respect to customized, constrained logic. We begin with a standard ISA currently in use for embedded systems. We modify its core to measure performance characteristics, obtaining a system that provides cycle-accurate timings and presents results in the style of gprof, but with absolutely no software overhead. We then provide cache-behavior statistics that are typically unavailable in a generic processor. In contrast with simulation, our approach executes the program at full speed and delivers statistics based on the actual behavior of the cache subsystem. Finally, in response to the performance profile developed on our platform, we evaluate various uses of the FPGA-realized instruction and data caches in terms of the applications performance.

A Modular System for FPGA-based TCP Flow Processing in High-Speed Networks; by David Schuehler, John Lockwood; 14th International Conference on Field Programmable Logic and Applications (FPL), Springer LNCS 3203, Antwerp, Belgium, August 2004, pp. 301-310.

Abstract: Field Programmable Gate Arrays (FPGAs) can be used in Intrusion Prevention Systems (IPS) to inspect application data contained within network flows. An IPS operating on high-speed network traffic can be used to stop the propagation of Internet worms and to protect networks from Denial of Services (DoS) attacks. When used in the backbone of a core network, the device will be exposed to millions of active flows simultaneously. In order to protect the data in each connection, network devices will need to track the state of every flow. This must be done at multi-gigabit line rates without introducing significant delays. This paper describes a high performance TCP processing system called TCP-Processor which supports flow processing in high-speed networks utilizing multiple devices. This circuit provides stateful flow tracking, TCP stream reassembly, context storage, and flow manipulation services for applications which process TCP data streams. A simple client interface eases the complexities associated with processing TCP data streams. In addition, a set of encoding and decoding circuits has been developed which efficiently transports this interface between multiple FPGA devices. The circuit has been implemented in FPGA hardware and tested using live Internet traffic.

Automated Method to Generate Bitstream Intellectual Property Cores for Virtex FPGAs; by Edson Horta and John W. Lockwood; 14th International Conference on Field Programmable Logic and Applications (FPL), Springer LNCS 3203, Antwerp, Belgium, August 2004, pp. 975-979.

Abstract: This paper presents an innovative way to deploy Bitstream Intellectual Property (BIP) cores. By using standard tools to generate bitstreams for Field Programmable Gate Arrays (FPGAs) and a tool called PARBIT, it is possible to extract a partial bitstream containing a modular component developed on one Virtex FPGA that can be placed or relocated inside another Virtex FPGAs. The methodology to obtain the BIP cores is explained, along with details about PARBIT and Virtex devices.

Design of a System for Real-Time Worm Detection, by Bharath Madhusudan and John Lockwood; 12th Annual Proceedings of IEEE Hot Interconnects (HotI-12); Stanford, CA, August, 2004, pp. 77-83.

Abstract: Recent well publicized attacks have made it clear that worms constitute a threat to Internet security. Systems that secure networks against malicious code are expected to be a part of critical Internet infrastructure in the future. Intrusion Detection and Prevention Systems (IDPS) currently have limited use because they can filter only known worms. In this paper, we present the design and implementation of a system that automatically detects new worms in real-time by monitoring traffic on a network. The system uses Field Programmable Gate Arrays (FPGAs) to scan packets for patterns of similar content. Given that a new worm hits the network and the rate of infection is high, the system is automatically able to detect an outbreak. Frequently occuring strings in packet payloads are instantly reported as likely worm signatures.

Liquid Architecture, by Phillip Jones, Shobana Padmanabhan, Daniel Rymarz, John Maschmeyer, David V. Schuehler, John W. Lockwood, and Ron K. Cytron; International Parallel and Distributed Processing Symposium: Workshop on Next Generation Software (NSF-NGS); Santa Fe, NM, April 26, 2004, Paper: W10-NSFNGS-13.

Abstract: We present an implementation of a liquid-architecture system that supports efficient development, prototyping, and performance evaluation of custom architectures. The implementation integrates the LEON soft-core, SPARC-compatible processor into the Field-programmable Port Extender (FPX). The resulting platform can be instantiated, configured, and executed via the Internet.

Implementation Results of Bloom Filters for String Matching, by Michael Attig, Sarang Dharmapurikar, and John Lockwood; In Proceedings of: IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, April 20-23, 2004.

Abstract: Network Intrusion Detection and Prevention Systems (IDPS) use string matching to scan Internet packets for malicious content. Bloom filters offer a mechanism to search for a large number of strings efficiently and concurrently when implemented with Field Programmable Gate Array (FPGA) technology. A string matching circuit has been implemented within the FPX platform using Bloom filters. Using 155 block RAMs on a single Xilinx VirtexE 2000 FPGA, the circuit scans for 35,475 unique signatures.

Secure Remote Control of Field-programmable Network Devices, Haoyu Song, Jing Lu, John Lockwood, James Moscola In Proceedings of: IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, April 20-23, 2004.

Abstract: A circuit and an associated lightweight protocol have been developed to secure communication between a control console and remote programmable network devices. The circuit provides encryption, data integrity checking and sequence number verification to ensure confidentiality, integrity and authentication of control messages sent over the public Internet. All of these functions are performed directly in FPGA hardware to provide high throughput and near-zero latency. The circuit has been used to control and configure remote firewalls and intrusion detection systems. The circuit could also be used to control and configure other distributed network applications.

Deep Packet Inspection using Parallel Bloom Filters, by Sarang Dharmapurikar, Praveen Krishnamurthy, Todd S. Sproull, John W. Lockwood; IEEE Micro, Vol. 24, No. 1, Jan 2004, pp. 52-61

Abstract: Because conventional software-based packet inspection algorithms have not kept pace with high-speed networks, interest has turned to using hardware to process network data quickly. String scanning with Bloom filters can scan entire packet payloads for predifined signatures at multi-Gigabit-per-second line speeds.

Architecture for a Hardware-Based, TCP/IP Content-Processing System, by David V. Schuehler, James Moscola, John W. Lockwood; IEEE Micro, Vol. 24, No. 1, Jan 2004, pp. 62-69

Abstract: A new architecture performs content scanning of TCP flows in high-speed networks. Combining a TCP processing engine, a per-flow state store, and a content-scanning engine, this architecture permits complete payload inspections on 8 million TCP flows at 2.5 Gbps.

System-on-Chip Packet Processor for an Experimental Network Services Platform, by David Taylor, Alex Chandra, Yuhua Chen, Sarang Dharmapurikar, John Lockwood, Wenjing Tang, Jonathan Turner; Proceedings of IEEE Globecom 2003, December 2003.

Abstract: As the focus of networking research shifts from raw performance to the delivery of advanced network services, there is a growing need for open-platform systems for extensible networking research. The Applied Research Laboratory at Washington University in Saint Louis has developed a flexible Network Services Platform (NSP) to meet this need. The NSP provides an extensible platform for prototyping next-generation network services and applications. This paper describes the design of a system-on-chip Packet Processor for the NSP which performs all core packet processing functions including segmentation and reassembly, packet classification, route lookup, and queue management. Targeted to a commercial configurable logic device, the system is designed to support gigabit links and switch fabrics with a 2:1 speed advantage. We provide resource consumption results for each component of the Packet Processor design.

Application of Hardware Accelerated Extensible Network Nodes for Internet Worm and Virus Protection, by John W. Lockwood, James Moscola, David Reddick, Matthew Kulig, and Tim Brooks, International Working Conference on Active Networks (IWAN), Kyoto, Japan, December, 2003.

Abstract Today's crucial information networks are vulnerable to fast moving attacks by Internet worms and computer viruses. These attacks have the potential to cripple the Internet and compromise the integrity of the data on the end-user machines. Without new types of protection, the Internet remains susceptible to the assault of increasingly aggressive attacks. A platform has been implemented that actively detects and blocks worms and viruses at multi-Gigabit/second rates. It uses the Field-programmable Port Extender (FPX) to scan for signatures of malicious software (malware) carried in packet payloads. Dynamically reconfigurable Field Programmable Gate Array (FPGA) logic tracks the state of Internet flows and searches for regular expressions and fixedstrings that appear in the content of packets. Protection is achieved by the incremental deployment of systems throughout the Internet.

Longest Prefix Matching Using Bloom Filters; by Sarang Dharmapurikar, Praveen Krishnamurthy, David E. Taylor, Proceedings of SIGCOMM 2003, Karlsruhe, Germany, Aug. 25-29, 2003.

Abstract We introduce the first algorithm that we are aware of to employ Bloom filters for Longest Prefix Matching (LPM). The algorithm performs parallel queries on Bloom filters, an efficient data structure for membership queries, in order to determine address prefix membership in sets of prefixes sorted by prefix length. We show that use of this algorithm for Internet Protocol (IP) routing lookups results in a search engine providing better performance and scalability than TCAM-based approaches. The key feature of our technique is that the performance, as determined by the number of dependent memory accesses per lookup, can be held constant for longer address lengths or additional unique address prefix lengths in the forwarding table given that memory resources scale linearly with the number of prefixes in the forwarding table. Our approach is equally attractive for Internet Protocol Version 6 (IPv6) which uses 128-bit destination addresses, four times longer than IPv4. We present a basic version of our approach along with optimizations leveraging previous advances in LPM algorithms. We also report results of performance simulations of our sys snapshots of IPv4 BGP tables and extend the results to IPv6. Using less than 2Mb of embedded RAM and a commodity SRAM device, our technique achieves average performance of one hash probe per lo and a worst case of two hash probes and one array access per lookup.

Internet Worm and Virus Protection in Dynamically Reconfigurable Hardware; by John W. Lockwood, James Moscola, Matthew Kulig, David Reddick, Tim Brooks, Military and Aerospace Programmable Logic Device (MAPLD), Washington DC, 2003, Paper E10, Sep 9-11, 2003. [Photos]

Abstract The security of the Internet can be improved using Programmable Logic Devices (PLDs). A platform has been implemented that actively scans and filters Internet traffic for Internet worms and viruses at multi-Gigabit/second rates using the Field-programmable Port Extender (FPX). Modular components implemented with Field Programmable Gate Array (FPGA) logic on the FPX process packet headers and scan for signatures of malicious software (malware) carried in packet payloads. FPGA logic is used to implement circuits that track the state of Internet flows and search for regular expressions and fixed-strings that appear in the content of packets. The FPX contains logic that allows modules to be dynamically reconfigured to scan for new signatures. Network-wide protection is achieved by the deployment of multiple systems throughout the Internet.

An Extensible, System-On-Programmable-Chip, Content-Aware Internet Firewall, by John W. Lockwood, Christopher Neely, Christopher Zuver, James Moscola, Sarang Dharmapurikar, and David Lim; Field Programmable Logic and Applications (FPL), Lisbon, Portugal, Paper 14B, Sep 1-3, 2003.

Abstract An extensible firewall has been implemented that performs packet filtering, content scanning, and per-fiow queuing of Internet packets at Gigabit/second rates. The firewall uses layered protocol wrappers to parse the content of Internet data. Packet payloads are scanned for keywords using parallel regular expression matching circuits. Packet headers are compared to rules specified in Ternary Content Addressable Memories (TCAMs). Per-fiow queuing is performed to mitigate the effect of Denial of Service attacks. All packet processing operations were implemented with reconfigurable hardware and fit within a single Xilinx Virtex XCV2000E Field Programmable Gate Array (FPGA). The singlechip firewall has been used to filter Internet SPAM and to guard against several types of network intrusion. Additional features were implemented in extensible hardware modules deployed using run-time reconfiguration.

Deep Packet Inspection Using Parallel Bloom Filters, by Sarang Dharmapurikar, Praveen Krishnamurthy, Todd Sproull, John W. Lockwood; Hot Interconnects 11 (HotI), Stanford, CA, USA, pp. 44-51, Aug. 2003.

Abstract Recent advances in network packet processing focus on payload inspection for applications that include contentbased billing, layer-7 switching and Internet security. Most of the applications in this family need to search for predefined signatures in the packet payload. Hence an important building block of these processors is string matching infrastructure. Since conventional software-based algorithms for string matching have not kept pace with high network speeds, specialized high-speed, hardware-based solutions are needed. We describe a technique based on Bloom filters for detecting predefined signatures (a string of bytes) in the packet payload. A Bloom filter is a data structure for representing a set of strings in order to support membership queries. We use hardware Bloom filters to isolate all packets that potentially contain predefined signatures. Another independent process eliminates false positives produced by Bloom filters.
We outline our approach for string matching at line speeds and present the performance analysis. Finally, we report the results for a prototype implementation of this system on the FPX platform. Our analysis shows that with the state-of-the-art FPGAs, a set of 10,000 strings can be scanned in the network data at the line speed of OC48 (2.4 Gbps).

Architecture for a Hardware Based, TCP/IP Content Scanning System, by David V. Schuehler, James Moscola, and John W. Lockwood; Hot Interconnects 11 (HotI), Stanford, CA, USA, pp. 89-94, Aug. 2003.

Abstract Hardware assisted intrusion detection systems and content scanning engines are needed to process data at multigigabit line rates. These systems, when placed within the core of the Internet, are subject to millions of simultaneous flows, with each flow potentially containing data of interest. Existing IDS systems are not capable of processing millions of flows at gigabit-per-second data rates. This paper describes an architecture which is capable of performing complete, stateful, payload inspections on 8 million TCP flows at 2.5 gigabits-per-second. To accomplish this task, a hardware circuit is used to combine a TCP protocol processing engine, a per flow state store, and a content scanning engine.

Implementation of a Streaming Content Search-and-Replace Module for an Internet Firewall, by James Moscola, Michael Pachos, John W. Lockwood, Ron P. Loui; Hot Interconnects 11 (HotI), Stanford, CA, USA, pp. 122-129, Aug. 2003.

Abstract A module has been implemented in Field Programmable Gate Array (FPGA) hardware that is able to perform regular expression search-and-replace operations on the content of Internet packets at Gigabit/ second rates. All of the packet processing operations are performed using reconfigurable hardware within a single Xilinx Virtex XCV2000E FPGA. A set of layered protocol wrappers is used to parse the headers and payloads of packets for Internet protocol data. A content matching server automatically generates, compiles, synthesizes, and programs the module into the Field-programmable Port Extender (FPX) platform.

Scalable IP Lookup for Internet Routers, by David E. Taylor, Jonathan S. Turner, John W. Lockwood, Todd S. Sproull, David B. Parlour; IEEE Journal on Selected Areas in Communications (JSAC), Vol. 21, No. 4, May 2003, pp. 522-534. (Journal version of Infocom 2002 paper)

Abstract Internet protocol (IP) address lookup is a central processing function of Internet routers. While a wide range of solutions to this problem have been devised, very few simultaneously achieve high lookup rates, good update performance, high memory efficiency, and low hardware cost. High performance solutions using content addressable memory devices are a popular but high-cost solution, particularly when applied to large databases. We present an efficient hardware implementation of a previously unpublished IP address lookup architecture, invented by Eatherton and Dittia. Our experimental implementation uses a single commodity synchronous random access memory chip and less than 10% of the logic resources of a commercial configurable logic device, operating at 100 MHz. With these quite modest resources, it can perform over 9 million lookups/s, while simultaneously processing thousands of updates/s, on databases with over 100000 entries. The lookup structure requires 6.3 bytes per address prefix: less than half that required by other methods. The architecture allows performance to be scaled up by using parallel fast IP lookup (FIPL) engines, which interleave accesses to a common memory interface. This architecture allows performance to scale up directly with available memory bandwidth. We describe the Tree Bitmap algorithm, our implementation of it in a dynamically extensible gigabit router being developed at Washington University in Saint Louis, and the results of performance experiments designed to assess its performance under realistic operating conditions.

Implementation of a Content-Scanning Module for an Internet Firewall, by James Moscola, John Lockwood, Ronald P. Loui, Michael Pachos; FCCM, Napa, CA, April 9-11, 2003. [Presentation], [Demo Night Photos], [Audience Photos]

Abstract A module has been implemented in Field Programmable Gate Array (FPGA) hardware that scans the content of Internet packets at Gigabit/second rates. All of the packet processing operations are performed using recon/gurable hardware within a single Xilinx Virtex XCV2000E FPGA. A set of layered protocol wrappers is used to parse the headers and payloads of packets for Internet protocol data. A content match- ing server automatically generates the Finite State Machines (FSMs) to search for regular expressions. The complete system is operated on the Field-programmable Port Extender (FPX) platform.

TCP-Splitter: A TCP/IP Flow Monitor in Reconfigurable Hardware, by David V. Schuehler, John Lockwood; Hot Interconnects 10 (HotI-10), Stanford, CA, Aug 21-23, 2002, pp. 127-131.

Abstract TCP/IP is the most commonly-used protocol on the internet. It provides a reliable transport for nearly all applications that utilize a network. These include Web browsers, FTP, Telnet, Secure Shell and other applications. New types of routers require the examination of TCP/IP flows transiting this networking equipment. This paper describes TCP-Splitter, a reconfigurable hardware based solution for analyzing and processing TCP/IP flows at multi-gigabit line rates. A consistent byte stream is delivered to a client application for every TCP/IP connection processed by TCP-Splitter. In order to maintain a design that is lightweight, efficient, and able to process a nearly unlimited number of flows at gigabit line rates, the system uses a non-passive flow processing algorithm.

Implementing a dynamically reconfigurable ATM switch on the VIRTEX FPGA of the FPX platform, by E. L. Horta, J. W. Lockwood, S. T. Kofuji; Proceedings of SPIE, volum e 4867, Boston, MA, July 30 2002.
Scalable IP Lookup for Programmable Routers, by David E. Taylor, John W. Lockwood, Todd S. Sproull, Jonathan S. Turner, David B. Parlour, IEEE Infocom 2002, New York NY, June 23-27, 2002.

Dynamic Hardware Plugins in an FPGA with Partial Run-time Reconfiguration. by Edson L. Horta, John W. Lockwood, David E. Taylor, David Parlour; Design Automation Conference (DAC), New Orleans, LA, June 10-14, 2002.

Abstract Tools and a design methodology have been developed to support partial run-time reconfiguration of FPGA logic on the Field Programmable Port Extender. High-speed Internet packet processing circuits on this platform are implemented as Dynamic Hardware Plugin (DHP) modules that fit within a specific region of an FPGA device. The PARBIT tool has been developed to transform and restructure bitfiles created by standard computer aided design tools into partial bitsteams that program DHPs. The methodology allows the platform to hot-swap application-specific DHP modules without disturbing the operation of the rest of the system.

Design of a High Performance Dynamically Extensible Router, by Fred Kuhns, John DeHart, Anshul Kantawala, Ralph Keller, John Lockwood, Prashanth Pappu, David Richards, David Taylor, Jyoti Parwatikar, Ed Spitznagel, Jon Turner, Ken Wong; DARPA Active Networks Conference and Exposition (DANCE), San Francisco, May 2002.

Abstract This paper describes the design, implementation and performance of an open, high performance, dynamically extensible router under development atWashington University in St. Louis. This router supports the dynamic installation of software and hardware plugins in the data path of application data flows. It provides an experimental platform for research on programmable networks, protocols, router software and hardware design, network management, quality of service and advanced applications. It is designed to be flexible, without sacrificing performance. It supports gigabit links and uses a scalable architecture suitable for supporting hundreds or even thousands of links. The systems flexibility makes it an ideal platform for experimental research on dynamically extensible networks that implement higher level functions in direct support of individual application sessions.

Control and Configuration Software for a Reconfigurable Networking Hardware Platform, Todd Sproull, John W. Lockwood, David E. Taylor; IEEE Symposium on Field-Programmable Custom Computing Machines, (FCCM), Napa, CA, April 24, 2002. [Photos]

Abstract A suite of tools called NCHARGE (Networked Configurable Hardware Administrator for Reconfiguration and Governing via End-systems) has been developed to simplify the co-design of hardware and software components that process packets within a network of Field Programmable Gate Arrays (FPGAs). A key feature of NCHARGE is that it provides a high-performance packet interface to hardware and standard Application Programming Interface (API) between software and reprogrammable hardware modules. Using this API, multiple software processes can communicate to one or more hardware modules using standard TCP/IP sockets. NCHARGE also provides a Web-Based User Interface to simplify the configuration and control of an entire network switch that contains several software and hardware modules.

Protocol Wrappers for Layered Network Packet Processing in Reconfigurable Hardware, IEEE Micro, Volume 22, Number 3, Feb 2002, by Florian Braun, John Lockwood, and Marcel Waldvogel, pp. 66-74.
Abstract A network platform called the Field-programmable Port Extender (FPX) streamlines and simplifies network transmission processing directly in hardware.
Dynamic Hardware Plugins (DHP): Exploiting Reconfigurable Hardware for High-Performance Programmable Routers, by David E. Taylor, Jonathan S. Turner, John W. Lockwood, Edson L. Horta; Computer Networks, February 2002, Volume 38, Issue 3, pp. 295-310, Elsevier Science.

Layered Protocol Wrappers for Internet Packet Processing in Reconfigurable Hardware, by Florian Braun, John Lockwood, and Marcel Waldvogel, Proceedings of Hot Interconnects 9 (HotI-9) Stanford, CA, Aug 22-24, 2001, pp. 93-98.

Abstract A library of layered protocol wrappers has been developed that process Internet packets in reconfigurable hardware. These wrappers can be used with a reprogrammable network platform called the Field Programmable Port Extender (FPX) to rapidly prototype hardware circuits for processing Internet packets. We present a framework to streamline and simplify the development of networking applications that process ATM cells, AAL5 frames, Internet Protocol (IP) packets and UDP datagrams directly in hardware.

Evolvable Internet Hardware Platforms, by John W. Lockwood, NASA/DoD Workshop on Evolvable Hardware (EHW'01), Long Beach, CA, July 12-14, 2001, pp. 271-279.

Abstract Network routing platforms and Internet firewalls of the next decade will be radically different than the platforms of today. They will contain modular components that can be dynamically reconfigured over the Internet. But, unlike the active networks that are in the research labs today, these new platforms will not suffer from the performance penalty of processing packets in software.
These platforms will implement routing, packet filtering, and queuing functions in reprogrammable hardware. The hardware of the system will evolve over time as packet pro-cessing algorithms and protocols progress. The granularity of the system will be configurable down to the level of the logic gates. These logic gates, and the interconnections be-tween them, will be reconfigurable over the Internet. These routers will enable new services to be rapidly deployed over the Internet and operate at the full rate of the an Internet backbone link.
Through the development of the the Field Programmable Port Extender (FPX), a platform has been built that demon-strates how networking modules can be used for rapid prototype and deployment of networking hardware. The platform includes high-speed network interfaces, multiple banks of memory, and Field Programmable Gate Array (FPGA) logic. Applications have been developed for the FPX that include Internet packet routing, data queuing, and application-level data modification. The FPX is currently used as a component in an evolvable router.

Platform and Methodology for Teaching Design of Hardware Modules in Internet Routers and Firewalls, by John W. Lockwood, IEEE Computer Society International Conference on Microelectronic Systems Education (MSE'2001), Las Vegas, NV, June 17-18, 2001, pp. 56-57.
Dynamic Hardware Plugins (DHP): Exploiting Reconfigurable Hardware for High-Performance Programmable Routers, by David E. Taylor, Jonathan S. Turner, John W. Lockwood IEEE OPENARCH 2001: 4th IEEE Conference on Open Architectures and Network Programming, Anchorage, AK, 4/01.

Reprogrammable Network Packet Processing on the Field Programmable Port Extender (FPX), by John W. Lockwood, Naji Naufel, Jon S. Turner, and David E. Taylor, ACM International Symposium on Field Programmable Gate Arrays (FPGA'2001), Monterey, CA, February 2001, pp. 87-93.

Abstract A prototype platform has been developed that allows processing of packets at the edge of a multi-gigabit-per-second network switch. This system, the Field Programmable Port Extender (FPX), enables packet processing functions to be implemented as modular components in reprogrammable hardware. All logic on the on the FPX is implemented in two Field Programmable Gate Arrays (FPGAs). Packet processing functions in the system are implemented as dynamically-loadable modules.
Core functionality of the FPX is implemented on an FPGA called the Networking Interface Device (NID). The NID contains the logic to transmit and receive packets over a network, dynamically reprogram hardware modules, and route individual traffic flows. A full, non-blocking, switch is implemented on the NID to route packets between the networking interfaces and the modular components. Modular components of the FPX are implemented on a second FPGA called the Reprogrammable Application Device (RAD). Modules are loaded onto the RAD via reconfiguration and/or partial reconfiguration of the FPGA.
Through the combination of the NID and the RAD, the FPX can individually reconfigure the packet processing functionality for one set of traffic flows, while the rest of the system continues to operate. The platform simplifies the development and deployment of new hardware-accelerated packet processing circuits. The modular nature of the system allows an active router to migrate functionality from softare plugins to hardware modules.

An Open Platform for Development of Network Processing Modules in Reprogrammable Hardware, by John W. Lockwood, IEC DesignCon 2001, Santa Clara, CA, Jan. 2001, Paper WB-19.

Field Programmable Port Extender (FPX) for Distributed Routing and Queuing, by John W. Lockwood, Jon S. Turner, David E. Taylor, ACM International Symposium on Field Programmable Gate Arrays (FPGA'2000), Monterey, CA, February 2000, pp. 137-144.

Abstract Field Programmable Gate Arrays (FPGAs) are being used to provide fast Internet Protocol (IP) packet routing and advanced queuing in a highly scalable network switch. A new module, called the Field-programmable Port Extender (FPX), is being built to augment the Washington University Gigabit Switch (WUGS) with reprogrammable logic.
FPX modules reside at the edge of the WUGS switching fabric. Physically, the module is inserted between an optical line card and the WUGS gigabit switch backplane. The hardware used for this project allows ports of the switch populated with an FPX to operate at rates up to 2.4 Gigabits/second. The aggregate throughput of the system scales with the number of switch ports.
Logic on the FPX module is implemented with two FPGA devices. The first device is used to interface between the switch and the line card, while the second is used to prototype new networking functions and protocols. The logic on the second FPGA can be reprogrammed dynamically via control cells sent over the network.

Design of a Flexible Open Platform for High Performance Active Networks, by Sumi Choi, John Dehart, Ralph Keller, John Lockwood, Jonathan Turner and Tilman Wolf; Allerton Conference, Champaign, IL, 1999.

Abstract: This paper describes an architecture for a high performance active router. The system is designed as an open research platform, with a range of configuration options and possibilities for extension in both the software and hardware dimensions. The system is built around a scalable switch fabric and includes a general-purpose processor subsystem at each port, enabling flexible packet processing and peployment of flow-specific active plugins. Such a research platform is becoming an indispensiable tool for effective systems research in networking and distributed systems.

Available Bit Rate Architecture and Simulation for an Input-Buffered and Per-VC Queued ATM Switch, M. Bossardt, J. W. Lockwood, S. M. Kang, S-.Y. Park, IEEE Global Telecommunications Conference (Globecom'98), Sydney, Australia, Nov. 1998. pages 1817-1822.

Abstract: This paper proposes an innovative concept called virtual output queue to support Available Bit Rate (ABR) traffic on an input-buffered, per-virtual circuit queued switch. This technique allows ABR models developed for output-buffered systems to be migrated to an input-buffered system.
In order to evaluate the virtual output queue and to compare different ABR algorithms, a simulator of the ATM testbed at the University of Illinois has been enhanced with ABR functions. This paper provides simulation results for the input-buffered variation of the ERICA+ algorithm

A High-performance OC-12/OC-48 Queue Design Prototype for Input-buffered ATM Switches, by H. Duan, J. W. Lockwood, S. M. Kang, J.D. Will; IEEE Infocom '97, Kobe, Japan, April 7-11, 1997, pp 20-28.

Abstract: This paper presents the design and prototype of an intelligent, 3-Dimensional Queue (3DQ) system for high-performance, scalable, input buffered ATM switches. The 3DQ uses pointers and linked lists to organize ATM cells into multiple virtual queues according to priority, destination, and virtual connections, then selects proper cells for switching based on Quality-of-Service (QoS) parameters and run-time traffic conditions. Using Field-Programmable-Gate-Array (FPGA) devices, our prototype hardware can process ATM cells at 622 Mb/s (OC-12). Using more aggressive technology (Multi-Chip-Module (MCM) and fast GaAs logic), the same 3DQ design can process cells at 2.5 Gb/s (OC-48). Combined with the Matrix-Unit-Cell-Scheduler (MUCS) module, a high-performance input-buffered ATM switch system has been designed, which avoids Head-Of-Line (HOL) blocking and achieves near-100% link bandwidth utilization.

Workshops

June 2002 FPX Workshop John W Lockwood, Todd Sproull, James Moscola, David Schuehler, Dave Lim, Sarang Dharmapurikar, Chris Neely, Gigabit Kits Workshop, Saint Louis, MO, June 19-20, 2002.
January 2002 FPX Workshop John W Lockwood, David Taylor, James Moscola, Todd Sproull, Dave Lim, Gigabit Kits Workshop, Saint Louis, MO, January 3-4, 2002.
August 2001 FPX Workshop John W Lockwood, Henry Fu, Todd Sproull, Sarang Dharmapurikar, Edson Horta, Dave Lim Gigabit Kits Workshop, Saint Louis, MO, August 15-16, 2001.
January 2001 FPX Workshop John W Lockwood, David E. Taylor, Todd Sproull, Sarang Dharmapurikar, Florian Braun. Gigabit Kits Workshop, Saint Louis, MO, January 11-12, 2001.

Technical Reports

Techniques for Processing TCP/IP Flow Content in Network Switches at Gigabit Line Rates

Abstract: The growth of the Internet has enabled it to become a critical component used by businesses, governments and individuals. While most of the traffic on the Internet is legitimate, a proportion of the traffic includes worms, computer viruses, network intrusions, computer espionage, security breaches and illegal behavior. This rogue traffic causes computer and network outages, reduces network throughput, and costs governments and companies billions of dollars each year.
This dissertation investigates the problems associated with TCP stream processing in high-speed networks. It describes an architecture that simplifies the processing of TCP data streams in these environments and presents a hardware circuit capable of TCP stream processing on multi-gigabit networks for millions of simultaneous network connections. Live Internet traffic is analyzed using this new TCP processing circuit.

Eliot I. Sinclair, John W. Lockwood, Multiflow TCP, UDP, IP, and ATM Traffic Generation Module, WUCSE-2003-24, April 24, 2003.

Abstract: Networking devices must be capable of processing traffic flows from multiple sources. In order to verify that such devices operates properly, a network testbench can be used to inject traffic into the device. The specification of the traffic flows can be difficult. At the low level, there are header fields, data checksums, and packet length fields that all must be formatted correctly. Further, there can be multiple flows of traffic that will arrive simultaneously.
It is desirable to specify traffic at a high level of abstraction. A software program can then be written to parse the specification and generate the low-level data that is actually processed by the networking hardware.
For this project, a traffic generation program was built that accepts high-level traffic flow specifications. The program generates a cell-by-cell representation of the combined traffic flows. These flows can then be read by a testbench and fed into a simulation. With a hardware module capable of sending traffic created from the above program, a hardware test can be conducted using traffic generated with this program.

John Lockwood et. al., Field Programmable Port Extender (FPX) User Guide: Version 2.2, Washington University, Department of Computer Science, Technical Report WUCS-02-15, June, 2002.

Abstract: Field-programmable Port Extender (FPX) is a general-purpose, reprogrammable platform that performs data processing in Field Programmable Gate Array (FPGAs) hardware. The FPX extends operation of the Washington University Gigabit Switch (WUGS) by adding FPGA hardware at ingress and egress ports. As with the Smart Port Card (SPC), data packets can be actively processed by user-defined, reprogrammable modules as they pass though the device. Unlike the SPC, however, the FPX uses reprogramamble hardware, not software, to process the packets. The hardware-based processing allows the FPX to achieve multi-Gigabit per second throughput, even when performing deep processing of the packet payload. This manual summarizes how to insert the FPX into the Washington University Gigabit Switch, how to install the NCHARGE control software, how to initialize the system, and how to reprogram a user-defined module into the FPX over the network using the included web-based tools.

David Taylor, John W. Lockwood, Todd Sproull, and David B. Parlour, Scalable IP Lookup for Programmable Routers, Washington University, Department of Computer Science, Technical Report WUCS-01-33, October, 2001.
Synthesizable Design of a Multi-Module Memory Controller, Washington University, Department of Computer Science, Technical Report WUCS-01-26, October, 2001.

Fred Kuhns, John Dehart, Ralph Keller, John Lockwood, P. Pappu, J. Parwatikar, Ed Spitznagel, Dave Richards, Dave Taylor, Jon Turner, and Ken Wong, Implementation of an Open Multi-Service Router Washington University, Department of Computer Science, Technical Report WUCS-01-20, August, 2001.

Abstract: This paper describes the design, implementation and performance of an open, highperformance, dynamically reconfigurable Multi-Service Router (MSR) being developed at Washington University in St. Louis. This router provides an experimental platform for research on protocols, router software and hardware design, network management, quality of service and advanced applications. The MSR has been designed to be flexible, without sacrificing performance. It supports gigabit links and uses a scalable architecture suitable for supporting hundreds or even thousands of links. The MSRs flexibility makes it an ideal platform for experimental research on dynamically extensible networks that implement higher level functions in direct support of individual application sessions.

The FPX KCPSM Module: An Embedded, Reconfigurable Active Processing Module for the Field Programmable Port Extender (FPX), by Henry Fu and John W. Lockwood, Washington University, Department of Computer Science, Technical Report WUCS-01-14, July, 2001.

Abstract: While hardware plugins are well suited for processing data with high throughput, software plugins are well suited for implementing complex control functions. A plugin module has been implemented for the FPX that executes software on an embedded soft-core processor. By including this module in an FPX design, it is possible to implement active networking functions on the FPX using both hardware and software. The KCPSM, an 8-bit microcontroller developed by Xilinx Corp., has been embedded into a FPX module. The module includes circuits to be reprogrammed over the network and to execute new programs between the processing of data packets. A sample application, called the FPX KCPSM Module has been developed that illustrates how easily an application can make use of the hybrid system. This module loads the program memory of the KCPSM from an incoming UDP packet, and executes the new program upon receiving a new incoming UDP packet. The resulting circuit runs at 70MHz and occupies 35% on a Xilinx XCV1000E-7- FG680.

PARBIT: A Tool to Transform Bitfiles to Implement Partial Reconfiguration of Field Programmable Gate Arrays (FPGAs), by Edson Horta and John W. Lockwood, Washington University, Department of Computer Science, Technical Report WUCS-01-13, July, 2001.

Abstract: Field Programmable Gate Arrays (FPGAs) can be partially reconfigured to implement Dynamically load-able Hardware Plugin (DHP) modules. A tool called PARBIT has been developed that transforms FPGA configuration bitfiles to enable DHP modules. With this tool it is possible to define a partial reconfigurable area inside the FPGA and download it into a specified region of the FPGA device. One ore more DHPs, with different sizes can be implemented using PARBIT.

Layered Protocol Wrappers for Internet Packet Processing in Reconfigurable Hardware, by Florian Braun, John W. Lockwood, and Marcel Waldvogel, Washington University, Department of Computer Science, Technical Report WUCS-01-10, July, 2001.
Parallel FPGA Programming over Backplane Chassis, by John W. Lockwood, Tom McLaughlin, Tom Chaney, Yuhua Chen, Fred Rosenberger, Alex Chandra, and Jon Turner, Washington University, Department of Computer Science, Technical Report WUCS-00-11, June 12, 2000.
David E. Taylor, John W Lockwood, Sarang Dharmapurikar, Generalized RAD Module Interface Specification of the Field Programmable Port Extender (FPX), Washington University, Department of Computer Science. Version 2.0, Technical Report, January 8, 2000.

John W. Lockwood, David Lim, "Hello World": A Simple application for the Field Programmable Port Extender (FPX), Washington University, Department of Computer Science, Technical Report WUCS-00-12, July 11, 2000.

Abstract: The FPX provides simple and fast mechanisms to process cells or packets at the full line speed of the card [currently 2.4 Gbits/sec]. A sample application, called `Hello World' has been developed that illustrates how easily an application can be implemented on the FPX. This application uses the FPGA hardware to search for a string on a particular flow and selectively replace contents of the payload. The resulting circuits operates at 119 MHz on a Xilinx XCV1000E-FG680-7, and occupies less than 1% of the available gates on the device.
Related Files:

Helloworld: VHDL source

Pins.ucf

Pins_bracket.ucf

John W. Lockwood, Project History and Previous Research on the Illinois Input Queue and WUGS/iiQueue Integration, Aug. 1998
John W. Lockwood, Quality of Service Enhancement of Washington University Gigabit Switch Using the Illinois Input Queue, December 1996

John W. Lockwood, Design and Implementation of a Multicast, Input-Buffered ATM Switch for the iPOINT Testbed, PhD Dissertation, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, UMI Number: 9625160 , 1995

Abstract: This thesis presents the design and implementation of the multicast, input-buffered Asynchronous Transfer Mode (ATM) switch for use with the iPOINT testbed. The input-buffered architecture of this switch is optimal in terms of the memory bandwidth required for the implementation of an ATM queue module. The contention resolution algorithm used by the iPOINT switch supports atomic multicast, enabling the simultaneous delivery of ATM cells to multiple output ports without the need for recirculation buffers, duplication of cells in memory, or multiple clock cycles to transfer a cell from an input queue module.
The implementation of the prototype switch is unique in that it was entirely constructed using Field Programmable Gate Array (FPGA) technology. A fully functional, five-port, 800 Mbps ATM switch has been developed and currently serves as the high-speed, optically interconnected, local area network for a cluster of Sun SPARCstations and the gateway to the wide-area Blanca/XUNET gigabit testbed. Through the use of FPGA technology, new hardware-based switching algorithms and functionality can be implemented without the need to modify hard-wired logic. Further, through the use of the remote switch manager, switch controller, and FPGA controller, the management, operation, and even logic functionality of the iPOINT testbed can be dynamically altered, all without the need for physical access to the iPOINT hardware.
Based on the existing prototype switch, the design of the FPGA-based, gigabit-per-second ``Any-Queue'' module is presented. For this design in its maximum configuration, up to 256 queue modules can be supported, providing an aggregate throughput of 180 Gbps. Further, the design of a 16-port, 11.2 Gbps aggregate throughput, switch fabric is documented that can be entirely implemented using only eight FPGA devices.
In addition to the design of the switch module, this thesis describes the supporting components of the iPOINT testbed, including the network control and application software, the hardware specifications of the switch interface, and the device requirements of the optoelectronic components used in the testbed. VHDL and schematics of the switch hardware and C/C++ source code for the supporting systems are included.

Guided Tutorials

Development of a System-On-Chip Extensible Network Processor and debugging using Identify, by John W. Lockwood and Chris Zuver, Reconfigurable Network Group Tutorial.
- Adobe .PDF or PowerPoint .PPT format.

Related Web Pages

Reconfigurable Network Group Homepage
FPX Homepage