Final Projects
Completed as a part of the graduate course:
CSE535: Acceleration of Networking Algorithms in Hardware
Fall 2003

Prof. John W. Lockwood
Department of Computer Science and Engineering
Washington University in St. Louis


Photos from the Presentations and Thumbnail Listing

Liquid Processor

Software functions can be performed on the FPX by implementing a 32-bit microprocessor in FPGA logic. Such a processor is often called a "soft" core. Several standard microprocessors have already been implemented as soft cores, including one called the LEON processor. The LEON processor implements 32-bit instructions defined by Sun Microsystems for the SPARC processor. LEON was developed as an open-core, and details about the LEON can be found on-line from the website.

The goal of this project is to port the LEON to work on the FPX platform. The LEON has been made to work on other Xilinx Virtex platforms. For this project, however, memory will be accessed over the network. Much of the infrastructure to implement this project is already in place on the FPX platform. Both the SRAM and SDRAM on the FPX have multiple, arbitrated interfaces. For this project, one port is used by the processor, and other by a control packet processor to read and write the contents of SRAM and/or SDRAM over the network.

The gcc cross-compiler is used to compile programs for the LEON. The compiled code is then downloaded into the memory on the FPX and executed. Programs read from from one area of memory and write the output to another. After the program is finished, the network is used to to read the contents of memory and show that the task was completed correctly. Control packets transfer the contents of a compiled software program over the network and place it into memory on the FPX. Further, a control packet command is used to start and halt the processor.

Project Team:


Redundant Alignment Removal for BLASTN

Basic Local Alignment Search Tool (BLAST) is a collection of algorithms (traditionally implemented in software) to do comparative searches through biosequence databases. Comparative searches are useful to annotate current relationships between genes and proteins of different organisms as well as the evolution of genomes through time. Information on related work can be found here. While BLAST does protein and DNA searches, this project will be focused on BLASTN which only does DNA similarity searches.

In the first stage of BLASTN, exact genetic alignments of a given length are found. These alignments are returned as a set of indices, one corresponding to the beginning of the match in the query sequence, and the other corresponding to the beginning of the match in the database sequence. Some of these alignments may overlap with nearby alignments and these are also returned as answers; however, in the second stage of BLASTN, probabilistic (ungapped) matches are done which expand the window of observation around the exact alignment. Hence, if the first of a series of overlapping matches is evaluated, the examination of all other overlapping matches is a waste of time and resources. This is important because the size of the database files can be on the order of 10 to 100 GB leading to runtimes of hundreds of CPU-days.

A speedy circuit will be developed which will take the alignment outputs from the first stage, remove all redundant answers pertaining to the first non-redundant match and return the index to the valid match. The circuit will also allow for out of order matches to be kept in memory.

Project Team:


Snort Lite

A popular Network Intrusion Detection system is called SNORT. For this project, a small subset of the snort rules were implemented in hardware on the FPX platform. There are many features of SNORT that do not translate easily to hardware. This project is focused on the implementation of rules with one content rule associated with up to four header rules.

The circuit supports scanning for 10,000 content rules, each with a length between 24 to 32 bytes. Bloom filters are used to perform the string scanning functions. A software controller allows rules to be added and removed from the Bloom filter over the network. Statistics are maintained that identify which rules matched. By using a single Bloom filter engine, the circuit achieves 500 Mbps of throughput.

Project Team:


Nachi Worm Detector

The Nachi Worm spreads by having an infected host perform a port scan that sends ICMP ping messages to a sequence of potential victim hosts. Newer versions of Nachi tend to send ping messages of length 66 bytes. Older versions of Nachi sent pings of length 96 bytes. Nachi scans for victim hosts by using ping. When a machine replies, the infected host attempts to connect to the machine on a TCP port between 666 to 765. The worm then instructs the machine to download the worm executable via TFTP and then run it. More information about the Nachi Worm is on-line.

For this project, a circuit was implemented to monitor a network for hosts that perform port scans. A counting scheme is implemented to track which hosts generate pings to the largest numbers of other machines. When a threshold is exceeded, a UDP/IP warning packet is sent from the circuit to a monitoring host. All tracking statistics kept by the circuit can be read at any time via a control packet.

Project Team:


Major Content Detection

For this project, an automated system was implemented on the FPX platform that detects when a network carries large amounts of repeated content. Such would be the case when a new Internet worm or virus is spreading, as the contents of the worm will appear within the payload of all of the traffic flows coming from infected machines attempting to subvert targeted machines.

Motivation for such a system is discussed in a paper, The EarlyBird System for Real-time Detection of Unknown Worms by Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage of UCSD.

One challenge to automated worm detection is the appearance of Polymorphic worms that change their payload by inserting no-ops. We assume that worms are characterized by a well known signature. Instead of trying to detect a signature over the entire packet payload, we calculate several signatures over smaller window sizes. For instance, if the packet is 1500 bytes long, we might want to calculate 50 signatures of 30 bytes each. Worm signatures are counted using techniques described in: New Directions in Traffic Measurement and Accounting by Cristian Estan and George Varghese (In Proceedings of ACM SIGCOMM 2002). Frequently occuring signatures are flagged.

Project Team:


A CORBA Event Channel in Hardware

CORBA is a standard for distributed object computing (DOC) middleware. In DOC middleware, objects which may or may not reside locally provide services and clients issue requests for those services to be performed on their behalf. At the heart of distributed object computing are Object Request Brokers (ORBs) that deliver requests to objects and return any output values back to clients. Clients do not need to know where on the network objects reside, how they communicate, how they are implemented or how they execute. Before an application can issue a request to an object it must hold an object reference to that object. An ORB uses object references to identify and locate objects so it can deliver requests to them. As long as an object reference exists, the ORB allows the holder of an object to request services from it.

The CORBA Event Service specification defines an asynchronous communication model that allows an application to send an event that will be recieved by a number of objects. The Distributed Object Computing Research group at Washington University have implemented a CORBA-compliant ORB endsystem called TAO. TAO`s event service defines an Event Channel which acts as a mediator that propagates events to consumers (objects) on behalf of suppliers (applications) as follows:

  1. The Event Channel allows consumers to register events.
  2. The Event Channel accepts incoming events from suppliers.
  3. The Event Channel forwards supplier generated events to registered consumers.

Currently TAO`s event channel must be implemented as a CORBA object residing on a host. All end systems supplying or receiving events through the event channel object must acquire a valid reference to this object. This project entails implementing the event channel in hardware as a module on the FPX platform. Our implementation must:

  1. Understand Inter-Orb protocol (IIOP) messages to decipher and reply to requests from different ORBs.
  2. Support event type based consumer-supplier subscriptions, and filtering to allow the forwarding of event types to interested consumers.
  3. Support event correlation to notify consumers when a sequence of events have occurred.
  4. Support prioritized dispatching of events through the event channel.

Having implemented the Event Channel in hardware experiments that were run to measure overhead of TAO`s event channel will re-run to gauge the effect of hardware acceleration.

Details about TAO are on-line. Details about TAO`s event channel are also on-line.

Project Team:


On-chip, Networked Logic Analyzer

When debugging the operation of a Field Programmable Gate Array, it can useful to see the internal values of logic when certain events occur. A logic analyzer allows the circuit developer to know the value of internal signals as they change when an event occurs. Most logic analyzers are standalone systems that connect to a host.

For this project, a logic analyzer was implemented that can debug FPGA circuits running on the FPX platform over a network. The circuit consists of auto-generated VHDL code that includes user-defined logic to wait for an event to occur. Once the event occurs, all of the signals identified as important by the user are stored to an on-chip BlockRAM on every consecutive clock cycle until memory is full.

The system uses UDP/IP control packets to transfer the signal values collected from the on-chip BlockRAM to an external software application. The software client controls the operation of the logic analyzer by sending and receiving UDP/IP packets.

Project Team:


On Chip, Networked Logic Analyzer II

This is a second on chip networked analyzer project. As with the first project, a networked logic analyzer was implemented on the FPX platform. The circuit consist of auto-generated VHDL code that records the value of internal FPGA signals as well as a triggering circuit that waits for an event to occur. Once the event occurs, all of the signals identified as important are stored to on-chip BlockRAM on every consecutive clock cycle until memory is full.

The system uses UDP/IP control packets to interface with an external software client. This software client program controls the operation of the logic analyzer by sending and receiving UDP/IP packets. It fetchs the contents of the BlockRAMs, and saves the data to a file on the local machine.

This implementation of the logic analyzer lies within the Internet protocol wrappers on the FPX platform and connects to circuit device under test. A front end application allows a developer to select which internal signals they want to view and under when they want to begin data collection. VHDL code for those signals, as well as the code for the logic analyzer interface is then automatically generated. control packets are sent between the logic analyzer hardware and the control software over the network.

Project Team:


Network Packet Compression in Hardware

Most types of data exhibit consistent patterns that adhere to certain probability distribution characteristics. These characteristics can often be exploited for the purposes of data compression.

The focus of this project is to implement hardware support for network data compression on the FPX platform. Several potential compression algorithms were evaluated to determine their suitability for implementation in hardware. The circuit runs on the FPX and resides directly above the transport layer. The compression circuit encodes all application data in a packet prior to transmission, adding the custom header to convey necessary control information to the decompression circuit. Upon reception of the packet, the decompression circuit decodes the application data and outputs the data packet in its original uncompressed form. The circuit uses Arithmetic coding to decompress the data.

Project Team: