Click here to find out about Washington University's Gigabit Network Technology Distribution Program.
Washington University's Applied Research Lab has designed a gigabit
ATM switch architecture and supporting chip set that supports the economical
construction of large-scale ATM switches with link speeds from 150 Mb/s
up to 2.4 Gb/s. It seeks to license this technology to commercial
organizations for use in state-of-the-art products that offer unique
advantages to customers. This chip set implements a patented architecture
(U.S. patent #5,402,415, March 28, 1995) that provides nearly flat cost
per port across a very wide range of system configurations (a system with
4096 ports of 2.4 Gb/s each has a cost per port only 20% higher than a
system with eight ports) and offers unsurpassed efficiency and flexibility
in the provision of switched multicast.
The figure at right shows an efficient general interconnection topology that can be used for this purpose. If the access switches (the ones connecting to the end systems) are configured so that the number of ports connecting to other switches equals the number connecting to end-systems (a 1:1 concentration ratio), then the total number of ports on all the switches is seven times the number of users. For efficient switch architectures, the total system cost is dominated by the per port cost, making the cost of this hierarchical network roughly seven times that for a single large switch. The following plots show how this small switch penalty is affected by the total number of users and the concentration ratio.
In these figures, the vertical axis shows the ratio of the total number
of switch ports required, to the number of users.
It takes large concentration ratios to reduce the small switch penalty
to a tolerable level. In addition to having a higher cost, the
hierarchical network design yields greatly inferior performance with
respect to virtual circuit blocking, queueing delay and cell loss even
with a 1:1 concentration ratio. Its blocking performance is
particularly poor for multicast connections.
It consists of three main components, which are each implemented as a single custom integrated circuit. The Input Port Processors (IPP) at left, receive cells from the incoming links, buffer them while awaiting transmission through the central switching network and perform the virtual path/circuit translation required to route cells to their proper output (or outputs). The Output Port Processors resequence cells received from the switching network and queue them while they await transmission on the outgoing link. Each OPP is connected to its corresponding IPP, providing the ability to recycle cells belonging to multicast connections. This is discussed in more detail below. The central switching network is made up of Switching Elements (SE) with eight inputs and outputs and a common buffer to resolve local contention. The SEs switch cells to the proper output (or outputs) using information contained in the cell header or can distribute cells dynamically to provide load balancing. The load balancing option is used in the first k-1 stages of a 2k-1 stage network. In particular, for the configuration shown above (k=2), load distribution is performed in the first stage. Adjacent switch elements employ a simple hardware flow control mechanism to regulate the flow of cells between successive stages, eliminating the possibility of cell loss within the switching network. With this approach, relatively small buffers suffice within the network. Larger buffers are provided at the OPPs.
To provide sufficient bandwidth for 2.4 Gb/s data rates on the external links, the switch carries ATM cells in a 36 bit wide format. Four of the 36 bits contain addressing information and the remainder contains the cell payload, together with auxiliary fields that are added by the IPPs and removed by the OPPs. The internal cell cycle is 16 clock ticks long and the clock frequency used is 120 MHz. This yields an internal cell processing rate that is about 1.3 times the cell processing rate for external links operating at 2.4 Gb/s. The switching network is implemented in four parallel planes, with each plane receiving the same four bits of address information, plus eight bits of data. The core of the system operates fully synchronously, and the switch elements' operation is completely deterministic, meaning that the cells proceed through the four planes in parallel, without any explicit coordination, and are reconstructed at the OPP.
Because the cell recycling technique consumes a part of the system's bandwidth to handle multicast connections, engineering rules are needed to specify how much of the system's capacity to set aside for multicast. In a system with n ports, we can avoid blocking new connection requests if we dedicate a certain fraction of the ports for recycling. The figure below plots the fraction of ports that must be dedicated to recycling under a wide range of conditions. It's important to note that the fraction of the system's capacity that must be dedicated to multicast is independent of the number of ports.
The plot gives the fraction of the total outgoing traffic that belongs to multicast connections on the x axis and the fraction of the switch's ports that must be dedicated to multicast to ensure that virtual circuit blocking does not occur, on the y axis. The different curves reflect different aspects of the system configuration. In particular, the speed advantage is the ratio of the speed of the switch's internal data paths to the speed of the external links (1 if all external links operate at 2.4 Gb/s, 2 if they operate at 1.2 Gb/s). For systems in which the external links can be fully loaded, but the end users are limited to individual virtual circuit rates of 150 Mb/s, the middle curve is the relevant one. Note that if 20% of the traffic can be multicast, fewer than 20% of the switch ports must be dedicated to recycling. Also note that the amount of recycling capacity can be smoothly adjusted as the traffic demands change. This ability to adjust the multicast capacity to a wide range of situations, even after a system is deployed, is a unique advantage of the recycling approach to multicast.
The WUGS switch includes a special upstream discard mechanism to enable most efficient handling of many-to-many virtual paths and circuits. When a many-to-many connection has traffic entering a switch on multiple input links, the entering traffic streams are first brought together by routing them through a common recycling port. From this point the cells are forwarded to the required set of output links using a common multicast connection (either virtual path or circuit). If a given link is used as both an input and an output in the multicast connection, this can result in cells being forwarded back to the sender, which is often not desirable. The upstream discard mechanism can be used to prevent this. To implement upstream discard, the IPPs label cells arriving from the external link with the source port number. This allows cells to be optionally filtered on output if the outgoing link is the same as the one on which they arrived.
Arriving cells enter the switch element at the upper left, passing through a Skew Compensation Circuit (SKUC) and a Distribution Circuit (DSTC) before entering the Input Crossbar (IXBAR). The IXBAR forwards cells to free buffer slots in the central Cell Buffer (CBUF). Each row of the CBUF stores one cell (actually one fourth of the cell, plus control information) and has an associated Buffer Control Circuit (BCC). Outgoing cells pass through the Output Crossbar (OXBAR) and Header Modification Circuit (HMC) before proceeding to the output. Downstream neighbors provide a one bit flow control signal each cycle, indicating if they are prepared to receive an incoming cell or not. These grant signals are used within the (OXBAR) to control the flow of cells from the CBUF to the outputs. Upstream grants are generated by the Grant Generation Circuit (GGC) based on the number of available CBUF slots. The CBUF has been dimensioned to hold 40 cells.
The switch element is implemented using four chips, each of which handles
eight bits of the 32 bit data path. The switch element chip, shown at right
has been
fabricated in a 0.8 micron, 3.3 volt CMOS process.
The data flow through the chip is from left to right, with skew
compensation on the periphery at left, followed by a block that
includes the input crossbar split into an upper and lower half
surrounding the DSTC. The dense section that runs from top to bottom
just left of center is the CBUF and the BCCs are immediately to its
right. The large area to the right of the BCCs is the OXBAR. The chip
has approximately 650,000 transistors. The design was specified using
VHDL and designed using industry-standard synthesis tools and a
standard cell library, using conservative engineering rules
to allow for process variations and variations in operating
temperature and power supply voltage.
In addition, the XMB implements a simple block discard mechanism on selected virtual circuits, which performs block discarding on the basis of AAL 5 frames, rather than on individual cells. The particular algorithm used is a variation of the well-known Early Packet Discard technique in which hysteresis is added to damp the oscillations that can otherwise occur, making it possible to achieve 100\% throughput with modest buffer sizes. The block discard is implemented through the use of a Block Discard Index (BDI) which is inserted into arriving cells at the VXT. A non-zero BDI indicates a virtual circuit on which block discarding can be performed. The index is used by the Block Discard Controller in the XMB to access one bit of state information that controls the discarding of cells for that virtual circuit.
While the switching network has sufficient bandwidth to ensure that cells normally proceed to the output port for buffering without congesting the switching network, sustained high rate bursts can lead to congestion in the switching network that will ultimately back up to the input ports, causing the RCBs to fill up and exceed a preset threshold. If this happens at a particular IPP, it begins discarding all received cells with CLP=1 or CS=0 and continues discarding these cells until a timer has expired. The timer is generally set with a long enough period to ensure that the congestion has time to clear completely. This is a fairly drastic response to an event which should occur very rarely in a network which is properly engineered and controlled at higher levels.
The port processors can also accommodate interfaces using a simple serial data link chip set produced by Hewlett Packard (G-link). These interfaces can be operated at data rates of more than 1.2 Gb/s and are currently less expensive than OC-12C devices (roughly one-fourth the cost). The port processors also support a special dual G-link mode in which cells are sent at 2.4 Gb/s over a pair of G-links in parallel. When operated in this mode, the receiving IPP synchronizes the incoming streams to recover the transmitted data. The synchronization mechanism accommodates skew of about 12 ns, so the fiber lengths can differ by no more than about two meters. The dual G-link interface is intended primarily to permit demonstration of 2.4 Gb/s switch operation in the absence of commercially available OC-48C framer chips. While not a substitute for a true 2.4 Gb/s link, it can be useful in certain specialized applications.
A 20 Gb/s capacity system (designated WUGS-20) is being built at Washington University currently. These systems have eight ports on the core fabric. Each of these eight ports can accept line cards supporting several different transmission interfaces. The kit distribution will include six line cards supporting an interface speed of 1.2 Gb/s based on Hewlett-Packard serial data link chips (called G-link chips). The other two ports will come equipped with OC-3 interfaces, probably with two OC-3 interfaces on each line card, although the plans for this have not been finalized yet. Line cards are also being designed for OC-12 and OC-48 interface speeds. These will not be included in the core kit distribution, but may be made available for purchase.
The photograph above shows the WUGS-20 switch packaging. The switches are rack-mountable (or can be simply placed on a desktop) and will be about 11 cm tall. Connections will be through the front panel, which can be customized for different line card configurations.
This photo shows the internals of the WUGS-20 switch (looking straight
down from the top with the cover removed) using mock-ups
of the internal components. The switch elements, IPPs and OPPs will
be mounted on a large circuit board, a mock-up of which can be seen
at the bottom of the enclosure. You can see the PGA sockets where the
IPP and OPP chips will be mounted along the left and right sides of the
board mock-up. Line cards sit above this main board,
connecting to it using connectors at the left and right edges of the
enclosure. A mock-up of one line card is shown in the photo, along with
optical modules, which connect to the front panel connectors via fiber optic
pigtails. The main board and line cards are mounted on a central metal
plate that can be seen in the figure, using stand-offs.
The power supply is at the rear of the enclosure and power connections
are made to the rear edge of the main board. There is also a set of
fans at the rear of the enclosure which draw air through the enclosure
to cool the components.
The enclosure was designed to be taller than needed for the standard
line cards, so that experimental line cards needing more board real estate
could be accommodated within the enclosure. Typically, these cards with
have the same dimensions as a standard card, but will also have a
mezzanine card that sits above the base card. These mezzanine cards may
extend over a larger area, within the enclosure, if need be (occupying
space above the other standard line cards).
This system would be made up of eight I/O modules, each containing a large circuit board hosting two switch elements (eight chips) and eight port processors (16 chips), together with eight line cards, each containing two 1.2 Gb/s interfaces. These would connect to a backplane that would provide connections to a set of small cards at the top that would carry the switch element chips constituting the center stage of the required three stage network. Such a system could be packaged in a rack-mountable unit, approximately 75 cm in height with fiber connections through a front panel.
Still larger systems can be constructed using subsystems similar to this. The figure above shows components that can be used for systems with a capacity of up to 1.2 Tb/s. This system could be implemented in five standard equipment racks.
Acknowledgements. This work has been supported by a major contract from the Advanced Research Projects Agency and a grant from the National Science Foundation.
Prepared by Jonathan Turner: jst@cs.wustl.edu. Last updated 3/4/96.
This site included in InterNIC Academic Guide to the Internet.