Current Research Interests

My research approach has always been driven by trying to understand how technology trends give us new approaches for overcoming limits to existing systems architectures. Today, the challenge is to develop, deploy, and evolve very large scale applications that typically execute across thousands of network-connected servers, possibly within a single datacenter, or distributed around the world. The applications of interest are large-scale Internet sites like Amazon or Ebay, or Internet search like Google, or globe-spanning content distribution like Akamai. My interest is not so much the specific application, but rather the system architecture that makes it possible to develop such applications orders of magnitude faster and with a whole lot less people than it takes to run them today.

The key technology enablers we are investigating is statistical learning theory, a powerful mathematical framework that support a quantitative approach to characterize the behavior of systems, and programmable network elements, essentially network routers with the ability to "deep classify" packets and flows by examining packet contents beyond IP headers. Our technical approach builds on the paradigm of Observe network behavior, Analyze it using statistical methods, and Act upon it to control network behavior.

Today, edge services are deployed not by shrink-wrap software, but by plug-in network appliances. Essentially these are routers with service extensions, providing protocol-aware packet and stream classification at line speeds, invocation of actions based on these, and policy-based routing. Packet contents may be modified, delayed, or filtered based on preferences, context, or policies specified by administrators. New devices are now available—Programmable Network Elements (PNEs)—giving even more control over network processing. Using them as a foundation, the OASIS Project is investigating a comprehensive network behavior observation infrastructure for enterprise networks that collects observations—about protocol types, packet sizes, source and destination addresses, and numerous other attributes of packets and their flows—from points within the enterprise—server-side, Internet-side, and access-side—to determine correlations and to test their causality. For example, if two attributes correlate statistically, we can test causality by using the PNEs to reduce or increase one attribute of the traffic flow to verify whether the second attribute follows. Observation collection and guided causality experimentation feed our development of statistical models to discriminate “normal” versus “abnormal” behaviors. In turn, this enables management algorithms to invoke the actions supported by the underlying PNEs to block “bad” traffic, slow “suspect” traffic, and protect “good” traffic when network and system performance is under stress. Successfully developing such an infrastructure and its associated algorithms, and interfacing them to higher middleware layers—in collaboration other RADLab researchers, will ultimately enhance the reliability and dependability of distributed applications deployed in enterprise networks and large-scale datacenters.

Last updated: 27 December 2005, Randy H. Katz, randy@cs.Berkeley.edu