Tree-based Overlay Networks for Scalable, Reliable Tools and Applications.
Dorian Arnold (University of Wisconsin-Madison)
Monday February 11, 2008
10:00 a.m. - SENSQ 5317
Refreshments at 9:30 a.m.
Hosted by Sangyeun Cho
Abstract
HPC systems continue to grow in size and complexity making the development of scalable software systems increasingly difficult. As a result, very few tools and applications run effectively or at all at today's largest scales (tens and hundreds of thousands of processors). To make matters worse, million processor systems are scheduled for availability within the next two to four years.
Tree-based Overlay Networks (TBONs) have proven to be an effective computational model for scalable distributed tools and applications. ATBON is a network of hierarchically organized processes that exploits the logarithmic scaling properties of trees to provide scalable data multicast, gather, and in-network aggregation. In this talk, I will describe the TBON model, demonstrating its power and flexibility with scalability results up to 131,072 processors from a variety of application domains. I also will describe our novel TBON failure recovery model, state compensation, which relies on inherent information redundancies amongst TBON processes. State compensation features fast, decentralized tree reconstruction and state recovery protocols involving a small subset of the tree and no process coordination. The protocols are scalable because their performance is a function of the tree's fan-out, not total size. A tree with a fan-out of 64 recovers from failures in milliseconds: with only four levels, such a tree supports over 16,000,000 processes!
Biography of speaker
Dorian Arnold is a doctoral candidate and Intel Foundation Ph.D. fellow in the Computer Sciences Department at the University of Wisconsin. He holds a M.S. degree in Computer Science from the University of Tennessee and a B.S. degree in Mathematics and Computer Science from Regis University (Denver, CO). From 1999 to 2001, Dorian served as technical lead of the NetSolve project at the University of Tennessee's Innovative Computing Laboratory. In 2006, Dorian was a technical scholar at Lawrence Livermore National Laboratory. His research focuses on the performance and scalability issues of large distributed systems including efficient communication and runtime data anaysis, fault-tolerance,and system deployment.





