Date of Award
Campus Access Dissertation
Computer Science and Engineering
The Internet is increasingly being used for mission-critical applications and it is expected to be always available. Unfortunately, service disruptions happen even in well-managed networks, due to link and node failures. Although most failures affect only a single link or component, a significant fraction of failures involve more than one link or communications node. In order to support time-sensitive applications such as Voice Over IP (VOIP), these networks need to survive failures or route changes with minimal service interruptions. For example, a disruption time of longer than 50 ms can cause calls using Voice over IP be dropped. Therefore, providing uninterrupted service availability despite transient and possibly multiple independent failures is a major challenge for service providers, which is the focus of this dissertation. There have been several recent proposals based on local rerouting to provide greater network availability despite failures. While most of these proposals are effective in handling single failures, they can cause either loops or dropped packets in the case of multiple independent failures. This dissertation illustrates three ways to handle multiple failures either by isolating them in time, so as to treat them as serial failures, or by isolating them spatially in the network, and dealing with them in parallel. In addition, we use a combination of these two methods to expand the capabilities of current networks to recover from disruptions. In order to isolate multiple failures in time, a routing scheme needs to quickly converge to a new topology while at the same time ensuring forwarding continuity during the transition. Therefore, fast restoration of loop-free forwarding along the optimal paths is imperative for a routing scheme. Three potential problems exist which can degrade performance: failures can be circumvented quickly with local rerouting, but packets sometimes take potentially long detours, increasing the path stretch; global recomputation of new optimal routes creates a convergence delay, and routing updates can cause forwarding loops during convergence; attempts to avoid these transient loops by imposing order on the process can increase the convergence delay further. Fast Convergence with Fast Reroute (FCFR) overcomes these problems, i.e., it is always loop-free, and it returns to optimal routing as quickly as possible without inducing any convergence delay. FCFR employs an existing fast reroute scheme such as NotVia, and needs just one additional bit in the packet header to achieve loop-free fast convergence. If isolation in time is not possible, the network must ensure forwarding continuity even during recovery from multiple failures. By using Localized On-demand Link State (LOLS) routing, each packet can carry a blacklist, which is a set of failed links encountered along its path, and its next hop is then determined by excluding the blacklisted links. The blacklist is reset as soon as the packet makes forward progress towards the destination and therefore, can be encoded in just a few bits. Furthermore, blacklist-based forwarding entries at a router can be precomputed for a given scenario of failures requiring protection, thus permitting implementation in high-speed networks. Our ultimate goal is to extend the benefits of FCFR to any two failures in the network, while only introducing a minimum amount of overhead per packet. Thus, we propose Multi-failure FCFR, which utilizes a multi-failure fast reroute scheme, such as LOLS, and requires just 2 additional bits in the packet header in order to achieve loop-free fast convergence for multiple failures.
Robertson, G. S.(2012). Handling Multiple Failures in IP Networks. (Doctoral dissertation). Retrieved from http://scholarcommons.sc.edu/etd/796