Originally written for the APNIC Blog

by Samuel Steffen (former Master’s student at NSG)

Arguably, the most effective method to prevent downtime due to network misconfigurations is detecting them before configurations are deployed.

Network verification, which is concerned with checking properties such as reachability or load balancing, is an important instrument for reasoning about network behaviour. As such, there is an increasing number of tools allowing network operators to check properties ‘off-line’, before any issues manifest in the real network.

In practice, network components such as links and nodes are subject to failures, and incorporating these in the analysis is essential. Most existing tools are able to check a given property for a whole set of failure scenarios, such as all scenarios where up to k links fail simultaneously (for a small number k), and produce a binary answer of whether the property holds in _all_ considered scenarios or not. For instance, you can use these to verify whether some customer is guaranteed reachability, even in the presence of, at most, one link failure.

## Analysing ‘soft’ properties

While performing such an analysis is useful, what often matters in practice is the fraction of time (or, more technically, the probability) a property holds.

There are many such ‘soft’ properties, which may be violated for a small fraction of time in exchange for cheaper network designs, for example. For instance, an operator may accept traffic leaving via a suboptimal egress at the West Coast 5% of the time, as long as reachability is guaranteed for 99.99% of the time. As such, soft properties often emerge when reasoning about compliance with service level agreements (SLAs).

The results of a survey we at ETH Zurich conducted among 52 network operators indicated that performing such a probabilistic analysis is indeed practically relevant, yet very challenging. This led us to investigate in a recent research project how to accurately compute the probability of network properties under random failures. In the process, we have designed the network analyser, NetDice.

## Probabilistic verification of network configurations using NetDice

When provided with a probabilistic failure model and a network configuration, NetDice computes the probability that a given property holds in the network. It supports the most common routing protocols and forwarding mechanisms (OSPF, BGP, ECMP, and static routes), and is able to verify four 9s SLAs within minutes, even in large networks with hundreds of links.

## Pruning the failure space using ‘cold edges’

NetDice computes the property probability by efficiently exploring the failure space and pruning irrelevant scenarios. Our key insight is the existence of so-called ‘cold edges’, which are links whose failure is guaranteed not to affect whether the property holds. For example, consider the OSPF network in Figure 1 (left), where traffic from A to B is forwarded along the shortest path as indicated. From this we can make the following key observation: if any combination of links outside the shortest path fails, the forwarding graph does not change.

The failures shown in Figure 1 (centre) do not change how the traffic is forwarded. As a consequence, any property reasoning about that traffic is not affected by the failures. All edges outside the shortest path are cold, as they are irrelevant for the property. In contrast, a failure of the ‘hot’ edge shown in Figure 1 (right) leads to a different forwarding graph.

Once the cold edges are identified, their failures can essentially be skipped when computing the probability of the property. That is, by computing the probability mass of all possible cold edge failures, we can cover these at once. By introducing failures of hot edges and applying this idea recursively, NetDice explores the space of remaining failure scenarios.

Identifying cold edges for basic shortest-paths routing is rather simple. However, for more complex routing protocols such as BGP in interaction with OSPF, identifying cold edges is challenging. At its core, NetDice runs an elaborate algorithm to identify cold edges for common routing protocols. This enables fast inference even in large networks, and allows network operators to verify SLA compliance within minutes.

## Inter-department collaboration leading to open-source tool

NetDice is a project at the ICE Center, an interdepartmental research centre formed by the Networked Systems Group and the Secure, Reliable, and Intelligent Systems Lab at ETH Zurich.

A prototype implementation of NetDice supporting various common properties is freely available on GitHub. You can also watch a video of our recent presentation on it at SIGCOMM 2020.

Though the set of currently supported protocols and configurations is limited, we believe that NetDice makes a valuable first step towards practical network analysis in the face of probabilistic failures.

We are looking forward to extending NetDice and seeing its application in real-world networks!