Random testing has proven to be an effective way to catch bugs in concurrent and distributed

systems. This is often surprising, as the space of executions is enormous and conventional formal

methods intuition would suggest that bad behaviors would only be found by extremely

unlikely coincidences.

Empirically, many bugs in distributed systems can be explained by interactions among

only a small number of features. Thus, one can attempt to explain the effectiveness of

random testing under various “small depth" hypotheses. In particular, it may be possible to

test all interactions of k features for a small constant k by executing a family of tests that is

exponentially or even doubly-exponentially smaller than the family of all tests. Moreover,

under certain conditions, a randomly chosen small set of tests is sufficient to cover all k-wise

interactions with high probability.

I will describe two concrete scenarios. First, I will describe bugs in distributed systems

caused by network partition faults. In many practical instances, these bugs occur due to two

or three key nodes, such as leaders or replicas, not being able to communicate, or because

the leading node finds itself in a block of the partition without quorum. In this case, I will

show using the probabilistic method that a small set of randomly chosen tests will cover all

“small partition” scenarios with high probability.

Second, I will consider bugs that arise due to unexpected schedules (interleavings) of

concurrent events. Again, many bugs depend only on the relative ordering of a small number

of events (the “bug depth” of the bug). In this case, I will show a testing algorithm that

prioritizes low depth interleavings and a randomized testing algorithm that bounds the

probability of sampling any behavior of bug depth k for a fixed k. The testing algorithm

is based on combinatorial insights from the theory of partial orders, such as the notion

of dimension and its generalization to d-hitting families as well as results on online chain

partitioning.