This article presents the algorithm that solves the Byzantine General’s Problem, as first described by Lamport, Pease, and Shostak in 1982 [1]. While Lamport’s algorithm is not particularly complex, programmers who aren’t used to working on distributed computation might find it difficult to implement. To accompany the explanation of the algorithm, I have included a C++ program designed for experimentation with the solution.

Introduction

The Byzantine General’s Problem is one of many in the field of agreement protocols. In 1982, Leslie Lamport described this problem in a paper written with Marshall Pease and Robert Shostak. Lamport framed his paper around a story problem after observing what he felt was an inordinate amount of attention received by Dijkstra’s Dining Philosophers problem [2].

This problem is built around an imaginary General who makes a decision to attack or retreat, and must communicate the decision to his lieutenants. A given number of these actors are traitors (possibly including the General.) Traitors cannot be relied upon to properly communicate orders; worse yet, they may actively alter messages in an attempt to subvert the process.

When we’re not dwelling in storybook land, the generals are collectively known as processes, the general who initiates the order is the source process, and the orders sent to the other processes are messages. Traitorous generals and lieutenants are faulty processes, and loyal generals and lieutenants are correct processes. The order to retreat or attack is a message with a single bit of information: a one or a zero.

In general, a solution to an agreement problem must pass three tests: termination, agreement, and validity. As applied to the Byzantine General’s problem, these three tests are:

  1. A solution has to guarantee that all correct processes eventually reach a decision regarding the value of the order they have been given.
  2. All correct processes have to decide on the same value of the order they have been given.
  3. If the source process is a correct process, all processes have to decide on the value that was original given by the source process.

Note that one interesting side effect of this is that if the source process is faulty, all other processes still have to agree on the same value. It doesn’t matter what value they agree on, they simply all have to agree. So if the General is subversive, all lieutenants still have to come to a common, unanimous decision.

Difficulties

This agreement problem doesn’t lend itself to an easy naïve solution. Imagine, for example, that the source process is the only faulty process. It tells half the processes that the value of their order is zero, and the other half that their value is one.

After receiving the order from the source process, the remaining processes have to agree on a value that they will all decide on. The processes could quickly poll one another to see what value they received from the source process.

In this scenario, imagine the decision algorithm of a process which receives an initial message of zero from the source process, but sees that one of the other processes says that the correct value is one. Given the conflict, the process knows that either the source process is faulty, having given different values to two different peers, or the peer is faulty, and is lying about the value it received from the source process.

It’s fine to reach the conclusion that someone is lying, but making a final decision on who is the traitor seems to be an insurmountable problem. And in fact it can be proven that it is impossible to decide in some cases. The classic example used to show this is when there are only three processes: one source process and two peer processes.

In the two configurations shown in Figure 1 and Figure 2, the peer processes attempt to reach consensus by sending each other their proposed value after receiving it from the source process. In Figure 1, the source process (P1) is faulty, sending two different values to the peers. In Figure 2, P3 is faulty, sending an incorrect value to the peer.

You can see the difficulty P2 faces in this situation. Regardless of which configuration it is in, the incoming data is the same. He has no way to distinguish between the two configurations, and no way to know which of the two other processes to trust.



Figure 1
The case in which the source process is faulty


Figure 2
The case in which P3 is faulty


This situation doesn’t necessarily get better just by throwing more non-faulty processes at the problem. A naïve algorithm as shown in Figure 1 and Figure 2 might have each process tell every other process what it received from P1. A process would then decide on the correct value by taking a simple majority of the values in its incoming messages.

Under the rules of this approach, it is easy to show that regardless of how many processes are in the system, a subversive source process with one collaborator can cause half the processes to choose to attack, while half the processes elect to retreat, leading to maximum confusion.

The Lamport, Pease and Shostak Algorithm

In 1982, Lamport, Pease, and Shostak published a fairly simple solution to this problem. The algorithm assumes that there are n processes, with m faulty processes, where n > 3m. Thus, for a scenario such as that in Figure 1 and 2 with 1 faulty process, there would have to be a minimum of 4 processes in the system to come to agreement. (For the rest of this article, n will always refer to the count of processes, and m will always refer to the number of faulty processes.)

The definition of the algorithm in the original paper is short and succinct, but at least in my experience, is somewhat confusing for programmers without a lot of experience in distributed algorithms.

Lamport’s algorithm is a recursive definition, with a base case for m=0, and a recursive step for m > 0:

  Algorithm OM(0)

  1. The general sends his value to every lieutenant.
  2. Each lieutenant uses the value he receives from the general.

Algorithm OM(m), m > 0

  1. The general sends his value to each lieutenant.
  2. For each i, let vi be the value lieutenant i receives from the general. Lieutenant i acts as the general in Algorithm OM(m-1) to send the value vi to each of the n-2 other lieutenants.
  3. For each i, and each j ≠ i, let vi be the value lieutenant i received from lieutenant j in step 2 (using Algorithm (m-1)). Lieutenant i uses the value majority (v1, v2, … vn).

Lamport’s Algorithm Definition


To most programmers, this is going to look like a conventional recursive function definition, but it doesn’t quite fit into the mold you learned when studying the example of factorial( n ).

Lamport’s algorithm actually works in two stages. In the first step, the processes iterate through m+1 rounds of sending and receiving messages. In the second stage of the algorithm, each process takes all the information it has been given and uses it to come up with its decision.

I found this to be non-obvious without quite a bit of study, which is the reason for this article.

The First Stage

The first stage of the algorithm is simply one of data gathering. The algorithm defines m+1 rounds of messaging between all the processes.

In round 0, the General sends the order to all of its lieutenants. Having completed his work, the General now retires and stands by idly waiting for the remaining work to complete. Nobody sends any additional messages to the General, and the General won’t send any more messages.

In each of the remaining rounds, each lieutenant composes a batch of messages, each of which is a tuple containing a value and a path. The value is simply a 1 or a 0. The path is a string of process ids, <ID1, ID2, …, IDn>. What the path means in this context is that in Round N, PIDN is saying that was told in round N-1 that PIDN-1 was told by… PID1 that the command value was v. (This is very much like the classic party game in which a message is whispered from ear to ear through a chain of players, becoming slightly mangled along the way.) No path can contain a cycle. In other words, if ID1 is 1, no other ID in the string of process IDs will be a 1.

The message definition is easy in round 1. Each process broadcasts a message to all the other processes, including itself, but excluding the General, with the value it received from the General and its own process ID.

In subsequent rounds, things get more complicated. Each process takes all the messages it received from the previous round, appends its process ID where allowed, and sends those messages to all other processes, including itself. (The “where allowed” just means that the process skips any messages where adding its process ID to the list would create a cycle in the string of process IDs.)

For example, let’s suppose that in Round 0 that P1, a faulty general told P2, P3, and P4 that the command value was 0, and told P5, P6, and P7 that the command value was 1. In round 1, the following messages would be sent:

Sender=P2 Sender=P3 Sender=P4 Sender=P5 Sender=P6 Sender=P7
Dest Msg Dest Msg Dest Msg Dest Msg Dest Msg Dest Msg
P2 {0,12} P2 {0,13} P2 {0,14} P2 {1,15} P2 {1,16} P2 {1,17}
P3 {0,12} P3 {0,13} P3 {0,14} P3 {1,15} P3 {1,16} P3 {1,17}
P4 {0,12} P4 {0,13} P4 {0,14} P4 {1,15} P4 {1,16} P4 {1,17}
P5 {0,12} P5 {0,13} P5 {0,14} P5 {1,15} P5 {1,16} P5 {1,17}
P6 {0,12} P6 {0,13} P6 {0,14} P6 {1,15} P6 {1,16} P6 {1,17}
P7 {0,12} P7 {0,13} P7 {0,14} P7 {1,15} P7 {1,16} P7 {1,17}

Table 1
Messages sent by all six lieutenant processes in round 1


The number of messages goes up in in the second round. From the previous iteration, we know that each process now has six values that it received in the previous round – one message from each of the six other non-source processes – and it needs to send each of those messages to all of the other processes, which might mean each process would send 36 messages out.

In the previous table I showed the messages being sent to all six processes, which is fairly redundant, since the same messages are broadcast to all processes. For round 2, I’ll just show you the set of messages that each process sends to all of its neighbors.

Sender=P2 Sender=P3 Sender=P4 Sender=P5 Sender=P6 Sender=P7
{0,132}
{0,142}
{1,152}
{1,162}
{1,172}
{0,123}
{0,143}
{1,153}
{1,163}
{1,173}
{0,124}
{0,134}
{1,154}
{1,164}
{1,174}
{0,125}
{0,135}
{0,145}
{1,165}
{1,175}
{0,126}
{0,136}
{0,146}
{1,156}
{1,176}
{0,127}
{0,137}
{0,147}
{1,157}
{1,167}

Table 2
Messages sent by all six processes in round 2


The six messages that P2 received in round 1 were {0,12}, {0,13}, {0,14}, {1,15}, {1,16}, and {1,17}. According to the earlier definition, P2 will append its process ID to the path and forward each resulting message to all other processes. The possible messages it could broadcast in round 2 are {0,122}, {0,132}, {0,142}, {1,152}, {1,162}, and {1,172}. The first message, {1,122} contains a cycle in the path value of the tuple, so it is tossed out, leaving five messages to be sent to all processes.

The first message that P2 is sending in round 2, {0,132}, is equivalent to saying “P2 is telling you that in round 1 P3 told it that in round 0 that P1 (the General) told it that the value was 0″. The five messages shown in P2’s column in the table are sent to all six lieutenant processes, include itself.

It’s easy to see that as the number of processes increases, the number of messages being exchanged starts to go up rapidly. If there are N processes, each process sends N-1 messages in round 1, then (N-1)*(N-2) in round 2, (N-1)*(N-2)*(N-3) in round 3. That can add up to a lot of messages in a big system.

The Second Stage

While sending messages in each round, processes are also accumulating incoming messages. The messages are stored in a tree format, with each round of messages occupying one rank of the tree. Figure 3 shows the layout of the tree for a simple configuration with six processes, one of which can be faulty. Since m=1, there are just two rounds of messaging: the first, in which the general sends a value to each lieutenant process, and a second, in which each process broadcasts its value to all the other processes. Two rounds of messaging are equivalent to two ranks in the tree.

Each node in the tree has three elements: an input value, a path, and an output value. The input value and path are defined in the first stage of the algorithm – they are simply the messages received from the peer processes. The output value is left undetermined until the second stage of the algorithm, which I am defining here. Note that in the figure below, the output values are initially set to ‘?’, indicating that they are presently unknown.



Figure 3
The Tree Layout for 5 processes with 1 faulty process

In Figure 3, there are six processes, and the General (P1) is faulty – sending a 1 to the first three lieutenants and 0 to the last two. The subsequent round of messaging results in P2 having an information tree that looks just like that shown in Figure 3. (Because only the General is faulty, in this case all other processes will have an identical tree.)

Once a process has completed building its tree, it is ready to decide on a value. It does this by working its way up from the leaves of the tree, calculating the majority value at each rank and assigning it to the rank above it. The output value at each level is the third item in the data structure attached to each node, and those values are all undefined during the information gathering stage.

Calculating the output values is a three step process:

  1. Each leaf node in the tree (all values at rank m) copies its input value to the output value.
  2. Starting at rank m-1 and working down to 0, the output value of each internal node is set to be the majority of the output values of all its children. In the event of a tie, an arbitrary tie-breaker is used to assign a default value. The same default value must be used by all processes.
  3. When complete, the process has a decision value in the output of the sole node at rank 0.

In Figure 3, step 1 of the process assigns the initial values to the leaf nodes. In the next step, the majority value of { 1, 1, 1, 0, 0 } is evaluated and returns a value of 1, which is assigned to the output value in rank 0. Because that is the top rank, the process is done, and P1 decides on a value of 1.

Every lieutenant value in a given exercise will have the same paths for all its nodes, and in this case, since only the General is faulty, we know that all lieutenants will have the same input values on all its leaves. As a result, all processes will agree on the same value, 1, which fulfills the agreement property.

A More Complicated Example

Getting a good understanding of the algorithm really requires walking through an example that has at least three ranks. (Examples on the web, mostly extracted from lecture notes, nearly always have the simple two-rank example.) For this example, consider an example with n=7 and m=2. We’ll continue with the convention that the General is P1, and instead of having a faulty general, we’ll have P6 and P7 be faulty processes. After the initial three rounds of information exchange, we have the three-ranked tree shown in Figure 4:



Figure 4
A tree with n=7, m=2, and faulty processes P6 and P7

The important thing to note in these trees is that I’ve inserted the value ‘X’ for the input values of any input value that comes from the two faulty processes. We don’t know what P6 and P7 might send in any given round, so in general, we’ll try to work through the algorithm without constricting their incorrect messages to any specific values.

You’ll see that at rank 1, the values from path 17 and 16 are both set to X. In the first round the two faulty processes communicated possibly false values to all other processes, and may have arbitrarily changed the values sent to different processes in order to skew the results.

As a result of those bad values in rank 1, we see their frequent occurrence in rank 2. The incorrect values show up not only in direct messages from the faulty processes, but also in any message from a correct process that includes a faulty process earlier in its path.

All in all, at the leaf nodes, we have 18 deceptive values at the leaf nodes, and only 12 accurate messages that trace their way all the way back to the general through nothing but correct processes. Obviously, if we just voted on the majority of the messages we had received, we would be susceptible to falling for the wrong value.

Fortunately, the layout of the tree guarantees that we will actually get a correct value. In Figure 4, the roll up of the output values hasn’t occurred yet, so every node has a question mark in the output value. In Figure 5, the output values are shown. The leaf rank has the output values set to the input values, with X used to indicate unknown values from faulty processes.

When the leaf rank is rolled up to the second rank, the nodes with paths 12, 13, 14, and 15 all have clear majority values of 0 for their output values, with 16 and 17 set to X, as their values are uncertain.

The final roll up to the top rank successfully sets the output value to 0, as four of the inputs are set to 0 and only 2 are set to X. Mission accomplished. And because of the way this was calculated, we know that the correct result will be achieved regardless of what deceptive values are sent by the two faulty processes.



Figure 5
The tree after calculating the output values

The Sample Code

I’ve included a simple C++ program that implements this algorithm, with extensive internal documentation. It has a Process class that is used to send and receive messages, as well as to roll up the decision tree. A Traits class is used to define the number of processes, the number of faulty processes, the source process, and what values the faulty processes send in various rounds.

To help with visualization, the program will output the tree for a given process in the format used by dot, part of the free Graphviz program. You can then use dot to create a nice picture of the output graph – all the figures in this article were created that way. I find that using the SVG format for the output produces very readable results.

As supplied, the program is set for values of n=7 and m=2. Good exercises to perform while experimenting with it include:

  • Attempt to invalidate the program or the algorithm by getting incorrect results with some particular combination of faulty messages.
  • Add a third faulty process and show that it is relatively easy to get invalid output when n=7 and m=2.
  • Reduce n to 6 and show that it is relatively easy to get invalid output with two faulty processes.
  • Move up to m=3 and n=10. Experiment with various combinations of faulty Generals and lieutenants and see if you can create incorrect results.

Note

Most implementations of this algorithm include logic designed to deal with the case in which a faulty process fails to send a message. This is equivalent to simply having the faulty process send an arbitrary value, so I don’t treat it as a separate case.

Source Code

source.zip, which contains:

  • main.cpp
  • README.TXT
  • VS2003/byzantine.sln
  • VS2003/byzantine.vcproj
  • VS2005/byzantine.sln
  • VS2005/byzantine.vcproj

References

[1] The Byzantine Generals Problem (with Marshall Pease and Robert Shostak)
ACM Transactions on Programming Languages and Systems 4, 3 (July 1982), 382-401.
http://research.microsoft.com/users/lamport/pubs/byz.pdf

[2] The Writings of Leslie Lamport: http://research.microsoft.com/users/lamport/pubs/pubs.html#byz

Lynch, Nancy A. Distributed Algorithms. San Francisco, CA: Morgan Kaufmann, 1997. ISBN: 1558603484.

Graphviz – Graph Visualization Software. http://www.graphviz.org/

CS 6378: Advanced Operating Systems Section 081 Notes on Agreement Protocols (lecture notes by Neeraj Mittal, University of Texas at Dallas) http://www.utdallas.edu/~neerajm/cs6378su07/agreement.pdf

Update 01-November-2007
Distributed Computing With Malicious Processors w/o Crypto or Private Channels, YouTube Video of Google TechTalk, Valerie King, Department of Computer Science, University of Victoria, Victoria, BC, Canada