Software is terrible. This is mostly okay, except that there are a few specific types of software which we would much rather wasn’t terrible. For example, the software that controls your car’s throttle. Or the software that controls spaceship navigation. Or the first artificial general intelligence, if one is eventually made.

When correctness really matters, we break out the mathematical machinery and formally prove that our software is correct. Or rather, we feel guilty about how we aren’t formally proving our software correct, while academics prove the correctness of toy versions of our software and wonder why we aren’t doing it too. (It’s not like a line count and feature set 10^3 times bigger is a *fundamental* obstacle).

This is my attempt to outline the problem of making large-scale provably software. At the end of this blog post, you will be ready to apply formal methods to proving the correctness of a large system have an appreciation for why formally proving the correctness of software is hard.

Like most discussions of formal proof in software, we’re going to start with a sort function. This might seem silly, because sorting isn’t that hard, but applying formal proof methods revealed a bug in OpenJDK’s sort function which had gone undetected for years and survived extensive unit tests.

The proof-making process goes like this. First, we define what it means for our function to be correct. In the case of a sort function, that means it takes a list and returns a list, elements of the input and output lists match up one-to-one, and each element of the output list other than the first element is greater than or equal to the element before it. This is the *extensional* definition of sorting. Then we write the sort function itself, which is a *constructive* definition: it defines the same thing, but from a different angle. Then we informally prove that the two definitions are equivalent, and then we convert the informal proof into a machine-checkable formal representation.

Consider, for example, this inefficient C++ implementation of selection sort.

`int leastElementIndex(vector<int> &vec)`

{

int leastElement = vec[0];

int leastIndex = 0;

for(int ii=1; ii<vec.size(); ii++) {

if(vec[ii] < leastElement) {

leastElement = vec[ii];

leastIndex = ii;

}

}

return leastIndex;

}

```
```vector<int> sort(vector<int> input)

{

vector<int> sortedItems;

vector<int> itemsLeft = input;

` for(int ii=0; ii<input.size(); ii++) {`

int leastIndex = leastElementIndex(itemsLeft);

int leastValue = itemsLeft[leastIndex];

sortedItems.push_back(leastValue);

itemsLeft.erase(itemsLeft.begin()+leastIndex);

}

return sortedItems;

}

Our initial, informal proof goes like this. Upon each entry into the loop, `sortedItems` is sorted, every element of `sortedItems` is less than the smallest element in `itemsLeft`, and each item in `input` corresponds one-to-one with an item either in `sortedItems` or in `itemsLeft`. These are all true on the first entry into the loop because `sortedItems` is empty and `itemsLeft` is a copy of the input. If this was true at the start of a loop iteration, then it is true at the end of a loop iteration: `sortedItems` is still sorted because the item added to the end is greater than or equal to all of the items that were previously there; the one-to-one correspondence is preserved because the item that was removed from `itemsLeft` was added to `sortedItems`; and all items in `sortedItems` are all still less than or equal to the smallest item in `itemsLeft` because the item that was moved was the smallest item. The loop terminates after `input.size()` steps, at which point every element has been moved from `itemsLeft` to `sortedItems` so that’s the sorted list and we can return it.

This proof is incorrect and the function is not correct either. Can you spot the bug?

This function is incorrect for inputs with size >= 2^31 because input.size() is larger than the largest possible int. When you compile this code, the compiler will give you a warning. This is an overflow problem: when we encounter numbers that don’t fit in a 32-bit int, the things we know about basic arithmetic go out the window. This class of error is pervasive in real programs. An experiment found that most programmers couldn’t write a correct binary search, and this was a major reason. In most cases, the compiler will not catch these errors; our election sort was a fortunate exception.

## Unfortunate Primitives

The first obstacle to formal proof is that software is usually built out of primitives that are mathematically inconvenient, creating strange caveats and impedance mismatches. We want integers, but instead we get integers mod 2^32, so n+1 isn’t necessarily bigger than n. This is why we have one less Ariane 5 rocket. We want real numbers, but instead we get floating-point numbers, and suddenly addition isn’t an associative operation. This famously caused the Patriot missile defense system to fail. We either can’t prove that values won’t suddenly change in between accesses, or we have to invest significant effort into proving they won’t, because of mutability. Some languages force us to reason about whether objects have any references left, or else we’ll have memory leaks, or worse, try to use an object after it’s been freed (causing a use-after-free security vulnerability).

Some languages do a better job at avoiding these problems than others. Languages are commonly classified as functional or imperative; the “functional” side is generally seen as doing better at avoiding these problems. However, this distinction is *very* fuzzy, languages that have traditionally been thought of as imperative and object-oriented have tended to adopt traits that used to identify functional languages, such as lightweight function syntax, garbage collection and immutability.

## Static Analysis

Inside every major compiler, there is an inference and proof engine. No, not the type system-the other one. Actually, there’s two proof systems. Type-checking is a sort of proof system, thanks to the Curry-Howard isomorphism; it proves that we’ll never try to do particular silly things like divide a number by a string. But the *main* inference engine is invisible. Its purpose is to find and prove statements of the form “program P is equivalent to program P’, which is faster”. In most cases, these are independent of programming language and shared between multiple programming languages, and independent of processor architecture and shared between processor architectures.

These proof engines are less ambitious than formal verification in general, but all formal program manipulation starts roughly the same way: by translating the program to a representation that’s easier to work with and easier to prove things about. (Actually, compilers tend to translate programs through a *series* of representations which highlight different aspects of it, performing different optimizations in each). Because the optimizers in a compiler are entirely automated, and the correctness of their output is important, they don’t get to gloss over any details; they have to thoroughly check their assumptions, including assumptions that would normally be swept under the rug.

When we look at them, we find that many of the techniques are about figuring out and proving things that make good lemmas for a larger proof, such as:

- Which variable usages correspond to which assignments (SSA form)
- Which variables/memory locations might or might not change in between accesses (escape analysis)
- Which pointers/references definitely point to the same thing, or definitely don’t point to the same thing (alias analysis)
- Upper and lower bounds on variables (value range analysis)

Several of these steps serve to eliminate the differences between functional and imperative programming languages. In particular, the conversion to static single assignment form eliminates mutability in local variables, and the escape analysis reduces (but doesn’t always eliminate) need to worry about mutability in heap variables.

## The Dual Definitions Problem

When we wrote a sort function and formally defined what properties we wanted it to have, I phrased this in terms of writing two definitions from two different angles. In practice, this can be the hardest part. There are two ways this can go wrong. First, we might not be able to write a second definition. It was hard enough figuring out what we wanted the first time; if the second specification has to be *sufficiently different*, we might not be able to do that at all.

This is especially hard if we have components that aren’t well modularized or well factored. Solving it is mainly a matter of good software architecture: identifying the problems your software solves and carving it at the joints, choosing the *right* joints and carving it cleanly. This is a very hard problem. While some speak of programmers who don’t use formal proof being *lazy*, this problem is not one that’s usually solved by mere effort; it requires time, deep understanding, and cleverness.

The second problem is that we could write down a second definition, with the exact same error or the exact same blind spot. This is particularly likely for corner cases, like empty lists and extreme values.

Be careful about using the following code — I’ve only proven that it works, I haven’t tested it.

— Donald Knuth

The solution to this problem is *unit testing*. A unit test is a written example with an expected result. This makes fundamental confusions less likely. It’s also a way of verifying corner cases that would much up a proof. There’s only one empty list, for example, so a unit test on the empty list means not needing to worry about it in a proof. There’s an important caveat, however, which is that a unit test does not necessarily represent a considered belief about what should happen; a developer could cheat, by writing code first, observing what the results are, and then writing unit tests which check for those results, without knowing whether they’re actually *correct*. This is why Test-Driven Development advocates writing unit tests first, before writing code.

## The Complex Interface Problem

CompCert is a formally verified C compiler. It has a proof, verified by Coq, that output assembly has the same semantics as input C programs. Unsure whether it was really for real, I checked its issue tracker. This is unusually good; there are few entries, and almost all of them are new-feature requests. But this one is a straight-up bug, and a fairly serious one. What happened?

What happened was that at the boundary where CompCert-generated code interacts with other compilers’ code in a library, it formats some argument lists differently, causing values to be misinterpreted. This is outside the realm of CompCert’s proof. Applying formal proof to this problem wouldn’t help, because the same problem would show up in the specification used by the proof.

In the case of CompCert, the problem was detected by a modest amount of testing and then fixed. In other cases, however, the interfaces are much larger, and more full of special cases. Proving the correctness of a PDF renderer or a web browser, for example, would be a hopeless endeavor, because what correctness means is defined by others’ expectations about what happens at an interface, and that interface has not been formally specified anywhere.

## Conclusion

Formal verification of software is hard, but there is more than one angle to attack the problem from. The first, traditional angle is: write better automated proof-searchers and proof-checkers. But the best proof-checker in the world can’t help if you don’t know what you’re proving, or you’re proving your software does something you don’t want, or you’re building on architectural quicksand. The second angle is purely technical: we need to take the traps out of the core abstractions we’re building on. In 2015, there is no excuse for programming languages to silently wrap integers on overflow, and yet most of them do. And the third angle is: we need to get better at software architecture. This requires improving our cognitive strategies for thinking about it, and looks somewhat like rationality research.