Every now and then you stumble across some phenomenon of nature that is so surprising that it almost seems like magic.
Imagine that you could conduct a code review that not only told you how many bugs you have found, but also told you how many bugs you haven’t found. What? How can you know how much of something you didn’t find? This is the magic of the capture-recapture method.
In population biology, when you want to measure the size of an animal population in the wild, you don’t just go out and count every single animal in the environment. That would be impossible for most circumstances. Instead you do something that I’m sure you’ll recognize from television nature documentaries1. You set traps and capture a group of animals from that environment. After you have captured those animals, you tag them. Then you release them back into the wild, and wait for enough time to pass for the population to randomly redistribute. Then you set traps again and capture another group of animals from that same environment.
If many of the animals you capture the second time are tagged, then you have probably captured a large portion of the total population:
If few of the animals in the second capture are tagged, then the total population is probably much larger than the size of your capture:
The trick to making the population estimate work is the statistical independence of the samples. In the wild, this is done with the spatial arrangement of traps and the time difference between captures. Amazingly enough, you can apply the same reasoning to the population of design defects in a product specification. In a quality inspection, statistical independence can be achieved by assigning multiple reviewers to independently analyze the design artifact (e.g. source code). Each reviewer separately inspects the artifact for defects and submits a marked up copy.
If each reviewer has discovered many of the same defects, then most of the defects have been found:
If each reviewer identifies a mostly unique set of defects, then the total number of defects, and hence the number of undiscovered defects, is large:
At the very least, a high number of latent defects should trigger a reinspection. Such a finding might even be cause to reject the artifact and perform root cause analysis to find the upstream source of poor quality.
Almost as surprising as the method itself is the simplicity of the formula for estimating the total population:
N = Estimate of total defect population size
n1 = Total number of defects discovered by first reviewer
n2 = Total number of defects discovered by second reviewer
m = Number of common defects discovered by both reviewers
A more accurate, and slightly more complicated formula is:
I have a warm place in my heart for these little probability tricks.
The procedure for conducting this type of review is simple. The basic version requires four roles:
- the reviewee
- reviewer A
- reviewer B
- review moderator
The size of the code to be reviewed is important. Approximately 200 lines of code will result in the highest inspection yield, or defects found per line. Review effectiveness starts to decline as the review target increases beyond 200 lines. Fortunately for us, our lean, feature-oriented work definition should produce artifacts that are suitable for such a review.
Prior to the review meeting, the code to be reviewed should be identified and sent to the two reviewers. The reviewers inspect the documents to their satisfaction, or better yet, time-box their inspections. It is essential that the reviewers do not interact with one another prior to the review meeting. Capture-recapture depends upon the statistical independence of the samples. The reviewers should each inspect the code according to a defined set of criteria. The review criteria should include the requirements, the design, and code standards. The emphasis of their inspections should be on finding defects rather than nitpicking style issues. Defects might be true coding mistakes like an off-by-one loop index or an operator precedence error. They might be low-level design mistakes like a missing argument validation or an incomplete rollback after an early function return. Defects might also be high-level design or even requirements errors that were not discovered previously.
The reviewers should bring their marked-up documents to the review meeting. Each inspector will enumerate his findings, and the group will validate each defect. Legitimate defects are marked for logging. Any discussion about corrections should be deferred to another time. As the defects are enumerated by the first inspector, the second inspector will call out any defects that are common. The tally of common defects is recorded by the facilitator.
After the review, the totals of each inspector’s approved defects and the total of common approved defects are applied to the formula. If the estimated population of undiscovered defects is large, then the inspection should be rerun, with different inspectors if possible. If reinspection produces a similar result, then the code change should be rejected and a root cause analysis should be performed to determine the source of the defects and propose a process improvement.
There really is only one legitimate purpose for a code review: to find defects. Code reviews are expensive and bureaucratic. They are only worth doing if they are done very well, and most code reviews are not done well at all. There are a small set of techniques that are genuinely effective, though fortunately, some of them can be combined. Capture-recapture probably gives the best value of all of them.
1. How Many Fish are in the Pond?