Every now and then you stumble across some phenomenon of nature that is so surprising that it almost seems like magic.
Imagine that you could conduct a code review that not only told you how many bugs you have found, but also told you how many bugs you haven’t found. What? How can you know how much of something you didn’t find? This is the magic of the capture-recapture method.
The theory
In population biology, when you want to measure the size of an animal population in the wild, you don’t just go out and count every single animal in the environment. That would be impossible for most circumstances. Instead you do something that I’m sure you’ll recognize from television nature documentaries1. You set traps and capture a group of animals from that environment. After you have captured those animals, you tag them. Then you release them back into the wild, and wait for enough time to pass for the population to randomly redistribute. Then you set traps again and capture another group of animals from that same environment.
If many of the animals you capture the second time are tagged, then you have probably captured a large portion of the total population:

If few of the animals in the second capture are tagged, then the total population is probably much larger than the size of your capture:

The trick to making the population estimate work is the statistical independence of the samples. In the wild, this is done with the spatial arrangement of traps and the time difference between captures. Amazingly enough, you can apply the same reasoning to the population of design defects in a product specification. In a quality inspection, statistical independence can be achieved by assigning multiple reviewers to independently analyze the design artifact (e.g. source code). Each reviewer separately inspects the artifact for defects and submits a marked up copy.
If each reviewer has discovered many of the same defects, then most of the defects have been found:

If each reviewer identifies a mostly unique set of defects, then the total number of defects, and hence the number of undiscovered defects, is large:

At the very least, a high number of latent defects should trigger a reinspection. Such a finding might even be cause to reject the artifact and perform root cause analysis to find the upstream source of poor quality.
Almost as surprising as the method itself is the simplicity of the formula for estimating the total population:

where:
N = Estimate of total defect population size
n1 = Total number of defects discovered by first reviewer
n2 = Total number of defects discovered by second reviewer
m = Number of common defects discovered by both reviewers
A more accurate, and slightly more complicated formula is:

I have a warm place in my heart for these little probability tricks.
The practice
The procedure for conducting this type of review is simple. The basic version requires four roles:
- the reviewee
- reviewer A
- reviewer B
- review moderator
The size of the code to be reviewed is important. Approximately 200 lines of code will result in the highest inspection yield, or defects found per line. Review effectiveness starts to decline as the review target increases beyond 200 lines. Fortunately for us, our lean, feature-oriented work definition should produce artifacts that are suitable for such a review.
Prior to the review meeting, the code to be reviewed should be identified and sent to the two reviewers. The reviewers inspect the documents to their satisfaction, or better yet, time-box their inspections. It is essential that the reviewers do not interact with one another prior to the review meeting. Capture-recapture depends upon the statistical independence of the samples. The reviewers should each inspect the code according to a defined set of criteria. The review criteria should include the requirements, the design, and code standards. The emphasis of their inspections should be on finding defects rather than nitpicking style issues. Defects might be true coding mistakes like an off-by-one loop index or an operator precedence error. They might be low-level design mistakes like a missing argument validation or an incomplete rollback after an early function return. Defects might also be high-level design or even requirements errors that were not discovered previously.
The reviewers should bring their marked-up documents to the review meeting. Each inspector will enumerate his findings, and the group will validate each defect. Legitimate defects are marked for logging. Any discussion about corrections should be deferred to another time. As the defects are enumerated by the first inspector, the second inspector will call out any defects that are common. The tally of common defects is recorded by the facilitator.
After the review, the totals of each inspector’s approved defects and the total of common approved defects are applied to the formula. If the estimated population of undiscovered defects is large, then the inspection should be rerun, with different inspectors if possible. If reinspection produces a similar result, then the code change should be rejected and a root cause analysis should be performed to determine the source of the defects and propose a process improvement.
There really is only one legitimate purpose for a code review: to find defects. Code reviews are expensive and bureaucratic. They are only worth doing if they are done very well, and most code reviews are not done well at all. There are a small set of techniques that are genuinely effective, though fortunately, some of them can be combined. Capture-recapture probably gives the best value of all of them.
1. How Many Fish are in the Pond?



Kanban discussion
Kanban Group
Noah Iliinsky | 05-Jun-07 at 3:24 pm | Permalink
Interesting. It’s sort of an inversion of the the logic for calculating how many combinations of n items can be selected from a field of N.
As n approaches N, the number of combinations decreases, and the contents of n1 and n2 must necessarily become similar.
The proof is left as an exercise to the reader.
Joe Arnold | 05-Jun-07 at 6:00 pm | Permalink
This is totally awesome.
It might be something that could be built into bug tracking tools (if you use one). Most tools have a ‘Merge Duplicate’ feature. Perhaps a ‘Hey, I found it too’ feature could be used to capture data. Although that sounds like it would pollute the statistical independence.
Also, I disagree with the sole reason for code reviews (which is to find defects). I’ve also seen it used (in a very cheap and un-bureaucratic way) to share knowledge of system changes.
-Joe Arnold
J.D. Meier's Blog | 08-Jun-07 at 11:16 am | Permalink
Get Lean, Eliminate Waste…
If you want to tune your software engineering, take a look at Lean . Lean is a great discipline with…
Michael | 07-Mar-08 at 7:05 pm | Permalink
Very elegant idea, but what about the “visibility” of bugs? It seems to me that some bugs are very easy to catch and some only manifest themseleves in very specialized situations. This ordering of the bugs in terms of their visibility or ease of detection could lead to a false assumption that two independent testers have found more of the bugs than they have.
Do you have any experience with or data about this phenonmenon? Also, have you checked the results of this heuristic against a more in depth inspection to see what its success rate is?
Corey Ladas | 07-Mar-08 at 10:49 pm | Permalink
Hi Michael,
Capture-recapture is a practice associated with SEI Team Software Process. Mature TSP teams have been documented to produce code with quality on a scale of 10 defects per million lines of code. Different practices tend to discover different types of defects, so TSP/PSP employs multiple, complementary practices, with feedback and statistical analysis to systematically eliminate the root causes of defect insertion. Inspections tend to find some kinds of things, static analysis tends to finds other kinds of things, etc. But then there also can be correlations between the results of different practices, and that’s where things get really interesting.
Then your goal is limited to the more manageable problem of:
Make your inspections good at finding the things that inspections are good at finding. And make them efficient.
One of the things we’re trying to do is to simplify some of the methods of TSP/PSP for the Agile audience, and provide a gentler learning curve. Of course, we think that Lean provides some of the right tools for that, and deep down, we know that TSP and Lean have Deming and SPC in common.
http://www.ddj.com/architect/184415470
Paul W. Homer | 09-Mar-08 at 8:30 pm | Permalink
Way back, someone once told me that in C, for example there was an expectation of one bug for every 200 lines of code (for normal or better coders). I’ve always kept track of the number of ‘new’ and changed lines going into any implementation. That frequently leaves me in testing with some ‘expectation’ for the number of bugs we are looking for. I’ll go up or down depending on the coders and the complexity of the changes. If the results in my head are way over or way under, I’ll generally start looking around for reasons.
Paul.
Michael Chermside | 10-Mar-08 at 5:55 am | Permalink
A really brilliant idea! I think this is an excellent approach for measuring unseen bug counts. The only method I knew of previously was based on the assumption that bugs found per day of QA in a given piece of code followed an exponential decay. This method had the property (good or bad… your call) that it *always* predicted there were more bugs to be found.
But I want to vigorously disagree with your final paragraph. Your thesis statement: “There really is only one legitimate purpose for a code review: to find defects.” is simply false. Finding defects is ONE benefit of performing code reviews, but it is one of the more minor benefits.
Bigger benefits include: (1) increased communication among team members: they actually talk about the code to each other, forcing people to describe and defend their commonly used idioms and practices, (2) mentoring opportunities, (3) increasing reuse due to awareness of different parts of the code: if you’ve never seen a piece of code, how likely are you to reuse it; conversely if you’ve code reviewed something you’ll likely remember it the next time you need that subroutine, (4) no piece of code ever languishes unmaintainable because only a single developer understands how it works. There are other benefits (mostly centered around communication and understanding), but this gives you the flavor.
I DO agree that beyond a certain point code reviews are no longer improving communication and they can become “expensive and bureaucratic”. The solution is the same one that fixes “carpenter’s thumb”: if it hurts (isn’t benefiting the code or the team), then *stop doing it*!
Joe Grossberg | 10-Mar-08 at 1:32 pm | Permalink
It’s an interesting metric, but I can think of two problems:
* the visibility issue mentioned above — people might always catch the most obvious bugs, in the same way that trappers will catch the animals most likely to get trapped (gullible, less cautious, hungrier for the bait, etc.)
* as you say, “The trick to making the population estimate work is the statistical independence of the samples.” If two programmers have very similar approaches to smoking out bugs, the results will be akin to two biologists who tend to place traps in similar areas.
Corey Ladas | 11-Mar-08 at 10:04 am | Permalink
@Michael,
Sometimes terminology can help clarify a point. The practice I described here as a review would be more accurately described as an inspection, the purpose of which is defect removal. The benefits you describe are usually conferred by a “walkthrough”. Unfortunately, both these types of meetings get lumped together under the ambiguous category of “review”. My workflows usually contain at least one walkthrough, in addition to an inspection. Efficient walkthroughs might be a good topic for another article!
Bernie Thompson | 11-Mar-08 at 12:37 pm | Permalink
A sample spreadsheet template for this method is available from us at http://leansoftwareengineering.....nspection/