How do you handle bugs and rework in a software development kanban system?
One way or another, buggy work-in-process is still in process and counts against the total WIP limit. The only question is which part of the workflow gets stuck with a kanban token for rework.
Bug scenario

Here we have a two-stage work package decomposition. Green tickets represent a “requirement” work item. These could be something like Use Cases, they could be Functional Requirements,… What matters is that they are something that represents customer value and that seems like a “thing” to analysts, UI designers, testers, and the like.
These are decomposed into smaller work items for developers. Each yellow ticket represents a “feature,” in the spirit of FDD. Each feature will be designed, reviewed, coded, tested, reviewed, and integrated into a development branch. When all of a requirement’s features are complete, they will be rolled back up for review between analysis, testing, and development, merged into the test branch, and approved for functional verification and validation.
We should expect that bugs will be found from time to time, though we hope that this happens with decreasing frequency as a team matures. In this example, we have two requirements in testing, and they have found three bugs so far (blue tickets).
The question is: what do we do with these bugs, now that we have found them? Let’s consider two options: 1) Reinsert defective WIP into an upstream station, 2) Assign bugs to a shared rework station. Each case has pros and cons, which depend on circumstances, like process maturity.
Option 1: reinsert defective WIP upstream

In this scenario, we’ve taken the kanban for work item R5 out of the Test station and placed it back in Development, where it’s treated like any other requirement work item. When it’s done, it will be placed back in the Resolved:Ready queue and retested. It will be subject to all of the usual limits and rules along the way. The kanban is charged to Development, and Test is free to pick up the next thing that appears in their Ready queue.
This is the softer approach. It’s less disruptive and it treats bugs with less urgency. A downside of treating rework like a regular work item is that if Development capacity is full, then the buggy work item will have to be placed in a queue to wait its turn.
The attitude here is that bugs are an expected common cause variation.
Option 2: shared rework station

In this scenario, we’re leaving the kanbans for work items R4 and R5 in test and moving the bug kanbans to a special rework station. Test may continue to work on R4 and R5 while the rework is done, but since they are at capacity, they can’t pull in any new work. That means that the bugs have to be treated like an expedited request. Otherwise, the system will stall until they are resolved.
This is the harder approach. It treats the bugs like a process failure that must be attended to immediately. The kanban is still allocated to the Test station, and the Rework station does not count against the Development limit. Since the rework station is dedicated, there’s no waiting for a slot to open up in development. Regardless, development capacity will be reduced because people will have to give priority to the bugs in order to make space to resolve the regularly scheduled work items.
The attitude here is that bugs are a special cause variation and call for corrective action. This might be the right configuration if a team or project is new, or the team is having an acute quality problem. Once they get the problem under control, they can relax to the reinsertion model.



Kanban discussion
Kanban Group
Agile Management by David Anderson | 27-Nov-07 at 2:59 pm | Permalink
Bugs and Kanban…
Corey Ladas takes a look at two ways to treat bugs in a kanban system . The second option is the more…
ronpih's weblog | 28-Nov-07 at 12:38 pm | Permalink
Lean Software Engineering Blog…
I need to add this to my blogroll… http://leansoftwareengineering.com/...
Carl Joseph | 28-Nov-07 at 10:57 pm | Permalink
This post got me thinking about using a kanban system to manage various “projects” at the same time.
Our team would be working on a number of different projects/products at any given time and managing the throughput of them as a whole is sometimes difficult (due to lack of transparency).
Have you used kanban in an environment similar to that before?
Corey Ladas | 29-Nov-07 at 12:08 am | Permalink
I can think of a few ways to deal with a situation like that. David’s sustaining engineering process was designed for this sort of thing, where kanbans are allocated to each project in proportion to a percentage of total engineering capacity of the organization. Each project gets its own board, and there’s a “central bank” that controls both the total supply of kanbans in the system and their allocation to projects. We’re still doing that today.
A related approach is to manage multiple projects on the same board, with hard partitions between them. Everything is visible to everybody, but each team is only responsible for its own cell/partition. A departmental manager might be responsible for distributing resources across the project, and those choices would be visible to all on the board. There’s a team here that does it this way, too.
Another approach would be a “heijunka” scheduling rule that mixes work items from each of the projects through a common workflow and with some shared resources. This is a good way to make the most out of specialist or constrained resources, and allows you to respond gracefully to changes in business priority. Say you have 3 project backlogs, each with their own queue limits. Each project’s stakeholder is responsible for the content and prioritization of his/her own queue. The production manager dequeues from each backlog according to an algorithm and places them in a single engineering ready queue. This is my favorite, but it’s been harder to get people to understand this one.
Project Shrink Links 5-12-2007 | 05-Dec-07 at 12:24 pm | Permalink
[...] Accounting for bugs and rework [...]
Bernie Thompson | 05-Dec-07 at 10:16 pm | Permalink
This is a great post and topic.
David’s post (linked above) advocates Option #2 as the one with ultimately better throughput, partially because David feels option #1 is more tolerant of rework.
Based on my experience, I’d actually bet on the other — that strategy #1 will be better at managing resources in the short term, with appropriate feedback (that is, of course, pain) to create better long-term behavior, and ultimately better throughput.
The problem I fear with #2 is as long as there are special rework stations, side-by-side with the new feature stations, I think a key feedback loop won’t be closed. The result will be that the new feature work can keep chugging along, piling up WIP in front of the test station, while test is piling up bugs in front of the rework station.
Option #1, by contrast, provides less opportunity for WIP to pile up anywhere.
Note most teams that I’ve been on haven’t managed rework WIP well, so either of these options might be a better starting point. I’ve generally seen the typical — that features sent to test keep spinning in test, while the dev team fixes the found bugs on an interrupt basis. So at Microsoft, WIP limits were introduced with the concept of “bug caps,” which worked like this: when the count of active bugs hit some limit (usually a certain # of bugs per team based on the size of the team), all new feature work would halt until the bug backlog was shrunk.
It did actually work quite well. And Option #1 is much like this, but with a more granular bug cap (every defect escape delays a new feature that otherwise would have started). I am a bit worried that making bug fixes non-interrupting will be more painful than management and the organization can stand (and could create a long feedback loop if the bugs, in fact, relate to knowledge that is being created in test). But it definitely is strong motivation to prevent defect escapes!
An example where knowledge is created in test: a domain like device drivers, where a big part of the functional requirements is patching over undocumented and subtle OS/hardware differences. In this case, some key knowledge is only discovered when the otherwise “working code” is deployed out to the big matrix of real-world devices and platforms.
All-in-all these options feel like great starting points to explore — but I also suspect that both options might have some pretty severe side-effects that might dominate the benefits for some teams.
It’ll be interesting more people comment on what they’ve seen, and what works for them …
software developer | 18-Dec-07 at 9:00 am | Permalink
I like the second way, the most intelligent and professional. I will think how to apply it into my practical work.
David Anderson | 04-Jan-08 at 10:30 pm | Permalink
Some interesting news from the Alchemy team… While they originally opted for an option 1 type approach, there is now an opinion that they need to switch to option 2. The advocacy for the change was led by the test manager and the project managers.
The “stop the line” nature of option 2 makes it less palatable but the “take the pain early” aspect of option 2 also means that the team focuses on delivery of working code - desirable from a project reporting perspective - and learns to reduce the number of inserted defects in order to reduce the flow interrupting expedite nature of bug fixing.
Hence, I think it might be a natural maturity progression that teams will start with option 1 and then change their process to option 2 when they are mature enough to emotionally handle its affect.
David