Queue utilization is a leading indicator

Comments (9)

Print This Post Print This Post

Email This Post Email This Post

Permalink

I talk a lot about how to apply Lean ideas to software development. Perhaps I sometimes take it for granted that we understand why we should apply them. Mary Poppendieck has already written quite a bit on that rationale, and I try not to rehash things I think she’s already covered adequately. I do think there are a few characteristic scenarios where Lean principles most clearly apply to software development:

  • Any kind of live network service, whether customer-facing (Google.com, Amazon.com) or machine-facing (Bigtable, SimpleDB)
  • Any kind of sustaining engineering process: bug fixing, security patching, incremental enhancement
  • Evolutionary product design (which is to say, effective product design)

That said, there is a very pragmatic reason to adopt a Lean workflow strategy, regardless of what sort of product you are building: Lean scheduling provides crystal clear leading indicators of process health.

I am speaking of kanban limits and andon lights.

Work in process is a leading indicator


For a stable workflow, lead time is a function of both throughput (how much stuff we complete every day) and work-in-process. For a given rate of throughput (with everybody busy at their jobs), an increase in WIP necessarily means an increase in lead time.

It’s simple cause and effect: an increase in WIP today will mean an increase in the time to deliver that work in the future. As far as leading indicators go, this one’s rock solid. You can’t do more work than you have the capacity to do work, without taking longer to do it.

A simple management technique is to simplify the problem with policy. If lead time is a function of both throughput and WIP, and you can hold WIP near constant by an act of policy, then you can begin to address the more difficult problem of throughput. WIP is relatively easy to control, because somebody in your business should have the power to approve or deny starting on a new work order. Throttling work orders is a much easier problem than learning how to work faster.

This is effectively the result of a Drum-Buffer-Rope system, or its Lean cousin, a kanban system. Only after you get the simpler variable under control can you begin to make consistent progress on the more difficult one.

If we have a well-defined workflow, then the total work-in-process is the sum of the WIP of all of the parts of that workflow. Limiting the total WIP in the system can still mean quite a bit of variation in the distribution of WIP between the parts of the system. Our next step after limiting total WIP will be managing that component WIP more closely, and it turns out that some parts of that component WIP are more sensitive predictors of lead time than others.

Which is to say, that given the same root cause, some inter-process workflow queue will go from 2 to 4 long before the global WIP would go from 20 to 40 if it were unregulated. If you set your system up right, one or more of those internal queues will telegraph problems well before they manifest elsewhere.

Development workflows need buffers


The irregularity of requirements and the creative, knowledge-intensive nature of a design activity like software development rules out clocked workflow synchronization. Sometimes the interface to something will be simple, but the algorithm behind it will not. Sometimes the opposite is true. Sometimes an apparently simple design change has wide-reaching effects that require reverification and a lot of thinking about edge cases. Risk and uncertainty are built into the nature of development work. Novelty is what gives software its value, so you can only get so far in reducing this kind of variation before you have to mitigate and adapt to it. Abandoning takt time for development work has been our big concession to the messy reality, although we still look for opportunities to introduce a regular cadence at a higher scale of integration. Of course, we’d be delighted and astounded to hear of anybody making a takt time concept work.

Instead, we have to use small inventory buffers between value-adding processes in order to absorb variation in the duration of each activity across work items. We allocate kanban to those buffers just like anywhere else, and those kanban count towards our total allocation. Making the buffers random-access makes them even more flexible in absorbing process variation.

What is this inventory? Specifications that have not been implemented. Designs that have not been reviewed. Code that has not been tested and deployed. You can measure things like “weeks of specs-on-hand” and “percentage of specs complete.” The higher that first number is, the lower the second one probably is. For orgs that carry months worth of specs at a time, that second number can quickly converge on zero. So don’t do that! If you’re carrying more than a few weeks worth of detailed specifications at a time, ask yourself….why? What are you going to do with them? Specification inventory is a liability just like any other kind of inventory.

So we’re carrying a few hours or days worth of inventory at a time, because it’s still faster than the alternatives of generalist labor or pipeline congestion. And to be clear, when I’m talking about carrying kanban inventory, I’m talking about hours or days, not weeks or months. And I like hours a whole lot better than days.

The joy of complementary side effects


Agile development has long rallied around the “inspect and adapt” paradigm of process improvement. It is a philosophy that it shares with its Lean cousin. But early Agile methods built their model of feedback around the notion of velocity, and velocity is a trailing indicator. Velocity, and even lead time, can only tell you about things that have already happened.

To be fair, all Agile methods include higher-frequency feedback in the form of the daily standup. But a qualitative assessment is not the same as a quantitative indicator. Done well, the right measure can tell you things that people in a conversational meeting either can’t see, or won’t admit to. An informal, qualitative, Scrum style of issue management leads to confusion between circumstantial vs systemic problems, and the obstacle-clearing function of the Scrum Master often leads to one of Deming’s “two mistakes”. But then, Deming might have taken exception to a number of beliefs and practices common to today’s Agile practitioner. That’s okay, we Planned and we Did, and now we are Studying and Acting.

The regulating power of the in-process inventory limit is that it tells you about problems in your process while you are experiencing the problem. You don’t have to extract a belated confession from a stubborn problem-solver or wait for the end of the month to have a review in order to notice that something went wrong. You watch it going wrong in front of your eyes as it happens.

In a kanban workflow system, inter-process queues start backing up immediately following any blockage in their downstream processes. If your team is all working within a line of sight of a visual control representation of that inventory, then you all see the problem together as it manifests. A backed-up queue is not a matter of opinion and the consequences are highly predictable.

Making the indicator work for us


If we’re using a kanban system, we have the WIP limit indicator at our disposal. How can we use this to our advantage?

Under normal conditions of smooth flow, the kanban queues should be operating below their limits. Which is to say, the system has some slack. Slack is good, and optimum flow means “just enough slack.” The limits for the queues are set according to a different rule than the limits for value-added states. Buffer states are non-value-added processing time, so we want to make them as small as we can. The queues are there for the purpose of smooth flow. Make them too big, and they just increase inventory and lead time. Make them too small and they cause traffic jams…which also increases lead time. So there’s a “just right” size for kanban queues, and that is as small as possible without stalling X% of the time. Since the queue size is a tradeoff, there is an optimal value for X which is less than 100. The difference between X and 100 is your expectation of process improvement which will be triggered by the occasional stall event. So our process has slack, but our slack doesn’t. When we run out of slack, we want to stop what we’re doing and try to learn how to operate with less slack in the future.

A healthy state of affairs. A lot of working, not much waiting. When the next analysis task is done, there will be room to store the result, even if design is busy. Design is not under any particular pressure to complete something…yet. But conditions can change quickly, so no excuse to dawdle!

Since our system is a pull system, our process breaks down in a characteristic way. When a queue fills up, there’s nowhere for the output of the process before it to go, so that process will begin to back up itself, and so on, until the entire pipeline in front of the jam eventually stops while the remainder of the pipeline flushes itself out. Good! That’s what we want. Every process in the system serves as a throttle for its predecessor. That means that the system as a whole is regulated by the health of its parts. Shortly after any part of the system starts to go wrong, the entire system responds by slowing down and freeing up resources to fix the problem. That automatic reflection of process health is a powerful mechanism for continuous improvement.

Let’s walk through a typical failure mode:

1. Something is going wrong in the design process, but nobody knows it yet. The senior devs are all sick with the flu. Nobody signals the andon light because they’re at home, or they have other problems on their minds.
2. The analysts, who are in a different hallway, seem immune and continue to complete their assignments. At this point, the process is already signaling that something is amiss.
3. The analysts start up their next tasks anyway. The pipeline to the right of design continues on processing from its own queue.
4. There’s nowhere for the analysts to put their completed work, so now they are also stalled. The right side of the pipeline has flushed out whatever work was already in process and now they are idle as well. The ready queue has backed up, and so the whole pipeline is now stalled.

With no intervention other than enforcing the kanban allocation, the system spontaneously responds to problems by shutting itself down. This would be an example of jidoka applied to our development workflow. The people who are idled by this process can and should spend their time looking into the root cause of the problem, either to mitigate it (if it is a special cause) or to prevent it from happening in the future (if it is a common cause). You can’t really predict when the design team will get sick, so in this case, perhaps the analysts and junior devs can work together and complete some of the design tasks until the missing devs get back to health. In this case, it may be an opportunity to discover if the team is sufficiently cross-trained to cover the gap and ask questions about roles and responsibilities.

Even though the problem is self-limiting by slide 4, we already know in slide 2 that slides 3 and 4 are likely to happen if we don’t intervene. It would have been better if somebody had taken greater notice of the signal in slide 2 and began an investigation. It would also be nice if the system itself could respond both more quickly and more gracefully than in this example.

In the next article, we’ll look at another queueing method that will allow us to simultaneously reduce lead times, smooth out flow, and respond more quickly and gracefully to disruptions.