June 2008

Completion queue as incremental throttle

In the last two posts, we’ve discussed some useful properties of internal workflow queues:

  • queue states between processes can provide an early warning of process breakdowns
  • local work-in-process limits serve to slow down a malfunctioning workflow and free up resources to fix it
  • queues can sometimes be combined to reduce the total work-in-process while still preserving their buffering function

I gave an example of workflow throttling, and suggested there was another configuration of those internal queues that could respond more smoothly and gracefully than the simple, independent queues given in the example.

In order to pull a work item, there has to be a place to pull it from, and there should be some way to distinguish work that is eligible to be pulled from work that is still in process. At the same time, there has to be a place to put completed work when you are done with it. A completion queue serves both these functions.

In this case, we can have up to 3 items in the “specify” state AND we can have up to 3 items waiting for the next state in the workflow. The team can pull new work into “specify” whenever there are fewer than 3 work items in process. If there are already 3 work items in process then the team will have to wait until something is moved into the completion queue. If there is some kind of blockage downstream, first the completion queue will fill up, THEN the specify queue will fill up, THEN the specify process will stall. And when it stalls, it stalls all at once. The flow is either on or off, there’s no middle speed, and it keeps going until it stalls.

In another example, we still have a busy state and a complete state, but the token limit is shared between them. In this case, we can have 4 items in process OR 4 waiting. Or we can have (3 busy + 1 waiting) OR (1 busy + 3 waiting).

In the ideal case of 3 busy and 1 waiting, this queue works just like the first example does. However, if work starts to accumulate in the “complete” state, then the “specify” state will incrementally throttle down. The effective WIP limit for “specify” goes from 4->3->2->1->0 as more items are completed ahead of the rate of downstream intake. So, the process slows before it stops, and it slows much sooner than it would have under the independent queues.

What’s more, even though it operates in the same way in the normal case, it does it with two fewer kanban in the system. Fewer kanban, with gradual throttling and smoother flow, should result in lower lead times.

With this in mind, let’s reconsider our scenario from the previous topic:

1. Something is going wrong in the design process, but nobody knows it yet.
2. The specify-complete queue starts to back up, thereby throttling down the WIP limit for specify. A resource is freed as a result, who should now inquire into the cause of the backup, which may only be random variation. The code process continues to complete work and pull from the existing backlog.
3. Code state begins to starve and specify state throttles down another level. Two more people are released as a result. There’s more than enough free resources now to either fix the problem or shut down the process.
4. The stall completes by flushing out the specify and code states.

It still takes a while for the system to stall completely. The difference is that it begins stalling immediately, and when it does stall, it stalls with less WIP. For equivalent throughput, this pipeline should operate with fewer kanban and less variation in WIP, and therefore should have smoother flow and shorter lead times. It should respond faster to problems and free up resources earlier to correct those problems.

These shared completion queues might be the most common type of workflow queue. There are a couple of other types that we use, and we’ll take a look at those in a future post.

Comments (1)

Print This Post Print This Post

Email This Post Email This Post


Queue utilization is a leading indicator

I talk a lot about how to apply Lean ideas to software development. Perhaps I sometimes take it for granted that we understand why we should apply them. Mary Poppendieck has already written quite a bit on that rationale, and I try not to rehash things I think she’s already covered adequately. I do think there are a few characteristic scenarios where Lean principles most clearly apply to software development:

  • Any kind of live network service, whether customer-facing (Google.com, Amazon.com) or machine-facing (Bigtable, SimpleDB)
  • Any kind of sustaining engineering process: bug fixing, security patching, incremental enhancement
  • Evolutionary product design (which is to say, effective product design)

That said, there is a very pragmatic reason to adopt a Lean workflow strategy, regardless of what sort of product you are building: Lean scheduling provides crystal clear leading indicators of process health.

I am speaking of kanban limits and andon lights.

Work in process is a leading indicator

For a stable workflow, lead time is a function of both throughput (how much stuff we complete every day) and work-in-process. For a given rate of throughput (with everybody busy at their jobs), an increase in WIP necessarily means an increase in lead time.

It’s simple cause and effect: an increase in WIP today will mean an increase in the time to deliver that work in the future. As far as leading indicators go, this one’s rock solid. You can’t do more work than you have the capacity to do work, without taking longer to do it.

A simple management technique is to simplify the problem with policy. If lead time is a function of both throughput and WIP, and you can hold WIP near constant by an act of policy, then you can begin to address the more difficult problem of throughput. WIP is relatively easy to control, because somebody in your business should have the power to approve or deny starting on a new work order. Throttling work orders is a much easier problem than learning how to work faster.

This is effectively the result of a Drum-Buffer-Rope system, or its Lean cousin, a kanban system. Only after you get the simpler variable under control can you begin to make consistent progress on the more difficult one.

If we have a well-defined workflow, then the total work-in-process is the sum of the WIP of all of the parts of that workflow. Limiting the total WIP in the system can still mean quite a bit of variation in the distribution of WIP between the parts of the system. Our next step after limiting total WIP will be managing that component WIP more closely, and it turns out that some parts of that component WIP are more sensitive predictors of lead time than others.

Which is to say, that given the same root cause, some inter-process workflow queue will go from 2 to 4 long before the global WIP would go from 20 to 40 if it were unregulated. If you set your system up right, one or more of those internal queues will telegraph problems well before they manifest elsewhere.

Development workflows need buffers

The irregularity of requirements and the creative, knowledge-intensive nature of a design activity like software development rules out clocked workflow synchronization. Sometimes the interface to something will be simple, but the algorithm behind it will not. Sometimes the opposite is true. Sometimes an apparently simple design change has wide-reaching effects that require reverification and a lot of thinking about edge cases. Risk and uncertainty are built into the nature of development work. Novelty is what gives software its value, so you can only get so far in reducing this kind of variation before you have to mitigate and adapt to it. Abandoning takt time for development work has been our big concession to the messy reality, although we still look for opportunities to introduce a regular cadence at a higher scale of integration. Of course, we’d be delighted and astounded to hear of anybody making a takt time concept work.

Instead, we have to use small inventory buffers between value-adding processes in order to absorb variation in the duration of each activity across work items. We allocate kanban to those buffers just like anywhere else, and those kanban count towards our total allocation. Making the buffers random-access makes them even more flexible in absorbing process variation.

What is this inventory? Specifications that have not been implemented. Designs that have not been reviewed. Code that has not been tested and deployed. You can measure things like “weeks of specs-on-hand” and “percentage of specs complete.” The higher that first number is, the lower the second one probably is. For orgs that carry months worth of specs at a time, that second number can quickly converge on zero. So don’t do that! If you’re carrying more than a few weeks worth of detailed specifications at a time, ask yourself….why? What are you going to do with them? Specification inventory is a liability just like any other kind of inventory.

So we’re carrying a few hours or days worth of inventory at a time, because it’s still faster than the alternatives of generalist labor or pipeline congestion. And to be clear, when I’m talking about carrying kanban inventory, I’m talking about hours or days, not weeks or months. And I like hours a whole lot better than days.

The joy of complementary side effects

Agile development has long rallied around the “inspect and adapt” paradigm of process improvement. It is a philosophy that it shares with its Lean cousin. But early Agile methods built their model of feedback around the notion of velocity, and velocity is a trailing indicator. Velocity, and even lead time, can only tell you about things that have already happened.

To be fair, all Agile methods include higher-frequency feedback in the form of the daily standup. But a qualitative assessment is not the same as a quantitative indicator. Done well, the right measure can tell you things that people in a conversational meeting either can’t see, or won’t admit to. An informal, qualitative, Scrum style of issue management leads to confusion between circumstantial vs systemic problems, and the obstacle-clearing function of the Scrum Master often leads to one of Deming’s “two mistakes”. But then, Deming might have taken exception to a number of beliefs and practices common to today’s Agile practitioner. That’s okay, we Planned and we Did, and now we are Studying and Acting.

The regulating power of the in-process inventory limit is that it tells you about problems in your process while you are experiencing the problem. You don’t have to extract a belated confession from a stubborn problem-solver or wait for the end of the month to have a review in order to notice that something went wrong. You watch it going wrong in front of your eyes as it happens.

In a kanban workflow system, inter-process queues start backing up immediately following any blockage in their downstream processes. If your team is all working within a line of sight of a visual control representation of that inventory, then you all see the problem together as it manifests. A backed-up queue is not a matter of opinion and the consequences are highly predictable.

Making the indicator work for us

If we’re using a kanban system, we have the WIP limit indicator at our disposal. How can we use this to our advantage?

Under normal conditions of smooth flow, the kanban queues should be operating below their limits. Which is to say, the system has some slack. Slack is good, and optimum flow means “just enough slack.” The limits for the queues are set according to a different rule than the limits for value-added states. Buffer states are non-value-added processing time, so we want to make them as small as we can. The queues are there for the purpose of smooth flow. Make them too big, and they just increase inventory and lead time. Make them too small and they cause traffic jams…which also increases lead time. So there’s a “just right” size for kanban queues, and that is as small as possible without stalling X% of the time. Since the queue size is a tradeoff, there is an optimal value for X which is less than 100. The difference between X and 100 is your expectation of process improvement which will be triggered by the occasional stall event. So our process has slack, but our slack doesn’t. When we run out of slack, we want to stop what we’re doing and try to learn how to operate with less slack in the future.

A healthy state of affairs. A lot of working, not much waiting. When the next analysis task is done, there will be room to store the result, even if design is busy. Design is not under any particular pressure to complete something…yet. But conditions can change quickly, so no excuse to dawdle!

Since our system is a pull system, our process breaks down in a characteristic way. When a queue fills up, there’s nowhere for the output of the process before it to go, so that process will begin to back up itself, and so on, until the entire pipeline in front of the jam eventually stops while the remainder of the pipeline flushes itself out. Good! That’s what we want. Every process in the system serves as a throttle for its predecessor. That means that the system as a whole is regulated by the health of its parts. Shortly after any part of the system starts to go wrong, the entire system responds by slowing down and freeing up resources to fix the problem. That automatic reflection of process health is a powerful mechanism for continuous improvement.

Let’s walk through a typical failure mode:

1. Something is going wrong in the design process, but nobody knows it yet. The senior devs are all sick with the flu. Nobody signals the andon light because they’re at home, or they have other problems on their minds.
2. The analysts, who are in a different hallway, seem immune and continue to complete their assignments. At this point, the process is already signaling that something is amiss.
3. The analysts start up their next tasks anyway. The pipeline to the right of design continues on processing from its own queue.
4. There’s nowhere for the analysts to put their completed work, so now they are also stalled. The right side of the pipeline has flushed out whatever work was already in process and now they are idle as well. The ready queue has backed up, and so the whole pipeline is now stalled.

With no intervention other than enforcing the kanban allocation, the system spontaneously responds to problems by shutting itself down. This would be an example of jidoka applied to our development workflow. The people who are idled by this process can and should spend their time looking into the root cause of the problem, either to mitigate it (if it is a special cause) or to prevent it from happening in the future (if it is a common cause). You can’t really predict when the design team will get sick, so in this case, perhaps the analysts and junior devs can work together and complete some of the design tasks until the missing devs get back to health. In this case, it may be an opportunity to discover if the team is sufficiently cross-trained to cover the gap and ask questions about roles and responsibilities.

Even though the problem is self-limiting by slide 4, we already know in slide 2 that slides 3 and 4 are likely to happen if we don’t intervene. It would have been better if somebody had taken greater notice of the signal in slide 2 and began an investigation. It would also be nice if the system itself could respond both more quickly and more gracefully than in this example.

In the next article, we’ll look at another queueing method that will allow us to simultaneously reduce lead times, smooth out flow, and respond more quickly and gracefully to disruptions.

Comments (9)

Print This Post Print This Post

Email This Post Email This Post


Shaping Software

My good friend and impossibly prolific writer J.D. Meier has a new blog called Shaping Software, which promises to be a general review of software engineering patterns and practices. He’s currently riffing on evolutionary development and process engineering. His old blog was already a terrific resource, but the new one promises to be even better.

Comments (0)

Print This Post Print This Post

Email This Post Email This Post


Pool queue

Manufacturing systems have workflows and knowledge work systems have workflows (and little lambs eat ivy). There are principles that apply to workflows in general, regardless of whether they operate on bits or atoms, and that accounts for much of what we discuss here at Lean Software Engineering. There are also things that are completely different about information workflows. One of those things is the physical space necessary to operate the system. The nature of information space is fundamentally different from any physical process.

Fortunately for us, that often works to our advantage. It means we can manipulate our workflows and work products in ways that would be nonsensical to a traditional industrial engineer. Since most of the literature about Lean is still about moving atoms around, you have to pinch yourself every now and then as a reminder that moving bits around involves a different set of rules.

Bits or atoms, the notion of an inter-process inventory buffer is generally important to our scheduling methodology. Our overall goal is to minimize lead times for new work requests, and a great part of how we do that is by managing our in-process inventory very carefully. But an information inventory is different from a manufacturing inventory, in that it doesn’t occupy exclusive space in a meaningful way. Our information WIP might go into a virtual queue, effectively infinite in size, with no definite order for queuing or dequeuing, and no conflict between objects in the queue. A virtual queue can be random-in-random-out in a way that’s improbable for more spatially-oriented storage.

An issue that seems to come up regularly for development teams is how to distribute multiple work product types across the team’s resources. One approach says dedicate resources to each product type, say, a couple of “feature teams” and some bug fixers. Or a “front end” team and a “back end” team. Another approach says make a prioritization rule and assign all of the work to the common team. A kanban system enables us to use a hybrid approach that dedicates capacity to each work product type, without actually dedicating people.

Suppose we have a fairly simple, generic, 2-stage development process, common to all work product types:

Because it’s knowledge work, there’s too much variation between the two subprocesses to synchronize according to a clock interval, so we make an inter-process queue to absorb the variation:

The queue just holds the kanban, the actual inventory is still sitting in the same document, database, or code repository that it was in when somebody was working on it. It doesn’t matter where the real inventory is because nobody is competing for the storage.

Then we scale that process according to the available resources and demand:

But we can hybridize even further by exploiting some of our “virtual space” advantage. Because our “workcells” and “buffer stocks” don’t actually occupy any spatially constrained floor space, we can arrange them in any logical arrangement that suits us. In this case, we’re going to make a single pooled buffer that straddles both production lines:

Why would we do that? Pooling the variation across the queues for both lines allows us to reduce the total number of kanban in the system, and thereby reduce the lead time for the system as a whole. The dedicated queues each needed a minimum capacity of 2, for a total of 4, to avoid stalling. The combined queue only needs 3 to avoid stalling, because it is rare that both independent queues are simultaneously at their limit of 2. We can reduce the queue further by improving the variability of either of the surrounding processes. Again, it will be easier to reduce from 3 to 2 than it would be to reduce from 2 to 1.

Comments (0)

Print This Post Print This Post

Email This Post Email This Post


E-mail It
Socialized through Gregarious 42