Scrum-ban

As more people become interested in Lean ideas and their application to knowledge work and project management, it’s helpful to find ways that make it easier to get started or learn a few basic concepts that can lead to deeper insights later. For those that are curious about kanban in an office context, it’s not unusual to find people who are either currently using Scrum, or have some understanding of Scrum as representative of Agile thinking. One way or another, Scrum users are an important constituent of the Kanban audience. Since Scrum can be described as a statement in the language we use to describe kanban systems, it is also fairly easy to elaborate on that case in order to describe Scrum/Kanban hybrids…

read this paper…

Comments (0)

Print This Post Print This Post

Email This Post Email This Post

Permalink

Software development pull systems at Agile 2008 Toronto

There’s a nice theme going on at this year’s Agile conference from practitioners of kanban and other pull-like systems. I wanted to make a t-shirt that said something fun like Have you pulled your kanban today?, but ultimately got voted down.

Here’s a list of pull-related topics. Please let me know if I missed anything.

Tuesday, August 5


Agile Game Development with Clinton Keith: Agile video game teams have adopted not only Scrum and XP, but are adopting Lean, Kanban and other practices to find ways to make better games.

Wednesday, August 6


Value Stream Mapping - Extending Our View to the Enterprise with Alan Shalloway: As effective as Agile teams have become, many times it isn’t the team that is the problem. In many cases, the structure within which the team exists is more problematic than the performance of the team itself…

Come and Take It! Lean Pull Applied with Rod Coffin and Don McGreal: The concept of “pull” from lean manufacturing challenges mainstream approaches to software development and reconsiders how value is delivered to the customer by inverting the thought process and focusing first on delivery.

GTD + Kanban + Round Robin for Product Owners with Thomas Nilsson: This demonstration will show how a Kanban board (task board with “states”) can be combined with a “round robin” scheme to keep analysts and Product Owners working on multiple tasks of high priority, but with potentially long lead times and fuzzy done criteria.

Thursday, August 7


Future Directions for Agile with David Anderson: How does our definition of agile evolve? How do we learn and adapt as a community? What about new ideas like Behavior-Driven Development, Kanban, Real Options, and others? Are they agile or not?

KFC Development - Finger Lickin’ Good with Karl Scotland and Aaron Sanders: This workshop explores three important Lean concepts - Kanban, Flow and Cadence (KFC) - which can be combined to generate a more pipeline-based approach to software development, as opposed to the more common timebox-based approaches of more Agile methods.

Estimating Considered Wasteful: Introducing Micro-Releases with Joshua Kerievsky: Micro-releasing has simplified our process by eliminating traditional agile planning activities…Instead, our focus remains on the mini-release. What important user story or stories do we most need to ship to production in the next few days?

Friday, August 8


Starting a Kanban System for Software Engineering with Value Stream Maps and Theory of Constraints with Corey Ladas: Any process with a recognizable workflow can be made into an efficient pull system by applying the kanban method. We can use kanban to transform either a traditional phase/gate software development system or a time-boxed iterative system into a lean continuous-flow system.

Comments (4)

Print This Post Print This Post

Email This Post Email This Post

Permalink

Completion queue as incremental throttle

In the last two posts, we’ve discussed some useful properties of internal workflow queues:

  • queue states between processes can provide an early warning of process breakdowns
  • local work-in-process limits serve to slow down a malfunctioning workflow and free up resources to fix it
  • queues can sometimes be combined to reduce the total work-in-process while still preserving their buffering function

I gave an example of workflow throttling, and suggested there was another configuration of those internal queues that could respond more smoothly and gracefully than the simple, independent queues given in the example.

In order to pull a work item, there has to be a place to pull it from, and there should be some way to distinguish work that is eligible to be pulled from work that is still in process. At the same time, there has to be a place to put completed work when you are done with it. A completion queue serves both these functions.

In this case, we can have up to 3 items in the “specify” state AND we can have up to 3 items waiting for the next state in the workflow. The team can pull new work into “specify” whenever there are fewer than 3 work items in process. If there are already 3 work items in process then the team will have to wait until something is moved into the completion queue. If there is some kind of blockage downstream, first the completion queue will fill up, THEN the specify queue will fill up, THEN the specify process will stall. And when it stalls, it stalls all at once. The flow is either on or off, there’s no middle speed, and it keeps going until it stalls.

In another example, we still have a busy state and a complete state, but the token limit is shared between them. In this case, we can have 4 items in process OR 4 waiting. Or we can have (3 busy + 1 waiting) OR (1 busy + 3 waiting).

In the ideal case of 3 busy and 1 waiting, this queue works just like the first example does. However, if work starts to accumulate in the “complete” state, then the “specify” state will incrementally throttle down. The effective WIP limit for “specify” goes from 4->3->2->1->0 as more items are completed ahead of the rate of downstream intake. So, the process slows before it stops, and it slows much sooner than it would have under the independent queues.

What’s more, even though it operates in the same way in the normal case, it does it with two fewer kanban in the system. Fewer kanban, with gradual throttling and smoother flow, should result in lower lead times.

With this in mind, let’s reconsider our scenario from the previous topic:

1. Something is going wrong in the design process, but nobody knows it yet.
2. The specify-complete queue starts to back up, thereby throttling down the WIP limit for specify. A resource is freed as a result, who should now inquire into the cause of the backup, which may only be random variation. The code process continues to complete work and pull from the existing backlog.
3. Code state begins to starve and specify state throttles down another level. Two more people are released as a result. There’s more than enough free resources now to either fix the problem or shut down the process.
4. The stall completes by flushing out the specify and code states.

It still takes a while for the system to stall completely. The difference is that it begins stalling immediately, and when it does stall, it stalls with less WIP. For equivalent throughput, this pipeline should operate with fewer kanban and less variation in WIP, and therefore should have smoother flow and shorter lead times. It should respond faster to problems and free up resources earlier to correct those problems.

These shared completion queues might be the most common type of workflow queue. There are a couple of other types that we use, and we’ll take a look at those in a future post.

Comments (0)

Print This Post Print This Post

Email This Post Email This Post

Permalink

Queue utilization is a leading indicator

I talk a lot about how to apply Lean ideas to software development. Perhaps I sometimes take it for granted that we understand why we should apply them. Mary Poppendieck has already written quite a bit on that rationale, and I try not to rehash things I think she’s already covered adequately. I do think there are a few characteristic scenarios where Lean principles most clearly apply to software development:

  • Any kind of live network service, whether customer-facing (Google.com, Amazon.com) or machine-facing (Bigtable, SimpleDB)
  • Any kind of sustaining engineering process: bug fixing, security patching, incremental enhancement
  • Evolutionary product design (which is to say, effective product design)

That said, there is a very pragmatic reason to adopt a Lean workflow strategy, regardless of what sort of product you are building: Lean scheduling provides crystal clear leading indicators of process health.

I am speaking of kanban limits and andon lights.

Work in process is a leading indicator


For a stable workflow, lead time is a function of both throughput (how much stuff we complete every day) and work-in-process. For a given rate of throughput (with everybody busy at their jobs), an increase in WIP necessarily means an increase in lead time.

It’s simple cause and effect: an increase in WIP today will mean an increase in the time to deliver that work in the future. As far as leading indicators go, this one’s rock solid. You can’t do more work than you have the capacity to do work, without taking longer to do it.

A simple management technique is to simplify the problem with policy. If lead time is a function of both throughput and WIP, and you can hold WIP near constant by an act of policy, then you can begin to address the more difficult problem of throughput. WIP is relatively easy to control, because somebody in your business should have the power to approve or deny starting on a new work order. Throttling work orders is a much easier problem than learning how to work faster.

This is effectively the result of a Drum-Buffer-Rope system, or its Lean cousin, a kanban system. Only after you get the simpler variable under control can you begin to make consistent progress on the more difficult one.

If we have a well-defined workflow, then the total work-in-process is the sum of the WIP of all of the parts of that workflow. Limiting the total WIP in the system can still mean quite a bit of variation in the distribution of WIP between the parts of the system. Our next step after limiting total WIP will be managing that component WIP more closely, and it turns out that some parts of that component WIP are more sensitive predictors of lead time than others.

Which is to say, that given the same root cause, some inter-process workflow queue will go from 2 to 4 long before the global WIP would go from 20 to 40 if it were unregulated. If you set your system up right, one or more of those internal queues will telegraph problems well before they manifest elsewhere.

Development workflows need buffers


The irregularity of requirements and the creative, knowledge-intensive nature of a design activity like software development rules out clocked workflow synchronization. Sometimes the interface to something will be simple, but the algorithm behind it will not. Sometimes the opposite is true. Sometimes an apparently simple design change has wide-reaching effects that require reverification and a lot of thinking about edge cases. Risk and uncertainty are built into the nature of development work. Novelty is what gives software its value, so you can only get so far in reducing this kind of variation before you have to mitigate and adapt to it. Abandoning takt time for development work has been our big concession to the messy reality, although we still look for opportunities to introduce a regular cadence at a higher scale of integration. Of course, we’d be delighted and astounded to hear of anybody making a takt time concept work.

Instead, we have to use small inventory buffers between value-adding processes in order to absorb variation in the duration of each activity across work items. We allocate kanban to those buffers just like anywhere else, and those kanban count towards our total allocation. Making the buffers random-access makes them even more flexible in absorbing process variation.

What is this inventory? Specifications that have not been implemented. Designs that have not been reviewed. Code that has not been tested and deployed. You can measure things like “weeks of specs-on-hand” and “percentage of specs complete.” The higher that first number is, the lower the second one probably is. For orgs that carry months worth of specs at a time, that second number can quickly converge on zero. So don’t do that! If you’re carrying more than a few weeks worth of detailed specifications at a time, ask yourself….why? What are you going to do with them? Specification inventory is a liability just like any other kind of inventory.

So we’re carrying a few hours or days worth of inventory at a time, because it’s still faster than the alternatives of generalist labor or pipeline congestion. And to be clear, when I’m talking about carrying kanban inventory, I’m talking about hours or days, not weeks or months. And I like hours a whole lot better than days.

The joy of complementary side effects


Agile development has long rallied around the “inspect and adapt” paradigm of process improvement. It is a philosophy that it shares with its Lean cousin. But early Agile methods built their model of feedback around the notion of velocity, and velocity is a trailing indicator. Velocity, and even lead time, can only tell you about things that have already happened.

To be fair, all Agile methods include higher-frequency feedback in the form of the daily standup. But a qualitative assessment is not the same as a quantitative indicator. Done well, the right measure can tell you things that people in a conversational meeting either can’t see, or won’t admit to. An informal, qualitative, Scrum style of issue management leads to confusion between circumstantial vs systemic problems, and the obstacle-clearing function of the Scrum Master often leads to one of Deming’s “two mistakes”. But then, Deming might have taken exception to a number of beliefs and practices common to today’s Agile practitioner. That’s okay, we Planned and we Did, and now we are Studying and Acting.

The regulating power of the in-process inventory limit is that it tells you about problems in your process while you are experiencing the problem. You don’t have to extract a belated confession from a stubborn problem-solver or wait for the end of the month to have a review in order to notice that something went wrong. You watch it going wrong in front of your eyes as it happens.

In a kanban workflow system, inter-process queues start backing up immediately following any blockage in their downstream processes. If your team is all working within a line of sight of a visual control representation of that inventory, then you all see the problem together as it manifests. A backed-up queue is not a matter of opinion and the consequences are highly predictable.

Making the indicator work for us


If we’re using a kanban system, we have the WIP limit indicator at our disposal. How can we use this to our advantage?

Under normal conditions of smooth flow, the kanban queues should be operating below their limits. Which is to say, the system has some slack. Slack is good, and optimum flow means “just enough slack.” The limits for the queues are set according to a different rule than the limits for value-added states. Buffer states are non-value-added processing time, so we want to make them as small as we can. The queues are there for the purpose of smooth flow. Make them too big, and they just increase inventory and lead time. Make them too small and they cause traffic jams…which also increases lead time. So there’s a “just right” size for kanban queues, and that is as small as possible without stalling X% of the time. Since the queue size is a tradeoff, there is an optimal value for X which is less than 100. The difference between X and 100 is your expectation of process improvement which will be triggered by the occasional stall event. So our process has slack, but our slack doesn’t. When we run out of slack, we want to stop what we’re doing and try to learn how to operate with less slack in the future.

A healthy state of affairs. A lot of working, not much waiting. When the next analysis task is done, there will be room to store the result, even if design is busy. Design is not under any particular pressure to complete something…yet. But conditions can change quickly, so no excuse to dawdle!

Since our system is a pull system, our process breaks down in a characteristic way. When a queue fills up, there’s nowhere for the output of the process before it to go, so that process will begin to back up itself, and so on, until the entire pipeline in front of the jam eventually stops while the remainder of the pipeline flushes itself out. Good! That’s what we want. Every process in the system serves as a throttle for its predecessor. That means that the system as a whole is regulated by the health of its parts. Shortly after any part of the system starts to go wrong, the entire system responds by slowing down and freeing up resources to fix the problem. That automatic reflection of process health is a powerful mechanism for continuous improvement.

Let’s walk through a typical failure mode:

1. Something is going wrong in the design process, but nobody knows it yet. The senior devs are all sick with the flu. Nobody signals the andon light because they’re at home, or they have other problems on their minds.
2. The analysts, who are in a different hallway, seem immune and continue to complete their assignments. At this point, the process is already signaling that something is amiss.
3. The analysts start up their next tasks anyway. The pipeline to the right of design continues on processing from its own queue.
4. There’s nowhere for the analysts to put their completed work, so now they are also stalled. The right side of the pipeline has flushed out whatever work was already in process and now they are idle as well. The ready queue has backed up, and so the whole pipeline is now stalled.

With no intervention other than enforcing the kanban allocation, the system spontaneously responds to problems by shutting itself down. This would be an example of jidoka applied to our development workflow. The people who are idled by this process can and should spend their time looking into the root cause of the problem, either to mitigate it (if it is a special cause) or to prevent it from happening in the future (if it is a common cause). You can’t really predict when the design team will get sick, so in this case, perhaps the analysts and junior devs can work together and complete some of the design tasks until the missing devs get back to health. In this case, it may be an opportunity to discover if the team is sufficiently cross-trained to cover the gap and ask questions about roles and responsibilities.

Even though the problem is self-limiting by slide 4, we already know in slide 2 that slides 3 and 4 are likely to happen if we don’t intervene. It would have been better if somebody had taken greater notice of the signal in slide 2 and began an investigation. It would also be nice if the system itself could respond both more quickly and more gracefully than in this example.

In the next article, we’ll look at another queueing method that will allow us to simultaneously reduce lead times, smooth out flow, and respond more quickly and gracefully to disruptions.

Comments (2)

Print This Post Print This Post

Email This Post Email This Post

Permalink

Shaping Software

My good friend and impossibly prolific writer J.D. Meier has a new blog called Shaping Software, which promises to be a general review of software engineering patterns and practices. He’s currently riffing on evolutionary development and process engineering. His old blog was already a terrific resource, but the new one promises to be even better.

Comments (0)

Print This Post Print This Post

Email This Post Email This Post

Permalink

Close
E-mail It
Socialized through Gregarious 42