August 2007

Kanban systems for software development

(part 4 in the series including 1, 2, 3, 4)

The pipeline model shares a common problem with network model scheduling. Variation in product development activities is simply too hard to control. Pipelines and network models can be made to work by adding a lot of time padding, and indeed, we started to resort to Critical Chain methods to try to make the pipeline work.

We can do something to manage the variation in work orders by controlling coupling and by specifying requirements according to a well defined schema that limits complexity. Those things help, but they are more order-of-magnitude levels of control.

Even if we could control work order size to, say, a factor of two, we’d still have the problem of variation within the workflow. Some requirements may be easy to design, but devilishly difficult to test, perhaps for instrumental reasons. There might be a simple and elegant design that almost matches the requirement and a monstrously complex design that exactly matches it. How long something takes might depend on who gets the assignment. High value problems are uncertain. Uncertainty is risk. Sometimes risk doesn’t go your way.

So a viable solution will have to do its best to control variation while still operating within the reality-based paradigm. Given the challenge, it’s unsurprising that craft production is attractive to practitioners, but I am certain that we can do better.

If you take a horizontal slice across the cumulative flow diagram of a development process, you get the sequence of the workflow for a particular work order. If you take a vertical slice, you get a snapshot of all of the current work-in-process. Curiously, with our pipeline model, these two sequences are the same:

task_leveling_2.png

Not only are they the same, they’re always the same. A consequence of our pipeline design is that it strictly limits work-in-process according to the proportions of the workflow. That’s good! That’s what we want.

Is there a way to more directly control work-in-process that allows for more variation than clock synchronization?

WIP and Flow

Imagine that we make a pooled workcell, where each station represents one step in the logical workflow of our hypothetical feature design process.

Each station has a work-in-process limit that corresponds to the time proportion of that state in the workflow. The limit governs the maximum number of work items that can be in that state at any instant. In this case, the analyze and test states have limits of 2. The design state has a limit of 3, and build and deploy have limits of 1.

If a state is below its limit, then it may take possession of a work item from a preceding state upon its completion by the preceding state. If a state is at its limit, it must wait for one of its own items to complete and be pulled into a downstream state before it can pull an item from an upstream state.

Without any further intervention, this design is already far less susceptible to stalls than any of our synchronized designs. If 2 out of 3 work items in a workflow state exceed the control limit on completion time, the rest of the system will continue to function. Even if an entire state locks up, it will take time for the upstream states to back up and the downstream states to starve. In the meantime, management will have ample warning to intervene. Flow will resume naturally once capacity is added or the obstacle is cleared.

Buffers in space vs. time

Most future events in a project plan will occur according to the long-tailed probability distribution that you ought to have tattooed on your other eyeball. The fundamental strategy for managing uncertainty in any cause-and-effect process is the buffer. Buffers in network model schedules take the form of time padding. Thinking about a flow model might lead us to think about other kinds of buffers.

Every buffer is waste of some form or another, but some wastes are easier to control than others, and sometimes exchanging one waste for another is still an improvement to the whole. Schedule buffers represent the waste of delay. With a pull system, we might introduce small inventories to smooth out the flow between adjacent processes and reduce delays due to congestion.

Kanban buffers

A Kanban queue is a small inventory between two processes to create the appearance of instant availability to the downstream process. Think about the stock on the shelf at the grocery store. When the shelf space is empty, this signals the grocer to replenish it. The inventory on the shelf can be very small if the grocer replenishes frequently. In fact, this is where the kanban idea comes from in the first place.

In a production process, a kanban buffer signals an upstream state to produce work only when there is actual demand for it. The productivity of downstream processes regulates the productivity of upstream processes, and this kind of regulation is called a pull system. At the moment a downstream process consumes a component from an upstream process, the upstream process begins production of a replacement. The kanban queue itself has a limit, so that if the queue fills up, the upstream producer will halt. The simple case is a queue limit of one.

The pipeline model forced process states to produce at the same rate and to release work at the exact moment that the next state needed it. By removing the synchronization clock, adjacent states can get out of step. Rate control can still be managed by setting WIP limits. The timing of work transfers can be managed by kanban buffers.

One of the nice things about not being a manufacturing process is that you are not constrained by the limits of physical space. Our kanban container can be infinitely large and occupy no space. Items can enter and exit in any order. In a manufacturing process, the kanban buffer will most likely be a first-in-first-out queue. A development workflow need not observe any such ordering. It might be advantageous to pick the dequeuing order according to local conditions visible to the people on the line.

However, it is still very important to control the size of the buffers. The ideal buffer size is 0, because all buffers are waste. But if a buffer is necessary for synchronization, then the ideal size is one. If the buffer size grows much bigger than one, then you might consider adjusting downward the WIP limit of the upstream state instead. Remember that the one and only goal of the kanban buffer is to create the appearance of instant availability downstream.

Decoupling buffers

There’s another kind of inventory buffer that we might need to keep things flowing. Some resources may have batch sizes greater than one, or have less-than-immediate availability. We don’t want to burden the upstream state with managing such a resource, so we can add a feeding buffer in front of the resource that still looks like a kanban buffer to the upstream state. Goldratt has more to say about this kind of buffer and we may come back to it sometime.

Are we there yet?

We had to take a stroll from the real through the imaginary in order to get back to the real again. But we are, in fact, back to reality, because the ideas described in this post are being applied at Corbis every day. And it works!

A perfect state of flow may be very difficult, or at least uneconomical, to achieve in a robust product development process. But we can get pretty close with a well-tuned kanban pull system. We have managed to combine most of the flexibility of craft production with most of the control of a pipeline. Work-in-process is limited, and cycle time can be managed. Most importantly, it is a highly transparent and repeatable process with all of the right conditions for continuous improvement.

And continuous improvement is really what this is all about.

Comments (10)

Print This Post Print This Post

Email This Post Email This Post

Permalink

In search of one piece flow (part 2)

(part 3 in the series including 1, 2, 3, 4)

In the previous post we considered the notion of synchronizing a design workflow like an assembly line in order to realize our ideal case. A little skepticism about such an idea is surely justified, and the simplest interpretation of that idea has some undesirable consequences. Nonetheless, the idea still offers a lot of room to explore, and I am the curious sort, so let’s see how far we can take it!

Task Leveling

It is natural to want to align a design workflow with the logical boundaries of the activities involved. However, if we are trying to synchronize work, it is unlikely that the logical boundaries of tasks will align well with the clock:

task_leveling_1.png

…which means that people will always be waiting for the bottleneck to finish its work:

task_leveling_5.png

It should be possible (even desirable) to break large activities into smaller, similarly-sized pieces:

task_leveling_2.png

You might even interleave some of the activities in order to smooth out the flow of information from one brain to another:

task_leveling_3.png

If the variation in the completion time of each of the tasks is under control, then the pipeline can flow.

A long pipeline of small steps will carry a lot of work-in-process. The cost of a pipeline stall will be lower, but the probability of a stall will be higher. Considerable slack may be needed to buffer variation in the cycle time for component tasks. However, related tasks can be combined into task groups:

task_leveling_4.png

…where the task group is internally self-organized and externally synchronized:

task_leveling_6.png

By recombining things in such a way, we can also apply Critical Chain-style buffering to each task group, in order to reduce the total amount or buffering required to keep things moving.

Concurrent Pipelines

If all work is moving through a single pipeline, then a stall in that pipeline will disrupt everything. The penalty for a pipeline stall is reduced if there is more than one pipeline. Additionally, a single pipeline can only carry one pipeline’s worth of capacity. We can expand capacity and smooth out disruptions at the same time by adding a second pipeline:

pipelines_1.png

The capacity of a single unstalled pipeline will be 100% minus whatever buffering is needed to optimize stalls vs slack time. Maybe full capacity is 80%. If a lone pipeline stalls, capacity is 0%. If one of two stalls, capacity is still 40%. 1:3 is 53% and so on. If there is a 25% chance of a stall in any clock tick, then there’s a 6% chance of 2:2 stalling, and so on.

Management overhead will scale linearly for a while. Will management overhead eventually scale to a point where a different organization is more efficient? Most likely.

We’re getting pretty creative with our efforts to make this pipeline idea work! I don’t know if we’ll ever be able to control work order variation enough to make this viable, but we’ve certainly identified some ideas that are worth exploring further. Next time, we’ll relax the requirement for synchronization and look at more ideas about using buffers to smooth out the variation between tasks and work orders.

Comments (2)

Print This Post Print This Post

Email This Post Email This Post

Permalink

In search of one piece flow

(part 2 in the series including 1, 2, 3, 4)

In our ideal case, I set out a goal to partition incoming requirements into similarly-sized pieces and run them through a one piece flow process through to integration and deployment. I concluded with the question:

a team of experts will somehow have to work together and coordinate with one another to make all of this happen without tripping over one another’s feet…how, exactly are we going to do that?

There are a few ways to go about this, so let’s start with something simple, see where that falls short, and work our way up to a better solution. Let us also assume that there is a common workflow that will be applied to each work request.


Craft Production

Imagine a small team of generalists. The work-in-process limit is made equal to the size of the team. As new work orders appear in the incoming queue, each idle team member will take ownership of one work order until there are no pending work orders or no idle team members. Each assigned team member applies the workflow to one requirement, continuously, until the requirement is integrated and deployed. A team member may only own one work order at a time. Upon completion, the team member then returns to the idle pool for reassignment.

craft.pngPro:

  • incoming work-in-process is controlled
  • defined workflow is possible
  • pull is possible
  • one piece flow is possible
  • variation in the size of work orders is buffered

Con:

  • generalists are slower than specialists
  • competent generalists are rare
  • knowledge transfer is hindered
  • standardized work is hindered
  • quality is inconsistent
  • accountability is limited
  • process improvement feedback is limited

There are pros and cons to this way of working, but the bottom line is that craft production is not lean production. Software development under this model is unlikely to qualify as software engineering.


Feature Crew

If it is not possible to assemble enough generalists to implement effective craft production, then perhaps small multidisciplinary teams could work. A Feature Crew contains a small set of workers with complementary skills. Work orders are defined in such a way to engage the team for a few days or a few weeks. The work-in-process limit is made equal to the number of teams available. As new work orders appear in the incoming queue, each idle feature team will take ownership of one work order until there are no pending work orders or no idle teams. The assigned feature team applies the workflow to the requirement, continuously, until the requirement is integrated and deployed. The feature team may only own one requirement at a time. Upon completion, the feature team then returns to the idle pool to be reassigned or recombined.

craft.pngPro:

  • incoming work-in-process is controlled
  • defined workflow is possible
  • pull is possible
  • one piece flow is possible
  • variation in the size of work orders is buffered
  • division of labor

Con:

  • specialists are hoarded
  • resources are underutilized
  • knowledge diffuses slowly
  • standardized work is limited
  • quality is inconsistent
  • process improvement feedback is limited

Feature crews have most of the advantages of solitary craft production, but fewer disadvantages. Within the feature team, people will self-organize around the workflow. Resource utilization will be lower than the simple craft mode, but the productivity loss will be offset by specialization and division of labor. Knowledge transfer will be greater if teams are periodically recombined.

These craft production approaches are essentially the domain of Agile development. While we could continue to explore the possibilities by evaluating various Agile methods, we will still be in the domain of craft production, so let’s back up and try a different approach altogether.


Synchronized Workflow

The principle, Schedule is Orthogonal to Workflow, suggests that there are two fundamental approaches to partitioning work: by schedule or by workflow. Traditional project management schedules large work orders and aligns resources by workflow. Agile/craft methods schedule small work orders and align resources by schedule. Why doesn’t anybody try to schedule small work orders and align resources by workflow? I don’t know…so let’s try it!

Imagine that you have a small cross functional team. There is one specialist for each step in the workflow, a classic division of labor. Work in process is limited to the number of steps in the workflow, and hence to the number or team members. The work is synchronized according to a clock. In other words, this is a discrete pipeline. At the first clock tick, a work order is pulled from a queue and placed in the first processing state. At the second clock tick, the first work order is moved to the second processing state and a new work order is set in process. Once the pipeline is full, each clock tick completes one work order, begins a new work order, and advances work-in-process to the next step.

In order for this to work efficiently, there must be limited variation in the size of work orders, limited variation between the durations of different processing states to complete any work order, and limited variation within any specific processing state. None of those conditions will be probable in any kind of creative process. Synchronization will only be possible if you set the clock interval proportional to the worst case duration of any of the processing states. That would give you control at the cost of enormous waste of delay.

Pro:

  • incoming work-in-process is controlled
  • standardized work is possible
  • pull is possible
  • division of labor
  • knowledge transfer
  • transparency and accountability
  • process improvement feedback

Con:

  • variation in the size of incoming work orders is difficult to control
  • process variation will cause delay and/or inventory between workflow steps
  • resources will thrash between underutilization and overutilization
  • does not accommodate design iteration
  • pipeline stalls disrupt flow

The fundamental problem with synchronized workflow is that there is too much variation in product development work to be strictly synchronized. But the idea of aligning resources by workflow has much more potential than this simplistic case. We’ll explore those ideas further in the next post.

Comments (11)

Print This Post Print This Post

Email This Post Email This Post

Permalink

An ideal case

(part 1 in the series including 1, 2, 3, 4)

Let’s imagine an ideal scenario for software development.

In this scenario, there are some users who have real needs that you can identify. Further, some of these users are paying customers who will gladly give you money if you can deliver value to them. You can express their needs as a set of criteria to be satisfied, and these criteria can be measured. Your customers bring you their business because: not only do you promise to identify what they want and build them a solution, you also promise to deliver a solution quickly.

The flow of information is something like:

latent demand -> characterized demand -> value-adding design -> production -> deployed solution

…and your goal is to have smooth and continuous flow through this process, gently accelerating for all eternity. The scope of the demand you can address and the supply you can deliver will continue to grow as long as your own capability continues to grow (remember that we’re talking about an ideal scenario!).

That is only a very high level description. How might a detailed process to realize such a system look? Imagining in detail how our ideal scenario might work may also give us some ideas for what might be possible in real life (in TRIZ, this practice is called the Ideal Final Result).

As information flows through the system, we must have some representation of it:

   properties that we can observe and measure about customer utility
-> properties that we can observe and measure about the product
-> functions that the product will perform
-> mechanisms that realize the functions that the product will perform
-> processes that produce the mechanisms that...

…or…

   what does the user want?
-> what will the product do?
-> how will the product do it?
-> how will we build the product?

There are many such representations, but let’s use:

  • use case: a description of how the product will be used, in the context of the user
  • functional requirement: an operational definition of what the product will do
  • design parameter: an operational definition of how the product will implement a functional requirement
  • constraint: some limitation on what design parameters may be chosen
  • process variable: an operational definition of how a design parameter will be produced

Because we always want to deliver new value quickly, we want to limit the amount of work that we take on at one time. The smallest amount means one work request. But one of what? Since we are value-oriented, we will pick use cases, since that speaks directly in user terms.

A use case describes the value that the product delivers to the user, roughly by telling a story about how the product will be used. A use case is a structured story, and may make reference to, or be composed of other use cases. A use case will also make reference to functional requirements, as a description of role of the product in the user’s story.

A composite use case might be large, so we will decompose new large use cases until they no longer contain or reference other use cases. Such an atomic use case is then a candidate to schedule for development. A goal for our descriptions of atomic use cases is that they should all be of a similar size. A use case should be testable and traceable to customer satisfaction criteria. A use case should say as much as possible about the user’s needs, expectations, and goals, and as little as possible about the design of the product. For convenience, let’s call such an atomic use case a feature. A feature is the simplest practical expression of: what does the user want?

So our system (so far) consists of a process for identifying and describing features, and then scheduling them, one at a time, for further development. The old (and broken) way of development might accumulate a long list of such features, and then give them to somebody to analyze to produce another long list enumerating what will the product do? Which in turn would be given to somebody else to design, and so on.

But that’s not what we will do. As soon as we identify any new feature, we immediately enumerate a short list describing what will the product do? Then we will get right to work on a corresponding list describing how will the product do it? Then we immediately produce a description of how will we build the product? Then we build it.

In other words, we practice depth-first design.

Now, it takes a lot of expertise to design a product of any significance, so a team of experts will somehow have to work together and coordinate with one another to make all of this happen without tripping over one another’s feet. It’s one thing to say that we’re going to deliver one feature at a time, but how, exactly are we going to do that?

And that question is precisely what makes this story interesting…

Comments (6)

Print This Post Print This Post

Email This Post Email This Post

Permalink

Total Design

In the 1970′s and 80′s, Stuart Pugh developed a philosophy of product development, which he called Total Design. This philosophy anticipated many of the values of Agile development and Design for Six Sigma, but in some ways is still more advanced than either of those. Pugh is remembered most often for his set-based development method of Pugh Concept Selection, but there is much more to his philosophy that just that.

Principles of Total Design
  1. The user need/customer requirement/voice of the customer is paramount to the success or failure of the product
  2. All facets of a business need to be involved in (and interact with) the design core in parallel and not sequentially
  3. To satisfy the user need, rigorous systematic working is required throughout the design core using modern methods
  4. A product’s status needs to be assessed accurately before starting any new design
  5. Within systematic working, a cyclical process of synthesis/analysis/synthesis is necessary, brought to a satisfactory conclusion by the appropriate methods
  6. The most up-to-date elements of engineering, based on sound engineering principles, must be used as appropriate
  7. Total design teams must be multi-disciplinary, with sufficient expertise within the team, and sufficient diversity of experience
  8. Consideration must be given to a wide range of alternatives without prior commitment to any particular alternative
  9. The design team must repeatedly scrutinize and test the information and reasoning on which a design is based
  10. People performance is critical to total design performance
  11. Engineering principles are a vital subset of total design; they influence but do not necessarily relate directly to the user need
  12. To minimize the cycle time for completion of the design core (to minimize process losses), systematic working with modern methods and aids is required
  13. Total product quality is only achievable through total design

I know of no popular software development methodology that lives up to these ideals. So, we are trying to define one!

Comments (0)

Print This Post Print This Post

Email This Post Email This Post

Permalink

Close
E-mail It
Socialized through Gregarious 42