Debugging code, McKinsey-style
The consulting firm McKinsey teaches all its analysts a concept called “Mutually Exclusive, Collectively Exhaustive” (MECE). It’s a way of breaking down a problem into a set of smaller problems that guarantees that the smaller problems contain the answer to the larger one.
Mutually exclusive means: each small problem doesn’t overlap with others; they can be analyzed independently.
Collectively exhaustive means: when you sum the small problems up, they equal the larger problem.
The classic example is business profit. Suppose John’s apple farm is no longer making as much money as it used to, and you want to know why.
One way of breaking down profit is into revenue vs cost. This is a MECE breakdown, because revenue and cost can be analyzed independently, and revenue + cost = profit.
Therefore, it must be the case that if John’s profit is going down, the apple farm either has decreasing revenue, or increasing cost (or both).
Let’s suppose we check, and it turns out revenue went down. Well, one way of breaking down revenue is revenue = number of units sold * unit price. So therefore, it must be the case that John is selling fewer apples, or he’s selling them for less money (or both).
And so on, until you’ve figured out what’s going on!
This concept is precisely the same concept behind systematic debugging. When I get stuck on a bug, this is what I do:
First, I state the outcome that should be happening. (“When the user clicks confirm, their credit card should be charged”).
Then, I separate the outcome into mutually exclusive, collectively exhaustive parts:
1. When the user clicks the button, an event handler fires
2. When the event-handler fires, we send a message to the server with the order
3. When the server gets a message with the order, it charges the credit card
Once I’ve attempted to break the problem down, there are one of two possibilities:
1. I was able to achieve a MECE breakdown. Good, I move on to the next step.
2. I could not figure out how to achieve MECE. This means one of two things:
2a. The system, as designed, does not guarantee that the bug won’t happen. If that’s the case, I stop looking for the bug, and redesign the system: that will usually make the bug go away on its own.
2b. I don’t understand the system well enough to know how to break it down into MECE parts. If so, then the next step has to be understanding the system better, until I’m able to achieve MECE, or conclude that 2a is the case.
Next, once I’ve achieved MECE, I sort the pieces based on my intuitive probability that the issue is with that piece. I then go through each piece and perform an experiment to confirm whether or not that piece is functioning correctly.
If I get through all the pieces, and haven’t found one that’s broken, then either:
1. I was wrong that my breakdown was MECE. I go back and double-check that I fully understand the system and didn’t miss a possibility. Often, this occurs when I’m taking something for granted, like the underlying technology I’m building on top of working correctly (I found a bug in Google Chrome’s javascript evaluation engine once by doing this).
2. I made a mistake in thinking that one of the pieces was working, when it actually wasn’t, which I catch by backtracking and testing more thoroughly!
Once I’ve found the sub-component that’s broken, I’ve made progress! I discard my initial definition of what the bug is, and redefine the bug as the issue with that component. Then, I go back and start the process from the top, until I’ve found the root cause of the original issue.
The cool think about this debugging process is it always works. 100% of the time. If it doesn’t work, it just means I made a mistake somewhere along the way, and I have to backtrack and find it.
In practice, I generally do this process in my head, much more intuitively than the rigid process I describe. But when my intuitive attempts aren’t working, I get increasingly rigorous until in a worse case scenario I’m literally writing down each step.
I view the internalization of this process of one of the things that sets amateur programmers apart form an experienced ones. I’ve noticed the people learning to program are able to find bugs if they have a good understanding of how a piece of code is supposed to work, but they get stuck as the program grows more complicated. An expert programmer can find a bug in any piece of technology, no matter how well he knows it at the outset, because by applying this process you eventually gain an adequate understanding of the technology to find the issue. Even if it means, say, buying a book on assembly on diving into your compiler: although generally at that point it’s more efficient to ask someone else for help!