Measuring Code Stability – Part II
In my last article, I introduced the concept of the Stability Index, and in this follow-up, I’ll take you through the formula we’re currently using to calculate it.
Calculating the Stability Index
Note that while there is some mathematics behind the ‘Stability Index’, it’s not the result of years of scientific research. I’m also confident the formula and approach presented here will be refined over time, so it’s a work in progress. My hope is simply that it will be a useful and effective metric in that it will prompt the right questions at the right times.
We wanted to be able to present the Stability Index on a scale where high was good, rather than calling it the Instability Index (or other name with negative connotations), and having zero as the baseline and no ceiling. This meant we had to have a ceiling, and if we had to have one, why not pick 100 and make it a percentage.
To calculate the Stability Index as a percentage, we ensure that the numbers from each component map into the range 01000 (there’s one exception to this – see below), then simply add the three together, making sure the overall result is greater than or equal to zero, and then divide by 3000. We can then express it as a percentage:
With that goal in mind, let’s take a look at each of these components in turn.
Weighted Bug Score
Our QA team has used this metric for some time now and find it very useful as a general indicator of the state of play for a project. Like many people, we grade our bugs A, B, or C depending on their severity. Each severity is assigned a weight, in our case they are: A=10, B=3, C=1.
The weighted bug score for a project is defined as:
Or, in English: The sum for each severity of the difference between the number of reported and fixed defects multiplied by the weighting for that severity.
Depending on what’s going on with a project, this can yield a large number (positive or negative), so we normalize the range to between 1000 and +1000. If it’s bigger, or smaller, than that then it really doesn’t make much difference. This is the only component that is permitted to be negative: a negative value indicates that the defect resolution rate is outpacing the reporting rate; it seems appropriate to reflect that in the result.
The Churn Score provides a normalized indication of the proportion of the codebase for a project that has changed during the period in question. There are lots of contextdependent factors here: changing 100% of the files in a small project is probably OK, but in a large project might be cause for concern. I say “might” be cause for concern, because what if the time period we’re considering is two years? Seems like that might be OK to me whereas if we’re talking about a project of 100K files and a period of 90 days then there’s definitely something worth looking into. It may just be bulk whitespace formatting, but it’s worth asking the question – context matters!
The key factors that affect the Churn Score then are: the length of the time period and the proportion of the total files changed in that time and, really, we need a way to have the effects of churning the code lose significance the longer the time period we’re looking at.
After many happy hours examining data in Excel, we came up with this:
I know, I know. What’s with the magic numbers – right? They’re the result of the use of a time-honorred technique: start from a target point and work backwards. There are so many factors at play here, and we needed a fixed point on the graph to base things on. To get that fixed point, I took the view that changing 100% of the files in a project within a period of 30 days should result in a Churn Score of 1000 (the upper limit). The smaller the project, the less accurate this assumption becomes, but I think it’s a broadly acceptable proposition for most projects in practice.
That is the fundamental baseline for the Churn Score: all files changed in 30 days or fewer equals a Churn Score of 1000.
From there we needed a way to degrade the Churn Score smoothly over time, and to map it onto the desired scale. The constants in there help us to do just that. There are the result of some trial and error and judicious use of Excel’s very handy ‘GoalSeek’ tool.
The result is a profile like this for projects with 100% Churn, 50% Churn, and 20% Churn respectively.
The Rework Score is a normalized measure of the number of files that are changed repeatedly during a period as a proportion of the total files in the project. Again we have to account for the length of the period under analysis: editing all the files in a project multiple times might be fine over the course of a year, but is unlikely to be a good thing inside a week for all but the smallest projects. Project size is also clearly an important factor.
We followed the same process used for the Churn Score, and used the same constant power to form the curved decay profile, but the constant multiplier to map the results onto our desired scale is different. The resulting formula is:
The profile this formula generates looks like this:
So now you know how to calculate the Stability Index, the question is, will you use it? We know it’s not a panacea and your mileage may vary, but in the long run the hope is that the use of a Stability Index combining bug data with change data will go some way towards prompting the questions that we need to ask.
When we have the answers to those questions we’ll be able to say whether or not it’s good squeezy.
If you calculate the Stability Index for your projects, and have feedback on how helpful that was for you, please drop me a line. I’d love to hear how it works out.