November 20, 2012

Lies, Damn Lies, and Statistics

Surround SCM
One of the most difficult tasks when working with code is to identify which areas of the application are most suspect. There are a variety of code analysis tools and theories that look at the structure of the code to try and identify questionable constructs and patterns within your code. These can be very helpful, but often require combing through a lot of “false positives” in order to get to actual risk areas. We’ve gone with a different approach with Surround SCM risk analysis. Rather than try and make our software understand your software, we’re trying to use some “extra” information we have. Namely, defect association. Whenever you attach a file to a defect, using TestTrack Pro integration for example, we have some insider information. We know that the file in question was associated with an issue. So we can use this information to try and glean some insight into potentially risky areas. Using some work originally done at UC Davis and later at Google research, we’ve incorporated these concepts into a new "Analyze Risk" feature in Surround SCM. “Analyze Risk” uses two pieces of information to create a risk score. The first is how often a file has been associated with defects. A file that has been associated with more defects has a higher score than one with fewer associations. Second, we look at how recently the file has been associated with defects. A file which has more recently been associated with a defect will have a higher score than one which was associated with a defect in the past. These two pieces of information are combined to calculate a relative risk score, which allows you to compare a set of files and identify those that may need further review. To use this in Surround SCM,  select a repository and choose Tools > Analyze Risk. There are a couple of options you can set in the Analyze Risk dialog. Some of these are self-explanatory (e.g., change the repository, which TestTrack connection to use). A couple are worth noting. The first is which (if any) TestTrack filter to apply. Only issues that pass the selected filter will count toward the score. This can help eliminate files that were attached to feature requests, or limit issues to only high-severity items. You can also limit the timeframe to analyze events over. Finally, search expanded history will look at defect events that occurred on other branches. After getting all the options set up, click the Analyze button to start the calculation. Depending on the number of files, defects, and length of time specified this might take some time. Once the scores are calculated, the list of files is updated with their ranking as shown in the following screenshot. [caption id="attachment_12320" align="alignnone" width="583"]Analyze Risk dialog Analyze Risk dialog[/caption] One thing to note is that the scores don’t have any absolute value. They just show relative ranking. In this example, Program2.cs and WysiMain.Designer.cs both have significantly higher scores than the other files. From here I can review their history, properties, or change custom fields on them like owner or state. Conveniently, I can use Surround SCM's code review feature right from here as well, assigning these files out for review from other team members. Identifying suspect files within your system can allow you to focus your review and refactoring efforts on those areas that have the highest likelihood of improving your overall quality. Using Surround SCM's analyze risk capability lets you take advantage of Surround's integration with TestTrack or other defect tracking tools to identify problem areas and make sure your team is working on the right areas of the code.