What Are Machine Learning Uses to Improve Static Analysis?
As code is being written, static analysis tools — such as Helix QAC and Klocwork — identify coding defects, vulnerabilities, and compliance issues. However, static analysis can also produce many results and depending on your perspective and goals, not all results will be relevant or interesting in all cases.
Here, we explain three machine learning uses to help improve the relevance of static analysis results.
Machine Learning Uses to Improve Static Analysis Results
While there are many machine learning uses to help improve static analysis results, these are the three most common.
Machine Learning Uses for Grouping Defects
Grouping defects is the process of grouping together defects that are similar in nature. This enables you to review similar defects one after the other, which reduces context-switching cognitive efforts.
Defects can be grouped together by defining and calculating a similarity measure. However, there are a few possible problems with this approach:
- The similarity measure is likely static and non-configurable.
- It may not guarantee a real relationship with every defect in the group.
A more beneficial approach is to use unsupervised machine learning clustering algorithms in order to automatically group defects together.
A call to readUntrustedData() is performed at line 11 and its result is stored in variable ind. This call presumably reads a value from an untrusted source, such as a network socket or any other untrusted data source.
Later, at line 19, this variable is passed to function defectSink(), where it is used as an index to access array A. Since no validation is being done on index ind, this could lead to out-of-bounds read/write operations, which represents a security risk.
The static analysis tool correctly detects this defect and reports the following information:
- Defect code SV.TAINTED.CALL.INDEX_ACCESS and message “Unvalidated integer value ‘ind’…”
- Defect traceback identifying:
- Defect source (line 11)
- Defect sink (line 6)
- Path events indicating the control flow (lines 18, 19)
Intuitively, two defects appear to be similar to each other if they have the same source, the same sink, or the control flow for these defects take common paths. In other words, the more common aspects that the two defects have, the more similar the defects will appear to be to the user, and the faster they will be to review.
From the example, you can see that most of the defects within the same cluster were reasonably similar to each other and different from those in different clusters. This can noticeably lessen the work required to investigate a typical set of defect reports from a static analysis tool.
Machine Learning Uses for Ranking Defects
One of the most common strategies for finding defects with static analysis tools is by using dataflow and control-flow analyses.
Often defects found according to dataflow and control-flow analyses require semantic analyses of the code and are often undecidable. For that reason, analyses use heuristics to find potential defects to control the speed and accuracy of the analysis. As a result, these defects are more likely to be false positives because of their undecidable nature.
For our example, we focused only on defects found according to dataflow and control flow analyses. These potential defects are found by a static analysis tool by a procedure that may involve approximations.
An approximation example would be multiplying a variable that is known to be within the range [0, 1000] by 2. It would be accurate to say that the result of this multiplication will be in the set of even numbers between 0 and 2000.
However, it would be an approximation to say that it could be any number between 0 and 2000. Our hypothesis was that the fewer approximations that are made while finding a defect, the more likely the defect is to be real.
We tested this hypothesis and found that the overall ratio of real defects in this project was 74% for the checks enabled.
This new ranking method also has a beneficial side effect: Straightforward defects are ranked higher than more complex defects. This means that if two defects are linked together and you are going through the list in a sequential manner, then there is a chance that the more complex potential defect would be fixed by just looking at the more straightforward defect.
Machine Learning Uses for AI-Assisted Defect Ranking
Traditional defect ranking approaches can make it difficult to use with other DevOps tools. For that reason, consider a machine learning-based approach.
For our example, we organized an approach that established similarity and assigned defect reports to a particular group, either “True Positive Reports” (TP) or “False Positive Reports” (FP). These two groups are based on the results of a human reviewer assignment of the defects reported in the past.
We then used several supervised machine learning algorithms — specifically the algorithm based on the Support Vector Machine (SVM) model. The SVM model works with the labeled examples mapped into an N-dimensional feature space and attempts to separate them by a hypersurface.
To build and test the model, we used an open-source SVM model implementation called “libSVM.” This is a C/C++ library with comprehensive documentation and interface and is available under “modified BSD license.”
We applied the SVM algorithm to the testing subset, and the “FP” and “TP” categories. The resulting mapping was accurate with probability in the range of 79%-85%.
Supervised learning — when used for defect post-processing with available historical defect reports review data — can be a useful tool for prioritizing reports. Especially those that are more likely to be associated with a real problem in the code.
Machine Learning and Static Analysis
While you don’t have to use any particular static analysis tool to take advantage of these machine learning uses for improved results, we recommend using Perforce static analysis tools.
- Applying coding standards, such as MISRA, AUTOSAR, and OWASP.
- Enforcing coding best practices.
- Identifying coding errors earlier in development — saving both time and money.
- Eliminating security vulnerabilities.
- Managing code quality over time by measuring, tracking, and reporting on quality metrics, such as Cyclomatic Complexity.
If you want to see either static analysis tool in action, be sure to sign up for a demo.