Evaluation of Machine Learning Algorithms for Outlier Detection in Clustered Code Fragments

Student:James Wafula
Title:Evaluation of Machine Learning Algorithms for Outlier Detection in Clustered Code Fragments
Advisors:Dotzler, G.; Ring, M.; Eskofier, B.; Philippsen, M.
State:submitted on October 30, 2015

During its life cycle, Software undergoes constant change. The reasons for changes vary. In most cases developers fix bugs or adapt their software to new APIs. Frequently, it is necessary to apply these modifications not only to one location in the code but to several at once. To reduce development time and to avoid errors introduced by manual changes, tools that simplify this repetitive task are needed. In our group we developed such a tool. It finds all code positions in a project that are affected by a bug or API change. For each of the identified locations our tool presents a recommendation to the developer that includes the position and the possible source code replacement. As libraries are used in more than one project, our tool is even able to identify bugs and recommend changes in independent projects.

Our group developed a tool to extract code changes from git archives. To create change patterns that are the base of the recommendations, our tool searches for clusters of similar code changes in software archives. As these clusters are solely based on the syntactical similarity they contain outliers, i.e. code changes, that do not belong to the assigned cluster. Currently our tool requires manual identifications of these outliers. The goal of this thesis is to identify these outliers automatically by using syntactic and semantic similarity in combination with pattern recognition techniques. To create semantic similarity values, it is necessary to adapt our symbolic code execution framework to use it with pattern recognition techniques. Additionally, part of this thesis is the evaluation of different pattern recognition algorithms for the described goal.


  • Research of applicable pattern recognition techniques
  • Training of classifiers to optimize the cluster assignments
  • Training of an outlier detection classifier
  • Use of the semantic similarity in the classifier training
  • Evaluation
  • Thesis


  • http://www.cs.waikato.ac.nz/~ml/weka/
  • C. Bishop, Pattern Recognition and Machine Learning, Springer, 2007
  • Kamp, Marius: Entwicklung eines Werkzeugs zum Vergleich von Code-Fragmenten durch symbolische Ausführung. Masterarbeit, Lehrstuhl für Informatik 2, Friedrich-Alexander-Universität Erlangen-Nürnberg, Oktober 2014.
  • Romstöck, Christoph: Entwicklung eines Werkzeugs zur Identifizierung vergleichbarer Code-Modifikationen in Software-Archiven. Masterarbeit, Lehrstuhl für Informatik 2, Friedrich-Alexander-Universität Erlangen-Nürnberg, Dezember 2013.
  • Dotzler, G.; Veldema, R.; Philippsen, M.: Annotation support for generic patches.
watermark seal