Design for Diagnosability

Director:Philippsen, M.
Period:May 15, 2013 - September 30, 2018
Coworker:Adersberger, J.; Lautenschlager, F.
Description:

Many software systems behave obtrusively during the test phase or even in normal operation. The diagnosis and the therapy of such runtime anomalies is often time consuming and complex, up to being impossible. There are several possible consequences for using the software system: long response times, inexplicable behaviors, and crashes. The longer the consequences remain unresolved, the higher is the accumulated economic damage.
"Design for Diagnosability" is a tool chain targeted towards increasing the diagnosability of software systems. By using the tool chain that consists of modeling languages, components, and tools, runtime anomalies can easily be identified and solved, ideally already while developing the software system. Our cooperation partner QAware GmbH provides a tool called Software EKG that enables developers to explore runtime metrics of software systems by visualizing them as time series.
The research project Design for Diagnosability enhances the eco-system of the existing Software EKG. The Software-Blackbox measures technical and functional runtime values of a software system in a minimally intrusive way. We store the measured values as time series in a newly developed time series database, called Chronix. Chronix is an extremely efficient storage of time series that optimizes disk space as well as response times. Chronix is an open source project (www.chronix.io) and is free to use for everyone.
The newly developed Time-Series-API analyzes these values, e.g., by means of an outlier detection mechanism. The Time-Series-API provides multiple additional building blocks to implement further strategies for identifying runtime anomalies.
The mentioned tools in combination with the existing Software EKG will become the so-called Dynamic Analysis Workbench. This tool enables developers to diagnose, explain, and fix any occurring runtime anomalies both quickly and reliably. It will provide diagnosis plans to localize and identify the root causes of runtime anomalies. The full tool chain aims at increasing the quality of software systems, particularly with respect to the metrics mean-time-to-repair and mean-time-between-defects.
Before we have successfully completed the project in July 2016, we have made the following contributions:

  • We have linked Chronix and a framework for distributed data processing so that our anomaly analyses now scale to huge sets of time series data.
  • We extended Chronix with additional components. Among them are, for example, a more efficient storage model, some adapters for more time series databases, additional server-side analysis functions, and some new time series types.
  • We have published our benchmark for time series databases.
  • We have investigated and implemented an approach to link application-level calls, e.g., a login of a user, down to the resulting calls on the OS level.

Although funding expired in 2016, we made further contributions in 2017:

  • We presented Chronix at the FAST conference in Santa Clara, CA in February 2017.
  • We have equipped Chronix with interfaces to attach time series databases that are used in the industry.
  • We have developed an approach that determines the ideal cluster configuration (w.r.t. processing time and costs) for a given analysis (specific function and set of time series).
  • We have expanded Spark, a framework for distributed processing of large-scale data, so that it now can make use of GPUs in distributed time series analyses. We presented the results at the Apache Big Data Conference in Miami, Florida, in May 2017.

We continued to make further contributions to the research project in 2018:

  • We have published a paper at PROFES 2018 that describes techniques and insights on how runtime data in a large software project can be offered to all project participants at the development stage to improve their collaboration.
  • We have maintained the Chronix Open Source project and stabilized it further (updating versions, fixing bugs, etc.).
watermark seal