Application Migration in Peer-to-peer Compute Clusters

Student:Xiaofan Liu
Title:Application Migration in Peer-to-peer Compute Clusters
Type:masters thesis
Advisors:Klemm, M.; Philippsen, M.; Guan, J.
State:submitted on August 5, 2008
Prerequisits:

Researchers often can access various computing resources through Computational Grids. However, to actually use these resources is far from user-friendly because of different architectures, network interconnects, memory bandwidths, etc. Users are also faced with the systems' schedulers that assign CPUs/nodes to jobs. Schedulers ask the user to provide an estimate of what amount of resources their application will demand at runtime. It is especially difficult to estimate wall clock times, as the runtime not only depends on the algorithms and the degree of parallelism used in the application. Moreover, it is influenced by environmental issues (e.g. the current system load or network load) that are often difficult to predict and change in unforeseen ways. If a job's runtime is underestimated the scheduler terminates it. This causes a loss of computation.

There are two common ways out to deal with the runtime estimation problem. First, the application can be decomposed into smaller phases after each of which the application can resume if it was terminated. However, this increases both programming effort and code complexity. Second, the runtime can be overestimated extensively (twice what is estimated or maximal allowed runtime). By accepting the penalty of extra long waiting in the cluster queues users "buy" a likely complete run of their job. Besides waiting penalties this also causes reservation holes in the job schedules and hence reduced overall cluster throughput.

The Programming Systems Group of the Computer Science Department at the University Erlangen-Nuremberg has proposed a solution to this problem that does not suffer from the above mentioned drawbacks and that also makes Grid computing more transparent [1]. If an OpenMP application is about to exceed the reserved time, it is automatically checkpointed and transparently migrated to either a new local reservation or to a reservation on a different accessible (possible remote) system. The target system may have a different architecture or network interconnect. If the available CPU count is changed in this migration, the application is automatically reparallelized and adapts to the new degree of parallelism by adding threads to or removing threads from the parallel region that was being executed when checkpointing and migration commenced.

The existing prototype uses a checkpointing algorithm to save the state of the application and to move it to the destination system. However, right now, the migration has to be performed manually, i.e., the application is terminated by the user, the state must be copied, and a new reservation has to be made on the target system. An automatic solution should monitor the state of the accessible clusters and automatically create a schedule to execute the application on several nodes to deliver as much computing power to the application as possible.

Topic:

It is the student's task to automate the task of migrating the application to the target system. The task is composed of several sub-tasks.

First, the current load of the systems has to be monitored. Each of the clusters should run a small daemon that gathers information about the system load, the CPU performance, network and disk bandwidth, etc. The daemon reports its information to the application. To retrieve the information from the cluster, the scheduler has to be queried. Each of the clusters runs a scheduler to assign nodes to applications. As each potentially uses a different type of scheduler, a plugin-based framework has to be developed to provide an API to query different schedulers and to assemble the gathered information in a common information base.

Second, as high-end computing systems are mostly hidden in private networks and/or are secured by firewalls, a flexible peer-to-peer protocol has to be implemented that provides sort of a "virtual etwork" that interconnects the different computing sides. The virtual network must provide an API to transport the cluster state information to the application. The student has to search for existing solutions that might prove useful with this task of interconnecting various sites. For example, Smartsockets [2] or Juxta [3] could provide the needed functionality.

Third, based on heuristics, the application determines a potential schedule spanning several clusters for the execution. Before the local reservation is about to be exceeded, the application must checkpoint and transfer its state to the target system. To transport the application state, the peer-to-peer network has to be extended to transport the application state to a target system. (This sub-task is optional.)

Finally, the performance of the implementation has to be assessed by a set of measurements. Based on these measurements, the student has to evaluate the implementation and propose optimization to it if the results show bottlenecks or problems of the implementation. The implementation and all results have to presented in the thesis.

Meilensteine

  • Implementation of a set of example parsers (e.g. for Maui [4] state files, OpenPBS qstat [5], etc.) to gather information from the clusters' schedulers.
  • Implementation of a peer-to-peer virtual network to transport the cluster state information to the application.
  • Automatic creation of the application schedule to determine when and where to migrate to.
  • Transfer of the application state through the peer-to-peer network.
  • Assess the implementation with measurements.
  • Write the thesis.

Literature
[1] Michael Klemm, Matthias Bezold, Stefan Gabriel, Ronald Veldema, and Michael Philippsen: Reparallelization and Migration of OpenMP Programs. In: Proceedings of the 7th International Symposium on Cluster Computing and the Grid, pages 529-537, Rio de Janeiro, Brazil, May 2007.

[2] Jason Maassen and Henri E. Bal: Smartsockets: Solving the Connectivity Problems in Grid Computing. In: Proceedings of the 16th International Symposium on High Performance Distributed Computing, pages 1-10, Monterey, CA, USA, June 2007.

[3] Joan Esteve Riasol and Fatos Xhafa: Juxta-Cat: a JXTA-based Platform for Distributed Computing. In: Proceedings of the 4th International Symposium on Principles and Practice of Programming in Java, pages 72-81, Mannheim, Germany, August 2006.

[4] |http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php

[5] |http://www-unix.mcs.anl.gov/openpbs/

watermark seal