Multi-GPU Cluster use for Java/OpenMP

Student:Thorsten Blass
Title:Multi-GPU Cluster use for Java/OpenMP
Type:diploma thesis
Advisors:Veldema, R.; Philippsen, M.; Schneider, T.; Sadeghi, A.
State:submitted on August 2, 2010
Prerequisits:

JaMP is an implementation of the well-known OpenMP standard adapted for Java. JaMP allows one to program, for example, a parallel for loop or a barrier without resorting to low-level thread programming.
For example:
class Test {
void foo ( ) {
//#omp parallelfor
for ( int i = 0; i < N; i++)
{ a[i] = b[i] + c[i]; }
}
}
is valid JaMP code. JaMP currently supports all of OpenMP 2.0 with partial support for 3.0 features. The current implementation is pure Java although an older version integrated into Jackal exists also. The older (Jackal) JaMP version allows transparent cluster computing and even allows migration of the application from one cluster to another via the OGRE framework. The newer (pure Java) JaMP version, translates parallel for loops to CUDA for extra speed gains. If a particular loop is not CUDA-able, it is translated to a threaded version using the normal processor’s cores in a multicore machine.

Topic:

Today’s workstation/desktop PC becomes more and more heterogeneous parallel systems. This thesis should extend the existing JaMP Framework to allow the programmer to transparently use any available hardware accelerator at runtime. Currently, JaMP generates only multi-threaded code or code for a single CUDA-capable GPU.
JaMP should be extended to use multiple (remote) hardware devices also of different types (e.g. CPU + GPU) at the same time. For this purpose, the existing JaMP architecture has to be extended by a middleware. This middleware consists of two modules: The Compute Kernel Manager (CKM) which performs the virtualization of the available computing devices and a Cluster Array Package (CAP) for all interests of the data distribution.
For both modules exists a basic implementation. This Middleware should be integrated into the JaMP Framework and extended where necessary. The use of multiple computing devices implies another fundamental problem of cluster computing: How many nodes of a cluster should be used for a program. In the case of JaMP the question is extended to: Which device types should be used. This work should also work out a starting point and offer a solution to find the best matching hardware set for an application.
All extensions have to be verified by benchmarks. A cluster consisting of 8 machines connected over a fast network (Infiniband), where each node contains two nVidia Tesla GPUs is our target machine.

Milestones

  • understand the JaMP Framework (Compiler, Classloader)
  • introduction to the middleware layers CKM and CAP
  • integrate the middleware into JaMP
  • extend CKM and CAP where necessary
  • implement a heuristic to find the best matching hardware set for an application
  • benchmark, benchmark, benchmark, ...
  • write up
watermark seal