Vectorization of recursive parallel programs

BearbeiterIn:Patrick Kreutzer
Titel:Vectorization of recursive parallel programs
Typ:bachelor thesis
Betreuer:Werth, T.; Philippsen, M.
Status:abgeschlossen am 31. Januar 2014
Vorausetzungen:
Thema:

Background
The goal of this BA/MA is to answer the following research questions for many core architectures like Intel MIC (Xeon Phi processor):

  • how to determine the correct grainsize for a cilkFor (see the runLoop function). Can this be done dynamically at runtime?
  • how to implement the queues so that the consecutive jobs can be found so that merging of kernels can be performed.
  • how to best manage control flow to manage kernel divergence.

Topic
The biggest problem with MIC are:

  • getting enough threads running
  • getting enough vectorization

Given the following (dummy) cilk program:
..void add(int arr[], int i, int end) {
....arr[i]++;
....if (i < end) spawn add(arr, i + 1, end);
....if (i < end) spawn add(arr, i + 2, end);
....if (i < end) spawn add(arr, i + 3, end);
....if (i < end) spawn add(arr, i + 4, end);
......sync;
..}
..add(arr, 0, arr.length-4);

Each time a spawn happens, a job is added to a job-queue. The arr[i]++ operations from different jobs can be combined to a vector-add IF we can get our hands on four consecutive jobs. For example if we get jobs with 'i' = {4,5,6,7}. Then the compiler could do a vector-add with a vector-job. This vector-job is created by first merging the four jobs using a form of inlining/unrolling by duplicating each statement. For example:

..// jobs {arr, i, end}, {arr, i+1, end} {arr, i+2, end} {arr, i+3, end}
..void add4(int arr[], int i, int end) {
....arr[i]++;
....arr[i+1]++;
....arr[i+2]++;
....arr[i+3]++;
....if (i < end) spawn add(arr, i + 1, end);
....if (i < end) spawn add(arr, i + 2, end);
....if (i < end) spawn add(arr, i + 3, end);
....if (i < end) spawn add(arr, i + 4, end);
....if (i+1 < end) spawn add(arr, i + 1+ 1, end);
....if (i+1 < end) spawn add(arr, i + 1+ 2, end);
....if (i+1 < end) spawn add(arr, i + 1+ 3, end);
....if (i+1 < end) spawn add(arr, i + 1+ 4, end);
....if (i+2 < end) spawn add(arr, i + 2+ 1, end);
....if (i+2 < end) spawn add(arr, i + 2+ 2, end);
....if (i+2 < end) spawn add(arr, i + 2+ 3, end);
....if (i+2 < end) spawn add(arr, i + 2+ 4, end);
....if (i+3 < end) spawn add(arr, i + 3 + 1, end);
....if (i+3 < end) spawn add(arr, i + 3 + 2, end);
....if (i+3 < end) spawn add(arr, i + 3 + 3, end);
....if (i+3 < end) spawn add(arr, i + 3 + 4, end);
....sync;
....sync;
....sync;
....sync;
..}

The four arr[]++ operations above can then be trivially recognized as a vector-operation.
Note that for this to work, we need a job-queue that can pop four consecutive jobs and instead of the normal path, call the merged/vectorized version to do all four jobs in one step.
Another problem in the above scheme are loops and if-statements. For example, for an 'if' statement, if the condition evaluates to true for all four merged functions, then control can stay in one path and vectorization can continue. If one or a subset of the 'if' statement conditions evaluates to false, then the rest of the kernel needs to be either again spawned as a new cilk-thread or sequentially executed. This problem is equivalent to 'divergence' as we see in GP-GPU computing.
Loops annotated with cilkFor (instead ordinary for) can be executed in parallel. Cilk essentially does the following divide and conquer scheme at runtime (if the termination condition does not change over loop iterations):
..void runLoop(first, last) {
....if (last - first) < grainsize) {
......for (int i = first; i < last; i++) LOOPBODY;
....} else {
......int mid = (last-first) / 2;
......cilkSpawn runLoop(first, mid);
......runLoop(mid, last);
....}
..}

The selection of the correct grainsize is crucial for the performance, especially on a many core architecure like Intel MIC.

References

watermark seal