Differences between revisions 7 and 8
 Deletions are marked like this. Additions are marked like this. Line 11: Line 11: === Reproducible OMP programming (ROMP) === == Reproducible OMP programming (ROMP) == Line 21: Line 21: ==== Reductions and similar sums ==== === Reductions and similar sums === Line 26: Line 26: ==== Which loops to make parallel ==== == Which loops to make parallel == Line 31: Line 31: ==== Parallelizing apparently serial algorithms ==== === Parallelizing apparently serial algorithms === Line 35: Line 35: In other cases, it was necessary to split a serial loop into several loops, only some of which could be parallelized. === Load balancing === In several places the original code looked approximately like     loop         if this iteration has something to do             do it         end if     end loop If the expensive iterations were clustered this resulted in some of the chunks of iterations given to the parallel code being much faster than the others, and dynamic scheduling did not significantly improve this. To get good load balancing it was effective to recode this as two loops     loop         if this iteration has something to do             append it to a list         end if     end loop     for each item in the list loop         do it     end loop

Parent: MorphoOptimizationProject

Our algorithms are dealing with a 3D volume or surface. I am going to concentrate on the 3D mesh of vertexs, faces, and edges for this discussion.

Consider the problem of finding a maximum in a list of values. Serial code scanning from one end will easily find either the first or last of equal maxima, and get the same answer every time. Parallel code may find any of the maxima, and find a different one every time.

A similar problem is calculating an average. Floating point addition is not associative, (A+B)+C might differ slightly from A+(B+C). In serial code, the exact order of evaluation will differ depending on compiler optimizations. In parallel code it may differ depending on the number of threads, and the assignment of loop iterations to threads by the omp implementation.

A surface folding algorithm calculates a value for each point on a surface using floating point operations, and one of several almost identical values will be the fold point, resulting in different answers depending on very small differences.

## Reproducible OMP programming (ROMP)

With this in mind, a loop can be categorized as

1. Is being experimented on to see if it support parallel execution
2. Is known to get slightly acceptably different answers
3. Is thought to get exactly the same answer regardless of the number of threads
4. Is verified by eye or by checking code to do so

The romp_support.h allows all the parallel loops to be annotated saying which of these it is, and then to suppress executing the earlier ones when reproducibility is needed.

### Reductions and similar sums

Reductions by summation of floating point values may produce different results from one execution of the parallel loop to another, especially if dynamic scheduling is used, or if an different number of threads is used.

To avoid this, there is support in the romp_for*.h files for decomposing the loop into the same chunks regardless of the number of threads, and summing the reductions in a stable manner.

## Which loops to make parallel

The ROMP support is includes timing code, which produces a report on program exit showing which loops executed, how much time they used, and which of the above categories they were in.

This report has enabled us to identify all the important loops executed during recon-all, and enough of them have been modified to satisfy level 3 criteria at least, that there is no good reason to do the lower levels in parallel.

### Parallelizing apparently serial algorithms

Some of our algorithms work across the whole surface or volume, one vertex or one face at a time. Because each vertex's behavior is influenced by the behavior of the rest, these algorithms are inheritently serial, and need to be replaced by a similar parallel algorithm.

mrisAsynchronousTimeStep_optionalDxDyDzUpdate is an excellent example of this.

In other cases, it was necessary to split a serial loop into several loops, only some of which could be parallelized.

In several places the original code looked approximately like

• loop
• if this iteration has something to do
• do it
end if
end loop

If the expensive iterations were clustered this resulted in some of the chunks of iterations given to the parallel code being much faster than the others, and dynamic scheduling did not significantly improve this.

To get good load balancing it was effective to recode this as two loops

• loop
• if this iteration has something to do
• append it to a list
end if
end loop for each item in the list loop
• do it
end loop

MorphoOptimizationProject_addingParallelism (last edited 2021-09-22 09:45:48 by DevaniCordero)