Parent: MorphoOptimizationProject

Ways to reduce serial code elapsed time

Background

Many books, papers, and courses are available about this topic. Here is a very brief overview, with descriptions of the techniques being applied to Freesurfer.

The execution of the program requires fetching both instructions and data from memory, via a multi-layered cache, executing the instructions, and storing the results. Getting the data from the memory and distant levels of the cache can be between tens and thousands of times slower than the instruction execution.

To reduce the execution time, find ways of removing data movement and instruction execution. Data movement is reduced by reordering the data placement to pack the 64 byte cache line fetches with useful data, and by reordering the instructions so that fetched data is reused while it is available rather than having to refetch it. Instruction execution is removed by looking for ways to eliminate large numbers of operations, and then by looking more closely at just the loops in the code that take the most time.

Sometimes it is better to defer calculations until the result is definitely needed

It is quick and simple to write code that says * Compute some properties of all vertexs into per-vertex data member * Loop over some subset of the vertexs, using those properties but this approach is only efficient if most of the data computed in the first loop is used in the second.

If the second loop does not use all the data, it may be best to instead defer the calculation until it is needed. This risks redoing the calculation, which could also be expensive, so it may be necessary to add a cache. Adding the cache leads to a problem determining when to invalidate the cache, and how to quickly invalidate many entries in the cache, and how to build a thread-safe cache. It is not surprising that, for all but the hot code, the quick and simple approach is often best.

For an example of building a deferral mechanism, see

Better intersection algorithms

There are several places in the code where vertexs are moved to improve some measurement, but their movement must be constrained by not introducing intersections.

Rather than testing one entity involved in the intersection against every other entity, several techniques are used to reduce possible suspects

  1. mrishash.c maps every face into a 3D voxellation of the space. It knows that the faces will not intersect if their voxellations don't. This gives good results but both the insertion and lookup are quite expensive
  2. realm.c creates a 3D tree of the space, maps each vertex into one node in the tree, and maps each face into the common ancestor of the vertex nodes. It knows that faces will not intersect if they don't share subtrees. This approach has fast insertion, removal, and lookup. The structure has to be maintained as the defect elimination code adds and removes entities as well as moving them.
  3. realm.c also optimizes the determination of intersecting great arcs during defect repair

Making such algorithms possible

Manipulations of related fields should be done in only a few places

Consider three of MRIS's members- max_vertices, nvertices, vertices

There is an obvious rule - nvertices <= max_vertices && max_vertices is the number of elements vertices points to

Maintaining this rule is hard, when code in many places is assigning to nvertices.

To stop this from happening, I changed nvertices to a const int, causing all the assignments to become error messages. Then I wrote a small set of functions for manipulating these three values, and editted all the errors to use these functions.

The same problem and solution was adopted with nfaces. Here is was harder and more important because there are multiple vectors that must be nfaces long, and the original code was changing nfaces in several places without updating all of the relevant vectors.