Ways to reduce serial code elapsed time
Many books, papers, and courses are available about this topic. Here is a very brief overview, with descriptions of the techniques being applied to Freesurfer.
The execution of the program requires fetching both instructions and data from memory, via a multi-layered cache, executing the instructions, and storing the results. Getting the data from the memory and distant levels of the cache can be between tens and thousands of times slower than the instruction execution.
To reduce the execution time, find ways of removing data movement and instruction execution. Data movement is reduced by reordering the data placement to pack the 64 byte cache line fetches with useful data, and by reordering the instructions so that fetched data is reused while it is available rather than having to refetch it. Instruction execution is removed by looking for ways to eliminate large numbers of operations, and then by looking more closely at just the loops in the code that take the most time.
In short: use fewer operations or use cheaper operations.
However it is important to remember that a computer has distinct subsystems
- Core's doing integer operations
- Floating point units doing operations such as add, sub, mul
- Floating point units doing expensive operations such as div, sqrt
A Hierarchy of memory caches, connected by memory buses - see https://software.intel.com/en-us/articles/why-efficient-use-of-the-memory-subsystem-is-critical-to-performance
One needs to determine which of these subsystems is the bounding factor on the performance of each portion of the program.
As described in MorphoOptimizationProject_improvingParallelism, there are several ways to measure the performance, but Intel's VTune tool provides insights that are hard to get other ways. It is always best to measure to find the places in the code that are truly hot, to find the real cause of the heat, and to guide the fix.
Using fewer operations
Sometimes it is better to defer calculations until the result is definitely needed
It is quick and simple to write code that says
- Compute some properties of all vertexs into per-vertex data member
- Loop over some subset of the vertexs, using those properties
but this approach is only efficient if most of the data computed in the first loop is used in the second.
If the second loop does not use all the data, it may be best to instead defer the calculation until it is needed. This risks redoing the calculation, which could also be expensive, so it may be necessary to add a cache. Adding the cache leads to a problem determining when to invalidate the cache, and how to quickly invalidate many entries in the cache, and how to build a thread-safe cache. It is not surprising that, for all but the hot code, the quick and simple approach is often best.
For an example of building a deferral mechanism, see
Fast invalidation of caches and tables
There are several places in our code where tables are filled in, and later need all entries invalidated. For instance in realm.c there is a vector, passed, which acts as a set of index. There are four basic operations
The set is initialized by creating the passed vector with all entries zeroed, and setting the passedClock to 1.
To test if an entry is in the set, the passed[index] value is compared to see if it is equal to passedClock
To add an entry to the set, the passed[index] is set to passedClock
To empty the set, the passedClock is incremented.
It is the efficiency of this last operation that is particularly valuable.
Cheaper matrix operations
The code has several different implementations of matrices and matrix operations, coming from varying sources. However almost all the matrices used are 1x3 or 3x3 or, sometimes 1x4, and 4x4. It is rare to exceed this. However almost all the code is written using generalized algorithms that do not know this.
Some measurements were suggesting that the locking caused by using malloc and free were contributing the to serializing of parallel code, so I made changes to eliminate these mallocs and frees, as described here.
qsort is used extensively in our code, and it is an excellent sorting algorithm, but it does have two significant penalties that may make it worth replacing in hot code
- It uses a call-back to do the comparison
- It does not use parallelism
There was one place in the serial code for mris_fix_topology where sorting was important, and so I added include/sort.h to provide a sort that doesn't have these handicaps.
Using specialized algorithms requiring auxiliary data structures
The code has the essential large data structures of the MRIS with its arrays of VERTEX and FACE, and algorithms can often be thought of, and coded, as manipulations of these. Sometimes the algorithm repeatedly makes a decision about the same entities in this data structure, and time can be saved by making it possible to ignore many of them or to visit them just once, rather than repeatedly, during the execution.
Using hash tables to accelerate searches
mrisReadSTLfile is an example of a using a hash table to avoid repeatedly searching many objects for ones that match some property.
Precomputing some information outside nested loops
Compilers are not good at turning
- for I
- for J
- answer[I,J] = function(J) * expensiveFunction(I)
- for J
- for I
- table[I] = expensiveFunction(I)
- for J
- answer[I,J] = function(J) * table[I]
but this optimization can give massive reductions in execution times.
Exploiting the spatial nature of Freesurfer's data
Our data represents a physical reality where only close things interact with each other. The vertexs and faces etc. are numbered, but sadly the physical closeness is not fully captured by the numbering, because of the folding of the surface. However by partitioning the vertexs etc. by their physical location, some of the hottest loops can have many of their iterations dropped from consideration.
There are several places in the code where vertexs are moved to improve some measurement, but their movement must be constrained by not introducing intersections.
Rather than testing one entity involved in the intersection against every other entity, several techniques are used to reduce possible suspects
- mrishash.c maps every face into a 3D voxellation of the space. It knows that the faces will not intersect if their voxellations don't. This gives good results but both the insertion and lookup are quite expensive
- realm.c creates a 3D tree of the space, maps each vertex into one node in the tree, and maps each face into the common ancestor of the vertex nodes. It knows that faces will not intersect if they don't share subtrees. This approach has fast insertion, removal, and lookup. The structure has to be maintained as the defect elimination code adds and removes entities as well as moving them.
realm.c also optimizes the determination of intersecting great arcs during defect repair
The spatial nature of the data also makes it possible to do quick checks before doing expensive checks to filter out some candidates, for instance MorphoOptimizationProject_distanceToNearestFace
Using cheaper operations
Square root and trig functions can be surprisingly expensive when used in hot loops.
'sqrt' can often be replaced by observing that 'sqrt(a) < sqrt(b)' is the same as 'a < b', so rather than choosing the shortest length of a line (length requiring the calculation of sqrt), calculate the shortest length-squared and then if really needed, take the sqrt only of the shortest length.
Trig functions can sometimes be replaced by just the first few terms of their taylor series. Near zero, sin(x) is x-x**3/6, cos(x) is 1-x**2/2. asin and acos and atan have similar simple approximations.
Alternatively a table can be be used to give a few digits accuracy approximations that are often adequate.
Making such algorithms possible
Manipulations of related fields should be done in only a few places
Consider three of MRIS's members- max_vertices, nvertices, vertices
There are requirements that the rest of the code relies on
nvertices <= max_vertices
max_vertices must be the number of elements vertices points to
if the max_vertices is increased, the vertices must be realloc'ed
if the vertices is increased, the new VERTEX elements must be bzero'ed
Maintaining theses rule is difficult when many places in the code are assign to nvertices.
To stop this from happening, I changed nvertices to a const int, causing all the assignments to become error messages. Then I wrote a small set of functions for manipulating these three values, and editted all the errors to use these functions.
The same problem and solution was adopted with nfaces. Here is was harder and more important because there are multiple vectors that must be nfaces long, and the original code was changing nfaces in several places without updating all of the relevant vectors.
To compute or to cache... that is the question
The gcas->prior member illustrates this tradeoff beautifully. Almost all the hot uses of it needed log(gcas->prior), and the call to the log was hot, so the member was renamed and the new code sets it via a call to gcas_setPrior(gcas,prior). This call computes and saves both the prior and its log.
Calls to get the log via gcas_getPriorLog can then fetch the precomputed value.
Since setting the log did not show up as a hot spot, and since it appeared at least at a cursory examination that all the logs were repeatedly used, I did not investigate adding a full cache mechanism.
See the full pull request at https://github.com/freesurfer/freesurfer/pull/216/files
Complete example - mris_register's alignment code
mris_register has an alignment algorithm that implements the following pseudo-code
- A list of 3D vertices around the (0,0,0) origin, with an intensity at each vertex
- A 2D picture that is effectively the projection of a sphere onto a containing open-ended cylinder that is truncated
- The 3 angles, Alpha Beta and Gamma, that rotate the vertices so that their projection onto the cylinder from the origin has a intensities closest to those on the cylinder
The previous algorithm for finding the least (A,B,G)
For Alpha, Beta, and Gamma iterating stepping over a range of angles. Typically there are about 8x8x8 resulting (A,B,G) possibilities.
- rotate all vertices through the 3 angles
- project each in turn onto a sphere, then onto the 2D picture, and sum the differences of the vertex's intensity and the 2D picture's value
A rough estimate of the work per vertex per (A,B,G) possibility is
- A 3x3 * 3x1 matrix multiply to do the rotation
- 3 squares, a squareroot and three divides to do the projection onto the sphere
- Two trig functions to project into the picture
for a total of 12 muls, 3 divides and a sqrt, and 2 trig functions for each of 512 possible (A,B,G).
The new algorithm
A rough estimate of the work per vertex per (A,B,G) possibility is
Fetch the vertex x,y,z into dense vector. This will save lots of memory accesses during the next steps, because each cache line fetched will be full of useful data. During this fetch, project them onto the sphere, because rotation will keep them on the sphere's surface. It is 3 muls, 1 divs (to get the inverse), 3 more muls (by the inverse, to do the div), and a sqrt. It replaces the old code doing these for each (A,B,G) possibility.
For each Gamma, rotate the x,y,z vertices into a second temp data set. This rotation is a 2x2 * 2x1 matrix multiply, and has to be done only 8 times.
For each Gamma, for each Beta, rotate the second temp data set to get a third temp data set. This rotation is a 2x2 * 2x1 matrix multiply, and has to be done only 8x8 times.
Project this point onto the 2D image with the first of the Alpha's. Because the alpha rotation of the sphere is around the axis of the cylinder, the various alpha rotations hit points in a line parallel to the edge of the image, and hence no projection is required, just an addition. This projection is two trig functions but done only 8x8 times.
The result is each vertex gets 6 mul, 1 div, 1 sqrt + 8 lots of 4 muls + 64 lots of 4 muls + 64 lots of two trig functions = about 300 muls + 1 sqrt + 128 trig functions. Compare this to the old code's 6000 muls, 512 sqrt's, 1536 sqrts, and 1024 trig functions, and it is easy to see why the new code takes much less time to execute.
Refining the (A,B,G)
The old algorithm tried one grid around the initial (A,B,G) angle, found the lowest point on this grid, and repeated the search over a tighter grid around this point.
The new algorithm refines this, to reuse some of the points on the wider grid on the tighter grid, and thus not have to recalculate them. Basically each tighter grid shares about 1/8th of its points with the previous grid, for about a 12% saving in execution time.