Notes on FreeSurfer code optimization

This page is for free-form entry of notes on ways to optimize the freesurfer code base, whether they be simple things, or notes on larger scale problems.

Format is: name:, <short label> - description

* nick:, -ffast-math - Try the -ffast-math flag of gcc v4.x. Prior experiments with this on the AMD compiler showed output differences in recon-all, but perhaps selective use of this flag is possible.

* nick:, SSE math lib - Replace instances of sin, cos, log and exp with routines optimized for SSE instructions found on intel processors. See http://gruntthepeon.free.fr/ssemath/. Tried this, but ran into problems and gave up. Some wrangling with it could make it possible to optionally build with this lib via #ifdefs. Actually, its not a lib but a header file.

* Richard:, MRI data structure - pointer-chasing implied by ***slices is horrific for the CPU caches (and a non-starter on the GPU). The 'chunking' alternative is much better, but needs to be used uniformly

* Richard:, MATRIX data type - If this is only used for 4x4 affine transformations, it should be coded as such. If used more generally too, then a separate 'Affine' class should be considered (this exists for the GPU already in the file affinegpu.cu)

* Richard:, Boundary conditions - MRIconvolve1d and MRImean handle out-of-range accesses differently. MRImean effectively returns zero, MRIconvolve1d uses the [x|y|z]i pointers which clamp to the edge of the range. There are probably other places where this happens. A uniform treatment would be best.

* Richard:, Data structure memory management - Datastructures which allocate RAM, or are allocated as arrays should always carry their lengths with them. I'm thinking particularly of GCA_SAMPLE arrays here, but I imagine there are other examples

* Richard:, MRI data structure - Do we really need support for all of UCHAR, SHORT, LONG and FLOAT? On the GPU, manipulating datatypes which aren't 4-byte aligned is slow, and I imagine that modern CPUs face similar difficulties. They do save some RAM, but it's only a factor of four; if we want to edit bigger volumes or long sequences, we should be thinking about better datastructures, not trying to 'cram down' the existing ones.