Differences between revisions 11 and 12

Notes on FreeSurfer code optimization

This page is for free-form entry of notes on ways to optimize the freesurfer code base, whether they be simple things, or notes on larger scale problems.

Format is: name:, <short label> - description

* nick:, -ffast-math - Try the -ffast-math flag of gcc v4.x. Prior experiments with this on the AMD compiler showed output differences in recon-all, but perhaps selective use of this flag is possible.

* nick:, SSE math lib - Replace instances of sin, cos, log and exp with routines optimized for SSE instructions found on intel processors. See http://gruntthepeon.free.fr/ssemath/. Tried this, but ran into problems and gave up. Some wrangling with it could make it possible to optionally build with this lib via #ifdefs. Actually, its not a lib but a header file.

* Richard:, MRI data structure - pointer-chasing implied by ***slices is horrific for the CPU caches (and a non-starter on the GPU). The 'chunking' alternative is much better, but needs to be used uniformly

* Richard:, MATRIX data type - If this is only used for 4x4 affine transformations, it should be coded as such. If used more generally too, then a separate 'Affine' class should be considered (this exists for the GPU already in the file affinegpu.cu)

* Richard:, Boundary conditions - MRIconvolve1d and MRImean handle out-of-range accesses differently. MRImean effectively returns zero, MRIconvolve1d uses the [x|y|z]i pointers which clamp to the edge of the range. There are probably other places where this happens. A uniform treatment would be best.

* Richard:, Data structure memory management - Datastructures which allocate RAM, or are allocated as arrays should always carry their lengths with them. I'm thinking particularly of GCA_SAMPLE arrays here, but I imagine there are other examples

* Richard:, MRI data structure - Do we really need support for all of UCHAR, SHORT, LONG and FLOAT? On the GPU, manipulating datatypes which aren't 4-byte aligned is slow, and I imagine that modern CPUs face similar difficulties. They do save some RAM, but it's only a factor of four; if we want to edit bigger volumes or long sequences, we should be thinking about better datastructures, not trying to 'cram down' the existing ones.

* Richard:, const correctness - It would be very useful if arguments could be declared const whenever possible.

* Richard:, Array ordering. Within an MRI structure, we have mri->slices[z][y][x] but within a GCA there's gca->nodes[xn][yn][zn] and a GCAmorph has gcam->nodes[i][j][k] These should be made consistent - I'm pretty sure that this difference is the reason for the horrible performance of GCAmri - whatever the order of the loop nest, one of the structures is going to be traversed cache-incoherently. As a note, the MRI ordering is good for CUDA.

* Richard:, strcpy Only an optimisation in the sense that safe code is optimised as compared to insecure code. I can see strcpy calls littered all over the place (e.g. in mriio.c), which is begging for trouble when someone decides to use lengthy identifiers

Deletions are marked like this.	Additions are marked like this.
Line 26:	Line 26:
	* ''Richard:'', `strcpy` Only an optimisation in the sense that safe code is optimised as compared to insecure code. I can see `strcpy` calls littered all over the place (e.g. in mriio.c), which is begging for trouble when someone decides to use lengthy identifiers