Freesurfer Test Plan
WORK IN PROGRESS
This page documents the Freesurfer software test plan. A formal software test plan ([http://en.wikipedia.org/wiki/Test_plan see Wikipedia reference]) describes a systematic approach to testing a software application (or suite), and includes these elements:
- Scope of testing
- Test deliverables
- Release criteria
- Risks and contingencies
Also, tests should cover the following categories of testing:
- Functional - can the software executes its basic functionality under optimal conditions?
- Boundary - determine the breaking points of the software, and whether the software gracefully handles input near and beyond these boundaries.
- Stability - gauge long-term behavior of the software: whether is has a memory leak, or prone to crashes which are not repeatable in any single run of any of the other tests.
- Coverage - what percentage of the code-base is exercised by the tests?
- Performance - produce benchmarks on the performance of the software.
The Freesurfer test plan is a work-in-progress. It is one not developed top-down, but rather grown from the bottom-up as necessity and time has dictated. The goal is to build a test suite that meets the criteria of a formal test plan. This will take time.
The current test suite is an ad-hoc collection of test scripts and C/C++ code providing rudimentary testing of most of the freesurfer code-base, consisting of unit, module and system tests. The #1 aim of these tests is simple: the output files produced by the recon-all stream ([wiki:ReconAllDevTable as documented here]) must be 'correct', relative to reference files which are known to be 'correct' as determined by manual inspection or some formal method (a table of precalculated results from another program). The word 'correct' is in quotes because Freesurfer, being a research tool, is constantly evolving, as well as there being inherent variability in any complex scientific software application.
The term 'unit test' is defined in our Freesurfer test plan to mean a test of a freesurfer binary (such as mri_ca_register) or smaller (a subroutine). The framework for these tests is the 'make check' framework built into the 'make' utility (and the 'automake' tools). The 'check' target of 'make' initiates the build and run of tests created by the user to test the thing that is made by the 'all' target of a Makefile. In freesurfer, there are a number of 'make check' tests, and 'make check' is run after 'make' on each nightly build platform (see the section [wiki:DevelopersGuide/MartinosCenter "How the nightly build works"] for details).
Future - To formalize the unit tests, documentation (a wiki page) should be created which lists 1. all the binaries used in recon-all, 2. other important binaries not in the stream, and 3. the critical subroutines, as determined either by name (see Bruce Fischl and Doug Greve) and/or by profiling the binaries during a run of the recon-all stream; and for each of these, the name of the test (as run by 'make check') is listed. A table of this sort allows ascertaining coverage, and identification of tests to be developed.
The term 'module test' is defined in our Freesurfer test plan to apply, at this time, to the atlases used by the recon-all stream. The AtlasSubjects page describes how these atlas are built and tested. So there are two module tests, summarized (from AtlasSubjects) here:
- Aseg atlas test - 27 manually-segmented subjects are automatically segmented, and Dice coefficients indicate the degree of overlap (correctness) between the manual and automatic segmentations. This test is run manually, and requires manual inspection of the results (the Dice coefficients).
- Aparc atlas test - 97 manually-parcellated subjects are automatically parcellated, and Dice coefficients indicate the degree of overlap. Again, this test is run manually, and requires manual inspection of the results.
Future - These tests ought to be run automatically periodically, say, once a month. The results should also be automatically determined.
The term 'system test' is defined in our Freesurfer test plan to apply to the recon-all stream as a whole. In the current setup, different test platforms, representing the varying OS's (Linux and Mac, 32 and 64bit), each run the recon-all stream, and then each output file is compared against a known-good reference set for one subject (bert). This is briefly described in the section [wiki:DevelopersGuide/MartinosCenter "How the daily testing works"].
Future - A 64bit Mac OS platform needs to be setup. Additionally, and more importantly, a bigger set of test subjects needs to be included in the test suite. Currently, just 'bert' is used. But Freesurfer by its very nature can react quite differently to different scan parameters, and different brain pathologies, and ages. An automatic test of the Buckner40 set of subjects is necessary, and is documented here: Bucker40Testing.
Another future item is to regulary run 'valgrind' on each binary, to check for memory corruption or huge memory leaks.
Test results need to be reported firstly to those who can determine failure causes (and fix it), and secondly to users of the software so that they can be aware of the general state of health of Freesurfer.
Future - A [http://public.kitware.com/Dart/HTML/Index.shtml Dart Dashboard] needs to be created to allow intuitive and global (public) reporting of test results. This application is a reporting manager, not a test framework, so existing unit, module and system test scripts would report to it (in place of, or in addition to, emailing results).