MorphoOptimizationProject_DebuggingNondeterminism

Parent: MorphoOptimizationProject

===Overview=== This is detected by a difference either in the stdout lines or the created files. The problem is to find when the change first happens.

My general approach is

Add debugging code in the serial portion of the code that produces more output on stdout. I use stdout because stderr flushes at different times, so it is not well-correlated with stdout. I only put it in the serial code to reduce differences.
Run with 1 and 8 threads to maximize the different behavior.
Use bcompare or kdiff3 to compare the outputs.
Add code to have the diverging loop execute serially, but with an iteration order that can be changed by an environment variable
Run it with the two different orders (I used iterating from lo to hi, and from hi to lo)
Identify the failing iteration
Add run-time conditionalized tracing within the code executed by the iterations, but only enable it during the iterations just before and including the difference
Add more such tracing to 'binary search' for a narrow range where the variation happened
Use either debugging or tracing to compare the two runs, and understand the cause and impact of any variation

===Details=== The most useful additional output was the hash of the MRIS, because that is the main data that is output from one step and input to the next. If the hash is the same between runs, it is likely that the final outputs will be the same.

The comparison soon revealed where the divergence happened, but to nail this down I added code to romp_support.c that counts the number of parallel loops executed (not iterations, but the number of times such a loop is executed). To enable this tracing, at the start of utils/romp_support.c there is a line

static const int tracing = [0 or 1];

Rebuild. Now when you run, you will get stdout lines that show when the parallel loops have executed, and you will see at the end all the loops that have executed.

Now add fprint's close to where those parallel loops are, to show what their inputs and output's are. To show a MRIS, simply print its hash using mris_print_hash. If the mris has the same state, its hash will be the same. It is highly unlikely that two different ones will have the same hash.

Now diff'ing the outputs should enable you to zoom into where the two runs differ. The biggest problem is finding shared variables - variables written by one thread while being read or written by another. Intel's Inspector tool is aimed at finding these, and is very powerful.

A simple fast-executing alternative for finding some of the violations is to have code keep track of whether an object has been accessed for read or write by omp parallel loop iteration, and check that only one iteration accesses each object for write, or that the only accesses are reads. Sadly this requires adding significant code. In some projects I have worked in, major classes have a member that specifies the current owner.

Another alternative is, once the problem loop is identified, is to do the iterations in a different order - say hi to lo - printing out the behavior of each iteration. The behavior should be the same, so the iterations that do different things give a hint as to the problem.

Getting the same answer for different reasons

Sometimes the code is searching a list of candidates for any one that matches a criteria - eg: does this face intersect any other.

The function that returns true or false should also return at least one candidate, so that if one run matches and another does not, we know at least one matching candidate to investigate.