This is easily done, at least on Linux.
rm -rf oprofile_data
. operf -g -t ./mris_fix_topology . . .
opreport -- callgraph
The resulting table is a little more difficult to understand, but basically it is a list of hot spots. Each hotspot lists some of its callers, and then the hotspot itself slightly less indented, and then some called functions. Typically you just need to know the first few hotspots, because they are the most important. The % samples will tell you how important the slightly less indented hotspot is compared to others.
Inlining often results in functions, sometimes very large functions that only have one caller, disappearing. The NOINLINE macro in include/base.h can be used to avoid this.
This requires a rebuild, after editing include/romp_support.h to enable it. The resulting indented display shows approximately how much elapsed time was spent in each parallel region, and approximately how well the available cpu's were used.
Using Intel Vtune