This is easily done, at least on Linux.
rm -rf oprofile_data
operf -g -t ./mris_fix_topology ...
The resulting table is a little more difficult to understand, but basically it is a list of hot spots. Each hotspot lists some of its callers, and then the hotspot itself slightly less indented, and then some called functions. Typically you just need to know the first few hotspots, because they are the most important. The % samples will tell you how important the slightly less indented hotspot is compared to others.
Inlining often results in functions, sometimes very large functions that only have one caller, disappearing. The NOINLINE macro in include/base.h can be used to avoid this.
However this does not do a good job of showing you execution spread over many functions, so after you have driven the hotspots out this way, you need a better tool...
This requires a rebuild, after editing include/romp_support.h to #define ROMP_SUPPORT_ENABLED to enable it. With this enabled, executables output statistics to stderr as they exit. They also try to write .csv files containing the stats into the /tmp/ROMP_statsFiles directory.
The resulting indented display shows stats for each scope or parallel loop that has been annotated with ROMP macros. The stats show approximately how much elapsed time was spent in it and approximately how well the available cpus were used.
Using Intel Vtune