This is easily done, at least on Linux.
rm -rf oprofile_data
operf -g -t ./mris_fix_topology ...
opreport -- callgraph
The resulting table is a little more difficult to understand, but basically it is a list of hot spots. Each hotspot lists some of its callers, and then the hotspot itself slightly less indented, and then some called functions. Typically you just need to know the first few hotspots, because they are the most important. The % samples will tell you how important the slightly less indented hotspot is compared to others.
This requires a rebuild, after editing include/romp_support.h to enable it. The resulting indented display shows approximately how much elapsed time was spent in each parallel region, and approximately how well the available cpu's were used.
Using Intel Vtune