THIS PAGE HAS BEEN MOVED TO SHAREPOINT!

Please refer to this site/make edits here for the most updated information: https://partnershealthcare.sharepoint.com/sites/LCN/SitePages/Morpho-Optimization-Project.aspx





Parent: MorphoOptimizationProject

Causes and Cures

There are five main causes for parallelism not resulting in speedup

  1. Not enough of the application is parallel
  2. Insufficient work per thread
  3. Excessive locking
  4. Excessive memory traffic
  5. Work not spread equally between the threads, aka load imbalance

Insufficient work per thread

OpenMP starts the threads when it needs to, but keeps them idle and can assign work to them and start them fairly quickly. Never-the-less, it takes thousands of instructions to give a thread work, and so there must be thousands of instructions worth of work to be done. Because OpenMP assigns more than one iteration of a loop to a thread at once, this means that the number of iterations divided by the number of threads, multiplied by the amount of work in one iteration needs to be in the thousands of instructions. This is usually the case. Unless you use static(1) - don't do that unless each iteration has a lot of work to do! In general, let OpenMP decide how to schedule the work unless it results in an imbalance.

Excessive locking

Locking and unlocking takes only a few instructions - unless another thread is competing for the same lock. Bouncing locks between threads is expensive.

  1. #pragma omp critical should only be used if it locks less than 1% of the work, and you are sure that it is not nested.

  2. omp_lock_t operations are safer, but you may need to use lock the initialization of the lock! Partition complex data structures and have a lock per partition to to reduce contention. This is good for caches where multiple threads provide cached results for each other.

  3. Create per-thread data structures to hold each thread's contribution and, if necessary, merge the results after the parallel loop has exited. It is often possible to parallelize the merge.

Excessive memory traffic

This is the biggest, and least understood, performance killer. If the data is written by one thread and is in its 256KB L2 cache, it can take the equivalent of twenty or more instructions to move the 64byte cache line it is in to another thread.

For example, if you have an 128KB float array written in one loop, and read in the following loop, it may require at least three or four operations per element to cover the cost of moving the data. There is a lot of system-specific variation in these numbers.

If most of the data is coming out of the L3 cache or the DRAM, then that can become the bottleneck very easily. This is also a problem for serial code, so the issue is described in MorphoOptimizationProject_ReducingMemoryTraffic

Load Imbalance

If the work is not evenly spread amongst the iterations, but instead is concentrate in a few regions of it, then the threads that don't get assigned such regions must wait for those that do.

This is quite possible given the spatial locality in our data.

Diagnosis

There is one easy way to see if more parallelism can speed up your program - If two copies of the program can run simultaneously in about the same time that one copy does, then there is potential for additional parallelism to speed up one copy.

There are three levels of tools to diagnose why this potential exists.

  1. Timing code built into the application.
  2. oprofile et. al.
  3. Intel's VTune product.

Of these, the VTune product is, by far, the easiest to use. Writing and understanding either the timing code or using oprofile are significantly more difficult and can not achieve the level of visibility into the execution.

Using VTune to improve concurrency

The following assumes Intel Parallel Studio is installed and you have executed psxevars.sh script. More about using VTune is described in MorphoOptimizationProject_usingVTune.

For mris_fix_topology the normal testing commands are

cd mris_fix_topology ; rm -rf testdata; tar xzf testdata.tar.gz
cd testdata 
cp ./subjects/bert/surf/lh.orig{.before,}
export SUBJECTS_DIR=`pwd`/subjects
../mris_fix_topology -niters 2 -mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

The last is the command whose behaviour I will study.

The first step is to build the VTune Amplifier command that will collect the data. Start the GUI...

amplxe-gui &

Create a project. I made my project name study_mris_fix_topology

Specifying Launch Application

Fill in the Application by browsing to it. Mine was /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/mris_fix_topology

Copy the rest of the Application parameters from above -niters 2 -mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

Specify the Working directory. For me it was /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata

In the Advanced section, I increased the Result size from 500 MB to 5000 MB. I did not specify any user-defined environment variables, because we will do this ourselves below.

Specify How to analyse the execution

You may see an error message in this panel telling you to set /proc/sys/kernel/yama/ptrace_scope to 0 You must do this, for instance by...

sudo -i
echo 0 > /proc/sys/kernel/yama/ptrace_scope
exit

The retry button then clears this error.

The default How is Basic Hotspots without analyzing OpenMP regions, but we really need to understand the OpenMP behaviour, so check that box. In the Details, you can see that it is sampling every 10ms, and collecting call stacks. All the defaults here are good enough.

Copy the command line

There is a button near the bottom that will show you the command line, which can then be copied to the clipboard.

My command line is shown below.

Run the collection

Because we have so much other stuff to set up, I don't use the GUI to start the collection. Instead I go to the command line terminal, do the setup as before, and then execute the command line

cd /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata
cp ./subjects/bert/surf/lh.orig{.before,}
export SUBJECTS_DIR=`pwd`/subjects
export OMP_NUM_THREADS=4
/opt/intel/vtune_amplifier_2019.0.0.553900/bin64/amplxe-cl -collect hotspots -knob analyze-openmp=true -app-working-dir /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata -- /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/mris_fix_topology -niters 2 -mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

Look at the result summary

There is a really simple summary printed

Summary
-------
Elapsed Time:             31.787
Paused Time:              0.0   
CPU Time:                 66.530
Average CPU Utilization:  2.095 
CPI Rate:                 1.014 

which tells us that we are only effectively using 2.095 of the available 4 cores. However to really see what is happening we need to run the GUI.

amplxe-gui r000hs/r000hs.amplxe &

The r000hs will have a new number for each run, and the hs stands for hotspot. We will see different numbers and abbreviations in later analyses.

The Project Navigator on the LHS is worthless. Click its X button to close it.

The Top Hotspots shows us

Function                                            Module             CPU Time 
func@0x11ad0                                        libgomp.so.1       13.6
mrisComputeDefectMRILogUnlikelihood_wkr._omp_fn.47  mris_fix_topology   9.5

This immediately shows the application is spending a lot of CPU time in the gomp stuff - locking or waiting!

In the Effective CPU Utilization Histogram, slide the left-most slider to the right so that 2 cores is Poor utilisation, and click the Apply button that appears.

Look at the Bottom-Up details

Switch to the Bottom-up tab.

This view clearly shows in the bottom graphs that one core does most of the pre- and post- processing of the mris_fix_topology, about 9 secs for the pre- and 4 secs for the post-. We are ignoring these because the recon-all doesn't have the -niters 2 and so the middle is much more important.

Using the left mouse button, select a few seconds out of the middle of the timeline, and zoom and filter in by selection.

We can see that the behavior of the secondary cores is to have intervals where they are idle, and other intervals where they are about half-used. We need to understand both these itervals, but we have only collected samples every 10ms, which is not good enough.

We decide that we really need to understand the 14sec - 17sec portion very well, so we will run another collection.

Collect Advanced Hotspots for the 14-17 sec portion

Using the

amplxe-gui &

again. open the same project.

DO NOT click Advanced Hotspots Analysis, which would run it, incorrectly configured! Instead click Configure Analysis....

In the Launch Application panel, in the Advanced section, select

  1. Automatically resume collection after 14 secs
  2. Automatically stop collection after 17 secs
  3. Result size 5000 MB
  4. Full finalization mode

In the How, choose Advanced Hotspots Analysis, and select

  1. 1ms sampling (the default)
  2. Hotspots and stacks (the default)
  3. Analyze OpenMP regions (NOT the default)

Again, get the command line into the clipboard. Note: I had to work around a bug in the Beta product and change 14000 to 14 in the following command.

Run the collection as before

cd /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata
cp ./subjects/bert/surf/lh.orig{.before,}
export SUBJECTS_DIR=`pwd`/subjects
export OMP_NUM_THREADS=4

rm -rf r001ah

/opt/intel/vtune_amplifier_2019.0.0.553900/bin64/amplxe-cl -collect advanced-hotspots -knob collection-detail=stack-sampling -knob analyze-openmp=true -knob enable-characterization-insights=false -data-limit=5000 -finalization-mode=full -app-working-dir /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata --start-paused --resume-after 14 --duration 17 -- /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/mris_fix_topology -niters 2 -mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

It takes a minute or more to finalize the collected data - be patient.

Look at the details

amplxe-gui r001ah/r001ah.amplxe &

By zooming in on the portion of the timeline where the big gaps in the secondary threads are, it is possible to see these being caused by a 20ms time to build a realmTree, which is not being done in parallel. This is best seen in the Top-down Tree tab.

Zooming in more closely into the times when the secondary CPU's look somewhat busy shows idle gaps where the main thread is running

but we are getting to the limits of the 1ms collection resolution to see these.

Selecting a longer interval of these, and filtering to only the main thread, and choosing instead the Bottom-up tab, we can see the hot functions.

This illustrates the problem with the Bottom-up view, when time is spread across many small functions, it is hard to see the cumulative effects.

Investigating the Top-down view filtered to just a secondary thread reveals it is idle about 50% of the time.

Comparing the Top-down view of the tid0 and the tid-non0 can help see where the non-threaded time is going.

Another approach is to examine the same time interval when there is only 1 or 2 threads. Compare the execution of the main thread in bot collections. The functions where the time has not changed are the ones that are being done in serial, and where the time has doubled (4 threads v 2 threads) or quadrupled (4 threads v 1 thread) are the ones being done in parallel. Of course it is hard to get the same time interval. It is possible using the Intel ITK API to turn collection on and off around the same interval, but it may be easier to just consider ratios of the function time to the time of the interval.

In our case, this revealed that makeRealmTree was a big serial component. We are not getting a good view of the smaller serial portions.

Double-clicking on makeRealmTree in the Top-down Tree tab takes us to a source tab which showed where the time was going in makeRealmTree, and most of it was in a loop that was easily parallelized.

Make the change and re-measure

After making a possible improvement, measurements can see if the time has gone done, and - if not - can show whether this time has gone down, and (if so) what has taken its place.

In my case I replaced this simple loop with a parallel reduction, and was irked to discover that there was no reduction at all in execution time.

I replaced...

    int vno = 0;
    VERTEX       const * vertex0      = &mris->vertices[vno];
    Captured_VERTEX_xyz* captured_xyz = &rt->captured_VERTEX_xyz[vno];
    getXYZ(vertex0, &captured_xyz->x, &captured_xyz->y, &captured_xyz->z);
    float xLo = captured_xyz->x, yLo = captured_xyz->y, zLo = captured_xyz->z; 
    float xHi = xLo, yHi = yLo, zHi = zLo;
    for (vno = 1; vno < mris->nvertices; vno++) {
        VERTEX const * vertex = &mris->vertices[vno];
        captured_xyz = &rt->captured_VERTEX_xyz[vno];
        getXYZ(vertex, &captured_xyz->x, &captured_xyz->y, &captured_xyz->z);
        float x = captured_xyz->x, y = captured_xyz->y, z = captured_xyz->z; 
        xLo = MIN(xLo, x); yLo = MIN(yLo, y); zLo = MIN(zLo, z); 
        xHi = MAX(xHi, x); yHi = MAX(yHi, y); zHi = MAX(zHi, z); 
    }

with the somewhat complicated...

    int vno = 0;
    VERTEX       const * vertex0      = &mris->vertices[vno];
    Captured_VERTEX_xyz* captured_xyz = &rt->captured_VERTEX_xyz[vno];
    getXYZ(vertex0, &captured_xyz->x, &captured_xyz->y, &captured_xyz->z);
    float xLo = captured_xyz->x, yLo = captured_xyz->y, zLo = captured_xyz->z; 
    float xHi = xLo,             yHi = yLo,             zHi = zLo;

    {   // This is expensive and easily parallelized
        // Unfortunately older compilers don't have a min and max reduction, so it is hand coded
        //
        float 
            xLos[_MAX_FS_THREADS], 
            xHis[_MAX_FS_THREADS],
            yLos[_MAX_FS_THREADS],
            yHis[_MAX_FS_THREADS],
            zLos[_MAX_FS_THREADS],
            zHis[_MAX_FS_THREADS];

        {   int tid;
            for (tid = 0; tid < max_threads; tid++) {
                xLos[tid] = xLo; xHis[tid] = xHi;
                yLos[tid] = yLo; yHis[tid] = yHi;
                zLos[tid] = zLo; zHis[tid] = zHi;
            }
        }
      
        int       vnoLo   = 1;
        int const vnoStep = MAX(1, (mris->nvertices - vnoLo + max_threads - 1) / max_threads);
        
        ROMP_PF_begin
#if defined(HAVE_OPENMP)
        #pragma omp parallel for if_ROMP(assume_reproducible) /* reduction(max:xHi,yHi,zHi) reduction(min:xLo,yLo,zLo) */
#endif
        for (vnoLo = 1; vnoLo < mris->nvertices; vnoLo += vnoStep) {
            ROMP_PFLB_begin
            int const vnoHi = MIN(mris->nvertices, vnoLo + vnoStep);
            int const tid   = omp_get_thread_num();
            int vno;
            for (vno = vnoLo; vno < vnoHi; vno++) {
                VERTEX const *       vertex       = &mris->vertices         [vno];
                Captured_VERTEX_xyz* captured_xyz = &rt->captured_VERTEX_xyz[vno];
                getXYZ(vertex, &captured_xyz->x, &captured_xyz->y, &captured_xyz->z);
                float x = captured_xyz->x, y = captured_xyz->y, z = captured_xyz->z; 
                xLos[tid] = MIN(xLos[tid], x); yLos[tid] = MIN(yLos[tid], y); zLos[tid] = MIN(zLos[tid], z); 
                xHis[tid] = MAX(xHis[tid], x); yHis[tid] = MAX(yHis[tid], y); zHis[tid] = MAX(zHis[tid], z); 
            }
            ROMP_PFLB_end
        }
        ROMP_PF_end
        
        {   int tid;
            for (tid = 0; tid < max_threads; tid++) {
              xLo = MIN(xLo,xLos[tid]); xHi = MAX(xHi, xHis[tid]);
              yLo = MIN(yLo,yLos[tid]); yHi = MAX(yHi, yHis[tid]);
              zLo = MIN(zLo,zLos[tid]); zHi = MAX(zHi, zHis[tid]);
            }
        }
    }

Why was there no change?

Look back at the list of explanations at the top of this page.

  1. Not enough of the application is parallel
  2. Insufficient work per thread
  3. Excessive locking
  4. Excessive memory traffic
  5. Work not spread equally between the threads, aka load imbalance

The initial and final loops are tiny compared to the parallel loop, so enough is parallel. It would appear that the amount of work per thread is sufficient - about 200,000 vertices per thread. Locking has been eliminated by the use of per-thread tempories. And there is no load imbalance. I wonder if we can show it is a memory traffic problem - it does seem possible, given the very little work done with each item.

We are going to use a different VTune collection - general exploration - and see what it shows us.

Generate the command line as before, but this time select general exploration. We have to deal with another warning message, which is fixed by

sudo -i
cat /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/perf_event_paranoid

cd /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata
cp ./subjects/bert/surf/lh.orig{.before,}
export SUBJECTS_DIR=`pwd`/subjects
export OMP_NUM_THREADS=4

rm -rf r000ge

/opt/intel/vtune_amplifier_2019.0.0.553900/bin64/amplxe-cl -collect general-exploration \
-knob collect-memory-bandwidth=true -knob analyze-openmp=true \
-data-limit=5000 -finalization-mode=full \
-app-working-dir /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata \
--start-paused --resume-after 14 --duration 17 \
-- /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/mris_fix_topology \
-niters 2 -mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

amplxe-gui r000ge/r000ge* &

The Summary page here is, for us, only vaguely interesting. It is an average, and we are interested in a specific time interval. Change to the Bottom-up view and zoom and filter in on the parallelism gap where we can see the spike that we have added with the above additional parallel loop. VTune is reporting that the two hot functions in this gap - makeRealmTree_omp_fn.o and get_origxyz - are 80% and 100% Back-End Bound.

Hovering over the title Back-End Bound shows the help text that describes it as being caused EITHER by data-cache misses or by DIV unit (floating point divide and sqrt) being overloaded. Expanding this cell shows that the problem is 'Contested Accesses' and 'Data Sharing'.

This happens when a cache line is being bounced between the cores - which in our case is caused by the way the per-thread data is defined. Let us fix that and try again...

The code now looks like this, although, to be honest, once I replaced the inner accesses with local variables there probably isn't any benefit to the reordering of the per thread data. I did it here just to show the whole concept.

        struct PerThread {
            float xLo, xHi, yLo, yHi, zLo, zHi;
            char cacheLinePadding[64];
        } perThreads[_MAX_FS_THREADS];

        {   int tid;
            for (tid = 0; tid < max_threads; tid++) {
                struct PerThread* pt = &perThreads[tid];
                pt->xLo = xLo; pt->xHi = xHi;
                pt->yLo = yLo; pt->yHi = yHi;
                pt->zLo = zLo; pt->zHi = zHi;
            }
        }
      
        int       vnoLo   = 1;
        int const vnoStep = MAX(1, (mris->nvertices - vnoLo + max_threads - 1) / max_threads);
        
        ROMP_PF_begin
#if defined(HAVE_OPENMP)
        #pragma omp parallel for if_ROMP(assume_reproducible) /* reduction(max:xHi,yHi,zHi) reduction(min:xLo,yLo,zLo) */
#endif
        for (vnoLo = 1; vnoLo < mris->nvertices; vnoLo += vnoStep) {
            ROMP_PFLB_begin
            int const vnoHi = MIN(mris->nvertices, vnoLo + vnoStep);
            int const tid   = omp_get_thread_num();
            struct PerThread* const pt = &perThreads[tid];
            float 
                xLo = pt->xLo, yLo = pt->yLo, zLo = pt->zLo,
                xHi = pt->xHi, yHi = pt->yHi, zHi = pt->zHi;
            int vno;
            for (vno = vnoLo; vno < vnoHi; vno++) {
                VERTEX const *       vertex       = &mris->vertices         [vno];
                Captured_VERTEX_xyz* captured_xyz = &rt->captured_VERTEX_xyz[vno];
                getXYZ(vertex, &captured_xyz->x, &captured_xyz->y, &captured_xyz->z);
                float x = captured_xyz->x, y = captured_xyz->y, z = captured_xyz->z; 
                xLo = MIN(xLo, x); yLo = MIN(yLo, y); zLo = MIN(zLo, z); 
                xHi = MAX(xHi, x); yHi = MAX(yHi, y); zHi = MAX(zHi, z); 
            }
            pt->xLo = xLo, pt->yLo = yLo, pt->zLo = zLo,
            pt->xHi = xHi, pt->yHi = yHi, pt->zHi = zHi;
            ROMP_PFLB_end
        }
        ROMP_PF_end
        
        {   int tid;
            for (tid = 0; tid < max_threads; tid++) {
                struct PerThread* const pt = &perThreads[tid];
                xLo = MIN(xLo,pt->xLo); xHi = MAX(xHi, pt->xHi);
                yLo = MIN(yLo,pt->yLo); yHi = MAX(yHi, pt->yHi);
                zLo = MIN(zLo,pt->zLo); zHi = MAX(zHi, pt->zHi);
            }
        }

The result of this change is massive. This function has basically disappeared from view. It is now showing as 100% Backend-Bound.

It is time to measure again, and see if there is a detectable improvement now, but before we leave, zoom out and look at the measurements for the entire 3 seconds. It shows that mrisComputeDefectMRILogUnlikelihood_wkr._omp_fn.47 is spending 25% of its cycles on 'bad speculation'.

Sad measurements

Sadly, redoing the test_mris_fix_topology_timing shows the win was only about 1% on the -niters 2 case, and none at all on the unlimited case.

This suggests another problem - we are not measuring the case we are trying to improve! Perhaps it has a different hotspot. Let us redo the experiment without the -niters 2 option. We will also increase the time interval to 10 seconds...

cd /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata
cp ./subjects/bert/surf/lh.orig{.before,}
export SUBJECTS_DIR=`pwd`/subjects
export OMP_NUM_THREADS=4

rm -rf r000ge

/opt/intel/vtune_amplifier_2019.0.0.553900/bin64/amplxe-cl -collect general-exploration \
-knob collect-memory-bandwidth=true -knob analyze-openmp=true \
-data-limit=5000 -finalization-mode=full \
-app-working-dir /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata \
--start-paused --resume-after 14 --duration 24 \
-- /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/mris_fix_topology \
-mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

amplxe-gui r000ge/r000ge* &

Now we see the truth. The big notches, the one with the makeRealmTree calls, are much further apart. Improving these will not have a big impact on the -n unlimited case while they will (uselessly) improve the -n 2 case. We need to turn our attention to the time spent doing the iterations.

Measure Advanced Hotspots again

cd /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata
cp ./subjects/bert/surf/lh.orig{.before,}
export SUBJECTS_DIR=`pwd`/subjects
export OMP_NUM_THREADS=4

rm -rf r00*ah

amplxe-cl -collect advanced-hotspots -knob collection-detail=stack-sampling -knob analyze-openmp=true -knob enable-characterization-insights=false -data-limit=5000 -finalization-mode=full -app-working-dir /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata --start-paused --resume-after 14 --duration 24 -- /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/mris_fix_topology -mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

amplxe-gui r00*ah/r0* &

This shows that, in the serial portion, there are the following functions taking about equal amounts of the serial time.

  1. retessellateDefect_wkr 20%
  2. retessellateDefect_wkr calls intersectDefectEdges 20%
  3. defectMatch calls defectMaximizeLikelihood calls computeDefectVertexNormals 20%
  4. mrisComputeDefectMRILogUnlikelihood_wkr calls qsort - 20%
  5. retessellateDefect_wkr calls intersectDefectEdges calls ... calls possiblyIntersectingGreatArcs_callback 20%

All of these look improvable... which may as much as 2x our throughput We now mark each of these in the code, and consider how to improve them.

Also consider the hottest places again in the parallel code again. Doubleclicking mrisComputeDefectMRILogUnlikelihood_wkr._omp_fn.47 takes us to a really hot line

    new_distance = SIGN(least_sign)*sqrt(least_distance_squared);

These are all examples of situations described in MorphoOptimizationProject_BetterSerialCode.

One strange place in the serial code to be hot is

#ifdef HAVE_OPENMP              
#pragma omp parallel for if_ROMP(shown_reproducible)              527.219ms
#endif                          
for (bufferIndex = 0; bufferIndex < bufferSize; bufferIndex++) {

This line should have no cost at all, so it makes me wonder about the load balance of this loop.

Measure concurrency

I tried to look at the load balancing using Amplifier. It was a waste of time - perhaps because we are using gcc rather than icc. Sometime I will redo this using an icc build. I reported the issues to Intel.

Generate a command line as before, but this time choose Concurrency, change the interval to 1ms, and analyze OpenMP regions.

Note: My command line is a little different this time because I have moved to a different repository where I am building a change worthy of having a pull request generated. Also I have copied mris_fix_topology to mris_fix_topology.bf_wrongBufferSize so I can quickly switch between several branches while doing measurements.

cd /home/rheticus/freesurfer_repo_root/freesurfer/mris_fix_topology/testdata
cp ./subjects/bert/surf/lh.orig{.before,}
export SUBJECTS_DIR=`pwd`/subjects
export OMP_NUM_THREADS=4

rm -rf r00*cc

amplxe-cl -collect concurrency \
  -knob sampling-interval=1 -knob analyze-openmp=true \
  -data-limit=5000 -finalization-mode=full \
  -app-working-dir /home/rheticus/freesurfer_repo_root/freesurfer/mris_fix_topology/testdata \
  --start-paused --resume-after 14 --duration 17 \
  -- /home/rheticus/freesurfer_repo_root/freesurfer/mris_fix_topology/mris_fix_topology.bf_wrongBufferSize \
     -niters 2 -mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

amplxe-gui r00*cc/r0* &

Use schedule(guided)

OpenMP has a variety of algorithms for deciding how to assign iterations to the threads, and schedule(guided) is one that copes with load imbalance by initially assigning a significant number of the iterations, but keeping some in reserve to spread amongst any threads that finish early.

Adding it to this line reduced the time spent here from 527ms to only 5.98ms.

#ifdef HAVE_OPENMP              
#pragma omp parallel for if_ROMP(shown_reproducible)              5.98ms
#endif                          
for (bufferIndex = 0; bufferIndex < bufferSize; bufferIndex++) {

Getting better visibility into short intervals

See MorphoOptimizationProject_usingVTune.

MorphoOptimizationProject_improvingParallelism (last edited 2021-09-22 09:53:45 by DevaniCordero)