Parent: MorphoOptimizationProject_BetterSerialCode

The initial state

The general matrix code allows arbitrary dimensions of either float or complex-float data.

Prior to my change, they were implemented as a heap-allocated struct, containing two pointers

  1. one to a vector of heap-allocated elements
  2. one to a vector of row pointers into these elements

So creating a matrix required three heap operations, as did freeing it.

Lots of code depended on this shape, so it was not plausible to change it.

Reducing to one heap operation

By calculating the amount of storage required for all three of these previous allocations, it was possible to malloc and free using one operation. malloc gets the needed storage and builds the three pieces within this one allocation. free only has one piece to deallocate.

Reducing to zero heap operations most of the time

Code was added to matrix.h to pass the calling file and line as well as the dimensions to the call to MatrixAlloc. Code within matrix.c then gathered statistics to identify the prime callers and the dimensions they requested. This identified the handful of sites where most of the allocations were done, and their preferred sizes.

Another struct - MatrixBuffer - was defined that was large enough to hold a 4x4 float matrix. Anywhere that was calling MatrixAlloc ... MatrixFree could now pass such a buffer to MatrixAlloc2 and, if the buffer was large enough, the buffer was used instead of calling malloc. MatrixFree was taught not to free such buffers.

In addition, operations such as MatrixMultiply already supported being given a destination matrix, but this was often not exploited. The hottest callers were changed to create and pass a buffered matrix, further exploiting the above work.

This change caused a massive reduction - 75% - in the number of calls to malloc and free!

Customizing for 1x3 3x3 etc. multiply etc.

So far these functions have not shown up as being hot - but when they do, changing to compiler-known bounds may win.

In particular there are places where the code computes a transpose and then multiplies by it. This is especially heinous because it is faster to multiply a matrix stored in transposed form, so the code is doing work to enable it to use a slower algorithm!