4 February 2018

Optimizating CGraph with SSE and AVX

As you might have seen on my github profile, I have been working on a new project called CGraph https://github.com/praisethemoon/cgraph, which is a computational graph that usable as a C API or Lua API.

The CGraph API uses internally cblas for fast matrix/vector operations. Unfortunately, BLAS does not cover all operations say adding two vectors or multiplying them element-wise.

Therefor for the sake of efficiency and learning, I have decided to learn SSE and apply them to CGraph. First of all, CGraph works only on double precision vectors/matrices.

SSE is an abbreviation of Streaming SIMD Instructions. SIMD refers to Single Instruction Multiple Data.

In short, SSE is an extension to set of x86 instructions that speeds SIMD workload. SSE works with registers that are 128bit wide, called XMM. You can use these registers are a set of blocks to store either integers, floats, doubles, etc, and operations will be performed to the entire register but block by block.

Example: We can store up to four floats to an XMM register since sizeof(float) = 32 and 32 * 4 = 128.

```r1 = {f0_1, f1_1, f2_1, f3_1}
+     +     +     +
r2 = {f0_2, f1_2, f2_2, f3_2}```

When we add these two registers as floats, we obtain

`r3 {f0_1 * f0_2, f1_1 * f1_2, f2_1 * f2_2, f3_1 * f3_2}`

The power of SSE comes from the fact that all these operations are done in a single instruction!

Let’s try a real example and compare the performance of an SSE over standard for loop.

For simplicity, we will use arrays with length multiple of4.

If you do not understand the code, it is pretty straight forward:

`__m128` is the equivalent XMM register which stores floats (remember that we can store a lot data types).

`_mm_load_ps(float*)` takes four float elements from the beginning of the pointer, stores them into a `__m128` register.

`_mm_mul_ps` multiplies two `__m128` registers and returns a new one.

`_mm_store_ps` transform a `__m128` register into the equivalent C array of float representation.

Eazy.

Compiling and running that with GCC 6.3.0 I get the following output:

```Hello, world!
SSE: 41.862408
std: 83.412814
bye!```

Now it is pretty clear that the performance is huge! But let’s try to profile the program to get a clearer insights. So for this task I am going to use `valgrind` and `kcachegrind`

```valgrind --tool=callgrind ./see_test
kcachegrind```

And here we can see the output of valgrind profiler!

As you can see, SSE block took around 40.05% of the main’s execution time while the standard loop version took around 54.91%.

However, when testing the same code but for double precision floating point, the performance of SSE over standard for loop actually decreased. So raises my curiosity, maybe the improvement mostly came from looping 4 times less over floats while looping only 2 times less over doubles (since XMM can store only 2 doubles)..

So it is about time we compare all this with AVX using double precision!

First AVX API is very similar to SSE with just different type prefixes/suffixes. It only requires including immintrin.h

AVX stands for Advanced Vector Extensions. These instructions operate on YMM registers that can handle up to 8 floats or 4 four doubles, thus having a size of 256bits.

Let’s do the same process again:

Raw profiling returned:

```Hello, world!
SSE: 33.140252
std: 26.112234
AVX: 20.766444
bye!```